Batchjobs loses track of running jobs on SLURM #100

mshvartsman · 2015-08-13T17:09:18Z

BatchJobs seems to lose track of running jobs on slurm. Happens with both CRAN and github BatchJobs. Below is a minimal example. Note below how waitForJobs() bails with an error when the jobs "disappear", and showStatus() shows that I go from 10 to 7 jobs. I think this might have to do with how the scheduler seems to rename jobs with square brackets depending on whether they are running or not. I don't know if this is standard for slurm or something unusual about our system. If I wait for the jobs to finish, they "reappear". Any hints?

> # modified example from http://www.r-bloggers.com/configuring-the-r-batchjobs-package-for
> library(BatchJobs)
Loading required package: BBmisc
Sourcing configuration file: '/home/ms44/R/x86_64-redhat-linux-gnu-library/3.1/BatchJobs/etc/BatchJobs_global_config.R'
BatchJobs configuration:
  cluster functions: Interactive
  mail.from:
  mail.to:
  mail.start: none
  mail.done: none
  mail.error: none
  default.resources:
  debug: FALSE
  raise.warnings: FALSE
  staged.queries: TRUE
  max.concurrent.jobs: Inf
  fs.timeout: NA

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BatchJobs_1.6 BBmisc_1.9

loaded via a namespace (and not attached):
 [1] base64enc_0.1-3 brew_1.0-6      checkmate_1.6.2 DBI_0.3.1
 [5] digest_0.6.8    fail_1.2        magrittr_1.5    parallel_3.1.2
 [9] RSQLite_1.0.0   sendmailR_1.2-1 stringi_0.5-5   stringr_1.0.0
[13] tools_3.1.2
> setConfig(list(debug=T, cluster.functions = makeClusterFunctionsSLURM('simple.tmpl'), default.resources = list(ntasks=1, ncpus=1, walltime="00:05:00", memory=100)))
> getConfig()
BatchJobs configuration:
  cluster functions: SLURM
  mail.from:
  mail.to:
  mail.start: none
  mail.done: none
  mail.error: none
  default.resources: ntasks=1, ncpus=1, walltime=00:05:00, memory=100
  debug: TRUE
  raise.warnings: FALSE
  staged.queries: TRUE
  max.concurrent.jobs: Inf
  fs.timeout: NA
> starts <- replicate(10, rnorm(100), simplify = FALSE)
> myFun  <- function(start) { median(start) }
> # create a registry
> reg <- makeRegistry(id = "batchtest", file.dir="/scratch/gpfs/ms44/batchtest")
Creating dir: /scratch/gpfs/ms44/batchtest
Saving registry: /scratch/gpfs/ms44/batchtest/registry.RData
> # submit
> ids  <- batchMap(reg, myFun, starts)
Adding 10 jobs to DB.
> testJob(reg)
Testing job with id=1 ...
Creating dir: /tmp/RtmpPOF1e6/58841790e5d8
Saving registry: /tmp/RtmpPOF1e6/58841790e5d8/registry.RData
Saving conf: /tmp/RtmpPOF1e6/58841790e5d8/conf.RData
### Output of new R process starts here ###
Loading required package: BBmisc
Loading required package: methods
Loading registry: /tmp/RtmpPOF1e6/58841790e5d8/registry.RData
Loading conf:
2015-08-13 12:53:11: Starting job on node della4.
Auto-mailer settings: start=none, done=none, error=none.
Setting work dir: /home/ms44/batchjobs-test
########## Executing jid=1 ##########
Timestamp: 2015-08-13 12:53:11
BatchJobs job:
  Job id: 1
  Fun id: 4ac90f73b6ca67f5ab4e353046970cdf
  Fun formals:
  Name: NA
  Seed: 530690155
  Pars: <unnamed>=-1.12,0.2829...
Setting seed: 530690155
Result:
 num -0.143
NULL
Writing result file: /tmp/RtmpPOF1e6/58841790e5d8/jobs/1-result.RData
2015-08-13 12:53:11: All done.
Setting work back to: /home/ms44/batchjobs-test
Memory usage according to gc:
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 364804 19.5     667722 35.7   379536 20.3
Vcells 532824  4.1    1031040  7.9   605262  4.7
### Output of new R process ends here ###
### Approximate running time: 0.82 secs
[1] -0.1430377
> submitJobs(reg)
OS cmd: squeue -h -o %i -u $USER
OS result:
$exit.code
[1] 0

$output
character(0)

Saving conf: /scratch/gpfs/ms44/batchtest/conf.RData
Submitting 10 chunks / 10 jobs.
Cluster functions: SLURM.
Auto-mailer settings: start=none, done=none, error=none.
Writing 10 R scripts...
SubmitJobs |+                                                |   0% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/01/1.sb
OS result:
$exit.code
[1] 0

$output
[1] "Submitted batch job 5105844"

SubmitJobs |+++++                                            |  10% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/02/2.sb
OS result:
$exit.code
[1] 0

$output
[1] "Submitted batch job 5105845"

SubmitJobs |++++++++++                                       |  20% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/03/3.sb
OS result:
$exit.code
[1] 0

$output
[1] "Submitted batch job 5105846"

SubmitJobs |+++++++++++++++                                  |  30% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/04/4.sb
OS result:
$exit.code
[1] 0

$output
[1] "Submitted batch job 5105847"

SubmitJobs |++++++++++++++++++++                             |  40% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/05/5.sb
OS result:
$exit.code
[1] 0

$output
[1] "Submitted batch job 5105848"

SubmitJobs |++++++++++++++++++++++++                         |  50% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/06/6.sb
OS result:
$exit.code
[1] 0

$output
[1] "Submitted batch job 5105849"

SubmitJobs |+++++++++++++++++++++++++++++                    |  60% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/07/7.sb
OS result:
$exit.code
[1] 0

$output
[1] "Submitted batch job 5105850"

SubmitJobs |++++++++++++++++++++++++++++++++++               |  70% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/08/8.sb
OS result:
$exit.code
[1] 0

$output
[1] "Submitted batch job 5105851"

SubmitJobs |+++++++++++++++++++++++++++++++++++++++          |  80% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/09/9.sb
OS result:
$exit.code
[1] 0

$output
[1] "Submitted batch job 5105852"

SubmitJobs |++++++++++++++++++++++++++++++++++++++++++++     |  90% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/10/10.sb
OS result:
$exit.code
[1] 0

$output
[1] "Submitted batch job 5105853"

SubmitJobs |+++++++++++++++++++++++++++++++++++++++++++++++++| 100% (00:00:00)
Sending 10 submit messages...
Might take some time, do not interrupt this!
> waitForJobs(reg)
OS cmd: squeue -h -o %i -u $USER
OS result:
$exit.code
[1] 0

$output
 [1] "5105844_[1]" "5105845_[1]" "5105846_[1]" "5105847_[1]" "5105848_[1]"
 [6] "5105849_[1]" "5105850_[1]" "5105851_[1]" "5105852_[1]" "5105853_[1]"

Waiting [S:5 D:5 E:0 R:0] |+++++++++++++++++                 |  50% (00:00:41)OS cmd: squeue -h -o %i -u $USER
OS result:
$exit.code
[1] 0

$output
[1] "5105848_1"   "5105849_[1]" "5105850_[1]" "5105851_[1]" "5105852_[1]"
[6] "5105853_[1]"


Error in stop(e) :
  Some jobs disappeared, i.e. were submitted but are now gone. Check your configuration and template file.

> waitForJobs(reg)
> showStatus(reg)
Syncing registry ...
OS cmd: squeue -h -o %i -u $USER
OS result:
$exit.code
[1] 0

$output
[1] "5105852_[1]" "5105853_[1]" "5105851_1"

Status for 10 jobs at 2015-08-13 12:54:04
Submitted: 10 (100.00%)
Started:    7 ( 70.00%)
Running:    0 (  0.00%)
Done:       7 ( 70.00%)
Errors:     0 (  0.00%)
Expired:    0 (  0.00%)
Time: min=0.00s avg=0.00s max=0.00s

If I wait for the jobs to finish and rerun showStatus, the jobs reappear:

> showStatus(reg)
Syncing registry ...
OS cmd: squeue -h -o %i -u $USER
OS result:
$exit.code
[1] 0

$output
character(0)

Status for 10 jobs at 2015-08-13 12:56:06
Submitted: 10 (100.00%)
Started:   10 (100.00%)
Running:    0 (  0.00%)
Done:      10 (100.00%)
Errors:     0 (  0.00%)
Expired:    0 (  0.00%)
Time: min=0.00s avg=0.00s max=0.00s
> reduceResultsList(reg)
Reducing 10 results...
reduceResults |++++++++++++++++++++++++++++++++++++++++++++++| 100% (00:00:00)
$`1`
[1] -0.1430377

$`2`
[1] 0.01326604

$`3`
[1] 0.1663867

$`4`
[1] -0.1135874

$`5`
[1] -0.1887555

$`6`
[1] -0.09046936

$`7`
[1] -0.1852045

$`8`
[1] -0.0301674

$`9`
[1] 0.03331675

$`10`
[1] 0.09482852

The text was updated successfully, but these errors were encountered:

mshvartsman · 2015-08-18T17:08:02Z

As far as I can tell, this has to do with how BatchJobs parses the output of squeue and how slurm uses job ids. I am not a slurm expert, but the format seems to be JID_SID for running jobs and JID_[SID_RANGE] for pending jobs, where SID is a step id, and SID_RANGE is a range of step ids separated by "-". At least on our system, a job defaults to having just one step id (1). Additional step ids are generated for job arrays, or for calls of srun inside the sbatch script.

So for example, a pending job looks like 123456_[1] for a single job or 123456_[1-50] for a job array. A running job looks like 123456_1 for a single job or 123456_45 for a job array. BatchJobs puts it into the database as 123456. Then when it runs squeue the id it gets is 123456_1, so the job is lost from the database until it checks its result back in.

A simple fix (mshvartsman/BatchJobs@7e2a01a) strips everything after the underscore, and seems to work for me with chunks.as.arrayjobs=F. With chunks.as.arrayjobs=T, the jobs still disappear. I don't understand what BatchJobs does well enough to understand why.

Can anyone take a peek? Can anyone else on slurm weigh in?

mllg · 2015-09-30T11:39:22Z

I've just tested with our SLURM system, we just get the JID returned. I've patched clusterFunctionsSLURM to just take the first numbers. This should hopefully not have side effects. I do not have access to a system with array jobs, therefore I cannot reproduce or test the chunk stuff ... But if you can produce a similar output for chunked jobs, I can peek into it.

mllg · 2015-09-30T11:40:38Z

See 8a0f95d.

mllg added the bug label Sep 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batchjobs loses track of running jobs on SLURM #100

Batchjobs loses track of running jobs on SLURM #100

mshvartsman commented Aug 13, 2015

mshvartsman commented Aug 18, 2015

mllg commented Sep 30, 2015

mllg commented Sep 30, 2015

Batchjobs loses track of running jobs on SLURM #100

Batchjobs loses track of running jobs on SLURM #100

Comments

mshvartsman commented Aug 13, 2015

mshvartsman commented Aug 18, 2015

mllg commented Sep 30, 2015

mllg commented Sep 30, 2015