You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
BatchJobs seems to lose track of running jobs on slurm. Happens with both CRAN and github BatchJobs. Below is a minimal example. Note below how waitForJobs() bails with an error when the jobs "disappear", and showStatus() shows that I go from 10 to 7 jobs. I think this might have to do with how the scheduler seems to rename jobs with square brackets depending on whether they are running or not. I don't know if this is standard for slurm or something unusual about our system. If I wait for the jobs to finish, they "reappear". Any hints?
> # modified example from http://www.r-bloggers.com/configuring-the-r-batchjobs-package-for
> library(BatchJobs)
Loading required package: BBmisc
Sourcing configuration file: '/home/ms44/R/x86_64-redhat-linux-gnu-library/3.1/BatchJobs/etc/BatchJobs_global_config.R'
BatchJobs configuration:
cluster functions: Interactive
mail.from:
mail.to:
mail.start: none
mail.done: none
mail.error: none
default.resources:
debug: FALSE
raise.warnings: FALSE
staged.queries: TRUE
max.concurrent.jobs: Inf
fs.timeout: NA
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BatchJobs_1.6 BBmisc_1.9
loaded via a namespace (and not attached):
[1] base64enc_0.1-3 brew_1.0-6 checkmate_1.6.2 DBI_0.3.1
[5] digest_0.6.8 fail_1.2 magrittr_1.5 parallel_3.1.2
[9] RSQLite_1.0.0 sendmailR_1.2-1 stringi_0.5-5 stringr_1.0.0
[13] tools_3.1.2
> setConfig(list(debug=T, cluster.functions = makeClusterFunctionsSLURM('simple.tmpl'), default.resources = list(ntasks=1, ncpus=1, walltime="00:05:00", memory=100)))
> getConfig()
BatchJobs configuration:
cluster functions: SLURM
mail.from:
mail.to:
mail.start: none
mail.done: none
mail.error: none
default.resources: ntasks=1, ncpus=1, walltime=00:05:00, memory=100
debug: TRUE
raise.warnings: FALSE
staged.queries: TRUE
max.concurrent.jobs: Inf
fs.timeout: NA
> starts <- replicate(10, rnorm(100), simplify = FALSE)
> myFun <- function(start) { median(start) }
> # create a registry
> reg <- makeRegistry(id = "batchtest", file.dir="/scratch/gpfs/ms44/batchtest")
Creating dir: /scratch/gpfs/ms44/batchtest
Saving registry: /scratch/gpfs/ms44/batchtest/registry.RData
> # submit
> ids <- batchMap(reg, myFun, starts)
Adding 10 jobs to DB.
> testJob(reg)
Testing job with id=1 ...
Creating dir: /tmp/RtmpPOF1e6/58841790e5d8
Saving registry: /tmp/RtmpPOF1e6/58841790e5d8/registry.RData
Saving conf: /tmp/RtmpPOF1e6/58841790e5d8/conf.RData
### Output of new R process starts here ###
Loading required package: BBmisc
Loading required package: methods
Loading registry: /tmp/RtmpPOF1e6/58841790e5d8/registry.RData
Loading conf:
2015-08-13 12:53:11: Starting job on node della4.
Auto-mailer settings: start=none, done=none, error=none.
Setting work dir: /home/ms44/batchjobs-test
########## Executing jid=1 ##########
Timestamp: 2015-08-13 12:53:11
BatchJobs job:
Job id: 1
Fun id: 4ac90f73b6ca67f5ab4e353046970cdf
Fun formals:
Name: NA
Seed: 530690155
Pars: <unnamed>=-1.12,0.2829...
Setting seed: 530690155
Result:
num -0.143
NULL
Writing result file: /tmp/RtmpPOF1e6/58841790e5d8/jobs/1-result.RData
2015-08-13 12:53:11: All done.
Setting work back to: /home/ms44/batchjobs-test
Memory usage according to gc:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 364804 19.5 667722 35.7 379536 20.3
Vcells 532824 4.1 1031040 7.9 605262 4.7
### Output of new R process ends here ###
### Approximate running time: 0.82 secs
[1] -0.1430377
> submitJobs(reg)
OS cmd: squeue -h -o %i -u $USER
OS result:
$exit.code
[1] 0
$output
character(0)
Saving conf: /scratch/gpfs/ms44/batchtest/conf.RData
Submitting 10 chunks / 10 jobs.
Cluster functions: SLURM.
Auto-mailer settings: start=none, done=none, error=none.
Writing 10 R scripts...
SubmitJobs |+ | 0% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/01/1.sb
OS result:
$exit.code
[1] 0
$output
[1] "Submitted batch job 5105844"
SubmitJobs |+++++ | 10% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/02/2.sb
OS result:
$exit.code
[1] 0
$output
[1] "Submitted batch job 5105845"
SubmitJobs |++++++++++ | 20% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/03/3.sb
OS result:
$exit.code
[1] 0
$output
[1] "Submitted batch job 5105846"
SubmitJobs |+++++++++++++++ | 30% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/04/4.sb
OS result:
$exit.code
[1] 0
$output
[1] "Submitted batch job 5105847"
SubmitJobs |++++++++++++++++++++ | 40% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/05/5.sb
OS result:
$exit.code
[1] 0
$output
[1] "Submitted batch job 5105848"
SubmitJobs |++++++++++++++++++++++++ | 50% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/06/6.sb
OS result:
$exit.code
[1] 0
$output
[1] "Submitted batch job 5105849"
SubmitJobs |+++++++++++++++++++++++++++++ | 60% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/07/7.sb
OS result:
$exit.code
[1] 0
$output
[1] "Submitted batch job 5105850"
SubmitJobs |++++++++++++++++++++++++++++++++++ | 70% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/08/8.sb
OS result:
$exit.code
[1] 0
$output
[1] "Submitted batch job 5105851"
SubmitJobs |+++++++++++++++++++++++++++++++++++++++ | 80% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/09/9.sb
OS result:
$exit.code
[1] 0
$output
[1] "Submitted batch job 5105852"
SubmitJobs |++++++++++++++++++++++++++++++++++++++++++++ | 90% (00:00:00)OS cmd: sbatch /scratch/gpfs/ms44/batchtest/jobs/10/10.sb
OS result:
$exit.code
[1] 0
$output
[1] "Submitted batch job 5105853"
SubmitJobs |+++++++++++++++++++++++++++++++++++++++++++++++++| 100% (00:00:00)
Sending 10 submit messages...
Might take some time, do not interrupt this!
> waitForJobs(reg)
OS cmd: squeue -h -o %i -u $USER
OS result:
$exit.code
[1] 0
$output
[1] "5105844_[1]" "5105845_[1]" "5105846_[1]" "5105847_[1]" "5105848_[1]"
[6] "5105849_[1]" "5105850_[1]" "5105851_[1]" "5105852_[1]" "5105853_[1]"
Waiting [S:5 D:5 E:0 R:0] |+++++++++++++++++ | 50% (00:00:41)OS cmd: squeue -h -o %i -u $USER
OS result:
$exit.code
[1] 0
$output
[1] "5105848_1" "5105849_[1]" "5105850_[1]" "5105851_[1]" "5105852_[1]"
[6] "5105853_[1]"
Error in stop(e) :
Some jobs disappeared, i.e. were submitted but are now gone. Check your configuration and template file.
> waitForJobs(reg)
> showStatus(reg)
Syncing registry ...
OS cmd: squeue -h -o %i -u $USER
OS result:
$exit.code
[1] 0
$output
[1] "5105852_[1]" "5105853_[1]" "5105851_1"
Status for 10 jobs at 2015-08-13 12:54:04
Submitted: 10 (100.00%)
Started: 7 ( 70.00%)
Running: 0 ( 0.00%)
Done: 7 ( 70.00%)
Errors: 0 ( 0.00%)
Expired: 0 ( 0.00%)
Time: min=0.00s avg=0.00s max=0.00s
If I wait for the jobs to finish and rerun showStatus, the jobs reappear:
As far as I can tell, this has to do with how BatchJobs parses the output of squeue and how slurm uses job ids. I am not a slurm expert, but the format seems to be JID_SID for running jobs and JID_[SID_RANGE] for pending jobs, where SID is a step id, and SID_RANGE is a range of step ids separated by "-". At least on our system, a job defaults to having just one step id (1). Additional step ids are generated for job arrays, or for calls of srun inside the sbatch script.
So for example, a pending job looks like 123456_[1] for a single job or 123456_[1-50] for a job array. A running job looks like 123456_1 for a single job or 123456_45 for a job array. BatchJobs puts it into the database as 123456. Then when it runs squeue the id it gets is 123456_1, so the job is lost from the database until it checks its result back in.
A simple fix (mshvartsman/BatchJobs@7e2a01a) strips everything after the underscore, and seems to work for me with chunks.as.arrayjobs=F. With chunks.as.arrayjobs=T, the jobs still disappear. I don't understand what BatchJobs does well enough to understand why.
Can anyone take a peek? Can anyone else on slurm weigh in?
I've just tested with our SLURM system, we just get the JID returned. I've patched clusterFunctionsSLURM to just take the first numbers. This should hopefully not have side effects. I do not have access to a system with array jobs, therefore I cannot reproduce or test the chunk stuff ... But if you can produce a similar output for chunked jobs, I can peek into it.
BatchJobs seems to lose track of running jobs on slurm. Happens with both CRAN and github BatchJobs. Below is a minimal example. Note below how
waitForJobs()
bails with an error when the jobs "disappear", andshowStatus()
shows that I go from 10 to 7 jobs. I think this might have to do with how the scheduler seems to rename jobs with square brackets depending on whether they are running or not. I don't know if this is standard for slurm or something unusual about our system. If I wait for the jobs to finish, they "reappear". Any hints?If I wait for the jobs to finish and rerun showStatus, the jobs reappear:
The text was updated successfully, but these errors were encountered: