-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exceeding cluster time limit & understanding # of rounds #69
Comments
That time limit is unfortunate!
The default is 6 rounds, with 243Mbp of the genome processed in round 6. You could shorten the total number of rounds by using the In case those three commands are all in the same job, it would also be a good idea to split up |
Unfortunately, I'm already running those three commands in different scripts so round six is getting the full 72 hours to itself. Based on your response, it seems like my run time for round 6 isn't abnormal for plant genomes of these sizes (540 - 707 Mbp). I know RepeatModeler works best with assembled genomes and they all have been assembled and polished. All three have BUSCO scores around 93% and two have N50s over 2.2 Mbp. I read through the description from your website (http://www.repeatmasker.org/RepeatModeler/) as well as the usage from the help menu and don't see a break down of how to use -genomeSampleSizeMax effectively. For example, my job will time out in ~12 hours and this is the last line in my .out file: 55% completed, 46:7:34 (hh:mm:ss) est. time remaining. does this mean that ~134Mbp (55% of 243Mbp) of the genome has been processed? If so, I should be able to find out what percentage it times out at and set it a tiny bit lower than that. Let me know if that sounds feasible. Because I haven't run RepeatModeler to completion, I don't know if round six is the last step in the pipeline and I don't want to not leave time for whatever needs to run next. Final question: Here are the sample stats printed right before round 6 started its "all-by-other comparisons": -- Sample Stats: Thanks for pointing out the sample size from round 5, now I see that the sample sizes for: I did not understand that before, so thank you for pointing that out. It looks like the 6 rounds together cover 60.37% of my genome. Does this mean that any genome larger than 403 Mbp will only have a fraction of its genome processed? And if so, have anyone thought of work arounds or is not really an issue? I've never done it before, but I'm wondering if I can split my assembly fasta files in half and run them separately. Any help would be greatly appreciated. Thank you! |
That statistic is based on the current round, so it means 134Mbp of round 6 have been processed. Rounds 1-5 totaling to 160Mbp have already been over and done with by that time. That figure is not quite accurate, actually: the first 40Mbp for RepeatScout can overlap the samples used for RECON analysis in rounds 2+ so the total unique sequence processed may be a bit less than 160Mbp.
Good catch.
It does. There is an underlying assumption there, that repetitive elements in the genome will be widespread and frequent enough to show up even in a sample. We do suggest increasing
This will not give you the best results. Information from previous rounds informs later rounds, so running two batches of the genome will give you a very redundant pair of libraries as each run usually independently discovers approximately the same elements. This might be okay, if all you are doing is masking - but it will cause issues down the road if you do any work with the library itself or if you do annotation instead of masking. |
Thank you for the quick reply and I read issue #65. Based on what you've told me, unfortunately I'm going to have to lower -genomeSampleSizeMax in order to get this job completed with my university's limitations. I'm still a little confused about how I'm going to pull this off using -recoverDir. I want the 6th round to finish, but it seems like I can't risk the LTRPipeline + clustering steps starting and not finishing. Can I run two separate -recoverDir jobs to get everything completed? Job #1 (recoverDir without -LTRStruct to do round 6) Job #2 (recoverDir with -LTRStruct) Can I use -recoverDir on a completed job or is it only for the rounds of RepeatScout + RECON steps? I'm suspecting that I can't yet based on issue #65, but I don't think I can have round 6, LTRPipeline, & clustering steps all in the same run... |
I don't think you can finish round 6 at all, if it would take 130 hours -
Where |
OK, I understood that the round starts over when when you use -recoverDir and that I also need to lower -genomeSampleSizeMax because of my 72hr limit. I apologize for not being clear that I understood those parts. Just because this job takes so much time on my cluster, I want to ask one final question before submitting my job. In your experience, how does the LTR pipeline compare with round 6 in terms of time? I'm just trying to make an estimate of how much I should lower the -genomeSampleSizeMax option. Thank you again for all your help. |
I'm sorry to say that I have no data for this on hand. I can look around for a few previous run results, but I believe it depends on genome composition rather than just size so it will be hard to extrapolate. |
For those hitting wall time limits, try more threads! My initial run used 4 ( I verified via I'm using a Red Hat Linux HPC cluster, running on a single node with 32 cores and 256G memory, for reference. |
Thank you for developing such a useful program. This is my first time running RepeatModeler and I'm still working out how to use it to the fullest extent. I'm really glad that you have the "-recoverDir" option and it's working perfectly fine for me. All three of my genome assemblies made it to round 6 before hitting the time limit of 72 hours on my university's cluster. Here lies the issue. After restarting round 6 last night, I checked just now and it's only at 5% with an estimated 130 hours to go. This round will never finish with my current available resources.
I think that I'm running the job at its full capacity. I have maximum of 40 threads/cores available per job submission, so I ran all three genomes with the "-pa 10" option.
Here are my lines in my script fo the job I've submitted (I'm not including the variable paths):
$RepeatModeler/BuildDatabase -engine rmblast -name My_species_01 -dir $pilon
$RepeatModeler/RepeatModeler -pa 10 -LTRStruct -database My_species_01
$RepeatModeler/RepeatModeler -pa 10 -recoverDir $recoverDir -srand 1584548300 -LTRStruct -database My_species_01
I think I followed the manual and usage properly, but I wanted to post what I was submitting just in case. I also want to add that the genome sizes for my three assemblies are between 540 MB and 707 MB.
I'm not finding this specifically, but how many rounds are there supposed to be? is it possible to stop after 5 rounds? I plan on using RepeatMasker next, followed by PurgeHaplotigs (which over purge all three of these assemblies, which is why I'm using RepeatModeler/RepeatMasker). Any insight into these issues would be appreciated. Thank you!
The text was updated successfully, but these errors were encountered: