Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does --overwrite 0 recover RepeatModeler in progress? #252

Closed
kaede0e opened this issue Jan 26, 2022 · 13 comments
Closed

Does --overwrite 0 recover RepeatModeler in progress? #252

kaede0e opened this issue Jan 26, 2022 · 13 comments
Labels
question Further information is requested

Comments

@kaede0e
Copy link

kaede0e commented Jan 26, 2022

Hello,
Thanks for developing this comprehensive TE discovery pipeline. We are currently aiming to annotate multiple plant genomes with TEs de novo, which have been taking a lot more computational time than we expected initially.
I managed to finish one genome (~220 Mb and ~23% TE content using 8-days CPU time) but I am struggling to finish the pipeline for others. In most of the genomes, it seems to time out in the middle of RepeatModeler step. I tried running RepeatModeler separately to investigate whether that might do the job quicker, and discovered that they have a -recoverDir option to start from where the previous run left off. So, I was wondering if EDTA pipeline can potentially recover results from the RepeatModeler in progress instead of restart from beginning (I've been running the command: EDTA.pl --step final --overwrite 0).

Sincerely,
Kaede

@oushujun
Copy link
Owner

Hi Kaede,

EDTA will pick up the RepeatModeler result if its final product $genome.RM.consensi.fa is detected. Otherwise it will run ${repeatmodeler}RepeatModeler -engine ncbi -pa $threads -database $genome.masked 2>/dev/null, which does not contain the -recoverDir option and probably won't recycle unfinished runs. You may try to add this parameter to the command in Line 493 of your EDTA.

Best,
Shujun

@kaede0e
Copy link
Author

kaede0e commented Jan 31, 2022

Hi Shujun,
Oh I see thanks for the clarification, I'll try adding it.

Sincerely,
Kaede

@oushujun oushujun added the question Further information is requested label Jan 31, 2022
@oushujun
Copy link
Owner

Please let me know if it works!

Shujun

@kaede0e
Copy link
Author

kaede0e commented Feb 8, 2022

Hi Shujun,

Hmm it seems to indicate that the RepeatModeler didn't run properly when I've added --recoverDir RM_* extension to the line 493. I was in the process of round-6, so was hoping to pick up from there. But the job was done in a few hours (instead of days) and the files in my RM_* are indicating incompletion of round-6. I didn't find the consensi.fa etc. files that are supposed to be there:
drwxr-xr-x 2 kaedeh 408K Jan 29 05:53 round-1
drwxr-xr-x 4 kaedeh 40K Jan 29 06:01 round-2
drwxr-xr-x 4 kaedeh 112K Jan 29 06:57 round-3
drwxr-xr-x 4 kaedeh 412K Jan 29 14:59 round-4
-rw-r--r-- 1 kaedeh 51M Feb 1 17:40 families.stk
drwxr-xr-x 4 kaedeh 1.5M Feb 1 17:46 round-5
drwxr-xr-x 2 kaedeh 796K Feb 6 16:57 round-6

The dates are all odd and this is the log file I had from the job and I doubt that it properly completed all 6 rounds of RepeatModeler.

Tue Feb 8 04:04:35 PST 2022 Perform EDTA final steps to generate a non-redundant comprehensive TE library:

                            Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.

cat: 'RM_*/consensi.fa': No such file or directory
RepeatModeler is finished, but no consensi.fa files found.

I did get genome.mod.EDTA.TElib.fa and genome.mod.EDTA.intact.gff3 output files (not empty) but wonder if this was the result ignoring RepeatModeler output.

What do you think? I guess the recoverDir extension did not work...

@oushujun
Copy link
Owner

oushujun commented Feb 8, 2022 via email

@kaede0e
Copy link
Author

kaede0e commented Feb 9, 2022

Hi Shujun,

The -recoverDir extension does work if I run RepeatModeler separately. The command line looks like this:
RepeatModeler -database ${genus}_whole_genome -recoverDir RM_271804.WedFeb21527082022 -pa 1

I was going to try making a copy of this unfinished run and test it but figured that the intermediate files (the line 494: rm $genome.masked.nhr $genome.masked.nin $genome.masked.nnd $genome.masked.nni $genome.masked.nog $genome.masked.nsq) got deleted by the first try I did so I am missing -database argument, and I can't redo it unless I restart from the beginning... Is there a way to retrieve these files or do I need to restart?

Thanks,
Kaede

@oushujun
Copy link
Owner

Hi Kaede,

These files should be able to be regenerated by the indexing command:
${repeatmodeler}BuildDatabase -name $genome.masked -engine ncbi $genome.masked;

Best,
Shujun

@oushujun
Copy link
Owner

oushujun commented Apr 6, 2022

@kaede0e does it resolved?

@kaede0e
Copy link
Author

kaede0e commented Apr 6, 2022

No, but we decided to move forward with doing EDTA chromosome by chromosome to fit our computational resource.

@oushujun
Copy link
Owner

oushujun commented Apr 6, 2022 via email

@kaede0e
Copy link
Author

kaede0e commented Apr 6, 2022

Hi Shujun, I don't fully understand why performing EDTA chromosome by chromosome will create false positives. Is it the raw TE library filtering stage that could miss-identify TEs if I've been only using one chromosome at a time? When I combine the curated library from each chromosome without additional steps mentioned in this pan-genome pipeline, why will there be false positives?

@oushujun
Copy link
Owner

oushujun commented Apr 6, 2022 via email

@afurches
Copy link

afurches commented Sep 26, 2024

For those hitting wall time limits, I recommend running RepeatModeler independently in recovery mode as described above, but increase the number of threads to the maximum supported by your node.

By using 32 threads, I was able to finish round 6 of my LINE analysis without needing to decrease the sample size, details here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants