Does --overwrite 0 recover RepeatModeler in progress? #252

kaede0e · 2022-01-26T19:05:05Z

Hello,
Thanks for developing this comprehensive TE discovery pipeline. We are currently aiming to annotate multiple plant genomes with TEs de novo, which have been taking a lot more computational time than we expected initially.
I managed to finish one genome (~220 Mb and ~23% TE content using 8-days CPU time) but I am struggling to finish the pipeline for others. In most of the genomes, it seems to time out in the middle of RepeatModeler step. I tried running RepeatModeler separately to investigate whether that might do the job quicker, and discovered that they have a -recoverDir option to start from where the previous run left off. So, I was wondering if EDTA pipeline can potentially recover results from the RepeatModeler in progress instead of restart from beginning (I've been running the command: EDTA.pl --step final --overwrite 0).

Sincerely,
Kaede

oushujun · 2022-01-31T15:51:28Z

Hi Kaede,

EDTA will pick up the RepeatModeler result if its final product $genome.RM.consensi.fa is detected. Otherwise it will run ${repeatmodeler}RepeatModeler -engine ncbi -pa $threads -database $genome.masked 2>/dev/null, which does not contain the -recoverDir option and probably won't recycle unfinished runs. You may try to add this parameter to the command in Line 493 of your EDTA.

Best,
Shujun

kaede0e · 2022-01-31T18:39:21Z

Hi Shujun,
Oh I see thanks for the clarification, I'll try adding it.

Sincerely,
Kaede

oushujun · 2022-01-31T23:28:39Z

Please let me know if it works!

Shujun

kaede0e · 2022-02-08T17:37:04Z

Hi Shujun,

Hmm it seems to indicate that the RepeatModeler didn't run properly when I've added --recoverDir RM_* extension to the line 493. I was in the process of round-6, so was hoping to pick up from there. But the job was done in a few hours (instead of days) and the files in my RM_* are indicating incompletion of round-6. I didn't find the consensi.fa etc. files that are supposed to be there:
drwxr-xr-x 2 kaedeh 408K Jan 29 05:53 round-1
drwxr-xr-x 4 kaedeh 40K Jan 29 06:01 round-2
drwxr-xr-x 4 kaedeh 112K Jan 29 06:57 round-3
drwxr-xr-x 4 kaedeh 412K Jan 29 14:59 round-4
-rw-r--r-- 1 kaedeh 51M Feb 1 17:40 families.stk
drwxr-xr-x 4 kaedeh 1.5M Feb 1 17:46 round-5
drwxr-xr-x 2 kaedeh 796K Feb 6 16:57 round-6

The dates are all odd and this is the log file I had from the job and I doubt that it properly completed all 6 rounds of RepeatModeler.

Tue Feb 8 04:04:35 PST 2022 Perform EDTA final steps to generate a non-redundant comprehensive TE library:

                            Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.

cat: 'RM_*/consensi.fa': No such file or directory
RepeatModeler is finished, but no consensi.fa files found.

I did get genome.mod.EDTA.TElib.fa and genome.mod.EDTA.intact.gff3 output files (not empty) but wonder if this was the result ignoring RepeatModeler output.

What do you think? I guess the recoverDir extension did not work...

oushujun · 2022-02-08T17:41:08Z

Maybe you want to try this parameter on RepeatModeler first to make sure it will pick up from where it stopped. You can make a copy of the unfinished run and test it. You may find more discussions on their github. Shujun

…

On Tue, Feb 8, 2022 at 12:37 PM kaede0e ***@***.***> wrote: Hi Shujun, Hmm it seems to indicate that the RepeatModeler didn't run properly when I've added --recoverDir RM_* extension to the line 493. I was in the process of round-6, so was hoping to pick up from there. But the job was done in a few hours (instead of days) and the files in my RM_* are indicating incompletion of round-6. I didn't find the consensi.fa etc. files that are supposed to be there: drwxr-xr-x 2 kaedeh 408K Jan 29 05:53 round-1 drwxr-xr-x 4 kaedeh 40K Jan 29 06:01 round-2 drwxr-xr-x 4 kaedeh 112K Jan 29 06:57 round-3 drwxr-xr-x 4 kaedeh 412K Jan 29 14:59 round-4 -rw-r--r-- 1 kaedeh 51M Feb 1 17:40 families.stk drwxr-xr-x 4 kaedeh 1.5M Feb 1 17:46 round-5 drwxr-xr-x 2 kaedeh 796K Feb 6 16:57 round-6 The dates are all odd and this is the log file I had from the job and I doubt that it properly completed all 6 rounds of RepeatModeler. Tue Feb 8 04:04:35 PST 2022 Perform EDTA final steps to generate a non-redundant comprehensive TE library: Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods. cat: 'RM_*/consensi.fa': No such file or directory RepeatModeler is finished, but no consensi.fa files found. I did get genome.mod.EDTA.TElib.fa and genome.mod.EDTA.intact.gff3 output files (not empty) but wonder if this was the result ignoring RepeatModeler output. What do you think? I guess the recoverDir extension did not work... — Reply to this email directly, view it on GitHub <#252 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNX4NEBEEYBHZRBUIUO6JTU2FIEXANCNFSM5M33KSPA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

kaede0e · 2022-02-09T17:47:03Z

Hi Shujun,

The -recoverDir extension does work if I run RepeatModeler separately. The command line looks like this:
RepeatModeler -database ${genus}_whole_genome -recoverDir RM_271804.WedFeb21527082022 -pa 1

I was going to try making a copy of this unfinished run and test it but figured that the intermediate files (the line 494: rm $genome.masked.nhr $genome.masked.nin $genome.masked.nnd $genome.masked.nni $genome.masked.nog $genome.masked.nsq) got deleted by the first try I did so I am missing -database argument, and I can't redo it unless I restart from the beginning... Is there a way to retrieve these files or do I need to restart?

Thanks,
Kaede

oushujun · 2022-02-10T13:32:05Z

Hi Kaede,

These files should be able to be regenerated by the indexing command:
${repeatmodeler}BuildDatabase -name $genome.masked -engine ncbi $genome.masked;

Best,
Shujun

oushujun · 2022-04-06T07:14:59Z

@kaede0e does it resolved?

kaede0e · 2022-04-06T16:12:42Z

No, but we decided to move forward with doing EDTA chromosome by chromosome to fit our computational resource.

oushujun · 2022-04-06T16:15:18Z

You will need the pan-genome method to combine sublibraries to control false positives. Check out this work: https://github.com/HuffordLab/NAM-genomes/tree/master/te-annotation Shujun

…

On Wed, Apr 6, 2022 at 9:12 AM kaede0e ***@***.***> wrote: No, but we decided to move forward with doing EDTA chromosome by chromosome to fit our computational resource. — Reply to this email directly, view it on GitHub <#252 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNX4NBQ34M3J3UUXGCSGD3VDWZYJANCNFSM5M33KSPA> . You are receiving this because you commented.Message ID: ***@***.***>

kaede0e · 2022-04-06T16:34:40Z

Hi Shujun, I don't fully understand why performing EDTA chromosome by chromosome will create false positives. Is it the raw TE library filtering stage that could miss-identify TEs if I've been only using one chromosome at a time? When I combine the curated library from each chromosome without additional steps mentioned in this pan-genome pipeline, why will there be false positives?

oushujun · 2022-04-06T16:38:03Z

Each EDTA run will have some sort of FP that can not be fully removed. Most of them are low copy. Combining multiple runs together will inflate these FP and the pan module can effectively control these. Shujun

…

On Wed, Apr 6, 2022 at 9:34 AM kaede0e ***@***.***> wrote: Hi Shujun, I don't fully understand why performing EDTA chromosome by chromosome will create false positives. Is it the raw TE library filtering stage that could miss-identify TEs if I've been only using one chromosome at a time? When I combine the curated library from each chromosome without additional steps mentioned in this pan-genome pipeline, why will there be false positives? — Reply to this email directly, view it on GitHub <#252 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNX4NEPSPNLTCYBEBJCAZ3VDW4KXANCNFSM5M33KSPA> . You are receiving this because you commented.Message ID: ***@***.***>

afurches · 2024-09-26T15:41:10Z

For those hitting wall time limits, I recommend running RepeatModeler independently in recovery mode as described above, but increase the number of threads to the maximum supported by your node.

By using 32 threads, I was able to finish round 6 of my LINE analysis without needing to decrease the sample size, details here.

oushujun added the question Further information is requested label Jan 31, 2022

oushujun closed this as completed Aug 22, 2023

This was referenced Sep 26, 2024

EDTA_raw.pl LINE analysis stalls #504

Closed

How to run EDTA in large genomes (>10Gb)? #61

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does --overwrite 0 recover RepeatModeler in progress? #252

Does --overwrite 0 recover RepeatModeler in progress? #252

kaede0e commented Jan 26, 2022

oushujun commented Jan 31, 2022

kaede0e commented Jan 31, 2022

oushujun commented Jan 31, 2022

kaede0e commented Feb 8, 2022

oushujun commented Feb 8, 2022 via email

kaede0e commented Feb 9, 2022

oushujun commented Feb 10, 2022

oushujun commented Apr 6, 2022

kaede0e commented Apr 6, 2022

oushujun commented Apr 6, 2022 via email

kaede0e commented Apr 6, 2022

oushujun commented Apr 6, 2022 via email

afurches commented Sep 26, 2024 •

edited

Loading

Does --overwrite 0 recover RepeatModeler in progress? #252

Does --overwrite 0 recover RepeatModeler in progress? #252

Comments

kaede0e commented Jan 26, 2022

oushujun commented Jan 31, 2022

kaede0e commented Jan 31, 2022

oushujun commented Jan 31, 2022

kaede0e commented Feb 8, 2022

oushujun commented Feb 8, 2022 via email

kaede0e commented Feb 9, 2022

oushujun commented Feb 10, 2022

oushujun commented Apr 6, 2022

kaede0e commented Apr 6, 2022

oushujun commented Apr 6, 2022 via email

kaede0e commented Apr 6, 2022

oushujun commented Apr 6, 2022 via email

afurches commented Sep 26, 2024 • edited Loading

afurches commented Sep 26, 2024 •

edited

Loading