-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does --overwrite 0 recover RepeatModeler in progress? #252
Comments
Hi Kaede, EDTA will pick up the RepeatModeler result if its final product Best, |
Hi Shujun, Sincerely, |
Please let me know if it works! Shujun |
Hi Shujun, Hmm it seems to indicate that the RepeatModeler didn't run properly when I've added --recoverDir RM_* extension to the line 493. I was in the process of round-6, so was hoping to pick up from there. But the job was done in a few hours (instead of days) and the files in my RM_* are indicating incompletion of round-6. I didn't find the consensi.fa etc. files that are supposed to be there: The dates are all odd and this is the log file I had from the job and I doubt that it properly completed all 6 rounds of RepeatModeler. Tue Feb 8 04:04:35 PST 2022 Perform EDTA final steps to generate a non-redundant comprehensive TE library:
cat: 'RM_*/consensi.fa': No such file or directory I did get genome.mod.EDTA.TElib.fa and genome.mod.EDTA.intact.gff3 output files (not empty) but wonder if this was the result ignoring RepeatModeler output. What do you think? I guess the recoverDir extension did not work... |
Maybe you want to try this parameter on RepeatModeler first to make sure it
will pick up from where it stopped. You can make a copy of the unfinished
run and test it. You may find more discussions on their github.
Shujun
…On Tue, Feb 8, 2022 at 12:37 PM kaede0e ***@***.***> wrote:
Hi Shujun,
Hmm it seems to indicate that the RepeatModeler didn't run properly when
I've added --recoverDir RM_* extension to the line 493. I was in the
process of round-6, so was hoping to pick up from there. But the job was
done in a few hours (instead of days) and the files in my RM_* are
indicating incompletion of round-6. I didn't find the consensi.fa etc.
files that are supposed to be there:
drwxr-xr-x 2 kaedeh 408K Jan 29 05:53 round-1
drwxr-xr-x 4 kaedeh 40K Jan 29 06:01 round-2
drwxr-xr-x 4 kaedeh 112K Jan 29 06:57 round-3
drwxr-xr-x 4 kaedeh 412K Jan 29 14:59 round-4
-rw-r--r-- 1 kaedeh 51M Feb 1 17:40 families.stk
drwxr-xr-x 4 kaedeh 1.5M Feb 1 17:46 round-5
drwxr-xr-x 2 kaedeh 796K Feb 6 16:57 round-6
The dates are all odd and this is the log file I had from the job and I
doubt that it properly completed all 6 rounds of RepeatModeler.
Tue Feb 8 04:04:35 PST 2022 Perform EDTA final steps to generate a
non-redundant comprehensive TE library:
Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.
cat: 'RM_*/consensi.fa': No such file or directory
RepeatModeler is finished, but no consensi.fa files found.
I did get genome.mod.EDTA.TElib.fa and genome.mod.EDTA.intact.gff3 output
files (not empty) but wonder if this was the result ignoring RepeatModeler
output.
What do you think? I guess the recoverDir extension did not work...
—
Reply to this email directly, view it on GitHub
<#252 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABNX4NEBEEYBHZRBUIUO6JTU2FIEXANCNFSM5M33KSPA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi Shujun, The -recoverDir extension does work if I run RepeatModeler separately. The command line looks like this: I was going to try making a copy of this unfinished run and test it but figured that the intermediate files (the line 494: rm $genome.masked.nhr $genome.masked.nin $genome.masked.nnd $genome.masked.nni $genome.masked.nog $genome.masked.nsq) got deleted by the first try I did so I am missing -database argument, and I can't redo it unless I restart from the beginning... Is there a way to retrieve these files or do I need to restart? Thanks, |
Hi Kaede, These files should be able to be regenerated by the indexing command: Best, |
@kaede0e does it resolved? |
No, but we decided to move forward with doing EDTA chromosome by chromosome to fit our computational resource. |
You will need the pan-genome method to combine sublibraries to control
false positives. Check out this work:
https://github.com/HuffordLab/NAM-genomes/tree/master/te-annotation
Shujun
…On Wed, Apr 6, 2022 at 9:12 AM kaede0e ***@***.***> wrote:
No, but we decided to move forward with doing EDTA chromosome by
chromosome to fit our computational resource.
—
Reply to this email directly, view it on GitHub
<#252 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABNX4NBQ34M3J3UUXGCSGD3VDWZYJANCNFSM5M33KSPA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi Shujun, I don't fully understand why performing EDTA chromosome by chromosome will create false positives. Is it the raw TE library filtering stage that could miss-identify TEs if I've been only using one chromosome at a time? When I combine the curated library from each chromosome without additional steps mentioned in this pan-genome pipeline, why will there be false positives? |
Each EDTA run will have some sort of FP that can not be fully removed. Most
of them are low copy. Combining multiple runs together will inflate these
FP and the pan module can effectively control these.
Shujun
…On Wed, Apr 6, 2022 at 9:34 AM kaede0e ***@***.***> wrote:
Hi Shujun, I don't fully understand why performing EDTA chromosome by
chromosome will create false positives. Is it the raw TE library filtering
stage that could miss-identify TEs if I've been only using one chromosome
at a time? When I combine the curated library from each chromosome without
additional steps mentioned in this pan-genome pipeline, why will there be
false positives?
—
Reply to this email directly, view it on GitHub
<#252 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABNX4NEPSPNLTCYBEBJCAZ3VDW4KXANCNFSM5M33KSPA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
For those hitting wall time limits, I recommend running RepeatModeler independently in recovery mode as described above, but increase the number of threads to the maximum supported by your node. By using 32 threads, I was able to finish round 6 of my LINE analysis without needing to decrease the sample size, details here. |
Hello,
Thanks for developing this comprehensive TE discovery pipeline. We are currently aiming to annotate multiple plant genomes with TEs de novo, which have been taking a lot more computational time than we expected initially.
I managed to finish one genome (~220 Mb and ~23% TE content using 8-days CPU time) but I am struggling to finish the pipeline for others. In most of the genomes, it seems to time out in the middle of RepeatModeler step. I tried running RepeatModeler separately to investigate whether that might do the job quicker, and discovered that they have a -recoverDir option to start from where the previous run left off. So, I was wondering if EDTA pipeline can potentially recover results from the RepeatModeler in progress instead of restart from beginning (I've been running the command: EDTA.pl --step final --overwrite 0).
Sincerely,
Kaede
The text was updated successfully, but these errors were encountered: