Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EDTA_raw.pl LINE analysis stalls #504

Closed
afurches opened this issue Sep 18, 2024 · 3 comments
Closed

EDTA_raw.pl LINE analysis stalls #504

afurches opened this issue Sep 18, 2024 · 3 comments

Comments

@afurches
Copy link

afurches commented Sep 18, 2024

Hi,

I am running EDTA_raw.pl --type line and the analysis never finishes. It produced a round-6 directory with extensive contents, but I can see that the last time any files were written in the /LINE/ directory was over 10 hours ago. I checked the rmod.log and confirmed round 6 never finished. I also logged into the compute node and used top to see that the program eleredef is continuously running and using about 13% CPU and RepeatModeler is using about 0.2% memory.

I found a few similar reports of stalling on the RepeatModeler github, so the error definitely seems to be with that program, but can't find a solution. The contents of my rmod.log are not consistent with this example of a successful run posted by RepeatModeler.

RepeatModeler Version 2.0.3
===========================
Using output directory = /TEs/edta/ref.fa.mod.EDTA.raw/LINE/RM_92059.MonSep161431592024
Search Engine = rmblast 2.14.1+
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.2
LTR Structural Analysis: Disabled [use -LTRStruct to enable]
Random Number Seed: 1726511518
Database = ptricho_v4_noMTnoCP.fa.mod   - Sequences = 46
  - Bases = 392162179
Storage Throughput = fair ( 532.70 MB/s )


RepeatModeler Round # 1
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 40000000 bp
   - Final Sample Size = 40030921 bp ( 40002558 non ambiguous )
   - Num Contigs Represented = 23
   - Sequence extraction : 00:00:17 (hh:mm:ss) Elapsed Time
 -- Running RepeatScout on the sequences...
   - RepeatScout: 00:09:35 (hh:mm:ss) Elapsed Time


RepeatModeler Round # 2
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 3000000 bp
   - Sequence extraction : 00:00:03 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:00:07 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
   - TE Masking time 00:00:24 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 3000217 bp
       Num Contigs Represented = 19
       Non ambiguous bp:
             Initial: 3000217 bp
             After Masking: 2558809 bp
             Masked: 14.71 %
 -- Input Database Coverage: 3000217 bp out of 392162179 bp ( 0.77 % )
Sampling Time: 00:00:34 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
Comparison Time: 00:00:16 (hh:mm:ss) Elapsed Time, 4792 HSPs Collected
Round Time: 00:03:08 (hh:mm:ss) Elapsed Time


RepeatModeler Round # 3
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 9000000 bp
   - Sequence extraction : 00:00:04 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:00:23 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
   - TE Masking time 00:01:38 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 9024666 bp
       Num Contigs Represented = 19
       Non ambiguous bp:
             Initial: 9013128 bp
             After Masking: 6816277 bp
             Masked: 24.37 %
 -- Input Database Coverage: 12024883 bp out of 392162179 bp ( 3.07 % )
Sampling Time: 00:02:06 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
Comparison Time: 00:01:32 (hh:mm:ss) Elapsed Time, 18178 HSPs Collected
Round Time: 00:08:18 (hh:mm:ss) Elapsed Time


RepeatModeler Round # 4
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 27000000 bp
   - Sequence extraction : 00:00:12 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:01:12 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
   - TE Masking time 00:06:00 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 27045737 bp
       Num Contigs Represented = 23
       Non ambiguous bp:
             Initial: 27028912 bp
             After Masking: 19570070 bp
             Masked: 27.60 %
 -- Input Database Coverage: 39070620 bp out of 392162179 bp ( 9.96 % )
Sampling Time: 00:07:27 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
Comparison Time: 00:13:10 (hh:mm:ss) Elapsed Time, 85637 HSPs Collected
Round Time: 03:06:26 (hh:mm:ss) Elapsed Time


RepeatModeler Round # 5
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 81000000 bp
   - Sequence extraction : 00:00:32 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:03:50 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
   - TE Masking time 00:33:57 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 81130463 bp
       Num Contigs Represented = 32
       Non ambiguous bp:
             Initial: 81013744 bp
             After Masking: 53211635 bp
             Masked: 34.32 %
 -- Input Database Coverage: 120201083 bp out of 392162179 bp ( 30.65 % )
Sampling Time: 00:38:24 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
Comparison Time: 01:17:10 (hh:mm:ss) Elapsed Time, 272900 HSPs Collected
Round Time: 15:01:23 (hh:mm:ss) Elapsed Time


RepeatModeler Round # 6
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 243000000 bp
   - Sequence extraction : 00:01:40 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:11:22 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
   - TE Masking time 05:18:09 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 243427817 bp
       Num Contigs Represented = 44
       Non ambiguous bp:
             Initial: 243012899 bp
             After Masking: 144141161 bp
             Masked: 40.69 %
 -- Input Database Coverage: 363628900 bp out of 392162179 bp ( 92.72 % )
Sampling Time: 05:31:26 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
Comparison Time: 09:17:38 (hh:mm:ss) Elapsed Time, 923986 HSPs Collected

Have you seen this before or have any suggestions?

Thanks,
Anna

@afurches
Copy link
Author

Hi, I was able to finish the analysis successfully by running RepeatModeler independently, using recovery mode (-recoverDir) and by dramatically increasing the threads over the recommended number (-pa 31). Details here.

@oushujun
Copy link
Owner

oushujun commented Sep 28, 2024 via email

@tinyfallen
Copy link

tinyfallen commented Dec 4, 2024

Thank you for sharing your experience. I am glad it worked out! Shujun

On Thu, Sep 26, 2024 at 11:44 AM afurches @.> wrote: Hi, I was able to finish the analysis successfully by running RepeatModeler independently, using recovery mode (-recoverDir) and by dramatically increasing the threads over the recommended number (-pa 31). Details here <#252 (comment)> . — Reply to this email directly, view it on GitHub <#504 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NEMGVXAPZP5NLNVSALZYQTUPAVCNFSM6AAAAABOOD6BM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZXGMZTAMJQGI . You are receiving this because you are subscribed to this thread.Message ID: @.>

I found the step for LINE was so time-consuming too, it took ~ 27 h in a total run of ~35 h, using version 2.2.2 with --anno 1 and --sensitive 1. The task can be completed successfully, but could this step be further optimized?
Besides, I found the -pa parameter was Deprecated from version 2.0.4 of RepeatModeler, while -threads was enabled. Until now, anaconda provides RepeatModeler v2.0.6 and RepeatMasker v4.1.5 (while the official web site provides v4.1.7). The -pa parameter may mislead threads allocation and utilization, and the EDTA_raw.pl in EDTA v2.2.2 seems using 4-fold the threads because of this parameter by assigning -pa $threads in RepeatModeler steps.
Would you update these dependencies of EDTA in the yaml file and modify the scripts to take the advantages from these updates recently?
I am now busy with annotating some genomes. If so, I may wait for your updates to ensure all the TE annotation steps to be done using the same version of EDTA to make them comparable. If not , I will use v2.2.2 to go on my tasks.
Your excellent tools help me a lot! Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants