EDTA_raw.pl LINE analysis stalls #504

afurches · 2024-09-18T16:52:26Z

Hi,

I am running EDTA_raw.pl --type line and the analysis never finishes. It produced a round-6 directory with extensive contents, but I can see that the last time any files were written in the /LINE/ directory was over 10 hours ago. I checked the rmod.log and confirmed round 6 never finished. I also logged into the compute node and used top to see that the program eleredef is continuously running and using about 13% CPU and RepeatModeler is using about 0.2% memory.

I found a few similar reports of stalling on the RepeatModeler github, so the error definitely seems to be with that program, but can't find a solution. The contents of my rmod.log are not consistent with this example of a successful run posted by RepeatModeler.

RepeatModeler Version 2.0.3
===========================
Using output directory = /TEs/edta/ref.fa.mod.EDTA.raw/LINE/RM_92059.MonSep161431592024
Search Engine = rmblast 2.14.1+
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.2
LTR Structural Analysis: Disabled [use -LTRStruct to enable]
Random Number Seed: 1726511518
Database = ptricho_v4_noMTnoCP.fa.mod   - Sequences = 46
  - Bases = 392162179
Storage Throughput = fair ( 532.70 MB/s )


RepeatModeler Round # 1
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 40000000 bp
   - Final Sample Size = 40030921 bp ( 40002558 non ambiguous )
   - Num Contigs Represented = 23
   - Sequence extraction : 00:00:17 (hh:mm:ss) Elapsed Time
 -- Running RepeatScout on the sequences...
   - RepeatScout: 00:09:35 (hh:mm:ss) Elapsed Time


RepeatModeler Round # 2
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 3000000 bp
   - Sequence extraction : 00:00:03 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:00:07 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
   - TE Masking time 00:00:24 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 3000217 bp
       Num Contigs Represented = 19
       Non ambiguous bp:
             Initial: 3000217 bp
             After Masking: 2558809 bp
             Masked: 14.71 %
 -- Input Database Coverage: 3000217 bp out of 392162179 bp ( 0.77 % )
Sampling Time: 00:00:34 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
Comparison Time: 00:00:16 (hh:mm:ss) Elapsed Time, 4792 HSPs Collected
Round Time: 00:03:08 (hh:mm:ss) Elapsed Time


RepeatModeler Round # 3
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 9000000 bp
   - Sequence extraction : 00:00:04 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:00:23 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
   - TE Masking time 00:01:38 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 9024666 bp
       Num Contigs Represented = 19
       Non ambiguous bp:
             Initial: 9013128 bp
             After Masking: 6816277 bp
             Masked: 24.37 %
 -- Input Database Coverage: 12024883 bp out of 392162179 bp ( 3.07 % )
Sampling Time: 00:02:06 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
Comparison Time: 00:01:32 (hh:mm:ss) Elapsed Time, 18178 HSPs Collected
Round Time: 00:08:18 (hh:mm:ss) Elapsed Time


RepeatModeler Round # 4
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 27000000 bp
   - Sequence extraction : 00:00:12 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:01:12 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
   - TE Masking time 00:06:00 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 27045737 bp
       Num Contigs Represented = 23
       Non ambiguous bp:
             Initial: 27028912 bp
             After Masking: 19570070 bp
             Masked: 27.60 %
 -- Input Database Coverage: 39070620 bp out of 392162179 bp ( 9.96 % )
Sampling Time: 00:07:27 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
Comparison Time: 00:13:10 (hh:mm:ss) Elapsed Time, 85637 HSPs Collected
Round Time: 03:06:26 (hh:mm:ss) Elapsed Time


RepeatModeler Round # 5
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 81000000 bp
   - Sequence extraction : 00:00:32 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:03:50 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
   - TE Masking time 00:33:57 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 81130463 bp
       Num Contigs Represented = 32
       Non ambiguous bp:
             Initial: 81013744 bp
             After Masking: 53211635 bp
             Masked: 34.32 %
 -- Input Database Coverage: 120201083 bp out of 392162179 bp ( 30.65 % )
Sampling Time: 00:38:24 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
Comparison Time: 01:17:10 (hh:mm:ss) Elapsed Time, 272900 HSPs Collected
Round Time: 15:01:23 (hh:mm:ss) Elapsed Time


RepeatModeler Round # 6
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 243000000 bp
   - Sequence extraction : 00:01:40 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:11:22 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
   - TE Masking time 05:18:09 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 243427817 bp
       Num Contigs Represented = 44
       Non ambiguous bp:
             Initial: 243012899 bp
             After Masking: 144141161 bp
             Masked: 40.69 %
 -- Input Database Coverage: 363628900 bp out of 392162179 bp ( 92.72 % )
Sampling Time: 05:31:26 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
Comparison Time: 09:17:38 (hh:mm:ss) Elapsed Time, 923986 HSPs Collected

Have you seen this before or have any suggestions?

Thanks,
Anna

The text was updated successfully, but these errors were encountered:

afurches · 2024-09-26T15:43:43Z

Hi, I was able to finish the analysis successfully by running RepeatModeler independently, using recovery mode (-recoverDir) and by dramatically increasing the threads over the recommended number (-pa 31). Details here.

oushujun · 2024-09-28T02:03:14Z

Thank you for sharing your experience. I am glad it worked out! Shujun

…

On Thu, Sep 26, 2024 at 11:44 AM afurches ***@***.***> wrote: Hi, I was able to finish the analysis successfully by running RepeatModeler independently, using recovery mode (-recoverDir) and by dramatically increasing the threads over the recommended number (-pa 31). Details here <#252 (comment)> . — Reply to this email directly, view it on GitHub <#504 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNX4NEMGVXAPZP5NLNVSALZYQTUPAVCNFSM6AAAAABOOD6BM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZXGMZTAMJQGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

tinyfallen · 2024-12-04T15:04:35Z

Thank you for sharing your experience. I am glad it worked out! Shujun
…
On Thu, Sep 26, 2024 at 11:44 AM afurches @.> wrote: Hi, I was able to finish the analysis successfully by running RepeatModeler independently, using recovery mode (-recoverDir) and by dramatically increasing the threads over the recommended number (-pa 31). Details here <#252 (comment)> . — Reply to this email directly, view it on GitHub <#504 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NEMGVXAPZP5NLNVSALZYQTUPAVCNFSM6AAAAABOOD6BM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZXGMZTAMJQGI . You are receiving this because you are subscribed to this thread.Message ID: @.>

I found the step for LINE was so time-consuming too, it took ~ 27 h in a total run of ~35 h, using version 2.2.2 with --anno 1 and --sensitive 1. The task can be completed successfully, but could this step be further optimized?
Besides, I found the -pa parameter was Deprecated from version 2.0.4 of RepeatModeler, while -threads was enabled. Until now, anaconda provides RepeatModeler v2.0.6 and RepeatMasker v4.1.5 (while the official web site provides v4.1.7). The -pa parameter may mislead threads allocation and utilization, and the EDTA_raw.pl in EDTA v2.2.2 seems using 4-fold the threads because of this parameter by assigning -pa $threads in RepeatModeler steps.
Would you update these dependencies of EDTA in the yaml file and modify the scripts to take the advantages from these updates recently?
I am now busy with annotating some genomes. If so, I may wait for your updates to ensure all the TE annotation steps to be done using the same version of EDTA to make them comparable. If not , I will use v2.2.2 to go on my tasks.
Your excellent tools help me a lot! Many thanks!

afurches closed this as completed Sep 26, 2024

tinyfallen mentioned this issue Dec 5, 2024

TIR-learner overutilizes threads #523

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EDTA_raw.pl LINE analysis stalls #504

EDTA_raw.pl LINE analysis stalls #504

afurches commented Sep 18, 2024 •

edited

Loading

afurches commented Sep 26, 2024

oushujun commented Sep 28, 2024 via email

tinyfallen commented Dec 4, 2024 •

edited

Loading

EDTA_raw.pl LINE analysis stalls #504

EDTA_raw.pl LINE analysis stalls #504

Comments

afurches commented Sep 18, 2024 • edited Loading

afurches commented Sep 26, 2024

oushujun commented Sep 28, 2024 via email

tinyfallen commented Dec 4, 2024 • edited Loading

afurches commented Sep 18, 2024 •

edited

Loading

tinyfallen commented Dec 4, 2024 •

edited

Loading