Clustering step of LTR pipeline fails #260

sjd028 · 2024-10-23T20:24:35Z

Describe the issue

When running RepeatModeler, I am consistently getting the same issue at the clustering step of the LTR Pipeline. I am getting essentially the same error message as in #241, however when I try shortening the sequence identifiers (I have also tried shortening the genome name, and the database name) to less than 13 characters as described in #241, I am still getting the same exact issue.

I have tried using three different genomes, all of which are giving me the same error. The RECON/ RepeatScout pipeline seems to be working fine, and I am getting a -families.fa file which has the consensus families excluding LTRs.

This is the error report I am getting in the stderr file:
LTRPipeline : Error - could not open /home/sjd028/RepeatModelerTesting/AterTest/RM_1178777.SatOct51620362024/LTR_2708924.WedOct91432322024/clusters.dat! at /opt/RepeatModeler/LTRPipeline line 333.

This is the error I am getting in the stdout file:
_LTR Structural Analysis

Running LtrHarvest... : 00:35:17 (hh:mm:ss) Elapsed Time
Running Ltr_retriever... : 00:43:56 (hh:mm:ss) Elapsed Time
Aligning instances... : 00:04:37 (hh:mm:ss) Elapsed Time
Clustering...LTRPipeline: Error - could not cluster MAFFT results.
: 00:00:00 (hh:mm:ss) Elapsed Time
LTRPipeline Time: 01:23:53 (hh:mm:ss) Elapsed Time_

Reproduction steps
I ran RepeatModeler as a singularity on a computing cluster, giving the job 8 cores at 16Gb per core. This is the command I used:
singularity run $dfam RepeatModeler -database AterDbTest1 -threads 20 -LTRStruct

I tried three different genomes:
Drosophila melanogaster: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_000778455.1/
Abscondita terminalis (firefly): https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_013368085.1/
Lamprigera yunanna (firefly): https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_013368075.1/

Log output
File structure output:

AterDbTest1-rmod.log:
AterDbTest1-rmod.log
slurm (computing cluster job manager) output file:
slurm.hpc-4.272297.stdout.txt

Host system
This was run on a computing cluster on a linux operating system. More info:
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: Rocky
Description: Rocky Linux release 8.9 (Green Obsidian)
Release: 8.9
Codename: GreenObsidian

Singularity version: apptainer version 1.3.1-1.el8
The singularity container was downloaded on July 2, 2024

sjd028 · 2024-11-05T18:43:08Z

Additional info about host system:

The Dfam TETools container was installed using singularity. The version of RepeatModeler is 2.0.5. The version of the TETools package is 1.88.

rmhubley · 2024-11-12T22:32:42Z

First of all, you are allocating 8 cores for your job but telling RepeatModeler it has access to 20. While I am surprised your job wasn't killed sooner when it was running rmblast, it could be that mafft is overallocating cores and the job is getting killed. MAFFT is memory intensive, I would double check that you are actually giving your jobs 8x16GB, which should be adequate, but perhaps you are giving it less than that? Finally, you can rerun the LTR analysis separately for testing purposes like so: "./LTRPipeline -debug -threads # genome.fa" (NOTE: you give it the original genome in fasta format for this command ). This will generate more screen logging of what it is doing at each stage and keep additional files in the LTR_######## output directory.

sjd028 added the bug label Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering step of LTR pipeline fails #260

Clustering step of LTR pipeline fails #260

sjd028 commented Oct 23, 2024

sjd028 commented Nov 5, 2024

rmhubley commented Nov 12, 2024

Clustering step of LTR pipeline fails #260

Clustering step of LTR pipeline fails #260

Comments

sjd028 commented Oct 23, 2024

This is the error I am getting in the stdout file: _LTR Structural Analysis

sjd028 commented Nov 5, 2024

rmhubley commented Nov 12, 2024

This is the error I am getting in the stdout file:
_LTR Structural Analysis