Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with vg call on graph generated from cactus #2546

Open
RenzoTale88 opened this issue Nov 25, 2019 · 11 comments
Open

Issues with vg call on graph generated from cactus #2546

RenzoTale88 opened this issue Nov 25, 2019 · 11 comments

Comments

@RenzoTale88
Copy link

RenzoTale88 commented Nov 25, 2019

Hello,
I'm writing again to ask support on the vg call stage for some alignments.
I've generated cactus alingments for 5 different mammalian genomes, and converted that to vg as described in issue #2514. After that, I've proceeded as follow for each sample analised:

sample=sample1
# Map the reads to the graph
vg map -R $sample -N $sample -S 0 -u 1 -t 4 -x all.xg -g all.gcsa my_lib1_R1.fastq.gz my_lib1_R2.fastq.gz

# combine different libraries into one gam file
while read p; do cat ${p} && rm ${p}; done < mappedlist.txt | vg gamsort --threads $NSLOTS -p -i ALIGN/${sample}/${sample}.gam.gai - > ALIGN/${sample}/${sample}.gam

# Filter the alignments
vg filter ALIGN/${sample}/${sample}.gam -r 0.90 -fu -m 1 -q 15 -D 999 -x all.xg > FILTER/$sample/$sample.filtered.gam

# Chunk the gam
vg chunk -x /exports/cmvm/eddie/eb/groups/prendergast_roslin/Andrea/GraphGenomes/NewGraph_03112019/GRAPH_CACTUS/all.xg -a FILTER/${sample}/${sample}.sorted.gam -P ./LISTS/PATHS.txt -c 50 -g -s 2500000 -o 100000 -b FILTER/${sample}/${sample}_call_chunk -t ${NSLOTS} -E FILTER/${sample}/${sample}.chunklist -f

# augment every chunk
while read p; do
reads=$( echo $p | awk '{print $4}' )
bname=$(basename -s ".gam" $reads)
vg augment ./FILTER/$sample/${bname}.vg $reads -p -C -t $NSLOTS -A ./FILTER/$sample/${bname}.aug.gam > ./FILTER/$sample/${bname}.aug.vg
vg index ./FILTER/$sample/${bname}.aug.vg -x ./FILTER/$sample/${bname}.aug.xg
vg snarls ./FILTER/$sample/${bname}.aug.xg > ./FILTER/$sample/${bname}.aug.snarls
vg pack -x ./FILTER/$sample/${bname}.aug.xg -g ./FILTER/$sample/${bname}.aug.gam -Q 15 -o ./FILTER/$sample/${bname}.aug.pack
done < FILTER/${sample}/${sample}.chunklist 

# Do variant call
while read p; do
chroms=$( echo $p | awk '{print $1}' )
bpi=$( echo $p | awk '{print $2}' )
bpe=$( echo $p | awk '{print $3}' )
reads=$( echo $p | awk '{print $4}' )
clength=$( awk -v var=$chroms '$1==var {print $2}' ./LISTS/lengths.txt )
bname=$(basename -s ".gam" $reads)
vg call ./FILTER/$sample/${bname}.aug.xg -s $sample -k ./FILTER/$sample/${bname}.aug.pack -t $NSLOTS -p $chroms -l $clength -o $bpi | vcf-sort | bgzip -c > ./VCALL/$sample/${bname}.vcf.gz
done < FILTER/${sample}/${sample}.chunklist 

I've performed all the stages a first time using VG 1.19.0. The code works fine up the augmentation, then fails with the following error during vg call:

what():  cyclic reference path (breed.1) not supported by caller

I've also tried to perform the augmentation and calling using vg 1.20.0, and got the error attached.
stacktrace.txt

I've also tried more stringent filtering on each chunk as specified here #2474, but it didn't worked.

Is there anything I can try or that I did wrong? I've tested the mapping/calling pipeline on another graph genome, generated from vcf instead of cactus, and it worked fine.

Thank you in advance for your help,

Andrea

@glennhickey
Copy link
Contributor

glennhickey commented Nov 25, 2019 via email

@RenzoTale88
Copy link
Author

RenzoTale88 commented Nov 25, 2019

Hi Glenn,
thank you for your quick reply and your help!
Would it be ok to share the data using email? Data are large and it would probably be easier using something like GD or onedrive.

PS one more detail: I'm getting the error that you find in the attached file when I use the vg graph, not the xg. With the Xg the error is different (see below). Not sure if it is of any help.

Crash report for vg v1.20.0 "Ginestra"
Stack trace (most recent call last) in thread 180842:
#13   Object "", at 0xffffffffffffffff, in 
#12   Object "/exports/cmvm/eddie/eb/groups/prendergast_grp/Andrea/vg/vg-1.20.0", at 0x1b6933e, in __clone
#11   Object "/exports/cmvm/eddie/eb/groups/prendergast_grp/Andrea/vg/vg-1.20.0", at 0x10a4f5a, in start_thread
#10   Object "/exports/cmvm/eddie/eb/groups/prendergast_grp/Andrea/vg/vg-1.20.0", at 0x1ac3a95, in gomp_thread_start
#9    Object "/exports/cmvm/eddie/eb/groups/prendergast_grp/Andrea/vg/vg-1.20.0", at 0xd41cb3, in vg::SnarlManager::for_each_top_level_snarl_parallel(std::function<void (vg::Snarl const*)> const&) const [clone ._omp_fn.0]
#8    Object "/exports/cmvm/eddie/eb/groups/prendergast_grp/Andrea/vg/vg-1.20.0", at 0xaf32e7, in std::_Function_handler<void (vg::Snarl const*), vg::GraphCaller::call_top_level_snarls(bool)::{lambda(vg::Snarl const*)#1}>::_M_invoke(std::_Any_data const&, vg::Snarl const*
#7    Object "/exports/cmvm/eddie/eb/groups/prendergast_grp/Andrea/vg/vg-1.20.0", at 0xaf9cd1, in vg::LegacyCaller::call_snarl(vg::Snarl const&)
#6    Object "/exports/cmvm/eddie/eb/groups/prendergast_grp/Andrea/vg/vg-1.20.0", at 0xaf2fd5, in vg::LegacyCaller::get_ref_position(vg::Snarl const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
#5    Object "/exports/cmvm/eddie/eb/groups/prendergast_grp/Andrea/vg/vg-1.20.0", at 0x14c4751, in xg::XG::get_handle_of_step(handlegraph::step_handle_t const&) const
#4    Object "/exports/cmvm/eddie/eb/groups/prendergast_grp/Andrea/vg/vg-1.20.0", at 0xab456f, in sdsl::enc_vector<sdsl::coder::elias_delta, 128u, (unsigned char)0>::operator[](unsigned long) const
#3    Object "/exports/cmvm/eddie/eb/groups/prendergast_grp/Andrea/vg/vg-1.20.0", at 0x1ad8a31, in __assert_fail
#2    Object "/exports/cmvm/eddie/eb/groups/prendergast_grp/Andrea/vg/vg-1.20.0", at 0x1ad89bb, in __assert_fail_base
#1    Object "/exports/cmvm/eddie/eb/groups/prendergast_grp/Andrea/vg/vg-1.20.0", at 0x1ae53e0, in abort
#0    Object "/exports/cmvm/eddie/eb/groups/prendergast_grp/Andrea/vg/vg-1.20.0", at 0x10aa1d7, in raise

@glennhickey
Copy link
Contributor

Sure. My email's my username here @gmail.com.

Making a link here https://transfer.sh/ may also work for chunk-sized file.s

@RenzoTale88
Copy link
Author

Ok I've just shared the data through onedrive. Let me know if you manage to get them.

Andrea

@glennhickey
Copy link
Contributor

This should be corrected once #2548 is merged.

@RenzoTale88
Copy link
Author

Just compiled vg and tested it, and it seems to work fine also on another sample that I chunked. Thank you very much again for your help!

Andrea

@RenzoTale88
Copy link
Author

After making some more tests, I've seen that vg call now fails only with some chunks.
I think that this is related to the computation of the snarls for a specific chunk, which lead to a core dumped error (without using a significan amount of memory or cores).
Unfortunaltely the job is failing without creating an error log that I can share with you which would make everything easier.

This happens when I run the following commands:

vg augment sample1_call_chunk_7_1_16799120_19300994.vg sample1_call_chunk_7_1_16799120_19300994.gam -p -C -t 4 -A sample1_call_chunk_7_1_16799120_19300994.aug.gam > sample1_call_chunk_7_1_16799120_19300994.aug.vg 
vg index -x sample1_call_chunk_7_1_16799120_19300994.aug.xg sample1_call_chunk_7_1_16799120_19300994.aug.vg 
vg snarls sample1_call_chunk_7_1_16799120_19300994.aug.xg > sample1_call_chunk_7_1_16799120_19300994.aug.snarls 

I've tried to calculate the snarls on both the augmented vg, pg and xg, without success.

Is it somewhat related to the fact that I'm computing the snarls on the wrong dataset? Or do I have to tweak some parameters?

Thanks again for the help

Andrea

PS: @glennhickey if you want to test them, I've uploaded the data in a subfolder in the OneDrive folder I've shared with you yesterday

@glennhickey
Copy link
Contributor

It also crashes on sample1_call_chunk_7_1_16799120_19300994.vg so augmentation is not a direct factor. vg snarls is coming up with a bad root, which leads it to recurse too much and crash. vg snarls uses the longest path in the graph for a root if possible. In this case, the longest path is fragmented, presumably by vg chunk, and that sets it on the wrong track. It should use the path you chunked on. I can find this path by running

for i in `vg paths -Lv sample1_call_chunk_7_1_16799120_19300994.vg` ;do vg mod -r $i sample1_call_chunk_7_1_16799120_19300994.vg | vg validate - ; done

and looking for the only path from vg paths -Lv sample1_call_chunk_7_1_16799120_19300994.vg that doesn't have an error associated with it. If we call this path.1, all other paths can be removed by running

vg mod sample1_call_chunk_7_1_16799120_19300994.vg -r path.1 > sample1_call_chunk_7_1_16799120_19300994.fix.vg

and vg snarls runs fine on this graph. The snarls produced will also be valid on the original graph fwiw.

In summary, if chunking on a graph with multiple overlapping paths with vg chunk -Pp it is prudent to filter the other paths out before snarls using vg mod -r, as they can be invalid. vg snarls can certainly be fixed up to be more robust to this either by checking discontinuous paths or providing a hint option.

Broken subpaths are themselves a bug, and would be resolved by #2506.

@RenzoTale88
Copy link
Author

RenzoTale88 commented Nov 28, 2019

So, I've changed my code so that it runs an additional step when computing the snarls. The code is now:

vg augment sample1_call_chunk_7_1_16799120_19300994.vg $reads -p -C -t 4 -A sample1_call_chunk_7_1_16799120_19300994.aug.gam > sample1_call_chunk_7_1_16799120_19300994.aug.vg
vg index sample1_call_chunk_7_1_16799120_19300994.aug.vg -x sample1_call_chunk_7_1_16799120_19300994.aug.xg
vg mod -r $tgtpath sample1_call_chunk_7_1_16799120_19300994.aug.vg | vg snarls - > sample1_call_chunk_7_1_16799120_19300994.aug.snarls
vg pack -x sample1_call_chunk_7_1_16799120_19300994.aug.xg -g sample1_call_chunk_7_1_16799120_19300994.aug.gam -Q 5 -o sample1_call_chunk_7_1_16799120_19300994.aug.pack
vg call sample1_call_chunk_7_1_16799120_19300994.aug.vg -r sample1_call_chunk_7_1_16799120_19300994.aug.snarl -s sample1 -k sample1_call_chunk_7_1_16799120_19300994.aug.pack -t 4 -p path.1 -l $path1length -o 16799120 | vcf-sort | bgzip -c > sample1_call_chunk_7_1_16799120_19300994.vcf.gz

When I use this code, the software works fine and call the variants also in other paths (path1.1, path2.1, path3.1 etc.). But if I use the augmented XG graph, it will fail with the error code attached.

stacktrace_vg_crash_M8Hhch.txt

Is that right? Does the removal of the paths when computing the snarls affect downstream analysis?

Also, I've just seen that #2506 has been merged. If I compile the newest version of VG, will it fix the problem when chunking the datasets?

Thanks again for the support

Andrea

@glennhickey
Copy link
Contributor

That looks okay. Those paths you are removing are broken, so their removal can only help. It will also save you memory, especially while working with .vg where paths are expensive to load. Note that removing these paths has no effect on the topology of your graph.

#2506 is not merged. But if it were, it should fix this problem. But again, if you're not using the other paths then there's no harm removing them.

@RenzoTale88
Copy link
Author

Perfect, I'll proceed as recommended!
Thank you again for your help,
Andrea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants