Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differnece between running batch and each step individually #935

Open
shashwatsahay opened this issue Dec 21, 2024 · 0 comments
Open

Differnece between running batch and each step individually #935

shashwatsahay opened this issue Dec 21, 2024 · 0 comments

Comments

@shashwatsahay
Copy link

Version Info

Tested on v0.9.10, v0.9.11, v0.9.12

What is this all about

When running individual steps of pipeline I get different results as compared to when using batch command
Using batch leads to noisy results

When running commands individually

image

when using the batch command

image

Step by Step difference

I have not included here the coverage commands as I use multiple normals and they didnt look much different except for the antitarget region

Binning

Individual Batch
> Detected file format: bed
> Detected file format: bed
> Estimated read length 101.0
> Wrote /tmp/tmp0od03n9s.bed with 100 regions
> Splitting large targets
> Wrote Agilent_SureSelect_XT_HS2_All_Exon_V8_Regions.target.bed with 204770 regions
> Skipping untargeted chromosomes MT
> Wrote Agilent_SureSelect_XT_HS2_All_Exon_V8_Regions.antitarget.bed with 134163 regions
> Detected file format: bed
> Splitting large targets
> Wrote Agilent_SureSelect_XT_HS2_All_Exon_V8_Regions.target.bed with 232655 regions
> Wrote Agilent_SureSelect_XT_HS2_All_Exon_V8_Regions.antitarget.bed with 38328 regions

Reference

Individual Batch
> Targets: 9665 (4.72%) bins failed filters (log2 < -5.0, log2 > 5.0, spread > 1.0)
> Antitargets: 18937 (14.11%) bins failed filters
> Wrote reference.cnn with 338933 regions
> Targets: 13067 (5.616%) bins failed filters (log2 < -5.0, log2 > 5.0, spread > 1.0)
> Antitargets: 1894 (4.764%) bins failed filters
> Wrote reference.cnn with 272408 regions

Difference in fix

Individual Batch
> Processing target: tumour
> Keeping 195105 of 204770 bins
> Correcting for GC bias...
> Correcting for density bias...
> Processing antitarget: tumour
> Keeping 115226 of 134163 bins
> Correcting for GC bias...
> Correcting for RepeatMasker bias...
> Antitargets are 3.42 x more variable than targets
> Processing target: tumour
> Keeping 219588 of 232655 bins
> Correcting for GC bias...
> Correcting for density bias...
> Processing antitarget: tumour
> Keeping 37859 of 39753 bins
> Correcting for GC bias...
> Correcting for RepeatMasker bias...
> Antitargets are 1.39 x more variable than targets

Difference in Segment

This is a bit different because at some point segment call when just run as
cnvkit.py segment tumor.cnr -o tumor.cns
still starts to smoothing by default which is pretty strange as --smooth-cbs i thought was an opt in feature or is this something else.

Individual Batch
> Segmenting with method 'cbs', significance threshold 0.0001, in 1 processes
> Smoothing overshot at 8 / 233 indices: (-30.268828150561895, -0.21054706747579377) vs. original (-27.9209, 0.53479)
> Smoothing overshot at 10 / 595 indices: (-29.16372209174013, 1.8425761663484446) vs. original (-27.9546, -0.028386)
> Segmenting with method 'cbs', significance threshold 0.0001, in 1 processes
> Dropped 3 / 13645 bins on chromosome 1
> Dropped 2 / 11956 bins on chromosome 1
> Dropped 1 / 9698 bins on chromosome 5
> Dropped 2 / 10534 bins on chromosome 12
> Dropped 48 / 126 bins on chromosome Y
> Dropped 254 / 375 bins on chromosome Y

Then there are bunch of postprocessing step in batch mode which isnt documented as part of the batch pipeline altogether in the stable release version of the readthedocs like segmetrics and call to filter based on ci

CI filtering

Individual Batch
> Applying filter 'ci'
> Filtered by 'ci' from 59 to 34 rows
> Wrote tumor.ci.cns with 34 regions
> Applying filter 'ci'
> Filtered by 'ci' from 729 to 395 rows

This was followed by median centering and p-t-test

and finally endining with bintest

Bintest

Individual Batch
> Ignoring 115226 off-target bins
> Significant hits in 7141/195105 bins (3.66%)
> Ignoring 37859 off-target bins
> Significant hits in 5976/219588 bins (2.72%)

Overall i see two differences

  1. At the step of binning which uses target and antitarget instead of autobin if i am not wrong (refer to my comment on batch hybrid: Use autobin for target and antitarget bin sizes #302)

#302 (comment)

  1. Or it could be due to the automatic smmothing in segment step which i dont undestand how is it even happening
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant