Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: run variant index step #188

Merged
merged 12 commits into from
Oct 19, 2023
Merged

feat: run variant index step #188

merged 12 commits into from
Oct 19, 2023

Conversation

ireneisdoomed
Copy link
Contributor

This PR includes minor fixes to generate a variant index that is based on the variant annotation.

  • Change to the path of the GWAS Catalog Study locus with the aim of having all study locus records under the same parent directory.
  • Refactor of unique_variants_in_locus: simpler, more efficient way of extracting all variants of interest in a StudyLocus dataset.
    • This set will contain all variantIds of the lead variant in the locus and the other relevant variants in the locus.
  • Refactor of the logic in the step: now filtering variant annotation happens inside VariantIndex.from_variant_annotation, so that the step is lighter.
  • Downgrade of Hail to 0.2.122 to fix the bug reported in #3088

Important notes

  • This variant index is based on the most up to date and latest StudyLocus. Right now these are based on:
    • gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/study_locus/catalog_study_locus_updated_locus/: the latest PICSed version of GWASCatalog top hits
    • gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/study_locus/finngen_study_locus: the latest PICSed version of Finngen's summary statistics generated by @DSuveges
  • Given these inputs, the current variant index contains: 754_306 variants
  • Job runs in ~3min

@codecov-commenter
Copy link

Codecov Report

Merging #188 (2e812f5) into main (6389ff1) will increase coverage by 0.20%.
The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #188      +/-   ##
==========================================
+ Coverage   90.74%   90.95%   +0.20%     
==========================================
  Files          67       67              
  Lines        1459     1459              
==========================================
+ Hits         1324     1327       +3     
+ Misses        135      132       -3     
Files Coverage Δ
src/otg/dataset/study_locus.py 95.94% <100.00%> (-0.06%) ⬇️
src/otg/dataset/variant_index.py 100.00% <100.00%> (ø)
src/otg/common/session.py 75.00% <0.00%> (ø)
src/otg/variant_index.py 68.75% <0.00%> (+4.04%) ⬆️

... and 1 file with indirect coverage changes

Copy link
Contributor

@DSuveges DSuveges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good. Things got much nicer.

@DSuveges DSuveges merged commit c021e6b into main Oct 19, 2023
1 check passed
@tskir tskir deleted the il-variant-index branch November 2, 2023 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants