Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: redesign StudyLocusOverlap model to include overlapping variants in struct #384

Closed
wants to merge 9 commits into from

Conversation

ireneisdoomed
Copy link
Contributor

This PR includes significant modifications to the StudyLocusOverlap dataset schema:

  1. The schema has been restructured to consolidate leftLocus and rightLocus information. Instead of having separate entries for each variant in overlapping StudyLocusId pairs, the new schema represents these in a unified manner*.
    • Each StudyLocusId pair now corresponds to a single row, simplifying the overlap representation between leftStudyLocusId and rightStudyLocusId.
    • Changed the statistics field, that included some of the stats both left and right for leftLocus and rightLocus. These follow same schema and have the same content as locus, with the only exception that the variants in these arrays are just the overlapping ones (therefore, both arrays contain the same variants).
  2. Adaptation of other components that depended on overlaps:
    • Coloc and Ecaviar now anticipate the grouped schema
    • L2GGoldStandard.filter_unique_associations is now simpler with the grouped schema

*Roughly, we have shifted from this schema:

leftStudyLocusId
rightStudyLocusId
tagVariantId
statistics
--- leftPosteriorProbability
--- rightPosteriorProbability
...

to this one

leftStudyLocusId
rightStudyLocusId
leftLocus
--- variantId
--- posteriorProbability
rightLocus
--- variantId
--- posteriorProbability
...

Pending to be run ATM

@codecov-commenter
Copy link

Codecov Report

Attention: 16 lines in your changes are missing coverage. Please review.

Comparison is base (42b366c) 85.67% compared to head (9b81fef) 85.64%.
Report is 35 commits behind head on dev.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #384      +/-   ##
==========================================
- Coverage   85.67%   85.64%   -0.04%     
==========================================
  Files          89       91       +2     
  Lines        2101     2138      +37     
==========================================
+ Hits         1800     1831      +31     
- Misses        301      307       +6     
Files Coverage Δ
src/airflow/dags/common_airflow.py 90.38% <ø> (ø)
src/airflow/dags/finngen_preprocess.py 100.00% <100.00%> (ø)
src/otg/dataset/l2g_feature_matrix.py 82.92% <ø> (+7.31%) ⬆️
src/otg/dataset/l2g_gold_standard.py 90.32% <ø> (-0.31%) ⬇️
src/otg/dataset/study_locus.py 95.71% <100.00%> (-0.44%) ⬇️
src/otg/dataset/study_locus_overlap.py 100.00% <ø> (ø)
src/otg/datasource/eqtl_catalogue/study_index.py 85.00% <ø> (ø)
src/otg/datasource/finngen/study_index.py 100.00% <100.00%> (ø)
src/otg/datasource/finngen/summary_stats.py 100.00% <100.00%> (ø)
src/otg/datasource/gwas_catalog/study_index.py 100.00% <ø> (ø)
... and 10 more

@ireneisdoomed
Copy link
Contributor Author

Proposed changes are memory intensive and not scalable by the time each pair of overlapping studyloci carried thousands hundreds of common variants + their statistics.
Example of a failing job. By not exploding data, I generate tasks that are unbalanced and cannot scale by raising more infrastructure.
On top of that, overlaps require to be exploded to be used downstream by eCAVIAR and COLOC.
For all these reasons, we have decided to keep the schema in an exploded version as it is now.

@ireneisdoomed ireneisdoomed deleted the il-group-overlaps branch July 15, 2024 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants