Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding logic to flag gwas catalog studies based on curation #347

Merged
merged 46 commits into from
Jan 10, 2024

Conversation

DSuveges
Copy link
Contributor

@DSuveges DSuveges commented Dec 14, 2023

The main feature on this PR is to add logic to manage GWAS Catalog study curation. But it has some ripple effect on various pieces of the infrastructure.

Main bits being touched:

  • GWAS Catalog study class has method to update study information based on the provided curation table (optional).
  • This function updates study type, adds analysis flags, quality controls.
  • There's an other function to extract curation table for an other round of curation flagging new studies. (curation only applied for studies with summary statistics.)
  • The updated study index will regulate the number of studies being ingested for clumping.
  • To extract the eligible studies for clumping the study index dataset is updated with a function.

@codecov-commenter
Copy link

codecov-commenter commented Dec 14, 2023

Codecov Report

Attention: 91 lines in your changes are missing coverage. Please review.

Comparison is base (42b366c) 85.67% compared to head (1dc24fc) 85.84%.
Report is 47 commits behind head on dev.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #347      +/-   ##
==========================================
+ Coverage   85.67%   85.84%   +0.17%     
==========================================
  Files          89       96       +7     
  Lines        2101     2593     +492     
==========================================
+ Hits         1800     2226     +426     
- Misses        301      367      +66     
Files Coverage Δ
src/airflow/dags/common_airflow.py 90.38% <100.00%> (ø)
src/airflow/dags/finngen_preprocess.py 100.00% <100.00%> (ø)
src/airflow/dags/gwas_catalog_harmonisation.py 43.47% <ø> (ø)
src/airflow/dags/gwas_curation_update.py 100.00% <100.00%> (ø)
src/otg/common/session.py 87.50% <100.00%> (+0.32%) ⬆️
src/otg/dataset/dataset.py 91.80% <100.00%> (ø)
src/otg/dataset/l2g_feature_matrix.py 82.92% <ø> (+7.31%) ⬆️
src/otg/dataset/study_locus.py 96.20% <100.00%> (+0.04%) ⬆️
src/otg/datasource/finngen/study_index.py 100.00% <100.00%> (ø)
src/otg/datasource/finngen/summary_stats.py 100.00% <100.00%> (ø)
... and 23 more

@DSuveges DSuveges marked this pull request as ready for review December 15, 2023 11:14
@DSuveges DSuveges requested a review from d0choa December 15, 2023 11:14
@DSuveges DSuveges linked an issue Dec 15, 2023 that may be closed by this pull request
DSuveges and others added 8 commits December 15, 2023 14:55
* feat: draft of gwas catalog preprocess inclusion dag

* ci: new changelog and release notes templates  (#357)

Templates for CHANGELOG and release notes. To be fully tested on the next release.

---------

Co-authored-by: David Ochoa <[email protected]>
Co-authored-by: David Ochoa <[email protected]>
Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is really complex because it touches on many things, and processes are similar between each other.
I've left quite a lot of comments, happy to go through them in person.
Many of them are about making the process more interpretable.

What I understand from the PR is:

  • For associations, the moment we want to use the black list is when generating StudyLocus.
  • Study index will contain all of them, and in 2 separate files we'll keep track of a white and a black list

@@ -124,21 +124,22 @@ def _create_merged_config(

def read_parquet(
self: Session,
path: str,
path: str | list[str],
schema: StructType,
**kwargs: bool | float | int | str | None,
) -> DataFrame:
"""Reads parquet dataset with a provided schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Reads parquet dataset with a provided schema.
"""Reads a parquet or a list of parquet files with a provided schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

schema: StructType,
**kwargs: bool | float | int | str | None,
) -> DataFrame:
"""Reads parquet dataset with a provided schema.

Args:
path (str): parquet dataset path
path (str | list[str]): parquet dataset path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
path (str | list[str]): parquet dataset path
path (str | list[str]): path to the parquet file or list of parquet files

**kwargs: bool | float | int | str | None,
) -> Self:
"""Reads a parquet file into a Dataset with a given schema.

Args:
session (Session): Spark session
path (str): Path to the parquet file
path (str | list[str]): Path to the parquet file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
path (str | list[str]): Path to the parquet file
path (str | list[str]): Path to the parquet file or list of parquet files

@@ -72,14 +72,14 @@ def get_schema(cls: type[Self]) -> StructType:
def from_parquet(
cls: type[Self],
session: Session,
path: str,
path: str | list[str],
**kwargs: bool | float | int | str | None,
) -> Self:
"""Reads a parquet file into a Dataset with a given schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Reads a parquet file into a Dataset with a given schema.
"""Reads a parquet or a list of parquet files into a Dataset with a given schema.

@@ -139,3 +139,66 @@ def study_type_lut(self: StudyIndex) -> DataFrame:
DataFrame: A dataframe containing `studyId` and `studyType` columns.
"""
return self.df.select("studyId", "studyType")

def get_eligible_gwas_study_ids(self: StudyIndex) -> list[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes the qualityControls column is present, which is not true. If it is not, we'd filter out studies that are eligible.

I'd suggest adding a flag to check it exists, sth like:

filtered_df = self.df.filter(f.col("studyType") == "gwas")
if "qualityControls" in self.df.columns:
    filtered_df = filtered_df.filter((f.size(f.col("qualityControls")) == 0) | (f.col("qualityControls").isNull()))

return [
    row["studyId"]
    for row in filtered_df.distinct().collect()
]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right.

]

curation_columns = [
"studyId",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These'd need to be changed according to the PR opentargets/curation#17 (review)

assert isinstance(
mock_gwas_study_index.annotate_from_study_curation(mock_study_curation),
StudyIndexGWASCatalog,
), f"When applied None to curation function the returned type was: {type(mock_gwas_study_index.annotate_from_study_curation(mock_study_curation))}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
), f"When applied None to curation function the returned type was: {type(mock_gwas_study_index.annotate_from_study_curation(mock_study_curation))}"
), f"When applied a study metadata table to curation function the returned type was: {type(mock_gwas_study_index.annotate_from_study_curation(mock_study_curation))}"

zero_return_count = mock_gwas_study_index.annotate_from_study_curation(
None
).df.count()
return_count = mock_gwas_study_index.annotate_from_study_curation(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return_count and zero_return_count are the same. is this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The zero_return_count asserts that the number of returned studies won't change even if there's no curation table provided for the curation funcion. Might be an overly cautious test, but that's why it's tested under the same funcition.

)
]

assert expected == observed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to here you are testing annotate_from_study_curation. Ideally you could group them in a test Class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine.

"metadata": {}
},
{
"name": "analysisFlags",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name can change depending of opentargets/curation#17 (review)

Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is really complex because it touches on many things, and processes are similar between each other.
I've left quite a lot of comments, happy to go through them in person.
Many of them are about making the process more interpretable.

What I understand from the PR is:

  • For associations, the moment we want to use the black list is when generating StudyLocus.
  • Study index will contain all of them, and in 2 separate files we'll keep track of a white and a black list

Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is really complex because it touches on many things, and processes are similar between each other.
I've left quite a lot of comments, happy to go through them in person.
Many of them are about making the process more interpretable.

What I understand from the PR is:

  • For associations, the moment we want to use the black list is when generating StudyLocus.
  • Study index will contain all of them, and in 2 separate files we'll keep track of a white and a black list

@ireneisdoomed
Copy link
Contributor

Sorry for the stream of comments, I did the review from VSCode and something got stuck.

@DSuveges DSuveges requested a review from ireneisdoomed January 8, 2024 12:21
@d0choa
Copy link
Collaborator

d0choa commented Jan 9, 2024

@ireneisdoomed I went with Daniel through this PR. We identified a couple of things we really want to fix now and some that will come with follow-up PRs. There are a bunch of stylistic things (e.g. variable names) that we are not so sure they improve much so we might skip that for now. There is definitely some refactor material in this PR but there is enough critical logic to try to merge for now.

Copy link
Collaborator

@d0choa d0choa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed before there is a lot of business logic here. Some parts are more robust than others but there is an overall benefit in merging this logic and working on separate PRs for further improvements.

There is also some semantic debate about what exactly is metadata and how we distinguish our curation of study metadata vs the GWAS catalog curation association curation. Let's try not to forget about this because it might be confusing for people starting to work on the project.

@DSuveges DSuveges dismissed ireneisdoomed’s stale review January 10, 2024 11:49

We went through the PR with David and addressed these comments where it was necessary.

@DSuveges DSuveges merged commit 77dee8e into dev Jan 10, 2024
3 checks passed
@DSuveges DSuveges deleted the ds_3173_study_curation branch January 10, 2024 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Managing GWAS Catalog study QC/flags
4 participants