Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding logic to flag gwas catalog studies based on curation #347

Merged
merged 46 commits into from
Jan 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
017094a
feat: adding logic to flag gwas catalog studies based on curation
DSuveges Dec 14, 2023
93076fc
feat: adding curation step
DSuveges Dec 14, 2023
c6383f9
Merge branch 'dev' of https://github.com/opentargets/genetics_etl_pyt…
DSuveges Dec 14, 2023
aaab399
fix: fixing import
DSuveges Dec 14, 2023
0ede97a
feat: updating study schema
DSuveges Dec 14, 2023
b5908ff
test: adding study curation test
DSuveges Dec 14, 2023
3534c8c
test: improving test
DSuveges Dec 14, 2023
7ed4d08
fix: hopefully broken test got fixed
DSuveges Dec 14, 2023
96ef6f6
fix: maybe now
DSuveges Dec 14, 2023
540ecb7
test: increasing test coverage on study curation logic
DSuveges Dec 15, 2023
43f59db
refactor: move eligibility test function to study dataset
DSuveges Dec 15, 2023
8f53469
test: adding test for new study fucntion
DSuveges Dec 15, 2023
03578bc
test: completing test coverage on the curation funcitons
DSuveges Dec 15, 2023
0d3545d
Merge branch 'dev' into ds_3173_study_curation
DSuveges Dec 15, 2023
8a86854
Merge branch 'dev' into ds_3173_study_curation
DSuveges Dec 15, 2023
0183452
feat: adding interogator functions to study index
DSuveges Dec 15, 2023
63db7f6
Merge branch 'ds_3173_study_curation' of https://github.com/opentarge…
DSuveges Dec 15, 2023
3d03158
feat: adding step to get eligible list
DSuveges Dec 18, 2023
367b10f
Merge branch 'dev' of https://github.com/opentargets/genetics_etl_pyt…
DSuveges Dec 18, 2023
22b55a2
feat: Updating GWAS Catalog ingest DAG (#363)
DSuveges Dec 18, 2023
dc2e461
Merge branch 'dev' of https://github.com/opentargets/genetics_etl_pyt…
DSuveges Dec 18, 2023
e57d66a
feat: prototyping dag
DSuveges Dec 21, 2023
09762ce
fix: yaml format might help
DSuveges Dec 21, 2023
955b77e
fix: updating regex to make it applicable to different sources
DSuveges Dec 23, 2023
6523780
chore: updating branch from origin/dev
DSuveges Dec 30, 2023
3ed2720
feat: adding DAG for complete GWAS ingestion
DSuveges Jan 2, 2024
ea0e277
chore: dealing with curation table column name changes
DSuveges Jan 3, 2024
f4f7f7b
chore: update from main
DSuveges Jan 3, 2024
fed0998
test: removing test - no longer relevant
DSuveges Jan 3, 2024
af24dfe
docs: fixing docstring
DSuveges Jan 3, 2024
0547be1
refactor: splitting clumping step into two
DSuveges Jan 3, 2024
cec29a0
refactor: adjusting configuration to the split clumping step
DSuveges Jan 3, 2024
150347c
refactor: updating config to the splitted clumping step
DSuveges Jan 3, 2024
41b9ac6
feat: adding DAG to update curation table
DSuveges Jan 4, 2024
281e1c4
Merge branch 'dev' of https://github.com/opentargets/genetics_etl_pyt…
DSuveges Jan 5, 2024
a2d0505
fix: adding back eqtl catalog config
DSuveges Jan 8, 2024
95ccb7e
chore: final touches
DSuveges Jan 8, 2024
62823cc
Merge branch 'dev' into ds_3173_study_curation
DSuveges Jan 8, 2024
9c58077
docs: adding spacer for ld clump step
DSuveges Jan 8, 2024
aa56084
Merge branch 'ds_3173_study_curation' of https://github.com/opentarge…
DSuveges Jan 8, 2024
4adcbb3
docs: addign spacer for window based clumpingj
DSuveges Jan 8, 2024
a4017e9
docs: addign spacer for window based clumping
DSuveges Jan 8, 2024
80bccd6
Merge branch 'dev' into ds_3173_study_curation
DSuveges Jan 9, 2024
cd9823d
Merge branch 'dev' into ds_3173_study_curation
DSuveges Jan 9, 2024
0887139
fix: final touches proise
DSuveges Jan 9, 2024
1dc24fc
Merge branch 'dev' into ds_3173_study_curation
DSuveges Jan 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions config/datasets/gcp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,16 @@ anderson: gs://genetics-portal-input/v2g_input/andersson2014/enhancer_tss_associ
javierre: gs://genetics-portal-input/v2g_input/javierre_2016_preprocessed.parquet
jung: gs://genetics-portal-raw/pchic_jung2019/jung2019_pchic_tableS3.csv
thurman: gs://genetics-portal-input/v2g_input/thurman2012/genomewideCorrs_above0.7_promoterPlusMinus500kb_withGeneNames_32celltypeCategories.bed8.gz
catalog_associations: ${datasets.inputs}/v2d/gwas_catalog_v1.0.2-associations_e110_r2023-11-24.tsv
catalog_associations: ${datasets.inputs}/v2d/gwas_catalog_v1.0.2-associations_e110_r2023-12-21.tsv
catalog_studies:
# To get a complete representation of all GWAS Catalog studies, we need to
# ingest the list of unpublished studies from a different file.
- ${datasets.inputs}/v2d/gwas-catalog-v1.0.3-studies-r2023-11-24.tsv
- ${datasets.inputs}/v2d/gwas-catalog-v1.0.3-unpublished-studies-r2023-11-24.tsv
- ${datasets.inputs}/v2d/gwas-catalog-v1.0.3-studies-r2023-12-21.tsv
- ${datasets.inputs}/v2d/gwas-catalog-v1.0.3-unpublished-studies-r2023-12-21.tsv
catalog_ancestries:
- ${datasets.inputs}/v2d/gwas-catalog-v1.0.3-ancestries-r2023-11-24.tsv
- ${datasets.inputs}/v2d/gwas-catalog-v1.0.3-unpublished-ancestries-r2023-11-24.tsv
catalog_sumstats_lut: ${datasets.inputs}/v2d/harmonised_list-r2023-11-24a.txt
- ${datasets.inputs}/v2d/gwas-catalog-v1.0.3-ancestries-r2023-12-21.tsv
- ${datasets.inputs}/v2d/gwas-catalog-v1.0.3-unpublished-ancestries-r2023-12-21.tsv
catalog_sumstats_lut: ${datasets.inputs}/v2d/harmonised_list-r2023-12-21.txt
ukbiobank_manifest: gs://genetics-portal-input/ukb_phenotypes/neale2_saige_study_manifest.190430.tsv
l2g_gold_standard_curation: ${datasets.inputs}/l2g/gold_standard/curation.json
gene_interactions: ${datasets.inputs}/l2g/interaction # 23.09 data
Expand All @@ -40,6 +40,7 @@ v2g: ${datasets.outputs}/v2g
ld_index: ${datasets.outputs}/ld_index
catalog_study_index: ${datasets.study_index}/catalog
catalog_study_locus: ${datasets.study_locus}/catalog_study_locus
gwas_catalog_study_curation: ${datasets.inputs}/v2d/GWAS_Catalog_study_curation.tsv
finngen_study_index: ${datasets.study_index}/finngen
finngen_summary_stats: ${datasets.summary_statistics}/finngen
from_sumstats_study_locus: ${datasets.study_locus}/from_sumstats
Expand Down
6 changes: 0 additions & 6 deletions config/step/clump.yaml

This file was deleted.

6 changes: 6 additions & 0 deletions config/step/gwas_catalog_curation_update.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
_target_: otg.gwas_catalog_study_curation.GWASCatalogStudyCurationStep
catalog_study_files: ${datasets.catalog_studies}
catalog_ancestry_files: ${datasets.catalog_ancestries}
catalog_sumstats_lut: ${datasets.catalog_sumstats_lut}
gwas_catalog_study_curation_file: ${datasets.gwas_catalog_study_curation}
gwas_catalog_study_curation_out: ???
4 changes: 3 additions & 1 deletion config/step/gwas_catalog_ingestion.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
_target_: otg.gwas_catalog.GWASCatalogStep
_target_: otg.gwas_catalog_ingestion.GWASCatalogIngestionStep
catalog_study_files: ${datasets.catalog_studies}
catalog_ancestry_files: ${datasets.catalog_ancestries}
catalog_associations_file: ${datasets.catalog_associations}
catalog_sumstats_lut: ${datasets.catalog_sumstats_lut}
variant_annotation_path: ${datasets.variant_annotation}
catalog_studies_out: ${datasets.catalog_study_index}
catalog_associations_out: ${datasets.catalog_study_locus}
gwas_catalog_study_curation_file: ${datasets.gwas_catalog_study_curation}
inclusion_list_path: ???
10 changes: 10 additions & 0 deletions config/step/gwas_study_inclusion.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
_target_: otg.gwas_catalog_study_inclusion.GWASCatalogInclusionGenerator
catalog_study_files: ${datasets.catalog_studies}
catalog_ancestry_files: ${datasets.catalog_ancestries}
catalog_associations_file: ${datasets.catalog_associations}
variant_annotation_path: ${datasets.variant_annotation}
gwas_catalog_study_curation_file: ${datasets.gwas_catalog_study_curation}
harmonised_study_file: ???
criteria: ???
inclusion_list_path: ???
exclusion_list_path: ???
5 changes: 5 additions & 0 deletions config/step/ld_based_clumping.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
_target_: otg.ld_based_clumping.LdBasedClumpingStep
study_locus_input_path: ???
ld_index_path: ???
study_index_path: ???
clumped_study_locus_output_path: ???
5 changes: 5 additions & 0 deletions config/step/window_based_clumping.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
_target_: otg.window_based_clumping.WindowBasedClumpingStep
summary_statistics_input_path: ???
study_locus_output_path: ???
inclusion_list_path: ???
locus_collect_distance: null
5 changes: 0 additions & 5 deletions docs/python_api/step/clump.md

This file was deleted.

5 changes: 5 additions & 0 deletions docs/python_api/step/gwas_catalog_curation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: Apply in-house curation on GWAS Catalog studies
---

::: otg.gwas_catalog_study_curation.GWASCatalogStudyCurationStep
5 changes: 5 additions & 0 deletions docs/python_api/step/gwas_catalog_inclusion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: Generate inclusion and exclusions lists for GWAS Catalog study ingestion.
---

::: otg.gwas_catalog_study_inclusion.GWASCatalogInclusionGenerator
5 changes: 5 additions & 0 deletions docs/python_api/step/ld_clump.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: LD-based clumping
---

::: otg.ld_based_clumping.LdBasedClumpingStep
5 changes: 5 additions & 0 deletions docs/python_api/step/window_based_clumping.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: Window-based clumping
---

::: otg.window_based_clumping.WindowBasedClumpingStep
7 changes: 2 additions & 5 deletions src/airflow/dags/common_airflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,7 @@
from pathlib import Path

# Code version. It has to be repeated here as well as in `pyproject.toml`, because Airflow isn't able to look at files outside of its `dags/` directory.
OTG_VERSION = "1.0.0"

OTG_VERSION = "0.0.0"

# Cloud configuration.
GCP_PROJECT = "open-targets-genetics-dev"
Expand All @@ -29,7 +28,6 @@
GCP_DATAPROC_IMAGE = "2.1"
GCP_AUTOSCALING_POLICY = "otg-etl"


# Cluster init configuration.
INITIALISATION_BASE_PATH = (
f"gs://genetics_etl_python_playground/initialisation/{OTG_VERSION}"
Expand All @@ -40,18 +38,17 @@
f"{INITIALISATION_BASE_PATH}/install_dependencies_on_cluster.sh"
]


# CLI configuration.
CLUSTER_CONFIG_DIR = "/config"
CONFIG_NAME = "config"
PYTHON_CLI = "cli.py"


# Shared DAG construction parameters.
shared_dag_args = {
"owner": "Open Targets Data Team",
"retries": 0,
}

shared_dag_kwargs = {
"tags": ["genetics_etl", "experimental"],
"start_date": pendulum.now(tz="Europe/London").subtract(days=1),
Expand Down
1 change: 0 additions & 1 deletion src/airflow/dags/gwas_catalog_harmonisation.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,6 @@ def submit_jobs(**kwargs: Any) -> None:
],
)

# list_inputs >>
(
[list_inputs, list_outputs]
>> create_to_do_list()
Expand Down
181 changes: 139 additions & 42 deletions src/airflow/dags/gwas_catalog_preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,92 +5,189 @@

import common_airflow as common
from airflow.models.dag import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.google.cloud.hooks.gcs import GCSHook
from airflow.providers.google.cloud.operators.gcs import GCSListObjectsOperator
from airflow.utils.task_group import TaskGroup
from airflow.utils.trigger_rule import TriggerRule

CLUSTER_NAME = "otg-preprocess-gwascatalog"
AUTOSCALING = "otg-preprocess-gwascatalog"

SUMSTATS = "gs://open-targets-gwas-summary-stats/harmonised"
RELEASEBUCKET = "gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX"
RELEASEBUCKET_NAME = "genetics_etl_python_playground"
SUMMARY_STATS_BUCKET_NAME = "open-targets-gwas-summary-stats"
SUMSTATS = "gs://open-targets-gwas-summary-stats/harmonised"
MANIFESTS_PATH = f"{RELEASEBUCKET}/manifests/"


def upload_harmonized_study_list(
concatenated_studies: str, bucket_name: str, object_name: str
) -> None:
"""This function uploads file to GCP.

Args:
concatenated_studies (str): Concatenated list of harmonized summary statistics.
bucket_name (str): Bucket name
object_name (str): Name of the object
"""
hook = GCSHook(gcp_conn_id="google_cloud_default")
hook.upload(
bucket_name=bucket_name,
object_name=object_name,
data=concatenated_studies,
encoding="utf-8",
)


with DAG(
dag_id=Path(__file__).stem,
description="Open Targets Genetics — GWAS Catalog preprocess",
default_args=common.shared_dag_args,
**common.shared_dag_kwargs,
):
with TaskGroup(group_id="summary_stats_preprocessing") as summary_stats_group:
summary_stats_window_clumping = common.submit_step(
# Getting list of folders (each a gwas study with summary statistics)
list_harmonised_sumstats = GCSListObjectsOperator(
task_id="list_harmonised_parquet",
bucket=SUMMARY_STATS_BUCKET_NAME,
prefix="harmonised",
match_glob="**/_SUCCESS",
)

# Upload resuling list to a bucket:
upload_task = PythonOperator(
task_id="uploader",
python_callable=upload_harmonized_study_list,
op_kwargs={
"concatenated_studies": '{{ "\n".join(ti.xcom_pull( key="return_value", task_ids="list_harmonised_parquet")) }}',
"bucket_name": RELEASEBUCKET_NAME,
"object_name": "output/python_etl/parquet/XX.XX/manifests/harmonised_sumstats.txt",
},
)

# Processing curated GWAS Catalog top-bottom:
with TaskGroup(group_id="curation_processing") as curation_processing:
# Generate inclusion list:
curation_calculate_inclusion_list = common.submit_step(
cluster_name=CLUSTER_NAME,
step_id="clump",
task_id="catalog_sumstats_window_clumping",
step_id="gwas_study_inclusion",
task_id="catalog_curation_inclusion_list",
other_args=[
f"step.input_path={SUMSTATS}",
f"step.clumped_study_locus_path={RELEASEBUCKET}/study_locus/window_clumped/from_sumstats/catalog",
"step.criteria=curation",
f"step.inclusion_list_path={MANIFESTS_PATH}manifest_curation",
f"step.exclusion_list_path={MANIFESTS_PATH}exclusion_curation",
f"step.harmonised_study_file={MANIFESTS_PATH}harmonised_sumstats.txt",
],
)
summary_stats_ld_clumping = common.submit_step(

# Ingest curated associations from GWAS Catalog:
curation_ingest_data = common.submit_step(
cluster_name=CLUSTER_NAME,
step_id="clump",
task_id="catalog_sumstats_ld_clumping",
step_id="gwas_catalog_ingestion",
task_id="ingest_curated_gwas_catalog_data",
other_args=[f"step.inclusion_list_path={MANIFESTS_PATH}manifest_curation"],
)

# Run LD-annotation and clumping on curated data:
curation_ld_clumping = common.submit_step(
cluster_name=CLUSTER_NAME,
step_id="ld_based_clumping",
task_id="catalog_curation_ld_clumping",
other_args=[
f"step.input_path={RELEASEBUCKET}/study_locus/window_clumped/from_sumstats/catalog",
"step.ld_index_path={RELEASEBUCKET}/ld_index",
"step.study_index_path={RELEASEBUCKET}/study_index/catalog",
"step.clumped_study_locus_path={RELEASEBUCKET}/study_locus/ld_clumped/from_sumstats/catalog",
f"step.study_locus_input_path={RELEASEBUCKET}/study_locus/catalog_curated",
f"step.ld_index_path={RELEASEBUCKET}/ld_index",
f"step.study_index_path={RELEASEBUCKET}/study_index/catalog",
f"step.clumped_study_locus_output_path={RELEASEBUCKET}/study_locus/ld_clumped/catalog_curated",
],
trigger_rule=TriggerRule.ALL_DONE,
)
summary_stats_pics = common.submit_step(

# Do PICS based finemapping:
curation_pics = common.submit_step(
cluster_name=CLUSTER_NAME,
step_id="pics",
task_id="catalog_sumstats_pics",
task_id="catalog_curation_pics",
other_args=[
"step.study_locus_ld_annotated_in={RELEASEBUCKET}/study_locus/ld_clumped/from_sumstats/catalog",
"step.picsed_study_locus_out={RELEASEBUCKET}/credible_set/from_sumstats/catalog",
f"step.study_locus_ld_annotated_in={RELEASEBUCKET}/study_locus/ld_clumped/catalog_curated",
f"step.picsed_study_locus_out={RELEASEBUCKET}/credible_set/catalog_curated",
],
trigger_rule=TriggerRule.ALL_DONE,
)
summary_stats_window_clumping >> summary_stats_ld_clumping >> summary_stats_pics

with TaskGroup(group_id="curation_preprocessing") as curation_group:
parse_study_and_curated_assocs = common.submit_step(
# Define order of steps:
(
curation_calculate_inclusion_list
>> curation_ingest_data
>> curation_ld_clumping
>> curation_pics
)

# Processing summary statistics from GWAS Catalog:
with TaskGroup(
group_id="summary_satistics_processing"
) as summary_satistics_processing:
# Generate inclusion study lists:
summary_stats_calculate_inclusion_list = common.submit_step(
cluster_name=CLUSTER_NAME,
step_id="gwas_catalog_ingestion",
task_id="catalog_ingestion",
step_id="gwas_study_inclusion",
task_id="catalog_sumstats_inclusion_list",
other_args=[
"step.criteria=summary_stats",
f"step.inclusion_list_path={MANIFESTS_PATH}manifest_sumstats",
f"step.exclusion_list_path={MANIFESTS_PATH}exclusion_sumstats",
f"step.harmonised_study_file={MANIFESTS_PATH}harmonised_sumstats.txt",
],
)

curation_ld_clumping = common.submit_step(
# Run window-based clumping:
summary_stats_window_based_clumping = common.submit_step(
cluster_name=CLUSTER_NAME,
step_id="clump",
task_id="catalog_curation_ld_clumping",
step_id="window_based_clumping",
task_id="catalog_sumstats_window_clumping",
other_args=[
"step.input_path={RELEASEBUCKET}/study_locus/catalog_curated",
"step.ld_index_path={RELEASEBUCKET}/ld_index",
"step.study_index_path={RELEASEBUCKET}/study_index/catalog",
"step.clumped_study_locus_path={RELEASEBUCKET}/study_locus/ld_clumped/catalog_curated",
f"step.summary_statistics_input_path={SUMSTATS}",
f"step.study_locus_output_path={RELEASEBUCKET}/study_locus/window_clumped/from_sumstats/catalog",
f"step.inclusion_list_path={MANIFESTS_PATH}manifest_sumstats",
],
trigger_rule=TriggerRule.ALL_DONE,
)

curation_pics = common.submit_step(
# Run LD based clumping:
summary_stats_ld_clumping = common.submit_step(
cluster_name=CLUSTER_NAME,
step_id="ld_based_clumping",
task_id="catalog_sumstats_ld_clumping",
other_args=[
f"step.study_locus_input_path={RELEASEBUCKET}/study_locus/window_clumped/from_sumstats/catalog",
f"step.ld_index_path={RELEASEBUCKET}/ld_index",
f"step.study_index_path={RELEASEBUCKET}/study_index/catalog",
f"step.clumped_study_locus_output_path={RELEASEBUCKET}/study_locus/ld_clumped/from_sumstats/catalog",
],
)

# Run PICS finemapping:
summary_stats_pics = common.submit_step(
cluster_name=CLUSTER_NAME,
step_id="pics",
task_id="catalog_curation_pics",
task_id="catalog_sumstats_pics",
other_args=[
"step.study_locus_ld_annotated_in={RELEASEBUCKET}/study_locus/ld_clumped/catalog_curated",
"step.picsed_study_locus_out={RELEASEBUCKET}/credible_set/catalog_curated",
f"step.study_locus_ld_annotated_in={RELEASEBUCKET}/study_locus/ld_clumped/from_sumstats/catalog",
f"step.picsed_study_locus_out={RELEASEBUCKET}/credible_set/from_sumstats/catalog",
],
trigger_rule=TriggerRule.ALL_DONE,
)
parse_study_and_curated_assocs >> curation_ld_clumping >> curation_pics

# Order of steps within the group:
(
summary_stats_calculate_inclusion_list
>> summary_stats_window_based_clumping
>> summary_stats_ld_clumping
>> summary_stats_pics
)

# DAG description:
(
common.create_cluster(
CLUSTER_NAME, autoscaling_policy=AUTOSCALING, num_workers=5
)
>> common.install_dependencies(CLUSTER_NAME)
>> [summary_stats_group, curation_group]
>> common.delete_cluster(CLUSTER_NAME)
>> list_harmonised_sumstats
>> upload_task
>> curation_processing
>> summary_satistics_processing
)
Loading