Skip to content

Commit

Permalink
Merge pull request #2 from EBISPOT/Lizzy-suggestions
Browse files Browse the repository at this point in the history
Lizzy suggestions
  • Loading branch information
jiyue1214 authored Oct 28, 2024
2 parents 8ae9c61 + a80d6ef commit ada15c6
Show file tree
Hide file tree
Showing 12 changed files with 55 additions and 56 deletions.
22 changes: 11 additions & 11 deletions docs/Explanation/output-file-explanation.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,31 +38,31 @@ This step lifts variants to the desired genome assembly (GRCh38) using the proce

## Orientation of palindromic variants outputs
This step is to infer palindromic variants' strand orientation using a strand consensus approach. It contains two process: `./modules/local/ten_percent_counts.nf` and `./modules/local/ten_percent_counts_sum.nf`
- `./ten_sc/ten_percent_chr*.sc`: For variants in each chromosome, this output summarizes the number of:
- Forward strand variant
- `./ten_sc/ten_percent_chr*.sc`: For variants in each chromosome, this output summarises the number of:
- Forward strand variants
- Reverse strand variants
- Palindormic variant
- Palindromic variants
- No VCF record found
- Invalid variant for harmonisation
- Invalid variants for harmonisation
- `ten_percent_total_strand_count.tsv`: This file provides an overview across all chromosomes, detailing the counts of each variant type. It calculates the strand consensus ratio and infers the mode for palindromic variants:
- forward: Infers that the palindromic variant is on the forward strand
- reverse: Infers that the palindromic variant is on the reverse strand
drop: Indicates that the strand of the palindromic variant cannot be inferred, and these variants are dropped from harmonization.
- drop: Indicates that the strand of the palindromic variant cannot be inferred, and these variants are dropped from harmonisation.

## Harmonising the variants outputs
## Harmonising the variant outputs
This process aligns each variant with the reference and makes necessary changes to the corresponding values. It is the product of the process `./modules/local/harmonization.nf`
- `harmonization/chr*.merged.hm`: Contains the harmonized results for each chromosome.
- `harmonization/chr*.merged.log.tsv.gz`: This log file summarising the number of variants for each `hm_code` that appears in this chromosome. Please refer to this [file](../Introduction/Harmonising-the-variants.mdx) for more information about the `hm_code`.

## Quality control outputs
The QC process involves filtering out variants that lack valid values in essential columns. It includes results from the process `.modules/local/qc.nf`.
- `qc/harmonised.tsv`: This file includes the harmonized results from all chromosomes.
- `qc/harmonised.qc.tsv`: This is harmonised result only contains variants without missing values in [essential columns](../Tutorials/Preparing-Input-Files#data-requirement).
- `qc/harmonised.tsv`: This file includes the harmonised results from all chromosomes.
- `qc/harmonised.qc.tsv`: This file includes the harmonised results of variants without missing values in [essential columns](../Tutorials/Preparing-Input-Files#data-requirement).
- `qc/report.txt`: This file documents the rows that were removed during this QC step.

## Final outputs
The `final` folder contains well-compressed and organized final outputs:
- `final/random_name.h.tsv.gz`: Bgzip-compressed and sorted harmonization results from all chromsomes.
The `final` folder contains well-compressed and organised final outputs:
- `final/random_name.h.tsv.gz`: Bgzip-compressed and sorted harmonisation results from all chromosomes.
- `final/random_name.h.tsv.gz-meta.yaml`: Updated YAML file for `final/random_name.h.tsv.gz`.
- `final/random_name.h.tsv.gz.tbi`: Tabix file for `final/random_name.h.tsv.gz`
- `final/random_name.running.log`: Summary log recording information about the pipeline, reference, genome build results, inferred strand of palindromic SNPs, and the number as well as percentage of variants that were successfully harmonized or failed.
- `final/random_name.running.log`: Summary log recording information about the pipeline, reference, genome build results, inferred strand of palindromic SNPs, and the number as well as percentage of variants that were successfully harmonised or failed.
28 changes: 14 additions & 14 deletions docs/Explanation/output-folder-structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@ sidebar_position: 2
| 22 | 15925047 | A | G | -0.00477642 | 0.0164749 | 0.089851 | 0.77 | rs376238049 | ref_rs376238049 | 0.02 | lo | 0.03 | 12 | 22_15925047_G_A |

</details>
* The harmonized result file represents the harmonised [mandatory columns](https://www.ebi.ac.uk/gwas/docs/summary-statistics-format) in a specific order, followed by the remaining columns from the original file in their original order.
* All values in this file reflect the harmonized results.
* The harmonised result file represents the harmonised [mandatory columns](https://www.ebi.ac.uk/gwas/docs/summary-statistics-format) in a specific order, followed by the remaining columns from the original file in their original order.
* All values in this file reflect the harmonised results.
* All the variants in this file are sorted by chr and position and compressed using bgzip
* In addition to the columns from the original file, two extra columns are included:
* `hm_coordinate_conversion` Describes how this variant was mapped to the target genome.
* `harmonisation code` A code assigned to each record indicating the harmonization process that was applied.
* `harmonisation code` A code assigned to each record indicating the harmonisation process that was applied.
* Please refer to this [page](../Reference-guide/Hm_code.md) for more detailed information.

### YAML file for harmonised sumstat
Expand All @@ -36,15 +36,15 @@ sidebar_position: 2
is_sorted: true
```
</details>
This YAML file provides metadata about the harmonized result, including:
This YAML file provides metadata about the harmonised result, including:

* Whether the file is harmonized and sorted
* The reference used for harmonization
* Whether the file is harmonised and sorted
* The reference used for harmonisation
* The current genome build and coordinate system
* The md5sum for file integrity verification

### Tabix file for final harmonised sumstst
A tabix index file of the harmonisation result for quick data retrieve purposes
### Tabix file for final harmonised sumstat
A tabix index file of the harmonisation result for quick data retrieval purposes

### Running log summary the whole harmonisation process
<details>
Expand Down Expand Up @@ -140,17 +140,17 @@ Result SUCCESS_HARMONIZATION
```
</details>
The running log file provides detailed information about the harmonization process, including:
The running log file provides detailed information about the harmonisation process, including:

* The pipeline version and the date of harmonization
* The pipeline version and the date of harmonisation
* The reference VCF file and dbSNP version used
* A summary of the genome build mapping results, reporting the number and percentage of variants dropped during this step
* The orientation inferred for palindromic variants and the strand consensus ratio
* The number and percentage of variants successfully harmonized for each `hm_code`
* The number and percentage of variants that failed to be harmonized for each `hm_code`
* The number and percentage of variants successfully harmonised for each `hm_code`
* The number and percentage of variants that failed to be harmonised for each `hm_code`

:::info[Harmonised result before April 2023]
Starting in April 2023, with the release of the GWAS-SSF standard by the GWAS-Catalog, we began retaining only the harmonized results in the final `*.h.tsv` file to ensure consistency with the input file and reduce redundancy.
Starting in April 2023, with the release of the GWAS-SSF standard by the GWAS-Catalog, we began retaining only the harmonised results in the final `*.h.tsv` file to ensure consistency with the input file and reduce redundancy.

For files harmonized before this date, you will see two outputs for each summary statistic: one harmonized result (`*.h.tsv.gz`) and one YAML file (`*.h.tsv.gz-meta.yaml`). The harmonization process remains the same, but there is a slight difference in how data is represented in the `*.h.tsv.gz`. In these older harmonized files, the harmonized values are listed in columns starting with `hm_`, such as `hm_chromosome`.
For files harmonised before this date, you will see two outputs for each summary statistic: one harmonised result (`*.h.tsv.gz`) and one YAML file (`*.h.tsv.gz-meta.yaml`). The harmonisation process remains the same, but there is a slight difference in how data is represented in the `*.h.tsv.gz`. In these older harmonised files, the harmonised values are listed in columns starting with `hm_`, such as `hm_chromosome`.
:::
4 changes: 2 additions & 2 deletions docs/Introduction/Genome-Build-Mapping.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@ sidebar_position: 2
The first step in harmonizing variant data is updating the genomic coordinates to the desired assembly (GRCh38). The pipeline follows a systematic approach to ensure high-confidence mapping of each variant's position:

### Step 1: Mapping by rsID using Ensembl (v95)
The pipeline first attempts to update the variant's base pair location by mapping its rsID to the variants from the Ensembl reference database. If successful, the field `hm_coordinate_conversion` is set to `rs` to these variants in the output file [`${chr}.merges`](../Explanation/output-file-explanation#genome-build-mapping-outputs), indicating that the variant's position was determined through rsID-based mapping.
The pipeline first attempts to update the variant's base pair location by mapping its rsID to the variants from the Ensembl reference database. If successful, the field `hm_coordinate_conversion` is set to `rs` for these variants in the output file [`${chr}.merges`](../Explanation/output-file-explanation#genome-build-mapping-outputs), indicating that the variant's position was determined through rsID-based mapping.

### Step 2: Liftover to the latest genome build
For variants where rsID mapping is not possible, the pipeline uses the UCSC LiftOver tool to lift the coordinates from an older genome build to GRCh38. If successful, the field `hm_coordinate_conversion` is set to `lo`, indicating that the base pair location was updated by lifting over the original coordinates.

### Step 3: Variant removal
If neither rsID mapping nor liftover is successful, the variant is removed from the file and stored in [`unmapped`](../Explanation/output-file-explanation#genome-build-mapping-outputs) output. This ensures that only high-confidence variants with validated genomic positions are retained in the final dataset.

This process is recorded in the `hm_coordinate_conversion` field in the harmonised data file to provide traceability for how each variant's genomic position was determined.
This process is recorded in the `hm_coordinate_conversion` field in the harmonised data file to provide traceability for how each variant's genomic position was determined.
4 changes: 2 additions & 2 deletions docs/Introduction/Orientation-of-palindromic-variants.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The pipeline uses the following method to infer the orientation of palindromic v
* `forward/(forward + reverse)`, or
* `reverse/(forward + reverse)`.

This rate reflects the proportion of variants aligning to the forward strand or (reverse strand).
This rate reflects the proportion of variants aligning to the forward strand (or reverse strand).

### Step 3: Inferring the strand of palindromic variants
The consensus rate is then used to infer the orientation of palindromic variants. To minimize sampling bias and ensure accurate orientation, the pipeline applies the following thresholds:
Expand All @@ -27,4 +27,4 @@ The consensus rate is then used to infer the orientation of palindromic variants
- If the recalculated rate is > 0.99, the palindromic variants are inferred to be aligned to the forward (or reverse) strand and harmonised accordingly.
- If the recalculated rate is ≤ 0.99, the palindromic variants are dropped from harmonisation to prevent errors.
- If the consensus rate is ≤ 0.9
- The palindromic variants are excluded from further harmonisation steps to ensure data integrity.
- The palindromic variants are excluded from further harmonisation steps to ensure data integrity.
6 changes: 3 additions & 3 deletions docs/Introduction/Overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ The `gwas-sumstats-harmoniser` is a pipeline designed to standardise variant da

2. Palindromic Variant Orientation: Inferring strand orientation of palindromic variants using a strand consensus approach.

3. Variant Harmonization: Matching and aligning variants with those in a reference dataset to ensure allele consistency and orientation to the forward strand.
3. Variant Harmonisation: Matching and aligning variants with those in a reference dataset to ensure allele consistency and orientation to the forward strand.

4. Quality control: Removing variants that containing missing value in essential columns (chromosome, base pair location, or p-value).
4. Quality control: Removing variants missing any essential column value (chromosome, base pair location, or p-value).

![nextflow workflow](../img/Harmonisation.png)
![nextflow workflow](../img/Harmonisation.png)
2 changes: 1 addition & 1 deletion docs/Reference-guide/Hm_code.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ sidebar_position: 1
| lo | liftover base pair location to the target genome build (GRCh 38) |

`hm_code`
| Code | Description of Harmonization Process |
| Code | Description of Harmonisation Process |
|------|-------------------------------------------------------------------------|
| 1 | Palindromic; Infer strand; Forward strand; Alleles correct |
| 2 | Palindromic; Infer strand; Forward strand; Flipped alleles |
Expand Down
4 changes: 2 additions & 2 deletions docs/Reference-guide/Usefult-link.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ sidebar_position: 2

## Sumstat formatting tools
1. [UI formatting tool: SSF-morph](https://ebispot.github.io/gwas-sumstats-tools-ssf-morph/)
2. [CLI formatting tool: gwas-sumstats-tools](https://github.com/EBISPOT/gwas-sumstats-tools-ssf-morph)
2. [CLI formatting tool: gwas-sumstats-tools](https://github.com/EBISPOT/gwas-sumstats-tools)

## Nextflow documentation:
1. [Nextflow documentation](https://www.nextflow.io/docs/latest/index.html)
1. [Nextflow documentation](https://www.nextflow.io/docs/latest/index.html)
13 changes: 6 additions & 7 deletions docs/Tutorials/Preparing-Input-Files.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,12 @@ The pipeline requires a tab-separated values (TSV) file with a standardised head
* other allele: The non-effect allele.
* p-value: The statistical significance of the variant.

This is the minimum requirement to run the pipeline. However, if you have beta,
odds_ratio, z-score or effect_allele_frequency, you have also give them standard headers to make sure they can be recognise correctly by the pipeline to be harmonised. Please refer to [GWAS Catalog webiste](https://www.ebi.ac.uk/gwas/docs/summary-statistics-format) for more inforation about standard headers.
This is the minimum requirement to run the pipeline. However, if you have beta, odds_ratio, hazard_ratio, z_score or effect_allele_frequency, these should also be given as standard headers to ensure they are recognised by the pipeline. Please refer to [GWAS Catalog website](https://www.ebi.ac.uk/gwas/docs/summary-statistics-format) for more information about standard headers.

Ensure that required columns do not have missing values, while non-required fields with pandas-recognised missing value markers (e.g., NA, NaN, None) will be processed without issue.

<details>
<summary>Sumstat inpus file example: <code>gwas_sumstat_name.tsv</code></summary>
<summary>Sumstats input file example: <code>gwas_sumstat_name.tsv</code></summary>
```tsv
chromosome base_pair_location effect_allele other_allele p_value rsid
1 693730 A G 0.1 NA
Expand All @@ -51,7 +50,7 @@ The process for preparing your input data depends on the number of summary stati

* For a few sumstats requiring significant modifications: We recommend using our online formatter tool, [`SSF-morph`](https://ebispot.github.io/gwas-sumstats-tools-ssf-morph/), to prepare your input files. This tool simplifies the reformatting process and ensures compatibility with the pipeline.

* For a large number of sumstats already in TSV format: You can customize the header recognition directly in the pipeline code. This allows you to quickly adapt the pipeline to recognize different header formats without manually editing each file.
* For a large number of sumstats already in TSV format: You can customize the header recognition directly in the pipeline code. This allows you to quickly adapt the pipeline to recognise different header formats without manually editing each file.

<details>
<summary>Customise your header recognition</summary>
Expand All @@ -72,7 +71,7 @@ The process for preparing your input data depends on the number of summary stati
The pipeline requires a YAML file for each sumstat, containing essential metadata.

<details>
<summary>YAML inpus file example: <code>gwas_sumstat_name.tsv-meta.yaml</code></summary>
<summary>YAML input file example: <code>gwas_sumstat_name.tsv-meta.yaml</code></summary>
```YAML
# Study meta-data
date_metadata_last_modified: 2023-02-09
Expand All @@ -95,6 +94,6 @@ The pipeline requires a YAML file for each sumstat, containing essential metadat
While all fields in the YAML file are required for the pipeline to run, **only** the <Highlight color="#25c2a0"> genome_assembly</Highlight> and <Highlight color="#25c2a0">coordinate_system</Highlight> fields must be accurate for proper harmonisation.

#### Preparing the YAML data:
You can copy the example YAML file below to create your own. Make sure to adjust the `genome_assembly` and `coordinate_system` fields based on your dataset.
You can copy the example YAML file above to create your own. Make sure to adjust the `genome_assembly` and `coordinate_system` fields based on your dataset.
* The default value for `coordinate_system` is `1-based`.
* There is **no default** value for `genome_assembly`, so you must specify it according to your data.
* There is **no default** value for `genome_assembly`, so you must specify it according to your data.
Loading

0 comments on commit ada15c6

Please sign in to comment.