Skip to content

Commit

Permalink
Update Preparing-Reference-Files.mdx
Browse files Browse the repository at this point in the history
Highlight the memory requirement
provide more detailed information on the profile
  • Loading branch information
jiyue1214 authored Nov 1, 2024
1 parent 33d93e4 commit 60b82c8
Showing 1 changed file with 6 additions and 2 deletions.
8 changes: 6 additions & 2 deletions docs/Tutorials/Preparing-Reference-Files.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ sidebar_position: 3

A reference file is a dataset that contains detailed information about known genetic variants, including at least their genomic positions, reference alleles, and alternative alleles. These reference files are crucial for variant harmonisation, as they determine both whether and how your variants will be harmonised.

You can prepare the reference files needed in the harmonisation pipeline by either downloading them from our FTP server **OR** creating your own custom reference files.

## Download from FTP

The current reference files used for harmonisation by the GWAS Catalog are variations in VCF format from Ensembl release 95 (released in January 2019). To run the pipeline, you need both the `VCF` and corresponding `tabix` files, as well as `parquet` files.
Expand Down Expand Up @@ -44,6 +46,8 @@ Parameters Explained:
| `--reference` | Prepares the reference model. Downloads VCF files matching `homo_sapiens-${chr}.vcf.gz` from `--remote_vcf_location`, then generates **tabix** and **parquet** files. |
| `--ref` | Specifies the directory where reference files will be stored. |
| `--remote_vcf_location` | Defines the source of the reference VCF files. Default is `ftp://ftp.ensembl.org/pub/current_variation/vcf/homo_sapiens`. In this case, it points to Ensembl release 109. |
| `-profile` | Specifies the execution environment profile (e.g., `cluster`,`conda`). |
| `-profile` | A profile is a set of configuration attributes that can be selected during pipeline execution. Available profiles include `test` (quick run on local), `standard` (local executor), and `executor` (based on `./config/executor.config` with default SLURM). Container options are `conda`, `docker`, and `singularity`. |
| `-chrom` | Runs the pipeline for a specific chromosome. For example, to download only `homo_sapiens-chr22.vcf.gz`, use `--chrom 22` |
| `-chromlist` | Runs the pipeline for multiple chromosomes. For example, use `--chromlist 22,X,Y` to prepare reference for chr22, chrX and chrY |

Additionally, the pipeline will download all available VCF files in the specified folder by default. If you want to run only specific chromosomes, use the `--chrom` option. For example, to download only `homo_sapiens-chr22.vcf.gz`, use `--chrom 22`; for multiple chromosomes, use `--chromlist 22,X,Y`.
Preparing these references generally requires a large amount of memory. If you choose to run the pipeline with your own reference files, we recommend using an HPC environment. For example, preparing references for Ensembl release 95 requires around **50 GB** of memory.

0 comments on commit 60b82c8

Please sign in to comment.