Update Preparing-Reference-Files.mdx

Highlight the memory requirement provide more detailed information on the profile
EBISPOT · Nov 1, 2024 · 60b82c8 · 60b82c8
1 parent 33d93e4
commit 60b82c8
Showing 1 changed file with 6 additions and 2 deletions.
diff --git a/docs/Tutorials/Preparing-Reference-Files.mdx b/docs/Tutorials/Preparing-Reference-Files.mdx
@@ -5,6 +5,8 @@ sidebar_position: 3
 
 A reference file is a dataset that contains detailed information about known genetic variants, including at least their genomic positions, reference alleles, and alternative alleles. These reference files are crucial for variant harmonisation, as they determine both whether and how your variants will be harmonised.
 
+You can prepare the reference files needed in the harmonisation pipeline by either downloading them from our FTP server **OR** creating your own custom reference files.
+
 ## Download from FTP
 
 The current reference files used for harmonisation by the GWAS Catalog are variations in VCF format from Ensembl release 95 (released in January 2019). To run the pipeline, you need both the `VCF` and corresponding `tabix` files, as well as `parquet` files.
@@ -44,6 +46,8 @@ Parameters Explained:
 | `--reference`            | Prepares the reference model. Downloads VCF files matching `homo_sapiens-${chr}.vcf.gz` from `--remote_vcf_location`, then generates **tabix** and **parquet** files.    |
 | `--ref`                  | Specifies the directory where reference files will be stored.                                                                                                           |
 | `--remote_vcf_location`  | Defines the source of the reference VCF files. Default is `ftp://ftp.ensembl.org/pub/current_variation/vcf/homo_sapiens`. In this case, it points to Ensembl release 109. |
-| `-profile`               | Specifies the execution environment profile (e.g., `cluster`,`conda`).                                                                                                |
+| `-profile`               | A profile is a set of configuration attributes that can be selected during pipeline execution. Available profiles include `test` (quick run on local), `standard` (local executor), and `executor` (based on `./config/executor.config` with default SLURM). Container options are `conda`, `docker`, and `singularity`.                     |
+| `-chrom`               | Runs the pipeline for a specific chromosome. For example, to download only `homo_sapiens-chr22.vcf.gz`, use `--chrom 22`                                                                                             |
+| `-chromlist`               | Runs the pipeline for multiple chromosomes. For example, use `--chromlist 22,X,Y` to prepare reference for chr22, chrX and chrY                                                                                     |
 
-Additionally, the pipeline will download all available VCF files in the specified folder by default. If you want to run only specific chromosomes, use the `--chrom` option. For example, to download only `homo_sapiens-chr22.vcf.gz`, use `--chrom 22`; for multiple chromosomes, use `--chromlist 22,X,Y`.
+Preparing these references generally requires a large amount of memory. If you choose to run the pipeline with your own reference files, we recommend using an HPC environment. For example, preparing references for Ensembl release 95 requires around **50 GB** of memory.