-
Notifications
You must be signed in to change notification settings - Fork 15
FCS GX input
Required inputs for fcs.py screen genome
:
-
Genome sequence file:
--fasta
-
Taxonomy identifier corresponding to source organism:
--tax-id
-
FCS-GX database location:
--gx-db
-
Usage:
fcs.py screen genome --help
Optional inputs for fcs.py screen genome
:
Required inputs for fcs.py clean genome
:
-
Genome sequence file:
--fasta
-
FCS-GX action report
*.fcs_gx_report.txt
: Final contamination report with contaminant cleaning actions. -
Usage:
fcs.py clean genome --help
Optional inputs for any fcs.py
command:
-
--no-report-analytics
: disable usage reporting
The genome sequence file should be provided in FASTA format, optionally compressed with gzip. There is currently no support for running FCS-GX on FASTQ-formatted reads directly.
Each sequence in the file must have a definition line beginning with '>' and a unique identifier (SeqID), eg >contig001 or >contig002. The SeqIDs should:
- Be less than 50 characters long
- Only include letters, digits, hyphens (-), underscores (_), periods (.), colons (:), asterisks (*), and number signs (#).
- Be unique within a genome
- All sequences must be less than 2 Gbp.
- Stretches of 10 Ns or more will be split during FCS-GX alignment steps.
How to find your organism tax-id of interest in NCBI taxonomy:
- Go to NCBI Taxonomy.
- Enter the organism name in the search box and hit 'Enter'
- Click on the species hyperlink
- The NCBI Taxonomy ID is listed near the top of the page
- If the exact species is unavailable, search for a similar species or genus. Using these tax-ids will allow FCS-GX to set the proper source organism.
--tax-id=33208
) to set multiple FCS-GX divisions as primary is not supported
There are two FCS-GX databases:
- The small 'test-only' database should only be used for testing whether FCS-GX is set up properly.
- The complete 'all' database should be used for normal FCS-GX runs.
s5cmd is a very fast downloading tool for S3 objects. The GX database is available as S3 objects in the us-east-1 region. It takes approximately five minutes to download the entire GX database using this tool in a cloud VM (e.g., n2d-highmem-64 in us-central1-a, or r6i.24xlarge in us-east-2). A copy of the GX-database is hosted on Amazon Web Services (AWS) under the Open Data Sponsorship Program (ODP) with support from NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative. For more information, please refer to the Registry of Open Data on AWS. To download the FCS-GX database, do the following:
-
Download and install s5cmd:
curl -LO https://github.com/peak/s5cmd/releases/download/v2.0.0/s5cmd_2.0.0_Linux-64bit.tar.gz tar -xvf s5cmd_2.0.0_Linux-64bit.tar.gz ./s5cmd
-
Create space in tmpfs for the database files:
sudo mkdir /my_tmpfs sudo mount -t tmpfs tmpfs /my_tmpfs -o size=470G
-
Copy the GX test database from the S3 bucket to tmpfs:
./s5cmd --no-sign-request cp --part-size 50 --concurrency 50 s3://ncbi-fcs-gx/gxdb/test-only/test-only.* /my_tmpfs/test-only/
-
Copy the complete GX database from S3 to tmpfs:
./s5cmd --no-sign-request cp --part-size 50 --concurrency 50 s3://ncbi-fcs-gx/gxdb/latest/all.* /my_tmpfs/gxdb/
-
Run the following command to download the 'test-only' db:
fcs.py
will create the necessary directory under the LOCAL_DB path.SOURCE_DB_MANIFEST="https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/database/test-only/test-only.manifest" LOCAL_DB="/path/to/db/folder" python3 fcs.py db get --mft "$SOURCE_DB_MANIFEST" --dir "$LOCAL_DB/test-only"
-
Run the following command to download the 'all' db.
⚠️ Make sure the directory name is different for 'all' and 'test-only':SOURCE_DB_MANIFEST="https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/database/latest/all.manifest" LOCAL_DB="/path/to/db/folder" python3 fcs.py db get --mft "$SOURCE_DB_MANIFEST" --dir "$LOCAL_DB/gxdb"
Note: You have two options from where you can download the database using
fcs.py
. First is the NCBI FTP links used in the examples above. Second one is the public S3 bucket. To use the S3 bucket, set SOURCE_DB_MANIFEST as the following:-
For 'test-only' db:
SOURCE_DB_MANIFEST="https://ncbi-fcs-gx.s3.amazonaws.com/gxdb/test-only/test-only.manifest"
-
For ''all' db:
SOURCE_DB_MANIFEST="https://ncbi-fcs-gx.s3.amazonaws.com/gxdb/latest/all.manifest"
-
-
You may check if the 'all' database is downloaded successfully to $LOCAL_DB:
ls "$LOCAL_DB/gxdb" all.README.txt all.assemblies.tsv all.blast_div.tsv.gz all.gxi all.gxs all.manifest all.meta.jsonl all.seq_info.tsv.gz all.taxa.tsv
-
Caching the database
FCS-GX requires the database to be available in RAM. If you do not have access to a tmpfs- or ramfs-backed filesystem, you can skip this (Caching the database) step. FCS-GX will still require a high-memory server, but will compensate by memory-mapping the database at the beginning of the run and thereby caching it to memory on the go. While it may take a little extra time, it won't require sudo permissions.
If you do have access to a tmpfs- or ramfs-backed filesystem, e.g., /dev/shm, you can copy the downloaded databases to RAM to ensure it is available in successive runs on the same server. You can use the 'db get' command to copy files locally between disk and RAM. To do this, create a space in tmpfs:
sudo mkdir /my_tmpfs sudo mount -t tmpfs tmpfs /my_tmpfs -o size=470G
Next, run the following command to cache the 'test-only' db:
python3 fcs.py db get --mft "$LOCAL_DB/test-only/test-only.manifest" --dir /my_tmpfs/test-only
Run following command to cache the 'all' db:
python3 fcs.py db get --mft "$LOCAL_DB/gxdb/all.manifest" --dir /my_tmpfs/gxdb
-
Check the integrity of the database
Before running FCS-GX, it is a good idea to check the integrity of the database, or if the source db has changed.
Run the following to check if there are any differences between the source 'all' db and the downloaded 'all' db:
python3 fcs.py db check --mft "$SOURCE_DB_MANIFEST" --dir "$LOCAL_DB/gxdb"
You can also check if there are any differences between the downloaded 'all' db and the cached 'all' db:
python3 fcs.py db check --mft "$LOCAL_DB/gxdb/all.manifest" --dir /my_tmpfs/gxdb
If you see differences in the files, you can run 'db get' with the same parameters for --mft and --dir as above. This will import the files that are different.
--out-basename
: Output files' basename. Default is {fasta-basename}.{tax-id}.
--out-dir
: Output directory. (default: .)
--generate-logfile
: Redirect stdout and stderr to {out-basename}.summary.txt. (default: False)
--debug
: Produce a more verbose logfile for troubleshooting
The --env-file
parameter directs use of optional environment variables to control some of the features in FCS-GX.
-
GX_NUM_CORES
controls the number of CPU cores. -
GX_ALIGN_EXCLUDE_TAXA
excludes alignment to particular tax-ids. Multiple tax-ids may be provided as a comma-separated list.⚠️ this only works for the exact tax-ids explicitly in the GX database. For instance, settingGX_ALIGN_EXCLUDE_TAXA=33208
will not exclude all metazoan hits.) -
GX_EXTRA_CONTAM_DIVS
identifies additional contaminants for a given gx division when screening a metagenome. Multiple divisions may be provided as a comma-separated list. -
GX_ACTION_REPORT_PA_SAME_KINGDOM_THRESHOLD
defines the percentage cutoff for ignoring prokaryote-in-prokaryote contamination, i.e.GX_ACTION_REPORT_PA_SAME_KINGDOM_THRESHOLD=5
means that identified prokaryote contamination from a division that is less than 5% of the total genome size will be reported as REVIEW_RARE and will not be automatically cleaned byfcs.py clean genome
. The default is 1%.
For example, to run a genome with an 8-core CPU and excluding alignments to Toxoplasma gondii, use the following:
cat env.txt
GX_NUM_CORES=8
GX_ALIGN_EXCLUDE_TAXA=5811
python3 ./fcs.py --env-file env.txt screen genome --fasta ./GCA_000006565.2_TGA4_genomic.fna.gz --out-dir ./gx_out/ --gx-db "$LOCAL_DB/gxdb" --tax-id 508771 --generate-logfile T
To run a metagenome to identify additional contaminants from "anml:insect" division, use the following:
cat env.txt
GX_EXTRA_CONTAM_DIVS="anml:insects"
python3 ./fcs.py --env-file env.txt screen genome --fasta ../GCA_022076465.1_ASM2207646v1_genomic.fna.gz --tax-id=1163772 --out-dir ./gx_out/ --gx-db "$LOCAL_DB/gxdb" --generate-logfile T
Please create an Issue if you encounter any problems.
For all other questions or comments, please contact us at [email protected]
-
FCS-adaptor
-
FCS-GX
-
Setting up FCS in the cloud
-
FCS in Galaxy