Skip to content

FCS GX input

Eric Tvedte edited this page Apr 17, 2024 · 2 revisions

Required inputs for fcs.py screen genome:

Optional inputs for fcs.py screen genome:

Required inputs for fcs.py clean genome:

Optional inputs for any fcs.py command:

  • --no-report-analytics: disable usage reporting

Genome sequence file

The genome sequence file should be provided in FASTA format, optionally compressed with gzip. There is currently no support for running FCS-GX on FASTQ-formatted reads directly.

Definition lines

Each sequence in the file must have a definition line beginning with '>' and a unique identifier (SeqID), eg >contig001 or >contig002. The SeqIDs should:

  • Be less than 50 characters long
  • Only include letters, digits, hyphens (-), underscores (_), periods (.), colons (:), asterisks (*), and number signs (#).
  • Be unique within a genome

Genome sequences

  • All sequences must be less than 2 Gbp.
  • Stretches of 10 Ns or more will be split during FCS-GX alignment steps.

Taxonomy identifier

How to find your organism tax-id of interest in NCBI taxonomy:

  1. Go to NCBI Taxonomy.
  2. Enter the organism name in the search box and hit 'Enter'
  3. Click on the species hyperlink
  4. The NCBI Taxonomy ID is listed near the top of the page
  5. If the exact species is unavailable, search for a similar species or genus. Using these tax-ids will allow FCS-GX to set the proper source organism.

⚠️ Setting a high-level taxonomic rank (e.g. Metazoa --tax-id=33208) to set multiple FCS-GX divisions as primary is not supported

FCS-GX database location

There are two FCS-GX databases:

  1. The small 'test-only' database should only be used for testing whether FCS-GX is set up properly.
  2. The complete 'all' database should be used for normal FCS-GX runs.

⚠️ Make sure the two types of db do not end up in the same folder.

For Cloud users:

s5cmd is a very fast downloading tool for S3 objects. The GX database is available as S3 objects in the us-east-1 region. It takes approximately five minutes to download the entire GX database using this tool in a cloud VM (e.g., n2d-highmem-64 in us-central1-a, or r6i.24xlarge in us-east-2). A copy of the GX-database is hosted on Amazon Web Services (AWS) under the Open Data Sponsorship Program (ODP) with support from NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative. For more information, please refer to the Registry of Open Data on AWS. To download the FCS-GX database, do the following:

  1. Download and install s5cmd:

    curl -LO https://github.com/peak/s5cmd/releases/download/v2.0.0/s5cmd_2.0.0_Linux-64bit.tar.gz
    tar -xvf s5cmd_2.0.0_Linux-64bit.tar.gz
    ./s5cmd
    
  2. Create space in tmpfs for the database files:

    sudo mkdir /my_tmpfs
    sudo mount -t tmpfs tmpfs /my_tmpfs -o size=470G
    
  3. Copy the GX test database from the S3 bucket to tmpfs:

    ./s5cmd  --no-sign-request cp  --part-size 50  --concurrency 50 s3://ncbi-fcs-gx/gxdb/test-only/test-only.* /my_tmpfs/test-only/
    
  4. Copy the complete GX database from S3 to tmpfs:

    ./s5cmd  --no-sign-request cp  --part-size 50  --concurrency 50 s3://ncbi-fcs-gx/gxdb/latest/all.* /my_tmpfs/gxdb/
    

For non-cloud users:

  1. Run the following command to download the 'test-only' db:
    fcs.py will create the necessary directory under the LOCAL_DB path.

    SOURCE_DB_MANIFEST="https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/database/test-only/test-only.manifest"
    LOCAL_DB="/path/to/db/folder"
    python3 fcs.py db get --mft "$SOURCE_DB_MANIFEST" --dir "$LOCAL_DB/test-only" 
    
  2. Run the following command to download the 'all' db.

    ⚠️ Make sure the directory name is different for 'all' and 'test-only':

    SOURCE_DB_MANIFEST="https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/database/latest/all.manifest"
    LOCAL_DB="/path/to/db/folder"
    python3 fcs.py db get --mft "$SOURCE_DB_MANIFEST" --dir "$LOCAL_DB/gxdb" 
    

    Note: You have two options from where you can download the database using fcs.py. First is the NCBI FTP links used in the examples above. Second one is the public S3 bucket. To use the S3 bucket, set SOURCE_DB_MANIFEST as the following:

    • For 'test-only' db:

      SOURCE_DB_MANIFEST="https://ncbi-fcs-gx.s3.amazonaws.com/gxdb/test-only/test-only.manifest"

    • For ''all' db:

      SOURCE_DB_MANIFEST="https://ncbi-fcs-gx.s3.amazonaws.com/gxdb/latest/all.manifest"

  3. You may check if the 'all' database is downloaded successfully to $LOCAL_DB:

    ls "$LOCAL_DB/gxdb"
    
     all.README.txt
     all.assemblies.tsv
     all.blast_div.tsv.gz
     all.gxi
     all.gxs
     all.manifest
     all.meta.jsonl
     all.seq_info.tsv.gz
     all.taxa.tsv
    
  4. Caching the database

    FCS-GX requires the database to be available in RAM. If you do not have access to a tmpfs- or ramfs-backed filesystem, you can skip this (Caching the database) step. FCS-GX will still require a high-memory server, but will compensate by memory-mapping the database at the beginning of the run and thereby caching it to memory on the go. While it may take a little extra time, it won't require sudo permissions.

    If you do have access to a tmpfs- or ramfs-backed filesystem, e.g., /dev/shm, you can copy the downloaded databases to RAM to ensure it is available in successive runs on the same server. You can use the 'db get' command to copy files locally between disk and RAM. To do this, create a space in tmpfs:

     sudo mkdir /my_tmpfs
     sudo mount -t tmpfs tmpfs /my_tmpfs -o size=470G
    

    Next, run the following command to cache the 'test-only' db:

     python3 fcs.py db get --mft "$LOCAL_DB/test-only/test-only.manifest" --dir /my_tmpfs/test-only
    

    Run following command to cache the 'all' db:

     python3 fcs.py db get --mft "$LOCAL_DB/gxdb/all.manifest" --dir /my_tmpfs/gxdb
    
  5. Check the integrity of the database

    Before running FCS-GX, it is a good idea to check the integrity of the database, or if the source db has changed.

    Run the following to check if there are any differences between the source 'all' db and the downloaded 'all' db:

    python3 fcs.py db check --mft "$SOURCE_DB_MANIFEST" --dir "$LOCAL_DB/gxdb"
    

    You can also check if there are any differences between the downloaded 'all' db and the cached 'all' db:

    python3 fcs.py db check --mft "$LOCAL_DB/gxdb/all.manifest" --dir /my_tmpfs/gxdb
    

    If you see differences in the files, you can run 'db get' with the same parameters for --mft and --dir as above. This will import the files that are different.

Other input parameters

--out-basename: Output files' basename. Default is {fasta-basename}.{tax-id}.

--out-dir: Output directory. (default: .)

--generate-logfile: Redirect stdout and stderr to {out-basename}.summary.txt. (default: False)

--debug: Produce a more verbose logfile for troubleshooting

Environment variables

The --env-file parameter directs use of optional environment variables to control some of the features in FCS-GX.

  • GX_NUM_CORES controls the number of CPU cores.

  • GX_ALIGN_EXCLUDE_TAXA excludes alignment to particular tax-ids. Multiple tax-ids may be provided as a comma-separated list.

    ⚠️ this only works for the exact tax-ids explicitly in the GX database. For instance, setting GX_ALIGN_EXCLUDE_TAXA=33208 will not exclude all metazoan hits.)

  • GX_EXTRA_CONTAM_DIVS identifies additional contaminants for a given gx division when screening a metagenome. Multiple divisions may be provided as a comma-separated list.

  • GX_ACTION_REPORT_PA_SAME_KINGDOM_THRESHOLD defines the percentage cutoff for ignoring prokaryote-in-prokaryote contamination, i.e. GX_ACTION_REPORT_PA_SAME_KINGDOM_THRESHOLD=5 means that identified prokaryote contamination from a division that is less than 5% of the total genome size will be reported as REVIEW_RARE and will not be automatically cleaned by fcs.py clean genome. The default is 1%.

For example, to run a genome with an 8-core CPU and excluding alignments to Toxoplasma gondii, use the following:

cat env.txt
GX_NUM_CORES=8
GX_ALIGN_EXCLUDE_TAXA=5811

python3 ./fcs.py --env-file env.txt screen genome --fasta ./GCA_000006565.2_TGA4_genomic.fna.gz --out-dir ./gx_out/ --gx-db "$LOCAL_DB/gxdb" --tax-id 508771 --generate-logfile T

To run a metagenome to identify additional contaminants from "anml:insect" division, use the following:

cat env.txt
GX_EXTRA_CONTAM_DIVS="anml:insects"

python3 ./fcs.py --env-file env.txt screen genome --fasta ../GCA_022076465.1_ASM2207646v1_genomic.fna.gz --tax-id=1163772 --out-dir ./gx_out/ --gx-db "$LOCAL_DB/gxdb" --generate-logfile T