Skip to content

Running PHoeNIx

Jill V. Hagey, PhD edited this page Mar 20, 2023 · 70 revisions

You should have already set up your config file to make sure Nextflow knows how to run the programs within PHoeNIx. If you haven't already, please review the config set up portion of the install page.

Input Parameters

The following are the possible parameters you can pass to PHoeNIx. You can get this screen by running:

nextflow run cdcgov/phoenix -r v1.0.0 --help

Pipeline Workflow

Note that for PHX v1.0.0 the output argument --outdir CANNOT be a relative path. If you want to put the output in the directory you are in then append $PWD to the directory name like: --outdir $PWD/results. PHX >=1.1.0 allows relative paths for inputs and outputs.

Input: -entry PHOENIX or -entry CDC_PHOENIX

The full PHoeNIx pipeline (-entry PHOENIX or -entry CDC_PHOENIX) only runs on Illumina paired-end reads. Multiple samples can be run using a samplesheet.csv file

nextflow run cdcgov/phoenix -profile <docker/singularity/custom> -entry PHOENIX --input samplesheet.csv --kraken2db $PATH_TO_DB

Samplesheet Input

You will need to create a samplesheet with information about the samples you would like to analyze before running the pipeline. Use the --input parameter to specify its location. It must be a comma-separated file (csv) with at least 3 columns and a header row, as shown in the example below. DO NOT HAVE ANY SPACES IN THIS FILE. Do make sure the paths are full paths and not relative. For best results use the automated samplesheet creation scripts described in the automated section below.

--input '[path to samplesheet file]'

Reads Samplesheet

The samplesheet can have as many columns as you desire; however, there is a strict requirement for the first 3 columns to match those defined in the table below.

A final samplesheet file consisting of paired-end data may look something like the one below.

sample,fastq_1,fastq_2
SAMPLE_1,$PATH/AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
SAMPLE_2,$PATH/AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz
SAMPLE_3,$PATH/AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz
Column Description
sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (_).
fastq_1 Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz".
fastq_2 Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz".

An example samplesheet has been provided with the pipeline and can be used for testing.

Automated Samplesheet Creation

A script is available to create a samplesheet from a directory of fastq files. The script will search 1 directory deep and attempt to determine sample id names and pairing/multilane information and will automatically create a samplesheet.

- Please review the samplesheet for accuracy before using it in the pipeline.
phoenix/bin/create_samplesheet.sh <directory of fastq files> > samplesheet.csv

You can change the name of the samplesheet.csv above to anything you want.

Outputs

Output file structure

The output of PHoeNIx is structured like the following:

📦results
┣ 📂SRR17250615
┃ ┣ 📂AMRFinder
┃ ┃ ┣ 📜SRR17250615_AMRFinder_Organism.csv
┃ ┃ ┣ 📜SRR17250615_all_mutations.tsv
┃ ┃ ┗ 📜SRR17250615_amr_genes.tsv
┃ ┣ 📂ANI
┃ ┃ ┣ 📂fastANI
┃ ┃ ┃ ┗ 📜SRR17250615.fastANI.txt
┃ ┃ ┣ 📂mash_dist
┃ ┃ ┃ ┣ 📜SRR17250615.txt
┃ ┃ ┃ ┗ 📜SRR17250615_best_MASH_hits.txt
┃ ┃ ┗ 📜SRR17250615.ani.txt
┃ ┣ 📂Assembly
┃ ┃ ┣ 📜SRR17250615.assembly.gfa.gz
┃ ┃ ┣ 📜SRR17250615.bbmap_filtered.log
┃ ┃ ┣ 📜SRR17250615.contigs.fa.gz
┃ ┃ ┣ 📜SRR17250615.filtered.scaffolds.fa.gz
┃ ┃ ┣ 📜SRR17250615.renamed.scaffolds.fa.gz
┃ ┃ ┣ 📜SRR17250615.scaffolds.fa.gz
┃ ┃ ┗ 📜SRR17250615.spades.log
┃ ┣ 📂BUSCO*
┃ ┃ ┣ 📜SRR17250615-auto-busco.batch_summary.txt
┃ ┃ ┣ 📜short_summary.generic.bacteria_odb10.SRR17250615.filtered.scaffolds.fa.json
┃ ┃ ┣ 📜short_summary.generic.bacteria_odb10.SRR17250615.filtered.scaffolds.fa.txt
┃ ┃ ┣ 📜short_summary.specific.enterobacterales_odb10.SRR17250615.filtered.scaffolds.fa.json
┃ ┃ ┗ 📜short_summary.specific.enterobacterales_odb10.SRR17250615.filtered.scaffolds.fa.txt
┃ ┣ 📂fastp_trimd
┃ ┃ ┣ 📜SRR17250615.fastp.html
┃ ┃ ┣ 📜SRR17250615.fastp.json
┃ ┃ ┣ 📜SRR17250615.singles.fastq.gz
┃ ┃ ┣ 📜SRR17250615_1.trim.fastq.gz
┃ ┃ ┣ 📜SRR17250615_2.trim.fastq.gz
┃ ┃ ┣ 📜SRR17250615_raw_read_counts.txt
┃ ┃ ┣ 📜SRR17250615_singles.fastp.html
┃ ┃ ┣ 📜SRR17250615_singles.fastp.json
┃ ┃ ┗ 📜SRR17250615_trimmed_read_counts.txt
┃ ┣ 📂fastqc
┃ ┃ ┣ 📜SRR17250615_1_fastqc.html
┃ ┃ ┣ 📜SRR17250615_1_fastqc.zip
┃ ┃ ┣ 📜SRR17250615_2_fastqc.html
┃ ┃ ┗ 📜SRR17250615_2_fastqc.zip
┃ ┣ 📂gamma_ar
┃ ┃ ┣ 📜SRR17250615_ResGANNCBI_20210507_srst2.gamma
┃ ┃ ┗ 📜SRR17250615_ResGANNCBI_20210507_srst2.psl
┃ ┣ 📂gamma_hv
┃ ┃ ┣ 📜SRR17250615_HyperVirulence_20220414.gamma
┃ ┃ ┗ 📜SRR17250615_HyperVirulence_20220414.psl
┃ ┣ 📂gamma_pf
┃ ┃ ┣ 📜SRR17250615_PF-Replicons_20220414.gamma
┃ ┃ ┗ 📜SRR17250615_PF-Replicons_20220414.psl
┃ ┣ 📂kraken2_asmbld*
┃ ┃ ┣ 📂krona
┃ ┃ ┃ ┣ 📜SRR17250615_asmbld.html
┃ ┃ ┃ ┗ 📜SRR17250615_asmbld.krona
┃ ┃ ┣ 📜SRR17250615.asmbld_summary.txt
┃ ┃ ┣ 📜SRR17250615.classified.fastq.gz
┃ ┃ ┣ 📜SRR17250615.kraken2_asmbld.classifiedreads.txt
┃ ┃ ┣ 📜SRR17250615.kraken2_asmbld.report.txt
┃ ┃ ┣ 📜SRR17250615.mpa
┃ ┃ ┗ 📜SRR17250615.unclassified.fastq.gz
┃ ┣ 📂kraken2_asmbld_weighted
┃ ┃ ┣ 📂krona
┃ ┃ ┃ ┣ 📜SRR17250615_wtasmbld.html
┃ ┃ ┃ ┗ 📜SRR17250615_wtasmbld.krona
┃ ┃ ┣ 📜SRR17250615.kraken2_wtasmbld.report.txt
┃ ┃ ┗ 📜SRR17250615.wtasmbld_summary.txt
┃ ┣ 📂kraken2_trimd
┃ ┃ ┣ 📂krona
┃ ┃ ┃ ┣ 📜SRR17250615_trimd.html
┃ ┃ ┃ ┗ 📜SRR17250615_trimd.krona
┃ ┃ ┣ 📜SRR17250615.classified_1.fastq.gz
┃ ┃ ┣ 📜SRR17250615.classified_2.fastq.gz
┃ ┃ ┣ 📜SRR17250615.kraken2_trimd.classifiedreads.txt
┃ ┃ ┣ 📜SRR17250615.kraken2_trimd.report.txt
┃ ┃ ┣ 📜SRR17250615.mpa
┃ ┃ ┣ 📜SRR17250615.trimd_summary.txt
┃ ┃ ┣ 📜SRR17250615.unclassified_1.fastq.gz
┃ ┃ ┗ 📜SRR17250615.unclassified_2.fastq.gz
┃ ┣ 📂mlst
┃ ┃ ┣ 📜SRR17250615.tsv
┃ ┃ ┣ 📜SRR17250615_combined.tsv
┃ ┃ ┗ 📜SRR17250615_srst2.mlst*
┃ ┣ 📂quast
┃ ┃ ┗ 📜SRR17250615_report.tsv
┃ ┣ 📂removedAdapters
┃ ┃ ┗ 📜SRR17250615.bbduk.log
┃ ┣ 📂srst2*
┃ ┃ ┗ 📜SRR17250615__fullgenes__ResGANNCBI_20210507_srst2__results.txt
┃ ┣ 📜SRR17250615.synopsis
┃ ┣ 📜SRR17250615.tax
┃ ┣ 📜SRR17250615_Assembly_ratio_20210819.txt
┃ ┣ 📜SRR17250615_GC_content_20210819.txt
┃ ┗ 📜SRR17250615_summaryline.tsv
┣ 📂multiqc
┃ ┣ 📂multiqc_data
┃ ┃ ┣ 📜mqc_fastqc_per_base_n_content_plot_1.txt
┃ ┃ ┣ 📜mqc_fastqc_per_base_sequence_quality_plot_1.txt
┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Counts.txt
┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Percentages.txt
┃ ┃ ┣ 📜mqc_fastqc_per_sequence_quality_scores_plot_1.txt
┃ ┃ ┣ 📜mqc_fastqc_sequence_counts_plot_1.txt
┃ ┃ ┣ 📜mqc_fastqc_sequence_duplication_levels_plot_1.txt
┃ ┃ ┣ 📜mqc_fastqc_sequence_length_distribution_plot_1.txt
┃ ┃ ┣ 📜multiqc.log
┃ ┃ ┣ 📜multiqc_data.json
┃ ┃ ┣ 📜multiqc_fastqc.txt
┃ ┃ ┣ 📜multiqc_general_stats.txt
┃ ┃ ┗ 📜multiqc_sources.txt
┃ ┣ 📂multiqc_plots
┃ ┃ ┣ 📂pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_base_n_content_plot_1.pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_base_sequence_quality_plot_1.pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Counts.pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Percentages.pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_quality_scores_plot_1.pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_counts_plot_1.pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_counts_plot_1_pc.pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_duplication_levels_plot_1.pdf
┃ ┃ ┃ ┗ 📜mqc_fastqc_sequence_length_distribution_plot_1.pdf
┃ ┃ ┣ 📂png
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_base_n_content_plot_1.png
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_base_sequence_quality_plot_1.png
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Counts.png
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Percentages.png
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_quality_scores_plot_1.png
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_counts_plot_1.png
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_counts_plot_1_pc.png
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_duplication_levels_plot_1.png
┃ ┃ ┃ ┗ 📜mqc_fastqc_sequence_length_distribution_plot_1.png
┃ ┃ ┗ 📂svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_base_n_content_plot_1.svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_base_sequence_quality_plot_1.svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Counts.svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Percentages.svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_quality_scores_plot_1.svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_counts_plot_1.svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_counts_plot_1_pc.svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_duplication_levels_plot_1.svg
┃ ┃ ┃ ┗ 📜mqc_fastqc_sequence_length_distribution_plot_1.svg
┃ ┗ 📜multiqc_report.html
┣ 📂pipeline_info
┃ ┣ 📜execution_report_2022-06-16_09-34-32.html
┃ ┣ 📜execution_timeline_2022-06-16_09-34-32.html
┃ ┣ 📜execution_trace_2022-06-16_09-34-32.txt
┃ ┣ 📜pipeline_dag_2022-06-16_09-34-32.svg
┃ ┣ 📜samplesheet.valid.csv
┃ ┗ 📜software_versions.yml
┗ 📜Phoenix_Output_Report.tsv

This is the file tree for running one sample. *Designates files that will only be generated when you use the -entry CDC_PHOENIX for the pipeline.

Output File Overview

The following are an explanation of the files that are output:

  • ANI - Output of FastANI and Mash dist
  • AMRFinder - Output of FastANI and Mash dist
  • Assembly - Assembly output from SPADes and filtering/header renaming steps.
  • BUSCO - Output from BUSCO run on scaffolds summarizing assembly completeness.
  • fastp_trimd - Output of raw reads filtering and stats for trimmed, raw and unpaired reads.
  • fastqc - Raw read QC
  • GAMMA
    • gamma_ar - Output of GAMMA hits from curated AR database
    • gamma_hv - Output of GAMMA hits from hypervirulence gene database
    • gamma_pf - Output of GAMMA-S hits from plasmid finder database
  • Kraken2
    • kraken2_trimd - Output of Kraken2 run on trimmed reads and Krona plots
    • kraken2_asmbld - Output of Kraken2 run on the assembly and Krona plots
    • kraken2_asmbld_weighted - Output of Kraken2 run on the assembly weighted by sequence length and Krona plots
  • mlst - Output of MLST scans for assembly files against traditional PubMLST typing schemes
  • quast - Assembly QC metrics
  • removedAdapters - Output of BBDUK step to remove adapters
  • srst2 - Output of from SRST2 after mapping trimmed reads to a curated AR database.
  • Sample Specific Files - Files that summarize the results for a sample
  • Run Specific Files - A file that summarizes multiple samples A good first place to start
  • MultiQC - Aggregate report describing results and FastQC from the whole pipeline
  • Pipeline information - Report metrics generated during the workflow execution

ANI

Output files
  • ANI/
    • *.ani.txt: Output of FastANI. Which shows the ANI estimate between the assembly and the top 20 closest genomes (determined via the mash distance). The remaining columns are the ANI estimate, the number of genomes that were aligned as orthologous matches, and the total sequence fragments from the assembly. For further details see the FastANI documentation.
    • ANI/fastANI
      • *.fastANI.txt: This is a reformatted version of *.ani.txt that list matches in order of ANI and includes the top match information as the first line of the file to be extracted in downstream processes for reporting.
    • ANI/mash_dist
      • *.txt: output of mash distance.F or further details see the Mash documentation
      • *_best_MASH_hits.txt: A list of the top 20 matches found via mash dist that is past to FastANI to calculate

FastANI FastANI is developed for fast alignment-free computation of whole-genome Average Nucleotide Identity (ANI). ANI is defined as mean nucleotide identity of orthologous gene pairs shared between two microbial genomes. FastANI avoids expensive sequence alignments and uses Mashmap as its MinHash based sequence mapping engine to compute the orthologous mappings and alignment identity estimates.

AMRFinder

Output files
  • AMRFinder/
    • *_AMRFinder_Organism.csv: This file just contains the organism (if found) to be passed to amrfinder using the --organism parameter. Read more about the organism option in AMRFinder's documentation.
    • *_all_mutations.tsv: File generated by passing --mutation_all argument to AMRFinder read more about the mutation option in AMRFinder's documentation.
    • *_all_genes.tsv: The AR gene calls by AMRFinder. Only the point mutations are reported in Phoenix_Output_Report.tsv.

AMRFinder AMRFinder and the accompanying database identify acquired antimicrobial resistance genes in bacterial protein and/or assembled nucleotide sequences as well as known resistance-associated point mutations for several taxa. AMRFinderPlus has added select members of additional classes of genes such as virulence factors, biocide, heat, acid, and metal resistance genes.

Assembly

Output files
  • Assembly/
    • *.assembly.gfa.gz: Contains SPAdes assembly graph and scaffolds paths in GFA 1.0 format
    • *.bbmap_filtered.log: The log file of bbmap which is used to remove scaffolds that have <500bp
    • *.contigs.fa.gz: Contains contigs generated by SPAdes
    • *.filtered.scaffolds.fa.gz: Scaffolds file that has <500bp sequences remove.
    • *.renamed.scaffolds.fa.gz: Same as the *.filtered.scaffolds.fa.gz file, but headers contain the sample name.
    • *.scaffolds.fa.gz: Contains scaffolds generated by SPAdes
    • *.spades.log: SPAdes log

SPAdes – St. Petersburg genome assembler – is an assembly toolkit containing various assembly pipelines.. For further reading and documentation see the SPAdes manual. SPAdes scaffold files are used for downstream analysis.

BUSCO - only run with -entry CDC_PHOENIX

BUSCO output is based on evolutionarily-informed expectations of gene content of near-universal single-copy orthologs, thus the BUSCO metric is complementary to technical metrics like N50.

Output files
  • BUSCO/
    • *-auto-busco.batch_summary.txt:
    • short_summary.generic.*.filtered.scaffolds.fa.json: Contains a summary of the results in JSON form.
    • short_summary.generic.*.filtered.scaffolds.fa.txt: Contains a plain text summary of the results in BUSCO notation.
    • short_summary.specific.*.filtered.scaffolds.fa.json: Contains a summary of the results in JSON form.
    • short_summary.specific.*.filtered.scaffolds.fa.txt: Contains a plain text summary of the results in BUSCO notation.

For further reading and documentation see the BUSCO Users Guide.

Fastp

Output files
  • fastp_trimd/
    • *.fastp.html: Html output of fastp run on raw reads.
    • *.fastp.json: Same as the html output, just in json format
    • *.singles.fastq.gz: Unpaired reads that passed the QC filters when running fastp on the raw reads.
    • *_1.trim.fastq.gz: Forward reads from paired-end reads that passed the QC filters of fastp.
    • *_2.trim.fastq.gz : Reverse reads from paired-end reads that passed the QC filters of fastp.
    • *_raw_read_counts.txt: Parsed *.fastp.json on raw reads and custom stat calculations.
    • *_singles.fastp.html: Html output of fastp run on unpaired reads.
    • *_singles.fastp.json: Same as the html output, just in json format.
    • *_trimmed_read_counts.txt: Parsed *.fastp.json on trimmed reads and single reads with custom stat calculations.

FastP is a tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance. For further reading and documentation see the Fastp documentation.

FastQC

Output files
  • fastqc/
    • *_fastqc.html: FastQC report containing quality metrics.
    • *_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.

FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.

GAMMA

Output files
  • gamma_ar/
    • *_ResGANNCBI_*.gamma: Output of GAMMA that are the best matches from the curated AR gene database.
    • *_ResGANNCBI_*.psl:
  • gamma_hv/
    • *_HyperVirulence_*.gamma: Output of GAMMA that are the best matches from the hypervirulence database.
    • *_HyperVirulence_*.psl: blat output in psl format
  • gamma_pf/
    • *_PF-Replicons_*.gamma: Output of GAMMA-S that are the best matches from the plasmid finder database without translating them.
    • *_PF-Replicons_*.psl: blat output in psl format

GAMMA (Gene Allele Mutation Microbial Assessment) is a command line tool that finds gene matches in microbial genomic data using protein coding (rather than nucleotide) identity, and then translates and annotates the match by providing the type (i.e., mutant, truncation, etc.) and a translated description (i.e., Y190S mutant, truncation at residue 110, etc.). Because microbial gene families often have multiple alleles and existing databases are rarely exhaustive, GAMMA is helpful in both identifying and explaining how unique alleles differ from their closest known matches. GAMMA-S (Gene Allele Mutation Microbial Assessment-Sequence) finds best matches from a gene database without translating them--so it will find the best match by nucleotides, rather by the translated protein sequence. For further reading and documentation see the GAMMA's github.

Kraken2

Output files
  • kraken2_asmbld/
    • krona/
      • *_asmbld.html: Interactive hierarchical chart of kraken2's taxa calls on the assembly that can be viewed with any modern web browser.
      • *_asmbld.krona: Krona file used to make the *_asmbld.html file.
    • *.asmbld_summary.txt: The kraken2 best hit for the scaffolds.
    • *.classified.fastq.gz: The sequences that were able to be classified by kraken2.
    • *.kraken2_asmbld.classifiedreads.txt: Standard Kraken2 output on assembly scaffolds.
    • *.kraken2_asmbld.report.txt: Kraken2 report for assembly scaffolds.
    • *.mpa: Converted Kraken report style output to a mpa (MetaPhlAn)-style TEXT file. Used downstream to collect final stats.
    • *.unclassified.fastq.gz: The sequences that were unable to be classified by kraken2.
  • kraken2_asmbld_weighted/
    • krona/
      • *_wtasmbld.html: Interactive hierarchical chart of kraken2's taxa calls on the weighted assembly that can be viewed with any modern web browser.
      • *_wtasmbld.krona: Krona file used to make the *_wtasmbld.html file.
    • *.kraken2_wtasmbld.report.txt: Kraken2 report for weighted assembly.
    • *.wtasmbld_summary.txt: The kraken2 best hit for the weighted assembly.
  • kraken2_trimd/
    • krona/
      • *_trimd.html: Interactive hierarchical chart of kraken2's taxa calls on the trimmed reads that can be viewed with any modern web browser.
      • *_trimd.krona: Krona file used to make the *_trimd.html file.
    • *.classified_1.fastq.gz: The forward reads that were able to be classified by kraken2.
    • *.classified_2.fastq.gz: The reverse reads that were able to be classified by kraken2.
    • *.kraken2_trimd.classifiedreads.txt: Standard Kraken2 output on trimmed reads.
    • *.kraken2_trimd.report.txt: Kraken2 report for trimmed reads.
    • *.mpa: Converted Kraken report style output to a mpa (MetaPhlAn)-style TEXT file. Used downstream to collect final stats.
    • *.trimd_summary.txt: The kraken2 best hit for the trimmed reads.
    • *.unclassified_1.fastq.gz: The forward reads that were unable to be classified by kraken2.
    • *.unclassified_2.fastq.gz: The reverse reads that were unable to be classified by kraken2.

Kraken2 is a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. For further reading and documentation see the Kraken2's github. Krona allows hierarchical data to be explored with zooming, multi-layered pie charts. The resulting interactive charts are self-contained and can be viewed with any modern web browser. For further reading and documentation see the Krona's github.

MLST

Output files
  • mlst/
    • All files will contain all schemes relevant to the identified taxonomy (e.g., Acinetobacter baumannii and Escherichia coli will have 2 schemes each)
    • *.tsv: Output of MLST that contains the filename, matching PubMLST scheme name, ST (sequence type), and allele IDs.
      • This output has the following allele markers:
        • '~' : full length novel allele
        • '?' : partial match (>min_cov & > min_ID). Default min_cov = 10, Default min_ID=95%
        • '-' : Allele is missing

Example output of a novel allele:

source_file  Database  ST  locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 lous_9  locus_10
sample_A.filtered.scaffolds.fa   koxytoca        -       gapA(16)        infB(~28)       mdh(63) pgi(~37)        phoE(~7)        rpoB(20)        tonB(40?)

Example output of a partial allele match:

source_file  Database  ST  locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 lous_9  locus_10
sample_B.filtered.scaffolds.fa   klebsiella      -     gapA(3) infB(3?) mdh(1)  pgi(1)  phoE(1) rpoB(1) tonB(79)

Example output of missing allele:

source_file  Database  ST  locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 lous_9  locus_10
sample_C.filtered.scaffolds.fa   klebsiella      -     gapA(3) infB(3) mdh(-)  pgi(1)  phoE(1) rpoB(1) tonB(79)
  • *_srst2.mlst: Output of srst2 MLST that contains Sample, database, ST, mismatches, uncertainty, depth, maxMAF as well as all loci for the sample/database.
    • This output has the following allele markers:
      • '*' : Full length match with 1+ SNP (Novel)
      • '?' : edge depth is below N or average depth is below X (Default edge_depth = 2, Default average_depth = 5)
      • '-' : No allele assigned, usually because no alleles achieved >90% coverage

Example of novel allele:

Sample  database        ST      mismatches      uncertainty     depth   maxMAF  locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_D Klebsiella_pneumoniae   NF*     phoE_594/1snp   -       28.0804285714   0.25    gapA(3) infB(3) mdh(88) pgi(1)  phoE(594*)      rpoB(1) tonB(79)

Example output of low edge depth allele:

Sample  database        ST      mismatches      uncertainty     depth   maxMAF  locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_E Klebsiella_pneumoniae   NF*?    gapA_178/31holes;mdh_88/19holes;pgi_1/1snp;phoE_594/2snp7holes;tonB_79/24holes  gapA_178/edge0.0;infB_3/edge1.0;mdh_88/edge0.0;pgi_1/edge1.0;phoE_594/edge1.0;tonB_79/edge0.0   4.65714285714   0.5 gapA(178*?)      infB(3?)        mdh(88*?)       pgi(1*?)        phoE(594*?)     rpoB(1) tonB(79*?)

Example output of missing allele:

Sample  database        ST      mismatches      uncertainty     depth   maxMAF  locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_F Escherichia_coli#1      10      0       -       32.0842857143   0.0666666666667 adk(10) fumC(11)        gyrB(4) icd(8)  mdh(8)  purA(8) recA(2)
sample_F Escherichia_coli#2      NF       0       -       24.11425        0.1     dinB(8) icdA(-) pabB(7) polB(3) putP(7) trpA(1) trpB(4) uidA(2)
  • *_combined.tsv: Combines output of MLST and srst2 MLST results, if available, and also simplifies reasoning if a type is not able to be assigned. What the above isolates look like in _combined.tsv form.

Examples of what above entries look like when passed through clean up script:

Sample  Source  Pulled on       Database        ST      locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_A standard/srst2  2023-01-17      koxytoca     Novel_allele      gapA(16) infB(~28) mdh(63) pgi(~37) phoE(~7) rpoB(20) tonB(40?)
Sample  Source  Pulled on       Database        ST      locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_B standard/srst2  2023-01-17      klebsiella     Novel_allele     gapA(3) infB(3?) mdh(1)  pgi(1)  phoE(1) rpoB(1) tonB(79)
Sample  Source  Pulled on       Database        ST      locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_C standard/srst2  2023-01-17      klebsiella     Novel_allele     gapA(3) infB(3) mdh(-)  pgi(1)  phoE(1) rpoB(1) tonB(79)
Sample  Source  Pulled on       Database        ST      locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_D srst2   2023-01-17      klebsiella      Novel_allele    gapA(3) infB(3) mdh(88) pgi(1)  phoE(594*)      rpoB(1) tonB(79)
sample_D standard        2023-01-17      klebsiella      258     gapA(3) infB(3) mdh(1)  pgi(1)  phoE(1) rpoB(1) tonB(79)
Sample  Source  Pulled on       Database        ST      locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_E srst2   2023-01-17      klebsiella      Novel_allele    gapA(178*?)     infB(3?)        mdh(88*?)       pgi(1*?)        phoE(594*?)     rpoB(1) tonB(79*?)
sample_E standard        2023-01-17      klebsiella      258     gapA(3) infB(3) mdh(1)  pgi(1)  phoE(1) rpoB(1) tonB(79)
Sample  Source  Pulled on       Database        ST      locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_F standard/srst2  2023-01-17      ecoli_2(Pasteur)        Novel_allele       dinB(8) icdA(-) pabB(7) polB(3) putP(7) trpA(1) trpB(4) uidA(2)
sample_F standard/srst2  2023-01-17      ecoli(Achtman)  10      adk(10) fumC(11)        gyrB(4) icd(8)  mdh(8)  purA(8) recA(2)
- If both assembly (MLST) and read mlst (SRST2 MLST) are run and they don't agree, but they are still hitting to the same database, they will be placed on separate lines. If they do agree the source column (#2) will indicate it is a match.

MLST scans assembly files against traditional PubMLST typing schemes. srst2_MLST scans read files against traditional PubMLST typing schemes.

QUAST

Output files
  • quast/
    • *_report.tsv: tab-separated version of the summary, suitable for spreadsheets.

QUAST QUAST evaluates genome assemblies. For further reading and documentation see QUAST's Manual.

removedAdapters

Output files
  • removedAdapters/
    • *.bbduk.log: log file for the bbduk run.

BBDUK was developed to combine most common data-quality-related trimming, filtering, and masking operations into a single high-performance tool. For further reading and documentation see BBDUK's Manual.

srst2 - only run with -entry CDC_PHOENIX

Output files
  • srst2/
    • *._fullgenes__*__results.txt: STs will be printed in tab-delim format to a file called [outputprefix]mlst[db]__results.txt, output is explained further here.

SRST2 Short Read Sequence Typing for Bacterial Pathogens. For further reading and documentation see SRST2's GitHub.

Sample Specific Files

Output files
  • *.synopsis: This file contains a summary of stats for the sample and will provide warnings and alerts for metrics that don't meet quality standards. This is an example output:
---------- Checking SRR12352153 for successful completion on Fri Sep 16 15:27:15 EDT 2022 ----------
Summarized                    : SUCCESS  : Fri Sep 16 15:27:15 EDT 2022
FASTQs                        : SUCCESS  : R1: 165249420bps R2: 165400291bps
RAW_READ_COUNTS               : SUCCESS  : 1445464 individual reads found in sample (722732 paired reads)
RAW_Q30_R1%                   : WARNING  : Q30_R1% at 88% (Threshold is 90%)
RAW_Q30_R2%                   : SUCCESS  : Q30_R2% at 71% (Threshold is 70%)
TRIMMED_BPS                   : SUCCESS  : R1: 103339444bps R2: 84715304bps Unpaired: 28753074bps
TRIMMED_READ_COUNTS           : SUCCESS  : 1147210 individual reads found in sample (494219 paired reads, 158772 singled reads)
TRIMMED_Q30_R1%               : SUCCESS  : Q30_R1% at 98% (Threshold is 90%)
TRIMMED_Q30_R2%               : SUCCESS  : Q30_R2% at 96% (Threshold is 70%)
KRAKEN2_CLASSIFY_READS        : SUCCESS  : 32.86% Klebsiella pneumoniae with 1.63% unclassified reads
KRAKEN2_READS_CONTAM          : SUCCESS  : Only one genus has been found above the 25% threshold
ASSEMBLY                      : SUCCESS  : 195 scaffolds found
SCAFFOLD_TRIM                 : SUCCESS  : 123 scaffolds remain. 72 were removed due to shortness
KRAKEN2_CLASSIFY_WEIGHTED     : SUCCESS  : Klebsiella(97.59%) pneumoniae(97.28%) with 0.00% unclassified scaffolds
KRAKEN2_WEIGHTED_CONTAM       : SUCCESS  : Only one genus has been found above the 25% threshold
QUAST                         : SUCCESS  : #-123 length-5608577 n50-177421 %GC-57.19
QUAST_GC_Content              : SUCCESS  : %GC-57.19 is within 56.09796-58.11529 (2.58*0.39096stdevs) away from the mean of 57.10662.
TAXA-ANI_REFSEQ               : SUCCESS  : Klebsiella pneumoniae
ASSEMBLY_RATIO(SD)            : SUCCESS  : 1.0004x(.0074-SD) against K.pneumoniae
COVERAGE                      : ALERT    : 38.65x coverage based on trimmed reads (Target:40x)
FASTANI_REFSEQ                : SUCCESS  : 99.85%ID-94.76%COV-Klebsiella pneumoniae(Klebsiella_pneumoniae_GCF_001855315.1_ASM185531v1_genomic.fna.gz)
MLST-KLEBSIELLA               : SUCCESS  : ST147
GAMMA_AR                      : SUCCESS  : 26 AR gene(s) found from ResGANNCBI_20210507
AMRFINDER                     : SUCCESS  : 5 point mutation(s) found
PLASMID_REPLICONS             : SUCCESS  : 10 replicon(s) found from SRR12352153_PF-Replicons
HYPERVIRULENCE                : SUCCESS  : No hypervirulence genes were found from SRR12352153_HyperVirulence
Auto Pass/FAIL                : PASS     : Minimum Requirements met for coverage(30x)/ratio_stdev(<2.58)/min_length(>1000000) to pass auto QC filtering
---------- SRR12352153 completed as WARNING ----------
WARNINGS: out of line with what is expected and MAY cause problems downstream.
ALERT: something to note, does not mean it is a poor-quality assembly.
  • *.tax: This file contains the best taxa id. This is an example output:
(ANI_REFSEQ)-99.85%ID-94.76%COV-SRR12352153.fastANI.txt
D:	Bacteria
P:	Proteobacteria
C:	Gammaproteobacteria
O:	Enterobacterales
F:	Enterobacteriaceae
G:	Klebsiella
s:	pneumoniae
  • *_Assembly_ratio_20210819.txt: This file contains information on the assembly ratio and standard dev for the sample. This is an example output:
Tax: Klebsiella pneumoniae
NCBI_TAXID: 573
Species_St.Dev: 264827
Isolate_St.Devs: .0074
Actual_length: 5608577
Expected_length: 5606613
Ratio: 1.0004
  • *_GC_content_20210819.txt: This file contains information on the assembly ratio and standard dev for the sample. This is an example output:
Tax: Klebsiella pneumoniae
NCBI_TAXID: 573
Species_GC_StDev: 0.3909562608027466
Species_GC_Min: 41.6
Species_GC_Max: 66.5
Species_GC_Mean: 57.10662490502695
Species_GC_Count: 11319
  • *_summaryline.tsv: This is a one line summary that contains the columns:
    • ID - The name of the sample ID, which is determined from the samplesheet.
    • Auto_QC_Outcome - Either PASS or FAIL of the Auto PASS/FAIL
    • Warning_Count - The number of warnings for the sample. Warnings can be viewed in the *.synopsis file.
    • Estimated_Coverage - Estimated coverage as determined by (total trimmed bases / assembly length)
    • Genome_Length - Length of the assembled genome in base pairs.
    • Assembly_Ratio_(STDev) - The calculated assembly ratio (assembly size / median genome size of species) with the samples standard deviation. Standard deviation is only calculated when there are >=10 reference genomes for that taxa.
    • #of_Scaffolds>500bp - The number of scaffolds in the genome that are >500bp, those <500bp were filtered out of downstream analysis.
    • GC_% - % of G/C in the assembled genome.
    • Species - The Taxa determined by either FastANI or Kraken2.
    • Taxa_Confidence - Depending on the method used to determine taxa (FastANI, Kraken2_Weighted, or Kraken2_Trimd)
    • Taxa_Source - This column will say which method was used to determine taxonomy. PHoeNIx will assign taxonomy based on the best match from FastANI that compares genomes from RefSeq. If FastANI fails PHoeNIx will fall back on the taxonomic assignment from Kraken2_Weighted and if no assembly was created then it will use Kraken2_Trimd.
    • Kraken2_Trimd - Taxa determined by running kraken2 on the cleaned reads. The percent of reads per genus/species is presented in parenthesis in next to the respective taxa level.
    • Kraken2_Weighted - Taxonomic assignment based on the assembly (scaffolds) and the % is generated by weighting the scaffolds by their length. The percent per genus/species is presented in parenthesis in next to the respective taxa level.
    • MLST_Scheme_1 - Primary MLST scheme used.
    • MLST_1 - Primary MLST alleles.
    • MLST_Scheme_2 - If there is a secondary scheme it will be listed here.
    • MLST_2 - If there was a secondary scheme then those MLST alleles are listed here.
    • GAMMA_Beta_Lactam_Resistance_Genes - GAMMA hits against our custom database that combines AMRFinderPlus, ARG-ANNOT, and ResFinder filtered to only report the beta lactam genes.
    • GAMMA_Other_AR_Genes - Same as above only non-beta lactam genes.
    • AMRFinder_Point_Mutations - Point mutations as determined by AMRFinderPlus.
    • Hypervirulence_Genes - GAMMA hits against the database of hypervirulence genes from Russo et al.
    • Plasmid_Incompatibility_Replicons - GAMMA hits against the PlasmidFinder database.
    • Auto_QC_Failure_Reason - The reason for the auto failing the sample

This *_summaryline.tsv file will be combined together for the full Phoenix_Output_Report.tsv.

Run Specific Files

Output files
  • Phoenix_Output_Report.tsv: A file that is a combination of all *_summaryline.tsv files that is a good overview of the entire run.

MultiQC

Output files
  • multiqc/
    • multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
    • multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
    • multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarizing all samples in your project. Most of the pipeline QC results are visualized in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.