Skip to content

Standalone_RAPT_doc

RAPT-release edited this page Oct 3, 2024 · 6 revisions

As of December 2024, NCBI's pilot tool, Read assembly and Annotation Pipeline (RAPT) tool will no longer be available. We encourage you to check out NCBI’s suite of assembly and annotation tools including the genome assembler SKESA, the taxonomic assignment tool ANI, and the prokaryotic genome annotation pipeline (PGAP). Learn more...


This page contains the prerequisites and instructions for running Stand-alone RAPT on a local machine using the run_rapt.py python interface. "Local" means the same machine as where run_rapt.py is downloaded executed. It could be a physical machine on premise, or more conveniently, a cloud VM instance. Some basic knowledge of Unix/Linux commands, SKESA, and PGAP is useful. Please see our wiki page for References, Licenses and FAQs.

System Requirements

Stand-alone RAPT is run on the same machine where run_rapt.py is launched, therefore the machine must satisfy a set of minimal requirements:

  • At least 4 GB memory per CPU core

  • At least 8 CPU cores and 32 GB memory

  • Linux OS preferred, Windows 10 (pro or enterprise version) will also work but extra configuration is required.

  • 100 GB free storage space on disk

  • Internet connection (see Additional information section below for firewall requirements)

  • Container runner installed (currently one of Docker/Podman/Singularity). Docker is recommended. Below is a method to install Docker on a Ubuntu Linux machine, with version 18.04 LTS. Your operating system may require different commands. Please visit Docker for details.

    ~$ sudo snap install docker
    ~$ sudo apt update
    ~$ sudo apt install -y docker.io
    ~$ sudo usermod -aG docker $USER
    ~$ exit 

    re-log into your machine

    ~$ docker run hello-world
    
  • On some systems you may need to increase ulimit -n. ulimit -n 8192 has worked well for us.

  • Python 3 installed. Below is a method to install python on a linux machine

    ~$ sudo apt install python
    

You are now ready to run RAPT.

Get the RAPT command-line interface

  1. Log into the machine where you wish to run RAPT
  2. At your command prompt, download the latest release by executing the following commands:
~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v0.5.5/rapt-v0.5.5.tar.gz
~$ tar -xzf rapt.tar.gz && rm -f rapt.tar.gz

Now you should have the run_rapt.py script in your current directory.

  1. Run ./run_rapt.py -h to see the Stand-alone RAPT usage information
~$ ./run_rapt.py -h
usage: run_rapt.py [-h] [-a ACXN | -q FASTQ | -v | --test] [--organism ORGA]
                   [--strain STRAIN] [--skesa-only] [--no-usage-reporting]
                   [--stop-on-errors] [--auto-correct-tax] [-o OUTDIR]
                   [--refdata-dir REFDATA_HUB] [--pgap-ref PGAP_REF]
                   [--ani-ref ANI_REF] [-c MAXCPU] [-m MAXMEM]
                   [-D {docker,podman,singularity}] [-n DOCKER_NETWORK]

Read Assembly and Annotation Pipeline Tool (RAPT)

optional arguments:
  -h, --help            show this help message and exit
  -a ACXN, --submitacc ACXN
                        Run RAPT on an SRA run accession (sra_acxn).
  -q FASTQ, --submitfastq FASTQ
                        Run RAPT on Illumina reads in FASTQ or FASTA format.
                        The file must be readable from the computer that runs
                        RAPT. If forward and reverse readings are in two
                        separate files, specify as
                        "path/to/forward.fastq,path/to/reverse.fastq", or
                        "path/to/forward.fastq,reverse.fastq" if they are in
                        the same directory. The --organism argument is
                        mandatory for this type of input, while the --strain
                        argument is optional.
  -v, --version         Display the current RAPT version
  --test                Run a test suite. When RAPT does not produce the
                        expected results, it may be helpful to use this
                        command to ensure RAPT is functioning normally.
  --organism ORGA       Specify the binomial name or, if the species is
                        unknown, the genus for the sequenced organism. This
                        identifier must be valid in NCBI Taxonomy.
  --strain STRAIN       Specify the strain of the organism
  --skesa-only          Only assemble sequences to contigs, but do not
                        annotate.
  --no-usage-reporting  Prevents usage report back to NCBI. By default, RAPT
                        sends usage information back to NCBI for statistical
                        analysis. The information collected are a unique
                        identifier for the RAPT process, the machine IP
                        address, the start and end time of RAPT, and its three
                        modules: SKESA, taxcheck and PGAP. No personal or
                        project-specific information (such as the input data)
                        are collected
  --stop-on-errors      Do not run PGAP annotation pipeline when the genome
                        sequence is misassigned or contaminated
  --auto-correct-tax    If the genome sequence is misassigned or contaminated
                        and ANI predicts an organism with HIGH confidence, use
                        it for PGAP instead of the one provided by the user
  -o OUTDIR, --output-dir OUTDIR
                        Directory to store results and logs. If omitted, use
                        current directory
  --refdata-dir REFDATA_HUB
                        Specify a location to store reference data used by
                        RAPT. If omitted, use output directory
  --pgap-ref PGAP_REF   Full path to pre-downloaded PGAP reference data
                        tarball, if applicable. File is usually named like
                        input-<PGAP-BUILD>.prod.tgz
  --ani-ref ANI_REF     Full path to pre-downloaded ANI reference data
                        tarball, if applicable. File is usually named like
                        input-<PGAP-BUILD>.prod.ani.tgz
  -c MAXCPU, --cpus MAXCPU
                        Specify the maximal CPU cores the container should
                        use.
  -m MAXMEM, --memory MAXMEM
                        Specify the maximal memory (number in GB) the
                        container should use.
  -D {docker,podman,singularity}, --docker {docker,podman,singularity}
                        Use specified docker compatible program to run RAPT
                        image
  -n DOCKER_NETWORK, --network DOCKER_NETWORK
                        Specify the network the container should use. Note:
                        this parameter is passed directly to the --network
                        parameter to the container. RAPT does not check the
                        validity of the argument.
~$

Try an example

To run RAPT, you need the Illumina-sequenced reads for the genome you wish to assemble and annotate. These can be located in a fasta or fastq file on the machine where you wish to run RAPT, or they can be in a run in the NCBI Sequence Read Archive (SRA).
Important: 1. Only reads sequenced on Illumina machines can be used by RAPT. 2. The reads provided should be from a single isolate.

Starting from an SRA run

To demonstrate how to run RAPT, we are going to use SRR3496277, a set of reads available in SRA for Mycoplasma pirum.
This example takes about 20 minutes to complete on a 16-CPU 64 GB machine (time may vary dependind on the configuration of the computer).
Run the following command:

~$ ./run_rapt.py -a srr3496277
RAPT is now running, it may take a long time to finish. To see the progress, track the verbose log file /path/to/current_dir/raptout_c2f0732658/verbose.log.
~$ 

All output files and logs will be located in a subdirectory named raptout_xxxxxxxxxx under current directory. xxxxxxxxxx is the RUNID generated by run_rapt.py, unique to the execution. Please note that the duration of the job depends on the size of the genome and may be of several hours.

Starting from fastq or fasta files

You can use fastq or fasta files for single- or paired-end reads produced by an Illumina sequencer as input to RAPT.

  • Data should be short reads, from an Illumina sequencer.
  • Both single- or paired-end reads are acceptable.
  • Paired-end reads can be provided in one file with the two reads of a pair adjacent to each other (interleaved) in the file, or as two files, with the forward reads in one file and the reverse reads in the other.
  • The files should be in fastq or fasta format.
  • The quality scores are not necessary.
  • The files may be gzipped.
  • The files should be on the local file system.

The organism name assigned to the reads needs to be provided on the command line. It can be the genus species, or the the genus only if the species is not defined. The organism name must be known to the NCBI Taxonomy. The strain can optionally be provided with the --strain parameter.
Here is an example command, providing a single fastq file:

~$ ./run_rapt.py -q path/to/myreads.fastq --organism "Mycoplasma pirum" --strain "ATCC 25960"
RAPT is now running, it may take a long time to finish. To see the progress, track the verbose log file /path/to/current_dir/raptout_a2b2345678/verbose.log.
~$ 

Here is an example command, providing forward and reverse reads in separate gzipped fasta files, and specifying a name for the output directory:

~$ ./run_rapt.py -q path/to/myforwardreads.fasta.gz,myreversereads.fasta.gz --organism "Mycoplasma pirum" --strain "ATCC 25960" -o ATCC_25960_out
RAPT is now running, it may take a long time to finish. To see the progress, track the verbose log file /path/to/current_dir/ATCC_25960_out/verbose.log.
~$ 

Useful options

  • --auto-correct-tax:
    When set: if the step that evaluates the taxonomic assignment of the assembly (see the Average Nucleotide Identity tool) determines with high confidence that the species that best fits your assembly is different from the name you provided on input, the annotation process will use the ANI-chosen scientific name, and the ANI-chosen name will appear in the output files. When not set (default): the annotation process will use the name provided on input, regardless of the results of ANI.

  • --stop-on-errors:
    When set: if the step that evaluates the taxonomic assignment of the assembly (see the Average Nucleotide Identity tool) determines with high confidence that the species that best fits your assembly is different from the name you provided on input, or that the assembly is contaminated, RAPT will stop after the taxonomy check and no annotation results will be produced.

Review the output

  • Assembly results:

    • skesa_out.fa: multifasta files of the assembled contigs produced by SKESA
    • assembly_stat_report.tsv: assembly statistics (contig count, base count, min contig length, max contig length, contig N50, contig L50)
  • Taxonomy verification (see more details):

    • ani-tax-report.txt: results of the taxonomy check on the assembled sequences reported by the Average Nucleotide Identity tool, in text format. The possible statuses are:
      CONFIRMED: The organism name associated with the input reads, and assigned to the genome, has been confirmed by ANI. The submitted genus matches the genus of the ANI-predicted organism.
      MISASSIGNED: The organism has been found to be misassigned to the genome. The submitted genus does not match the genus of the ANI-predicted organism.
      INCONCLUSIVE: The organism cannot be identified (due to lack of a close enough type assembly in GenBank)
      CONTAMINATED: The genome is contaminated with sequences from an organism other than the submitted organism.
    • ani-tax-report.xml: same as above, in XML format
  • PGAP annotation results in multiple formats (see a detailed description of the annotation output files):

    • annot.gbk: annotated genome in GenBank flat file format
    • annot.gff: annotated genome in GFF3 format
    • annot.sqn: annotated genome in ASN format
    • annot.faa: multifasta file of the proteins annotated on the genome
    • calls.tab: tab-delimited file of the coordinates of detected foreign sequence. Empty if no foreign contaminant was found.
  • CheckM completeness results:

    • checkm.txt: Annotated assembly completeness and contamination as calculated by CheckM. See a full description of the file format at this location. Note: 1) The CheckM calculation is performed on the proteins produced by PGAP, 2) the set of markers used by CheckM is determined by the species associated with the genome (as provided on input or as overridden by ANI).
  • Execution logs:

    • concise.log: log file containing the major events in a run
    • verbose.log: all messages, with time stamps

Additional information

Reference data

The default location for reference data is the current working directory. RAPT will detect whether the proper version of reference data is available and automatically download it if it is not found. Downloaded reference data are stored in a version-named sub-directory, such as input-2020-07-09.build4716, so that multiple versions of reference data can exist side-by-side. Users who run RAPT regularly may want to store the reference data in a dedicated location, in which case the --refdata-dir switch can be used to specify a location other than the current directory:

~$ run_rapt.py -q path/to/myreads.fastq --organism "Mycoplasma pirum" --strain "ATCC 25960" --refdata-dir path/to/refdata-dir

Advanced options

  • -D docker|podman|singularity
    If multiple runners are installed (highly discouraged), you can specify a particular the runner to use. If a full path to the binary is provided, RAPT will use the path. Otherwise it must be specified in $PATH environment so that RAPT can find it.
  • -c MAXCPU, --cpus MAXCPU
    Specify the maximal number of CPUs the container should use.
  • -m MAXMEM, --memory MAXMEM
    Specify the maximal amount of memory (in GB) the container should use. Note: Singularity does not support dynamic resource limitation so the above options have no effect.
  • --network NETWORK
    Specify the network the container should use. Note: Only supported for Docker

Network considerations (Docker-Specific)

Some users may find that their local default configuration of Docker cannot send or receive data. To test network function, first bring up an interactive shell: > docker run -it xxxxx bash where xxxxx specifies docker image. Then, check for access within this interactive Docker shell, one way to do that: > curl -o /dev/null https://www.ncbi.nlm.nih.gov ; echo $? response of 0 indicates successful retrieval, other response indicates failure. If this indicates failure, Docker networking documentation may be helpful. Some users may find that using the host network is workable on a local machine without other docker instances running (that is, invoking docker --network host run -it xxxxx bash) Once a working network has been identified or configured, it can be specified using the --network advanced option to run_rapt.py, above.

Firewall requirements

It will be necessary to allow egress to the NCBI network services used by rapt for SRA access.

A complete list of SRA network resources is available at Firewall and Routing Information. RAPT may use the servers listed below; under normal circumstances, only www.ncbi.nlm.nih.gov and locate.ncbi.nlm.nih.gov will see traffic. If firewall modification is cumbersome (for example, due to a need to communicate with a systems group who may not respond immediately), it is safest to request access to sra-download.ncbi.nlm.nih.gov as well.

For convenience, the relevant information is reproduced below. The SRA information above is authoritative in case of any disagreement.

The general rule for SRA Toolkit tools is that they will use the https (443) port for TCP communications.

Servers used by RAPT

www.ncbi.nlm.nih.gov
    130.14.29.110

locate.ncbi.nlm.nih.gov
    130.14.29.113

sra-download.ncbi.nlm.nih.gov
    130.14.250.24
    130.14.250.25
    130.14.250.26
    130.14.250.27
    165.112.9.231
    165.112.9.232

Subnets

130.14.0.0/16, netmask 255.255.0.0
165.112.9.0/24, netmask 255.255.255.0

If you have other questions, please visit our FAQs page.