-
Notifications
You must be signed in to change notification settings - Fork 17
Standalone_RAPT_doc
As of December 2024, NCBI's pilot tool, Read assembly and Annotation Pipeline (RAPT) tool will no longer be available. We encourage you to check out NCBI’s suite of assembly and annotation tools including the genome assembler SKESA, the taxonomic assignment tool ANI, and the prokaryotic genome annotation pipeline (PGAP). Learn more...
This page contains the prerequisites and instructions for running Stand-alone RAPT on a local machine using the run_rapt.py
python interface. "Local" means the same machine as where run_rapt.py
is downloaded executed. It could be a physical machine on premise, or more conveniently, a cloud VM instance.
Some basic knowledge of Unix/Linux commands, SKESA, and PGAP is useful.
Please see our wiki page for References, Licenses and FAQs.
-
System Requirements
-
Get the RAPT command-line interface
- Try RAPT
-
Review the output
-
Additional information
Stand-alone RAPT is run on the same machine where run_rapt.py
is launched, therefore the machine must satisfy a set of minimal requirements:
-
At least 4 GB memory per CPU core
-
At least 8 CPU cores and 32 GB memory
-
Linux OS preferred, Windows 10 (pro or enterprise version) will also work but extra configuration is required.
-
100 GB free storage space on disk
-
Internet connection (see Additional information section below for firewall requirements)
-
Container runner installed (currently one of Docker/Podman/Singularity). Docker is recommended. Below is a method to install Docker on a Ubuntu Linux machine, with version 18.04 LTS. Your operating system may require different commands. Please visit Docker for details.
~$ sudo snap install docker ~$ sudo apt update ~$ sudo apt install -y docker.io ~$ sudo usermod -aG docker $USER ~$ exit
re-log into your machine
~$ docker run hello-world
-
On some systems you may need to increase
ulimit -n
.ulimit -n 8192
has worked well for us. -
Python 3 installed. Below is a method to install python on a linux machine
~$ sudo apt install python
You are now ready to run RAPT.
- Log into the machine where you wish to run RAPT
- At your command prompt, download the latest release by executing the following commands:
~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v0.5.5/rapt-v0.5.5.tar.gz
~$ tar -xzf rapt.tar.gz && rm -f rapt.tar.gz
Now you should have the run_rapt.py
script in your current directory.
- Run
./run_rapt.py -h
to see the Stand-alone RAPT usage information
~$ ./run_rapt.py -h
usage: run_rapt.py [-h] [-a ACXN | -q FASTQ | -v | --test] [--organism ORGA]
[--strain STRAIN] [--skesa-only] [--no-usage-reporting]
[--stop-on-errors] [--auto-correct-tax] [-o OUTDIR]
[--refdata-dir REFDATA_HUB] [--pgap-ref PGAP_REF]
[--ani-ref ANI_REF] [-c MAXCPU] [-m MAXMEM]
[-D {docker,podman,singularity}] [-n DOCKER_NETWORK]
Read Assembly and Annotation Pipeline Tool (RAPT)
optional arguments:
-h, --help show this help message and exit
-a ACXN, --submitacc ACXN
Run RAPT on an SRA run accession (sra_acxn).
-q FASTQ, --submitfastq FASTQ
Run RAPT on Illumina reads in FASTQ or FASTA format.
The file must be readable from the computer that runs
RAPT. If forward and reverse readings are in two
separate files, specify as
"path/to/forward.fastq,path/to/reverse.fastq", or
"path/to/forward.fastq,reverse.fastq" if they are in
the same directory. The --organism argument is
mandatory for this type of input, while the --strain
argument is optional.
-v, --version Display the current RAPT version
--test Run a test suite. When RAPT does not produce the
expected results, it may be helpful to use this
command to ensure RAPT is functioning normally.
--organism ORGA Specify the binomial name or, if the species is
unknown, the genus for the sequenced organism. This
identifier must be valid in NCBI Taxonomy.
--strain STRAIN Specify the strain of the organism
--skesa-only Only assemble sequences to contigs, but do not
annotate.
--no-usage-reporting Prevents usage report back to NCBI. By default, RAPT
sends usage information back to NCBI for statistical
analysis. The information collected are a unique
identifier for the RAPT process, the machine IP
address, the start and end time of RAPT, and its three
modules: SKESA, taxcheck and PGAP. No personal or
project-specific information (such as the input data)
are collected
--stop-on-errors Do not run PGAP annotation pipeline when the genome
sequence is misassigned or contaminated
--auto-correct-tax If the genome sequence is misassigned or contaminated
and ANI predicts an organism with HIGH confidence, use
it for PGAP instead of the one provided by the user
-o OUTDIR, --output-dir OUTDIR
Directory to store results and logs. If omitted, use
current directory
--refdata-dir REFDATA_HUB
Specify a location to store reference data used by
RAPT. If omitted, use output directory
--pgap-ref PGAP_REF Full path to pre-downloaded PGAP reference data
tarball, if applicable. File is usually named like
input-<PGAP-BUILD>.prod.tgz
--ani-ref ANI_REF Full path to pre-downloaded ANI reference data
tarball, if applicable. File is usually named like
input-<PGAP-BUILD>.prod.ani.tgz
-c MAXCPU, --cpus MAXCPU
Specify the maximal CPU cores the container should
use.
-m MAXMEM, --memory MAXMEM
Specify the maximal memory (number in GB) the
container should use.
-D {docker,podman,singularity}, --docker {docker,podman,singularity}
Use specified docker compatible program to run RAPT
image
-n DOCKER_NETWORK, --network DOCKER_NETWORK
Specify the network the container should use. Note:
this parameter is passed directly to the --network
parameter to the container. RAPT does not check the
validity of the argument.
~$
To run RAPT, you need the Illumina-sequenced reads for the genome you wish to assemble and annotate. These can be located in a fasta or fastq file on the machine where you wish to run RAPT, or they can be in a run in the NCBI Sequence Read Archive (SRA).
Important: 1. Only reads sequenced on Illumina machines can be used by RAPT. 2. The reads provided should be from a single isolate.
To demonstrate how to run RAPT, we are going to use SRR3496277, a set of reads available in SRA for Mycoplasma pirum.
This example takes about 20 minutes to complete on a 16-CPU 64 GB machine (time may vary dependind on the configuration of the computer).
Run the following command:
~$ ./run_rapt.py -a srr3496277
RAPT is now running, it may take a long time to finish. To see the progress, track the verbose log file /path/to/current_dir/raptout_c2f0732658/verbose.log.
~$
All output files and logs will be located in a subdirectory named raptout_xxxxxxxxxx
under current directory. xxxxxxxxxx
is the RUNID generated by run_rapt.py
, unique to the execution. Please note that the duration of the job depends on the size of the genome and may be of several hours.
You can use fastq or fasta files for single- or paired-end reads produced by an Illumina sequencer as input to RAPT.
- Data should be short reads, from an Illumina sequencer.
- Both single- or paired-end reads are acceptable.
- Paired-end reads can be provided in one file with the two reads of a pair adjacent to each other (interleaved) in the file, or as two files, with the forward reads in one file and the reverse reads in the other.
- The files should be in fastq or fasta format.
- The quality scores are not necessary.
- The files may be gzipped.
- The files should be on the local file system.
The organism name assigned to the reads needs to be provided on the command line. It can be the genus species, or the the genus only if the species is not defined. The organism name must be known to the NCBI Taxonomy. The strain can optionally be provided with the --strain
parameter.
Here is an example command, providing a single fastq file:
~$ ./run_rapt.py -q path/to/myreads.fastq --organism "Mycoplasma pirum" --strain "ATCC 25960"
RAPT is now running, it may take a long time to finish. To see the progress, track the verbose log file /path/to/current_dir/raptout_a2b2345678/verbose.log.
~$
Here is an example command, providing forward and reverse reads in separate gzipped fasta files, and specifying a name for the output directory:
~$ ./run_rapt.py -q path/to/myforwardreads.fasta.gz,myreversereads.fasta.gz --organism "Mycoplasma pirum" --strain "ATCC 25960" -o ATCC_25960_out
RAPT is now running, it may take a long time to finish. To see the progress, track the verbose log file /path/to/current_dir/ATCC_25960_out/verbose.log.
~$
-
--auto-correct-tax
:
When set: if the step that evaluates the taxonomic assignment of the assembly (see the Average Nucleotide Identity tool) determines with high confidence that the species that best fits your assembly is different from the name you provided on input, the annotation process will use the ANI-chosen scientific name, and the ANI-chosen name will appear in the output files. When not set (default): the annotation process will use the name provided on input, regardless of the results of ANI. -
--stop-on-errors
:
When set: if the step that evaluates the taxonomic assignment of the assembly (see the Average Nucleotide Identity tool) determines with high confidence that the species that best fits your assembly is different from the name you provided on input, or that the assembly is contaminated, RAPT will stop after the taxonomy check and no annotation results will be produced.
-
Assembly results:
- skesa_out.fa: multifasta files of the assembled contigs produced by SKESA
- assembly_stat_report.tsv: assembly statistics (contig count, base count, min contig length, max contig length, contig N50, contig L50)
- skesa_out.fa: multifasta files of the assembled contigs produced by SKESA
-
Taxonomy verification (see more details):
- ani-tax-report.txt: results of the taxonomy check on the assembled sequences reported by the Average Nucleotide Identity tool, in text format. The possible statuses are:
CONFIRMED: The organism name associated with the input reads, and assigned to the genome, has been confirmed by ANI. The submitted genus matches the genus of the ANI-predicted organism.
MISASSIGNED: The organism has been found to be misassigned to the genome. The submitted genus does not match the genus of the ANI-predicted organism.
INCONCLUSIVE: The organism cannot be identified (due to lack of a close enough type assembly in GenBank)
CONTAMINATED: The genome is contaminated with sequences from an organism other than the submitted organism.
- ani-tax-report.xml: same as above, in XML format
- ani-tax-report.txt: results of the taxonomy check on the assembled sequences reported by the Average Nucleotide Identity tool, in text format. The possible statuses are:
-
PGAP annotation results in multiple formats (see a detailed description of the annotation output files):
- annot.gbk: annotated genome in GenBank flat file format
- annot.gff: annotated genome in GFF3 format
- annot.sqn: annotated genome in ASN format
- annot.faa: multifasta file of the proteins annotated on the genome
- calls.tab: tab-delimited file of the coordinates of detected foreign sequence. Empty if no foreign contaminant was found.
- annot.gbk: annotated genome in GenBank flat file format
-
CheckM completeness results:
- checkm.txt: Annotated assembly completeness and contamination as calculated by CheckM. See a full description of the file format at this location. Note: 1) The CheckM calculation is performed on the proteins produced by PGAP, 2) the set of markers used by CheckM is determined by the species associated with the genome (as provided on input or as overridden by ANI).
- checkm.txt: Annotated assembly completeness and contamination as calculated by CheckM. See a full description of the file format at this location. Note: 1) The CheckM calculation is performed on the proteins produced by PGAP, 2) the set of markers used by CheckM is determined by the species associated with the genome (as provided on input or as overridden by ANI).
-
Execution logs:
- concise.log: log file containing the major events in a run
- verbose.log: all messages, with time stamps
- concise.log: log file containing the major events in a run
The default location for reference data is the current working directory. RAPT will detect whether the proper version of reference data is available and automatically download it if it is not found. Downloaded reference data are stored in a version-named sub-directory, such as input-2020-07-09.build4716, so that multiple versions of reference data can exist side-by-side. Users who run RAPT regularly may want to store the reference data in a dedicated location, in which case the --refdata-dir
switch can be used to specify a location other than the current directory:
~$ run_rapt.py -q path/to/myreads.fastq --organism "Mycoplasma pirum" --strain "ATCC 25960" --refdata-dir path/to/refdata-dir
-
-D docker|podman|singularity
If multiple runners are installed (highly discouraged), you can specify a particular the runner to use. If a full path to the binary is provided, RAPT will use the path. Otherwise it must be specified in $PATH environment so that RAPT can find it.
-
-c MAXCPU, --cpus MAXCPU
Specify the maximal number of CPUs the container should use.
-
-m MAXMEM, --memory MAXMEM
Specify the maximal amount of memory (in GB) the container should use. Note: Singularity does not support dynamic resource limitation so the above options have no effect. -
--network NETWORK
Specify the network the container should use. Note: Only supported for Docker
Some users may find that their local default configuration of Docker cannot send or receive data. To test network function, first bring up an interactive shell:
> docker run -it xxxxx bash
where xxxxx specifies docker image. Then, check for access within this interactive Docker shell, one way to do that:
> curl -o /dev/null https://www.ncbi.nlm.nih.gov ; echo $?
response of 0 indicates successful retrieval, other response indicates failure.
If this indicates failure, Docker networking documentation may be helpful. Some users may find that using the host network is workable on a local machine without other docker instances running (that is, invoking docker --network host run -it xxxxx bash
) Once a working network has been identified or configured, it can be specified using the --network
advanced option to run_rapt.py
, above.
It will be necessary to allow egress to the NCBI network services used by rapt for SRA access.
A complete list of SRA network resources is available at Firewall and Routing Information. RAPT may use the servers listed below; under normal circumstances, only www.ncbi.nlm.nih.gov
and locate.ncbi.nlm.nih.gov
will see traffic. If firewall modification is cumbersome (for example, due to a need to communicate with a systems group who may not respond immediately), it is safest to request access to sra-download.ncbi.nlm.nih.gov
as well.
For convenience, the relevant information is reproduced below. The SRA information above is authoritative in case of any disagreement.
The general rule for SRA Toolkit tools is that they will use the https (443) port for TCP communications.
www.ncbi.nlm.nih.gov
130.14.29.110
locate.ncbi.nlm.nih.gov
130.14.29.113
sra-download.ncbi.nlm.nih.gov
130.14.250.24
130.14.250.25
130.14.250.26
130.14.250.27
165.112.9.231
165.112.9.232
130.14.0.0/16, netmask 255.255.0.0
165.112.9.0/24, netmask 255.255.255.0
If you have other questions, please visit our FAQs page.