diff --git a/CHANGES.md b/CHANGES.md index 71d85e70..546aea61 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -1,13 +1,18 @@ +STAR 2.7.5a 2020/06/16 +====================== +**Major new features:** * Implemented STARsolo quantification for Smart-seq with --soloType SmartSeq option. * Implemented --readFilesManifest option to input a list of input read files. + +**Minor features and bug fixes:** * Change in STARsolo SJ output behavior: junctions are output even if reads do not match genes. * Fixed a bug with solo SJ output for large genomes. -* N characters in --soloAdapterSequence are not counted as mismatches, allowing for multiple adapters (e.g. ddSeq). +* N-characters in --soloAdapterSequence are not counted as mismatches, allowing for multiple adapters (e.g. ddSeq). * SJ.out.tab is sym-linked as features.tsv for Solo SJ output. * Issue #882: 3rd field is now optional in Solo Gene features.tsv with --soloOutFormatFeaturesGeneField3. * Issue #883: Patch for FreeBSD in SharedMemory and Makefile improvements. * Issue #902: Fixed seg-fault for STARsolo CB/UB SAM attributes output with --soloFeatures GeneFull --outSAMunmapped Within options. -* Issue #934: Fixed a problem with annotated junctions that was casuing very rare seg-faults. +* Issue #934: Fixed a problem with annotated junctions that was causing very rare seg-faults. * Issue #936: Throw an error if an empty whitelist is provided to STARsolo. STAR 2.7.4a 2020/06/01 diff --git a/README.md b/README.md index c0d45393..2f37d619 100644 --- a/README.md +++ b/README.md @@ -35,9 +35,9 @@ Download the latest [release from](https://github.com/alexdobin/STAR/releases) a ```bash # Get latest STAR source from releases -wget https://github.com/alexdobin/STAR/archive/2.7.4a.tar.gz -tar -xzf 2.7.4a.tar.gz -cd STAR-2.7.4a +wget https://github.com/alexdobin/STAR/archive/2.7.5a.tar.gz +tar -xzf 2.7.5a.tar.gz +cd STAR-2.7.5a # Alternatively, get STAR source using git git clone https://github.com/alexdobin/STAR.git diff --git a/RELEASEnotes.md b/RELEASEnotes.md index 912e8e5d..42dac326 100644 --- a/RELEASEnotes.md +++ b/RELEASEnotes.md @@ -1,5 +1,27 @@ +STAR 2.7.5a 2020/06/16 +====================== +**Major new features: +~ support for Plate-based (Smart-seq) scRNA-seq +~ manifest file to list the input reads FASTQ files** + +* Typical STAR command for mapping and quantification of plate-based (Smart-seq) scRNA-seq will look like: +``` + --soloType SmartSeq --readFilesManifest /path/to/manifest.tsv --soloUMIdedup Exact --soloStrand Unstranded +``` +For detailed description, see [Plate-based (Smart-seq) scRNA-seq](docs/STARsolo.md#plate-based-Smart-seq-scRNA-seq) + +* The convenient way to list a large number of reads FASTQ files and their IDs is to create a file manifest and supply it in `--readFilesManifest /path/to/manifest.tsv`. The manifest file should contain 3 tab-separated columns. For paired-end reads: +``` +Read1-file-name \t Read2-file-name \t File-id +``` +For single-end reads, the 2nd column should contain the dash - : +``` +Read1-file-name \t - \t File-id +``` +File-id can be any string without spaces. File-id will be added as ReadGroup tag (*RG:Z:*) for each read in the SAM/BAM output. If File-id starts with *ID:*, it can contain several fields separated by tab, and all the fields will be copied verbatim into SAM *@RG* header line. + -STAR 2.7.3a 2020/06/01 +STAR 2.7.4a 2020/06/01 ====================== This release fixes multiple bugs and issues. The biggest issue fixed was a seg-fault for small genome which previously required scaling down `--genomeSAindexNbases`. Such scaling is still recommended but is no longer required. diff --git a/bin/Linux_x86_64/STAR b/bin/Linux_x86_64/STAR index 715e0f47..d9a1d8f7 100755 Binary files a/bin/Linux_x86_64/STAR and b/bin/Linux_x86_64/STAR differ diff --git a/bin/Linux_x86_64/STARlong b/bin/Linux_x86_64/STARlong index 9eca93eb..3aa10ea0 100755 Binary files a/bin/Linux_x86_64/STARlong and b/bin/Linux_x86_64/STARlong differ diff --git a/bin/Linux_x86_64_static/STAR b/bin/Linux_x86_64_static/STAR index 62d4e17e..1d74c39b 100755 Binary files a/bin/Linux_x86_64_static/STAR and b/bin/Linux_x86_64_static/STAR differ diff --git a/bin/Linux_x86_64_static/STARlong b/bin/Linux_x86_64_static/STARlong index d3d9c030..ef982fd7 100755 Binary files a/bin/Linux_x86_64_static/STARlong and b/bin/Linux_x86_64_static/STARlong differ diff --git a/doc/STARmanual.pdf b/doc/STARmanual.pdf index de983e54..0b345c7d 100644 Binary files a/doc/STARmanual.pdf and b/doc/STARmanual.pdf differ diff --git a/docs/STARsolo.md b/docs/STARsolo.md index fecaa87a..4b972e5d 100644 --- a/docs/STARsolo.md +++ b/docs/STARsolo.md @@ -1,6 +1,11 @@ -STARsolo: mapping, demultiplexing and quantification for single cell RNA-seq +**STARsolo**: mapping, demultiplexing and quantification for single cell RNA-seq ================================================================================= +Major updates in STAR 2.7.5a (2020/06/16) +--------------------------------------- +* [**Smart-seq scRNA-seq process:**](#plate-based-Smart-seq-scRNA-seq) + * STARsolo now supports for the plate-based (a.k.a. Smart-seq) scRNAs-seq technologies. + Major updates in STAR 2.7.3a (Oct 8 2019) ----------------------------------------- * **Output enhancements:** @@ -169,9 +174,9 @@ Basic cell filtering * Recent versions of CellRanger switched to more advanced filtering done with the EmptyDrop tool developed by [Lun et al](https://doi.org/10.1186/s13059-019-1662-y). To obtain filtered counts similar to recent CellRanger versions, we need to run this tools on **raw** STARsolo output ------------------- +--------------------------------------------------- Quantification of different transcriptomic features ------------------------ +--------------------------------------------------- * In addition to the gene counts (deafult), STARsolo can calculate counts for other transcriptomic features: * pre-mRNA counts, useful for single-nucleus RNA-seq. This counts all read that overlap gene loci, i.e. included both exonic and intronic reads: ``` @@ -209,19 +214,46 @@ BAM tags --outSAMtype BAM SortedByCoordinate ``` +-------------------------------- +Different scRNA-seq technologies +-------------------------------- +### Plate-based (Smart-seq) scRNA-seq +Plate-based (Smart-seq) scRNA-seq technologies produce separate FASTQ files for each cell. Cell barcodes are not incorporated in the read sequences, and there are no UMIs. Typical STAR command for mapping and quantification of these file will look like: +``` +--soloType SmartSeq --readFilesManifest /path/to/manifest.tsv --soloUMIdedup Exact --soloStrand Unstranded +``` + +* STARsolo `--soloType SmartSeq` option produces cell/gene (and other [features](#quantification-of-different-transcriptomic-features)) +count matrices, using rules similar to the droplet-based technologies. The differnces are (i) individual cells correspond to different FASTQ files,there are no Cell Barcode sequences, and "Cell IDs" have to be provided as input (ii) there are no UMI sequences, but reads can be deduplicated if they have identical start/end coordinates. + +* The convenient way to list all the FASTQ files and Cell IDs is to create a file manifest and supply it in `--readFilesManifest /path/to/manifest.tsv`. The manifest file should contain 3 tab-separated columns. For paired-end reads: +``` +Read1-file-name \t Read2-file-name \t Cell-id +``` +For single-end reads, the 2nd column should contain the dash - : +``` +Read1-file-name \t - \t Cell-id +``` +Cell-id can be any string without spaces. Cell-id will be added as ReadGroup tag (*RG:Z:*) for each read in the SAM/BAM output. If Cell-id starts with *ID:*, it can contain several fields separated by tab, and all the fields will be copied verbatim into SAM *@RG* header line. +* Deduplication based on read start/end coordinates can be done with `--soloUMIdedup Exact` option. To avoid deduplication (e.g. for single-end reads) use `--soloUMIdedup NoDedup`. Both deduplication options can be used together `--soloUMIdedup Exact NoDedup` and will produce two columns in the *matrix.mtx* output. +* Common Smart-seq protocols are unstranded and thus will require `--soloStrand Unstranded` option. If your protocol is stranded, you can can choose the proper `--soloStrand Forward` (default) or `--soloStrand Reverse` options. + ------------------------------------------------------------- ------------------------------------------------------ -------------------------------------------------- -For completenes, all parameters that control STARsolo output are listed again below with defaults and short descriptions: +All parameters that control STARsolo output are listed again below with defaults and short descriptions: --------------------------------------- ``` soloType None string(s): type of single-cell RNA-seq - CB_UMI_Simple ... (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium + CB_UMI_Simple ... (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium. CB_UMI_Complex ... one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop. + CB_samTagOut ... output Cell Barcode as CR and/or CB SAm tag. No UMI counting. --readFilesIn cDNA_read1 [cDNA_read2 if paired-end] CellBarcode_read . Requires --outSAMtype BAM Unsorted [and/or SortedByCoordinate] + SmartSeq ... Smart-seq: each cell in a separate FASTQ (paired- or single-end), barcodes are corresponding read-groups, no UMI sequences, alignments deduplicated according to alignment start and end (after extending soft-clipped bases) soloCBwhitelist - - string(s): file(s) with whitelist(s) of cell barcodes. Only one file allowed with + string(s): file(s) with whitelist(s) of cell barcodes. Only --soloType CB_UMI_Complex allows more than one whitelist file. + None ... no whitelist: all cell barcodes are allowed soloCBstart 1 int>0: cell barcode start base @@ -243,9 +275,9 @@ soloBarcodeReadLength 1 soloCBposition - strings(s) position of Cell Barcode(s) on the barcode read. Presently only works with --soloType CB_UMI_Complex, and barcodes are assumed to be on Read2. - Format for each barcode: startAnchor_startDistance_endAnchor_endDistance - start(end)Anchor defines the anchor base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end - start(end)Distance is the distance from the CB start(end) to the Anchor base + Format for each barcode: startAnchor_startPosition_endAnchor_endPosition + start(end)Anchor defines the Anchor Base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end + start(end)Position is the 0-based position with of the CB start(end) with respect to the Anchor Base String for different barcodes are separated by space. Example: inDrop (Zilionis et al, Nat. Protocols, 2017): --soloCBposition 0_0_2_-1 3_1_3_8 @@ -281,13 +313,13 @@ soloFeatures Gene Gene ... genes: reads match the gene transcript SJ ... splice junctions: reported in SJ.out.tab GeneFull ... full genes: count all reads overlapping genes' exons and introns - Transcript3p ... quantification of transcript for 3' protocols soloUMIdedup 1MM_All string(s): type of UMI deduplication (collapsing) algorithm 1MM_All ... all UMIs with 1 mismatch distance to each other are collapsed (i.e. counted once) 1MM_Directional ... follows the "directional" method from the UMI-tools by Smith, Heger and Sudbery (Genome Research 2017). Exact ... only exactly matching UMIs are collapsed + NoDedup ... no deduplication of UMIs, count all reads. Allowed for --soloType SmartSeq soloUMIfiltering - string(s) type of UMI filtering @@ -300,8 +332,10 @@ soloOutFileNames Solo.out/ features.tsv barcodes.tsv soloCellFilter CellRanger2.2 3000 0.99 10 string(s): cell filtering type and parameters - CellRanger2.2 ... simple filtering of CellRanger 2.2, followed by thre numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count - TopCells ... only report top cells by UMI count, followed by the excat number of cells + CellRanger2.2 ... simple filtering of CellRanger 2.2, followed by three numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count + TopCells ... only report top cells by UMI count, followed by the exact number of cells None ... do not output filtered cells +soloOutFormatFeaturesGeneField3 "Gene Expression" + string(s): field 3 in the Gene features.tsv file. If "-", then no 3rd field is output. ``` diff --git a/extras/doc-latex/STARmanual.tex b/extras/doc-latex/STARmanual.tex index 8a7d0bc8..656a92ab 100644 --- a/extras/doc-latex/STARmanual.tex +++ b/extras/doc-latex/STARmanual.tex @@ -179,10 +179,7 @@ \subsection{Basic options.} \end{itemize} -\subsection{Advanced options.} -There are many advanced options that control STAR mapping behavior. All options are briefly described in the Section \sechyperref{Description_of_all_options}. - -\subsubsection{Mapping multiple files in one run.} +\subsection{Mapping multiple files in one run.} Multiple samples can be mapped in one run with a single output. This is equivalent to concatenating the read files before mapping, except that distinct read groups can be used in \opt{outSAMattrRGline} command to keep track of reads from different files. For single-end reads use a comma separated list (no spaces around commas), e.g.: \opt{readFilesIn} \optv{sample1.fq,sample2.fq,sample3.fq} @@ -199,7 +196,7 @@ \subsubsection{Mapping multiple files in one run.} Note that this list is separated by commas surrounded by spaces (unlike \opt{readFilesIn} list). Another option for mapping multiple reads files, especially convenient for a very large number of files, is to create a file manifest and supply it in \opt{readFilesManifest} \optv{/path/to/manifest.tsv}. -The manifest file should contain 3 tab-separated columns, paired-end reads: +The manifest file should contain 3 tab-separated columns. For paired-end reads: \ofilen{read1-file-name $tab$ read2-file-name $tab$ read-group-line} @@ -211,6 +208,9 @@ \subsubsection{Mapping multiple files in one run.} If read-group-line does not start with ID:, it can only contain one ID field, and ID: will be added to it. If read-group-line starts with ID:, it can contain several fields separated by $tab$, and all the fields will be copied verbatim into SAM @RG header line. +\subsection{Advanced options.} +There are many advanced options that control STAR mapping behavior. All options are briefly described in the Section \sechyperref{Description_of_all_options}. + \subsubsection{Using annotations at the mapping stage.} Since 2.4.1a, the annotations can be included on the fly at the mapping step, without including them at the genome generation step. You can specify \opt{sjdbGTFfile} \optvr{/path/to/ann.gtf} and/or \opt{sjdbFileChrStartEnd} \optvr{/path/to/sj.tab}, as well as \opt{sjdbOverhang}, and any other \opt{sjdb*} options. The genome indices can be generated with or without another set of annotations/junctions. In the latter case the new junctions will added to the old ones. STAR will insert the junctions into genome indices on the fly before mapping, which takes 1~2 minutes. The on the fly genome indices can be saved (for reuse) with \opt{sjdbInsertSave} \optv{All}, into \optvr{\_STARgenome} directory inside the current run directory. @@ -476,9 +476,6 @@ \section{Output in transcript coordinates.} Note, that STAR first aligns reads to entire genome, and only then searches for concordance between alignments and transcripts.This approach offers certain advantages compared to the alignment to transcriptome only, by not forcing the alignments to annotated transcripts. Note that \opt{outFilterMultimapNmax} filter only applies to genomic alignments. If an alignment passes this filter, it is converted to all possible transcriptomic alignments and all of them are output. - - - By default, the output satisfies RSEM requirements: soft-clipping or indels are not allowed. Use \opt{quantTranscriptomeBan} \optv{Singleend} to allow insertions, deletions ans soft-clips in the transcriptomic alignments, which can be used by some expression quantification software (e.g. eXpress). \section{Counting number of reads per gene.} diff --git a/extras/doc-latex/parametersDefault.tex b/extras/doc-latex/parametersDefault.tex index c717ef3d..f9be9a87 100644 --- a/extras/doc-latex/parametersDefault.tex +++ b/extras/doc-latex/parametersDefault.tex @@ -926,4 +926,7 @@ \optOpt{TopCells} \optOptLine{only report top cells by UMI count, followed by the exact number of cells} \optOpt{None} \optOptLine{do not output filtered cells} \end{optOptTable} +\optName{soloOutFormatFeaturesGeneField3} + \optValue{"Gene Expression"} + \optLine{string(s): field 3 in the Gene features.tsv file. If "-", then no 3rd field is output.} \end{optTable} diff --git a/extras/docker/Dockerfile b/extras/docker/Dockerfile index a13260ef..cf54d14c 100755 --- a/extras/docker/Dockerfile +++ b/extras/docker/Dockerfile @@ -2,7 +2,7 @@ FROM debian:stretch-slim MAINTAINER dobin@cshl.edu -ARG STAR_VERSION=2.7.4a +ARG STAR_VERSION=2.7.5a ENV PACKAGES gcc g++ make wget zlib1g-dev unzip