Read for 2.7.5a

alexdobin · Jun 16, 2020 · e67d668 · e67d668
1 parent 502dd0c
commit e67d668
Show file tree

Hide file tree

Showing 12 changed files with 88 additions and 27 deletions.
diff --git a/CHANGES.md b/CHANGES.md
@@ -1,13 +1,18 @@
+STAR 2.7.5a 2020/06/16
+======================
+**Major new features:**
 * Implemented STARsolo quantification for Smart-seq with --soloType SmartSeq option.
 * Implemented --readFilesManifest option to input a list of input read files.
+
+**Minor features and bug fixes:**
 * Change in STARsolo SJ output behavior: junctions are output even if reads do not match genes.
 * Fixed a bug with solo SJ output for large genomes.
-* N characters in --soloAdapterSequence are not counted as mismatches, allowing for multiple adapters (e.g. ddSeq).
+* N-characters in --soloAdapterSequence are not counted as mismatches, allowing for multiple adapters (e.g. ddSeq).
 * SJ.out.tab is sym-linked as features.tsv for Solo SJ output.
 * Issue #882: 3rd field is now optional in Solo Gene features.tsv with --soloOutFormatFeaturesGeneField3.
 * Issue #883: Patch for FreeBSD in SharedMemory and Makefile improvements.
 * Issue #902: Fixed seg-fault for STARsolo CB/UB SAM attributes output with --soloFeatures GeneFull --outSAMunmapped Within options.
-* Issue #934: Fixed a problem with annotated junctions that was casuing very rare seg-faults.
+* Issue #934: Fixed a problem with annotated junctions that was causing very rare seg-faults.
 * Issue #936: Throw an error if an empty whitelist is provided to STARsolo.
 
 STAR 2.7.4a 2020/06/01

diff --git a/README.md b/README.md
@@ -35,9 +35,9 @@ Download the latest [release from](https://github.com/alexdobin/STAR/releases) a
 
 ```bash
 # Get latest STAR source from releases
-wget https://github.com/alexdobin/STAR/archive/2.7.4a.tar.gz
-tar -xzf 2.7.4a.tar.gz
-cd STAR-2.7.4a
+wget https://github.com/alexdobin/STAR/archive/2.7.5a.tar.gz
+tar -xzf 2.7.5a.tar.gz
+cd STAR-2.7.5a
 
 # Alternatively, get STAR source using git
 git clone https://github.com/alexdobin/STAR.git

diff --git a/RELEASEnotes.md b/RELEASEnotes.md
@@ -1,5 +1,27 @@
+STAR 2.7.5a 2020/06/16
+======================
+**Major new features:  
+~ support for Plate-based (Smart-seq) scRNA-seq  
+~ manifest file to list the input reads FASTQ files**
+
+* Typical STAR command for mapping and quantification of plate-based (Smart-seq) scRNA-seq  will look like:
+```
+ --soloType SmartSeq --readFilesManifest /path/to/manifest.tsv --soloUMIdedup Exact --soloStrand Unstranded
+```
+For detailed description, see [Plate-based (Smart-seq) scRNA-seq](docs/STARsolo.md#plate-based-Smart-seq-scRNA-seq)
+
+* The convenient way to list a large number of reads FASTQ files and their IDs is to create a file manifest and supply it in `--readFilesManifest /path/to/manifest.tsv`. The manifest file should contain 3 tab-separated columns. For paired-end reads:
+```
+Read1-file-name \t Read2-file-name \t File-id
+```
+For single-end reads, the 2nd column should contain the dash - :
+```
+Read1-file-name \t - \t File-id
+```
+File-id can be any string without spaces. File-id will be added as ReadGroup tag (*RG:Z:*) for each read in the SAM/BAM output. If File-id starts with *ID:*, it can contain several fields separated by tab, and all the fields will be copied verbatim into SAM *@RG* header line.
+
 
-STAR 2.7.3a 2020/06/01
+STAR 2.7.4a 2020/06/01
 ======================
 This release fixes multiple bugs and issues.  
 The biggest issue fixed was a seg-fault for small genome which previously required scaling down `--genomeSAindexNbases`. Such scaling is still recommended but is no longer required.  

diff --git a/bin/Linux_x86_64/STAR b/bin/Linux_x86_64/STAR
diff --git a/bin/Linux_x86_64/STARlong b/bin/Linux_x86_64/STARlong
diff --git a/bin/Linux_x86_64_static/STAR b/bin/Linux_x86_64_static/STAR
diff --git a/bin/Linux_x86_64_static/STARlong b/bin/Linux_x86_64_static/STARlong
diff --git a/doc/STARmanual.pdf b/doc/STARmanual.pdf
diff --git a/docs/STARsolo.md b/docs/STARsolo.md
@@ -1,6 +1,11 @@
-STARsolo: mapping, demultiplexing and quantification for single cell RNA-seq
+**STARsolo**: mapping, demultiplexing and quantification for single cell RNA-seq
 =================================================================================
 
+Major updates in STAR 2.7.5a (2020/06/16)
+---------------------------------------
+* [**Smart-seq scRNA-seq process:**](#plate-based-Smart-seq-scRNA-seq)
+    * STARsolo now supports for the plate-based (a.k.a. Smart-seq) scRNAs-seq technologies.
+
 Major updates in STAR 2.7.3a (Oct 8 2019)
 -----------------------------------------
 * **Output enhancements:**
@@ -169,9 +174,9 @@ Basic cell filtering
 * Recent versions of CellRanger switched to more advanced filtering done with the EmptyDrop tool developed by [Lun et al](https://doi.org/10.1186/s13059-019-1662-y). To obtain filtered counts similar to recent CellRanger versions, we need to run this tools on **raw** STARsolo output
 
 
-------------------
+---------------------------------------------------
 Quantification of different transcriptomic features
------------------------
+---------------------------------------------------
 * In addition to the gene counts (deafult), STARsolo can calculate counts for other transcriptomic features:
     * pre-mRNA counts, useful for single-nucleus RNA-seq. This counts all read that overlap gene loci, i.e. included both exonic and intronic reads:
         ```
@@ -209,19 +214,46 @@ BAM tags
     --outSAMtype BAM SortedByCoordinate
     ```
 
+--------------------------------
+Different scRNA-seq technologies
+--------------------------------
+### Plate-based (Smart-seq) scRNA-seq
+Plate-based (Smart-seq) scRNA-seq technologies produce separate FASTQ files for each cell. Cell barcodes are not incorporated in the read sequences, and there are no UMIs. Typical STAR command for mapping and quantification of these file will look like:
+```
+--soloType SmartSeq --readFilesManifest /path/to/manifest.tsv --soloUMIdedup Exact --soloStrand Unstranded
+```
+
+* STARsolo `--soloType SmartSeq` option produces cell/gene (and other [features](#quantification-of-different-transcriptomic-features))
+count matrices, using rules similar to the droplet-based technologies. The differnces are (i) individual cells correspond to different FASTQ files,there are no Cell Barcode sequences, and "Cell IDs" have to be provided as input (ii) there are no UMI sequences, but reads can be deduplicated if they have identical start/end coordinates.
+
+* The convenient way to list all the FASTQ files and Cell IDs is to create a file manifest and supply it in `--readFilesManifest /path/to/manifest.tsv`. The manifest file should contain 3 tab-separated columns. For paired-end reads:
+```
+Read1-file-name \t Read2-file-name \t Cell-id
+```
+For single-end reads, the 2nd column should contain the dash - :
+```
+Read1-file-name \t - \t Cell-id
+```
+Cell-id can be any string without spaces. Cell-id will be added as ReadGroup tag (*RG:Z:*) for each read in the SAM/BAM output. If Cell-id starts with *ID:*, it can contain several fields separated by tab, and all the fields will be copied verbatim into SAM *@RG* header line.
+* Deduplication based on read start/end coordinates can be done with `--soloUMIdedup Exact` option. To avoid deduplication (e.g. for single-end reads) use `--soloUMIdedup NoDedup`. Both deduplication options can be used together `--soloUMIdedup Exact NoDedup` and will produce two columns in the *matrix.mtx* output.
+* Common Smart-seq protocols are unstranded and thus will require `--soloStrand Unstranded` option. If your protocol is stranded, you can can choose the proper `--soloStrand Forward` (default) or `--soloStrand Reverse` options.
+
 -------------------------------------------------------------
 ------------------------------------------------------
 --------------------------------------------------
-For completenes, all parameters that control STARsolo output are listed again below with defaults and short descriptions:
+All parameters that control STARsolo output are listed again below with defaults and short descriptions:
 ---------------------------------------
 ```
 soloType                    None
     string(s): type of single-cell RNA-seq
-                            CB_UMI_Simple   ... (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium
+                            CB_UMI_Simple   ... (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium.
                             CB_UMI_Complex  ... one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop.
+                            CB_samTagOut    ... output Cell Barcode as CR and/or CB SAm tag. No UMI counting. --readFilesIn cDNA_read1 [cDNA_read2 if paired-end] CellBarcode_read . Requires --outSAMtype BAM Unsorted [and/or SortedByCoordinate]
+                            SmartSeq        ... Smart-seq: each cell in a separate FASTQ (paired- or single-end), barcodes are corresponding read-groups, no UMI sequences, alignments deduplicated according to alignment start and end (after extending soft-clipped bases)
 
 soloCBwhitelist             -
-    string(s): file(s) with whitelist(s) of cell barcodes. Only one file allowed with
+    string(s): file(s) with whitelist(s) of cell barcodes. Only --soloType CB_UMI_Complex allows more than one whitelist file.
+                            None            ... no whitelist: all cell barcodes are allowed
 
 soloCBstart                 1
     int>0: cell barcode start base
@@ -243,9 +275,9 @@ soloBarcodeReadLength       1
 soloCBposition              -
     strings(s)              position of Cell Barcode(s) on the barcode read.
                             Presently only works with --soloType CB_UMI_Complex, and barcodes are assumed to be on Read2.
-                            Format for each barcode: startAnchor_startDistance_endAnchor_endDistance
-                            start(end)Anchor defines the anchor base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end
-                            start(end)Distance is the distance from the CB start(end) to the Anchor base
+                            Format for each barcode: startAnchor_startPosition_endAnchor_endPosition
+                            start(end)Anchor defines the Anchor Base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end
+                            start(end)Position is the 0-based position with of the CB start(end) with respect to the Anchor Base
                             String for different barcodes are separated by space.
                             Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
                             --soloCBposition  0_0_2_-1  3_1_3_8
@@ -281,13 +313,13 @@ soloFeatures                Gene
                             Gene            ... genes: reads match the gene transcript
                             SJ              ... splice junctions: reported in SJ.out.tab
                             GeneFull        ... full genes: count all reads overlapping genes' exons and introns
-                            Transcript3p   ... quantification of transcript for 3' protocols
 
 soloUMIdedup                1MM_All
     string(s):              type of UMI deduplication (collapsing) algorithm
                             1MM_All             ... all UMIs with 1 mismatch distance to each other are collapsed (i.e. counted once)
                             1MM_Directional     ... follows the "directional" method from the UMI-tools by Smith, Heger and Sudbery (Genome Research 2017).
                             Exact               ... only exactly matching UMIs are collapsed
+                            NoDedup             ... no deduplication of UMIs, count all reads. Allowed for --soloType SmartSeq
 
 soloUMIfiltering            -
     string(s)               type of UMI filtering
@@ -300,8 +332,10 @@ soloOutFileNames            Solo.out/          features.tsv barcodes.tsv
 
 soloCellFilter              CellRanger2.2 3000 0.99 10
     string(s):              cell filtering type and parameters
-                            CellRanger2.2   ... simple filtering of CellRanger 2.2, followed by thre numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count
-                            TopCells        ... only report top cells by UMI count, followed by the excat number of cells
+                            CellRanger2.2   ... simple filtering of CellRanger 2.2, followed by three numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count
+                            TopCells        ... only report top cells by UMI count, followed by the exact number of cells
                             None            ... do not output filtered cells
 
+soloOutFormatFeaturesGeneField3 "Gene Expression"
+        string(s):                              field 3 in the Gene features.tsv file. If "-", then no 3rd field is output.
 ```
diff --git a/extras/doc-latex/STARmanual.tex b/extras/doc-latex/STARmanual.tex
@@ -179,10 +179,7 @@ \subsection{Basic options.}
 
 \end{itemize}
 
-\subsection{Advanced options.}
-There are many advanced options that control STAR mapping behavior. All options are briefly described in the Section \sechyperref{Description_of_all_options}. 
-
-\subsubsection{Mapping multiple files in one run.}
+\subsection{Mapping multiple files in one run.}
 Multiple samples can be mapped in one run with a single output. This is equivalent to concatenating the read files before mapping, except that distinct read groups can be used in \opt{outSAMattrRGline} command to keep track of reads from different files. For single-end reads use a comma separated list (no spaces around commas), e.g.:
 
 \opt{readFilesIn} \optv{sample1.fq,sample2.fq,sample3.fq} 
@@ -199,7 +196,7 @@ \subsubsection{Mapping multiple files in one run.}
 Note that this list is separated by commas surrounded by spaces (unlike \opt{readFilesIn} list).
 
 Another option for mapping multiple reads files, especially convenient for a very large number of files, is to create a file manifest and supply it in \opt{readFilesManifest} \optv{/path/to/manifest.tsv}.
-The manifest file should contain 3 tab-separated columns, paired-end reads: 
+The manifest file should contain 3 tab-separated columns. For paired-end reads: 
 
 \ofilen{read1-file-name $tab$ read2-file-name $tab$ read-group-line}
 
@@ -211,6 +208,9 @@ \subsubsection{Mapping multiple files in one run.}
 If read-group-line does not start with ID:, it can only contain one ID field, and ID: will be added to it.
 If read-group-line starts with ID:, it can contain several fields separated by $tab$, and all the fields will be copied verbatim into SAM @RG header line.
 
+\subsection{Advanced options.}
+There are many advanced options that control STAR mapping behavior. All options are briefly described in the Section \sechyperref{Description_of_all_options}. 
+
 \subsubsection{Using annotations at the mapping stage.}
 Since 2.4.1a, the annotations can be included on the fly at the mapping step, without including them at the genome generation step. You can specify \opt{sjdbGTFfile} \optvr{/path/to/ann.gtf} and/or \opt{sjdbFileChrStartEnd} \optvr{/path/to/sj.tab}, as well as \opt{sjdbOverhang}, and any other \opt{sjdb*} options. The genome indices can be generated with or  without another set of annotations/junctions. In the latter case the new junctions will added to the old ones. STAR will insert the junctions into genome indices on the fly before mapping, which takes 1~2 minutes. The on the fly genome indices can be saved (for reuse) with \opt{sjdbInsertSave} \optv{All}, into \optvr{\_STARgenome} directory inside the current run directory.
 
@@ -476,9 +476,6 @@ \section{Output in transcript coordinates.}
 
 Note, that STAR first aligns reads to entire genome, and only then searches for concordance between alignments and transcripts.This approach offers certain advantages compared to the alignment to transcriptome only, by not forcing the alignments to annotated transcripts. Note that \opt{outFilterMultimapNmax} filter only applies to genomic alignments. If an alignment passes this filter, it is converted to all possible transcriptomic alignments and all of them are output.
 
-
-
-
 By default, the output satisfies RSEM requirements: soft-clipping or indels are not allowed. Use \opt{quantTranscriptomeBan} \optv{Singleend} to allow insertions, deletions ans soft-clips in the transcriptomic alignments, which can be used by some expression quantification software (e.g. eXpress). 
 
 \section{Counting number of reads per gene.}

diff --git a/extras/doc-latex/parametersDefault.tex b/extras/doc-latex/parametersDefault.tex
@@ -926,4 +926,7 @@
   \optOpt{TopCells}   \optOptLine{only report top cells by UMI count, followed by the exact number of cells}
   \optOpt{None}   \optOptLine{do not output filtered cells}
 \end{optOptTable}
+\optName{soloOutFormatFeaturesGeneField3}
+  \optValue{"Gene Expression"}
+  \optLine{string(s):				field 3 in the Gene features.tsv file. If "-", then no 3rd field is output.} 
 \end{optTable}
diff --git a/extras/docker/Dockerfile b/extras/docker/Dockerfile
@@ -2,7 +2,7 @@ FROM debian:stretch-slim
 
 MAINTAINER [email protected]
 
-ARG STAR_VERSION=2.7.4a
+ARG STAR_VERSION=2.7.5a
 
 ENV PACKAGES gcc g++ make wget zlib1g-dev unzip