Skip to content

Commit

Permalink
Read for 2.7.5a
Browse files Browse the repository at this point in the history
  • Loading branch information
alexdobin committed Jun 16, 2020
1 parent 502dd0c commit e67d668
Show file tree
Hide file tree
Showing 12 changed files with 88 additions and 27 deletions.
9 changes: 7 additions & 2 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,18 @@
STAR 2.7.5a 2020/06/16
======================
**Major new features:**
* Implemented STARsolo quantification for Smart-seq with --soloType SmartSeq option.
* Implemented --readFilesManifest option to input a list of input read files.

**Minor features and bug fixes:**
* Change in STARsolo SJ output behavior: junctions are output even if reads do not match genes.
* Fixed a bug with solo SJ output for large genomes.
* N characters in --soloAdapterSequence are not counted as mismatches, allowing for multiple adapters (e.g. ddSeq).
* N-characters in --soloAdapterSequence are not counted as mismatches, allowing for multiple adapters (e.g. ddSeq).
* SJ.out.tab is sym-linked as features.tsv for Solo SJ output.
* Issue #882: 3rd field is now optional in Solo Gene features.tsv with --soloOutFormatFeaturesGeneField3.
* Issue #883: Patch for FreeBSD in SharedMemory and Makefile improvements.
* Issue #902: Fixed seg-fault for STARsolo CB/UB SAM attributes output with --soloFeatures GeneFull --outSAMunmapped Within options.
* Issue #934: Fixed a problem with annotated junctions that was casuing very rare seg-faults.
* Issue #934: Fixed a problem with annotated junctions that was causing very rare seg-faults.
* Issue #936: Throw an error if an empty whitelist is provided to STARsolo.

STAR 2.7.4a 2020/06/01
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,9 @@ Download the latest [release from](https://github.com/alexdobin/STAR/releases) a

```bash
# Get latest STAR source from releases
wget https://github.com/alexdobin/STAR/archive/2.7.4a.tar.gz
tar -xzf 2.7.4a.tar.gz
cd STAR-2.7.4a
wget https://github.com/alexdobin/STAR/archive/2.7.5a.tar.gz
tar -xzf 2.7.5a.tar.gz
cd STAR-2.7.5a

# Alternatively, get STAR source using git
git clone https://github.com/alexdobin/STAR.git
Expand Down
24 changes: 23 additions & 1 deletion RELEASEnotes.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,27 @@
STAR 2.7.5a 2020/06/16
======================
**Major new features:
~ support for Plate-based (Smart-seq) scRNA-seq
~ manifest file to list the input reads FASTQ files**

* Typical STAR command for mapping and quantification of plate-based (Smart-seq) scRNA-seq will look like:
```
--soloType SmartSeq --readFilesManifest /path/to/manifest.tsv --soloUMIdedup Exact --soloStrand Unstranded
```
For detailed description, see [Plate-based (Smart-seq) scRNA-seq](docs/STARsolo.md#plate-based-Smart-seq-scRNA-seq)

* The convenient way to list a large number of reads FASTQ files and their IDs is to create a file manifest and supply it in `--readFilesManifest /path/to/manifest.tsv`. The manifest file should contain 3 tab-separated columns. For paired-end reads:
```
Read1-file-name \t Read2-file-name \t File-id
```
For single-end reads, the 2nd column should contain the dash - :
```
Read1-file-name \t - \t File-id
```
File-id can be any string without spaces. File-id will be added as ReadGroup tag (*RG:Z:*) for each read in the SAM/BAM output. If File-id starts with *ID:*, it can contain several fields separated by tab, and all the fields will be copied verbatim into SAM *@RG* header line.


STAR 2.7.3a 2020/06/01
STAR 2.7.4a 2020/06/01
======================
This release fixes multiple bugs and issues.
The biggest issue fixed was a seg-fault for small genome which previously required scaling down `--genomeSAindexNbases`. Such scaling is still recommended but is no longer required.
Expand Down
Binary file modified bin/Linux_x86_64/STAR
Binary file not shown.
Binary file modified bin/Linux_x86_64/STARlong
Binary file not shown.
Binary file modified bin/Linux_x86_64_static/STAR
Binary file not shown.
Binary file modified bin/Linux_x86_64_static/STARlong
Binary file not shown.
Binary file modified doc/STARmanual.pdf
Binary file not shown.
58 changes: 46 additions & 12 deletions docs/STARsolo.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
STARsolo: mapping, demultiplexing and quantification for single cell RNA-seq
**STARsolo**: mapping, demultiplexing and quantification for single cell RNA-seq
=================================================================================

Major updates in STAR 2.7.5a (2020/06/16)
---------------------------------------
* [**Smart-seq scRNA-seq process:**](#plate-based-Smart-seq-scRNA-seq)
* STARsolo now supports for the plate-based (a.k.a. Smart-seq) scRNAs-seq technologies.

Major updates in STAR 2.7.3a (Oct 8 2019)
-----------------------------------------
* **Output enhancements:**
Expand Down Expand Up @@ -169,9 +174,9 @@ Basic cell filtering
* Recent versions of CellRanger switched to more advanced filtering done with the EmptyDrop tool developed by [Lun et al](https://doi.org/10.1186/s13059-019-1662-y). To obtain filtered counts similar to recent CellRanger versions, we need to run this tools on **raw** STARsolo output
------------------
---------------------------------------------------
Quantification of different transcriptomic features
-----------------------
---------------------------------------------------
* In addition to the gene counts (deafult), STARsolo can calculate counts for other transcriptomic features:
* pre-mRNA counts, useful for single-nucleus RNA-seq. This counts all read that overlap gene loci, i.e. included both exonic and intronic reads:
```
Expand Down Expand Up @@ -209,19 +214,46 @@ BAM tags
--outSAMtype BAM SortedByCoordinate
```
--------------------------------
Different scRNA-seq technologies
--------------------------------
### Plate-based (Smart-seq) scRNA-seq
Plate-based (Smart-seq) scRNA-seq technologies produce separate FASTQ files for each cell. Cell barcodes are not incorporated in the read sequences, and there are no UMIs. Typical STAR command for mapping and quantification of these file will look like:
```
--soloType SmartSeq --readFilesManifest /path/to/manifest.tsv --soloUMIdedup Exact --soloStrand Unstranded
```
* STARsolo `--soloType SmartSeq` option produces cell/gene (and other [features](#quantification-of-different-transcriptomic-features))
count matrices, using rules similar to the droplet-based technologies. The differnces are (i) individual cells correspond to different FASTQ files,there are no Cell Barcode sequences, and "Cell IDs" have to be provided as input (ii) there are no UMI sequences, but reads can be deduplicated if they have identical start/end coordinates.
* The convenient way to list all the FASTQ files and Cell IDs is to create a file manifest and supply it in `--readFilesManifest /path/to/manifest.tsv`. The manifest file should contain 3 tab-separated columns. For paired-end reads:
```
Read1-file-name \t Read2-file-name \t Cell-id
```
For single-end reads, the 2nd column should contain the dash - :
```
Read1-file-name \t - \t Cell-id
```
Cell-id can be any string without spaces. Cell-id will be added as ReadGroup tag (*RG:Z:*) for each read in the SAM/BAM output. If Cell-id starts with *ID:*, it can contain several fields separated by tab, and all the fields will be copied verbatim into SAM *@RG* header line.
* Deduplication based on read start/end coordinates can be done with `--soloUMIdedup Exact` option. To avoid deduplication (e.g. for single-end reads) use `--soloUMIdedup NoDedup`. Both deduplication options can be used together `--soloUMIdedup Exact NoDedup` and will produce two columns in the *matrix.mtx* output.
* Common Smart-seq protocols are unstranded and thus will require `--soloStrand Unstranded` option. If your protocol is stranded, you can can choose the proper `--soloStrand Forward` (default) or `--soloStrand Reverse` options.
-------------------------------------------------------------
------------------------------------------------------
--------------------------------------------------
For completenes, all parameters that control STARsolo output are listed again below with defaults and short descriptions:
All parameters that control STARsolo output are listed again below with defaults and short descriptions:
---------------------------------------
```
soloType None
string(s): type of single-cell RNA-seq
CB_UMI_Simple ... (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium
CB_UMI_Simple ... (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium.
CB_UMI_Complex ... one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop.
CB_samTagOut ... output Cell Barcode as CR and/or CB SAm tag. No UMI counting. --readFilesIn cDNA_read1 [cDNA_read2 if paired-end] CellBarcode_read . Requires --outSAMtype BAM Unsorted [and/or SortedByCoordinate]
SmartSeq ... Smart-seq: each cell in a separate FASTQ (paired- or single-end), barcodes are corresponding read-groups, no UMI sequences, alignments deduplicated according to alignment start and end (after extending soft-clipped bases)

soloCBwhitelist -
string(s): file(s) with whitelist(s) of cell barcodes. Only one file allowed with
string(s): file(s) with whitelist(s) of cell barcodes. Only --soloType CB_UMI_Complex allows more than one whitelist file.
None ... no whitelist: all cell barcodes are allowed

soloCBstart 1
int>0: cell barcode start base
Expand All @@ -243,9 +275,9 @@ soloBarcodeReadLength 1
soloCBposition -
strings(s) position of Cell Barcode(s) on the barcode read.
Presently only works with --soloType CB_UMI_Complex, and barcodes are assumed to be on Read2.
Format for each barcode: startAnchor_startDistance_endAnchor_endDistance
start(end)Anchor defines the anchor base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end
start(end)Distance is the distance from the CB start(end) to the Anchor base
Format for each barcode: startAnchor_startPosition_endAnchor_endPosition
start(end)Anchor defines the Anchor Base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end
start(end)Position is the 0-based position with of the CB start(end) with respect to the Anchor Base
String for different barcodes are separated by space.
Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
--soloCBposition 0_0_2_-1 3_1_3_8
Expand Down Expand Up @@ -281,13 +313,13 @@ soloFeatures Gene
Gene ... genes: reads match the gene transcript
SJ ... splice junctions: reported in SJ.out.tab
GeneFull ... full genes: count all reads overlapping genes' exons and introns
Transcript3p ... quantification of transcript for 3' protocols

soloUMIdedup 1MM_All
string(s): type of UMI deduplication (collapsing) algorithm
1MM_All ... all UMIs with 1 mismatch distance to each other are collapsed (i.e. counted once)
1MM_Directional ... follows the "directional" method from the UMI-tools by Smith, Heger and Sudbery (Genome Research 2017).
Exact ... only exactly matching UMIs are collapsed
NoDedup ... no deduplication of UMIs, count all reads. Allowed for --soloType SmartSeq

soloUMIfiltering -
string(s) type of UMI filtering
Expand All @@ -300,8 +332,10 @@ soloOutFileNames Solo.out/ features.tsv barcodes.tsv

soloCellFilter CellRanger2.2 3000 0.99 10
string(s): cell filtering type and parameters
CellRanger2.2 ... simple filtering of CellRanger 2.2, followed by thre numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count
TopCells ... only report top cells by UMI count, followed by the excat number of cells
CellRanger2.2 ... simple filtering of CellRanger 2.2, followed by three numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count
TopCells ... only report top cells by UMI count, followed by the exact number of cells
None ... do not output filtered cells

soloOutFormatFeaturesGeneField3 "Gene Expression"
string(s): field 3 in the Gene features.tsv file. If "-", then no 3rd field is output.
```
13 changes: 5 additions & 8 deletions extras/doc-latex/STARmanual.tex
Original file line number Diff line number Diff line change
Expand Up @@ -179,10 +179,7 @@ \subsection{Basic options.}

\end{itemize}

\subsection{Advanced options.}
There are many advanced options that control STAR mapping behavior. All options are briefly described in the Section \sechyperref{Description_of_all_options}.

\subsubsection{Mapping multiple files in one run.}
\subsection{Mapping multiple files in one run.}
Multiple samples can be mapped in one run with a single output. This is equivalent to concatenating the read files before mapping, except that distinct read groups can be used in \opt{outSAMattrRGline} command to keep track of reads from different files. For single-end reads use a comma separated list (no spaces around commas), e.g.:

\opt{readFilesIn} \optv{sample1.fq,sample2.fq,sample3.fq}
Expand All @@ -199,7 +196,7 @@ \subsubsection{Mapping multiple files in one run.}
Note that this list is separated by commas surrounded by spaces (unlike \opt{readFilesIn} list).

Another option for mapping multiple reads files, especially convenient for a very large number of files, is to create a file manifest and supply it in \opt{readFilesManifest} \optv{/path/to/manifest.tsv}.
The manifest file should contain 3 tab-separated columns, paired-end reads:
The manifest file should contain 3 tab-separated columns. For paired-end reads:

\ofilen{read1-file-name $tab$ read2-file-name $tab$ read-group-line}

Expand All @@ -211,6 +208,9 @@ \subsubsection{Mapping multiple files in one run.}
If read-group-line does not start with ID:, it can only contain one ID field, and ID: will be added to it.
If read-group-line starts with ID:, it can contain several fields separated by $tab$, and all the fields will be copied verbatim into SAM @RG header line.

\subsection{Advanced options.}
There are many advanced options that control STAR mapping behavior. All options are briefly described in the Section \sechyperref{Description_of_all_options}.

\subsubsection{Using annotations at the mapping stage.}
Since 2.4.1a, the annotations can be included on the fly at the mapping step, without including them at the genome generation step. You can specify \opt{sjdbGTFfile} \optvr{/path/to/ann.gtf} and/or \opt{sjdbFileChrStartEnd} \optvr{/path/to/sj.tab}, as well as \opt{sjdbOverhang}, and any other \opt{sjdb*} options. The genome indices can be generated with or without another set of annotations/junctions. In the latter case the new junctions will added to the old ones. STAR will insert the junctions into genome indices on the fly before mapping, which takes 1~2 minutes. The on the fly genome indices can be saved (for reuse) with \opt{sjdbInsertSave} \optv{All}, into \optvr{\_STARgenome} directory inside the current run directory.

Expand Down Expand Up @@ -476,9 +476,6 @@ \section{Output in transcript coordinates.}

Note, that STAR first aligns reads to entire genome, and only then searches for concordance between alignments and transcripts.This approach offers certain advantages compared to the alignment to transcriptome only, by not forcing the alignments to annotated transcripts. Note that \opt{outFilterMultimapNmax} filter only applies to genomic alignments. If an alignment passes this filter, it is converted to all possible transcriptomic alignments and all of them are output.




By default, the output satisfies RSEM requirements: soft-clipping or indels are not allowed. Use \opt{quantTranscriptomeBan} \optv{Singleend} to allow insertions, deletions ans soft-clips in the transcriptomic alignments, which can be used by some expression quantification software (e.g. eXpress).

\section{Counting number of reads per gene.}
Expand Down
3 changes: 3 additions & 0 deletions extras/doc-latex/parametersDefault.tex
Original file line number Diff line number Diff line change
Expand Up @@ -926,4 +926,7 @@
\optOpt{TopCells} \optOptLine{only report top cells by UMI count, followed by the exact number of cells}
\optOpt{None} \optOptLine{do not output filtered cells}
\end{optOptTable}
\optName{soloOutFormatFeaturesGeneField3}
\optValue{"Gene Expression"}
\optLine{string(s): field 3 in the Gene features.tsv file. If "-", then no 3rd field is output.}
\end{optTable}
2 changes: 1 addition & 1 deletion extras/docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ FROM debian:stretch-slim

MAINTAINER [email protected]

ARG STAR_VERSION=2.7.4a
ARG STAR_VERSION=2.7.5a

ENV PACKAGES gcc g++ make wget zlib1g-dev unzip

Expand Down

0 comments on commit e67d668

Please sign in to comment.