-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove genes.tsv.gz from mtx format #424
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
36 | ||
37 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -459,7 +459,7 @@ class MetadataSchemaName(Enum): | |
MatrixFormat.MTX.value: """ | ||
<h2>HCA Matrix Service MTX Output</h2> | ||
<p>The mtx-formatted output from the matrix service is a zip archive that contains | ||
three files:</p> | ||
four files:</p> | ||
<table class="table table-striped table-bordered"> | ||
<thead> | ||
<tr> | ||
|
@@ -477,11 +477,18 @@ class MetadataSchemaName(Enum): | |
<td>Cell metadata</td> | ||
</tr> | ||
<tr> | ||
<td><directory_name>/genes.tsv.gz</td> | ||
<td><directory_name>/features.tsv.gz</td> | ||
<td>Gene (or transcript) metadata</td> | ||
</tr> | ||
<tr> | ||
<td><directory_name>/barcodes.tsv.gz</td> | ||
<td>Cell barcodes</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
<p>For 10x experiments, this format adheres to the Cell Ranger | ||
<a href="https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices"> | ||
feature-barcode matrix</a> specification.</p> | ||
|
||
<h3><code>matrix.mtx.gz</code></h3> | ||
<p>This file contains expression values in the | ||
|
@@ -494,8 +501,8 @@ class MetadataSchemaName(Enum): | |
<p>The expression values are meant to be a "raw" count, so for SmartSeq2 experiments, this | ||
is the <code>expected_count</code> field from | ||
<a href="http://deweylab.biostat.wisc.edu/rsem/rsem-calculate-expression.html#output">RSEM | ||
output</a>. For 10x experiments analyzed with Cell Ranger, this is read from the | ||
<code>matrix.mtx</code> file that Cell Ranger produces as its filtered feature-barcode matrix.</p> | ||
output</a>. For 10x experiments analyzed with Optimus, this is read from the | ||
<a href="https://zarr.readthedocs.io/en/stable">zarr</a> array produced by the pipeline.</p> | ||
|
||
<h3><code>cells.tsv.gz</code></h3> | ||
<p>Each row of the cell metadata table represents a cell, and each column is a different metadata | ||
|
@@ -504,10 +511,14 @@ class MetadataSchemaName(Enum): | |
fields, <code>genes_detected</code> for example, are calculated during secondary analysis. | ||
Full descriptions of those fields are forthcoming.</p> | ||
|
||
<h3><code>genes.tsv.gz</code></h3> | ||
<h3><code>features.tsv.gz</code></h3> | ||
<p>The gene metadata contains basic information about the genes in the count matrix. | ||
Each row is a gene, and each row corresponds to the same row in the expression mtx file. | ||
Note that <code>featurename</code> is not unique.</p> | ||
|
||
<h3><code>barcodes.tsv.gz</code></h3> | ||
<p>A list of cell barcodes corresponding to the columns found in matrix.mtx.gz. | ||
Note that barcodes may not be unique.</p> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mckinsel What do you recommend we use instead to ensure uniqueness across projects? Will storing something other than barcodes in this file cause confusion? |
||
""" | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -209,11 +209,10 @@ def test_mtx(self, mock_upload_method): | |
# Check the components of the zip file | ||
members = mtx_output.namelist() | ||
self.assertIn("test.mtx/matrix.mtx.gz", members) | ||
self.assertIn("test.mtx/genes.tsv.gz", members) | ||
self.assertIn("test.mtx/cells.tsv.gz", members) | ||
self.assertIn("test.mtx/features.tsv.gz", members) | ||
self.assertIn("test.mtx/barcodes.tsv.gz", members) | ||
self.assertEqual(len(members), 5) | ||
self.assertEqual(len(members), 4) | ||
|
||
# Read in the cell and gene tables. We need both for mtx files | ||
# since the mtx itself is just numbers and indices. | ||
|
@@ -223,8 +222,19 @@ def test_mtx(self, mock_upload_method): | |
mtx_cells[row["cellkey"]] = row | ||
|
||
mtx_genes = collections.OrderedDict() | ||
for row in csv.DictReader(io.StringIO(gzip.GzipFile(fileobj=io.BytesIO( | ||
mtx_output.read("test.mtx/genes.tsv.gz"))).read().decode()), delimiter='\t'): | ||
for row in csv.DictReader( | ||
io.StringIO(gzip.GzipFile(fileobj=io.BytesIO( | ||
mtx_output.read("test.mtx/features.tsv.gz"))).read().decode()), | ||
delimiter='\t', | ||
fieldnames=["featurekey", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Required since |
||
"featurename", | ||
"featuretype", | ||
"featuretype_10x", | ||
"chromosome", | ||
"featurestart", | ||
"featureend", | ||
"isgene", | ||
"genus_species"]): | ||
mtx_genes[row["featurekey"]] = row | ||
|
||
# Read the expression values. This is supposed to be aligned with | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this content need to be updated somewhere else too?