Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[builder] schema 4.0 #872

Merged
merged 18 commits into from
Dec 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 26 additions & 7 deletions docs/cellxgene_census_schema.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# CZ CELLxGENE Discover Census Schema

**Version**: 1.2.0
**Version**: 1.3.0

**Last edited**: Sept, 2023.
**Last edited**: December, 2023.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED" "MAY", and "OPTIONAL" in this document are to be interpreted as described in [BCP 14](https://tools.ietf.org/html/bcp14), [RFC2119](https://www.rfc-editor.org/rfc/rfc2119.txt), and [RFC8174](https://www.rfc-editor.org/rfc/rfc8174.txt) when, and only when, they appear in all capitals, as shown here.

Expand Down Expand Up @@ -339,7 +339,7 @@ An example of this `SOMADataFrame` is shown below:
<tbody>
<tr>
<td>census_schema_version</td>
<td>1.2.0</td>
<td>1.3.0</td>
</tr>
<tr>
<td>census_build_date</td>
Expand Down Expand Up @@ -381,10 +381,15 @@ All datasets used to build the Census MUST be included in a table modeled as a `
</tr>
</thead>
<tbody>
<tr>
<td>citation</td>
<td>string</td>
<td>As defined in the CELLxGENE schema.</td>
</tr>
<tr>
<td>collection_id</td>
<td>string</td>
<td rowspan="5">As defined in CELLxGENE Discover <a href="https://api.cellxgene.cziscience.com/curation/ui/">data schema</a> (see &quot;Schemas&quot; section for field definitions)".</td>
<td rowspan="6">As defined in CELLxGENE Discover <a href="https://api.cellxgene.cziscience.com/curation/ui/">data schema</a> (see &quot;Schemas&quot; section for field definitions)".</td>
</tr>
<tr>
<td>collection_name</td>
Expand Down Expand Up @@ -719,7 +724,9 @@ Per the CELLxGENE dataset schema, [all RNA assays MUST include UMI or read count
This is an experimental data artifact - it may be removed at any time.

A library-sized normalized layer, containing a normalized variant of the count (raw) matrix.
For a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined
For Smart-Seq assays, given a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined
as `normalized[i,j] = (X[i,j] / var[j].feature_length) / sum(X[i, ] / var.feature_length[j])`.
For all other assays, for a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined
as `normalized[i,j] = X[i,j] / sum(X[i, ])`.

#### Feature metadata – `census_obj["census_data"][organism].ms["RNA"].var` – `SOMADataFrame`
Expand Down Expand Up @@ -752,7 +759,7 @@ The following columns MUST be included:
<tr>
<td>feature_length</td>
<td>int</td>
<td>Gene length in base pairs derived from the <a href="https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.1.0/schema.md#required-gene-annotations">gene reference files from the CELLxGENE dataset schema</a>.</td>
<td>As defined in CELLxGENE dataset schema</a>.</td>
</tr>
<tr>
<td>nnz</td>
Expand Down Expand Up @@ -838,7 +845,7 @@ Cell metadata MUST be encoded as a `SOMADataFrame` with the following columns:
</tr>
<tr>
<td>assay_ontology_term_id</td>
<td colspan="2" rowspan="17">As defined in CELLxGENE dataset schema</td>
<td colspan="2" rowspan="19">As defined in CELLxGENE dataset schema</td>
</tr>
<tr>
<td>assay</td>
Expand Down Expand Up @@ -867,6 +874,9 @@ Cell metadata MUST be encoded as a `SOMADataFrame` with the following columns:
<tr>
<td>is_primary_data</td>
</tr>
<tr>
<td>observation_joinid</td>
</tr>
<tr>
<td>self_reported_ethnicity_ontology_term_id</td>
</tr>
Expand All @@ -888,6 +898,9 @@ Cell metadata MUST be encoded as a `SOMADataFrame` with the following columns:
<tr>
<td>tissue</td>
</tr>
<tr>
<td>tissue_type</td>
</tr>
<tr>
<td>nnz</td>
<td>int64</td>
Expand Down Expand Up @@ -918,6 +931,12 @@ Cell metadata MUST be encoded as a `SOMADataFrame` with the following columns:

## Changelog

### Version 1.3.0

* Update to require [CELLxGENE schema version 4.0.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/4.0.0/schema.md)
* Adds `citation` to "Census table of CELLxGENE Discover datasets – `census_obj["census_info"]["datasets"]`"
* Adds `observation_joinid` and `tissue_type` to `obs` dataframe

### Version 1.2.0

* Update to require [CELLxGENE schema version 3.1.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.1.0/schema.md)
Expand Down
24 changes: 12 additions & 12 deletions tools/cellxgene_census_builder/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,26 +26,26 @@ classifiers = [
"Programming Language :: Python :: 3.11",
]
dependencies= [
"typing_extensions==4.8.0",
"pyarrow==13.0.0",
"pandas[performance]==2.0.3",
"anndata==0.9",
"typing_extensions==4.9.0",
"pyarrow==14.0.1",
"pandas[performance]==2.1.4",
"anndata==0.10.3",
"numpy==1.23.5",
# IMPORTANT: consider TileDB format compat before advancing this version. It is important that
# IMPORTANT: the tiledbsoma version lag that used in cellxgene-census package.
"tiledbsoma==1.4.4",
"cellxgene-census==1.6.0",
"scipy==1.10.1", # cellxgene-census==1.5.1 forces scipy<1.11
"fsspec==2023.9.2",
"s3fs==2023.9.2",
"tiledbsoma==1.6.1",
"cellxgene-census==1.9.1",
"scipy==1.11.4",
"fsspec==2023.12.2",
"s3fs==2023.12.2",
"requests==2.31.0",
"aiohttp==3.9.0",
"aiohttp==3.9.1",
"Cython", # required by owlready2
"wheel", # required by owlready2
"owlready2==0.44",
"gitpython==3.1.37",
"gitpython==3.1.40",
"attrs==23.1.0",
"psutil==5.9.5",
"psutil==5.9.6",
"pyyaml==6.0.1",
"numba==0.56.4",
]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ def open_anndata(
# These are schema versions this code is known to work with. This is a
# sanity check, which would be better implemented via a unit test at
# some point in the future.
assert CXG_SCHEMA_VERSION in ["3.1.0", "3.0.0"]
assert CXG_SCHEMA_VERSION in ["4.0.0"]

if h5ad.schema_version == "":
h5ad.schema_version = get_cellxgene_schema_version(ad)
Expand Down Expand Up @@ -80,6 +80,7 @@ def open_anndata(
# TODO - these should be looked up in the ontology
raw_var["feature_name"] = "unknown"
raw_var["feature_reference"] = "unknown"
raw_var["feature_length"] = 0
var = pd.concat([ad.var, raw_var])
else:
var = ad.raw.var
Expand All @@ -96,7 +97,7 @@ def open_anndata(
not isinstance(X, (sparse.csr_matrix, sparse.csc_matrix)) or X.has_canonical_format
), f"Found H5AD with non-canonical X matrix in {path}"

ad = anndata.AnnData(X=X if need_X else None, obs=ad.obs, var=var, raw=None, uns=ad.uns, dtype=np.float32)
ad = anndata.AnnData(X=X if need_X else None, obs=ad.obs, var=var, raw=None, uns=ad.uns)
assert not need_X or ad.X.shape == (len(ad.obs), len(ad.var))

# TODO: In principle, we could look up missing feature_name, but for now, just assert they exist
Expand Down Expand Up @@ -154,7 +155,7 @@ def _filter(ad: anndata.AnnData, need_X: Optional[bool] = True) -> anndata.AnnDa
assert ad.raw is None

# This discards all other ancillary state, eg, obsm/varm/....
ad = anndata.AnnData(X=X, obs=obs, var=var, dtype=np.float32)
ad = anndata.AnnData(X=X, obs=obs, var=var)

assert (
X is None or isinstance(X, np.ndarray) or X.has_canonical_format
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import pyarrow as pa
import tiledbsoma as soma

from .globals import CENSUS_DATASETS_COLUMNS, CENSUS_DATASETS_NAME
from .globals import CENSUS_DATASETS_NAME, CENSUS_DATASETS_TABLE_SPEC

T = TypeVar("T", bound="Dataset")

Expand All @@ -25,6 +25,7 @@ class Dataset:

# Optional - as reported by REST API
dataset_title: str = "" # CELLxGENE dataset title
citation: str = "" # CELLxGENE citation
collection_id: str = "" # CELLxGENE collection id
collection_name: str = "" # CELLxGENE collection name
collection_doi: str = "" # CELLxGENE collection doi
Expand Down Expand Up @@ -69,14 +70,14 @@ def create_dataset_manifest(info_collection: soma.Collection, datasets: List[Dat
"""
logging.info("Creating dataset_manifest")
manifest_df = Dataset.to_dataframe(datasets)
manifest_df = manifest_df[CENSUS_DATASETS_COLUMNS + ["soma_joinid"]]
manifest_df = manifest_df[list(CENSUS_DATASETS_TABLE_SPEC.field_names())]
if len(manifest_df) == 0:
return

schema = CENSUS_DATASETS_TABLE_SPEC.to_arrow_schema(manifest_df)

# write to a SOMA dataframe
with info_collection.add_new_dataframe(
CENSUS_DATASETS_NAME,
schema=pa.Schema.from_pandas(manifest_df, preserve_index=False),
index_column_names=["soma_joinid"],
CENSUS_DATASETS_NAME, schema=schema, index_column_names=["soma_joinid"]
) as manifest:
manifest.write(pa.Table.from_pandas(manifest_df, preserve_index=False))
Loading
Loading