[builder] schema 4.0 #872

bkmartinjr · 2023-12-03T18:20:23Z

Schema 4.0.0 support in both builder and Census schema. Changes:

Bump Census version to 1.3.0
Update CxG H5AD schema version number to 4.0.0. Fixes Schema: Update dataset (H5AD) schema version from 3.1.0 to 4.0.0 in Census schema #749 Fixes Builder: Update dataset (H5AD) schema version from 3.1.0 to 4.0.0 in Census builder #750
Add tissue_type to obs. Fixes Add obs["tissue_type] to schema #747 Fixes Builder must implement obs["tissue_type"] #748
Add observation_joinid to obs. Fixes Add observation_joinid to schema #610 Fixes Builder must implement observation_joinid #611
Add citation to datasets table. Fixes Add citation to schema #559 Fixes The Census builder must add citation to Census data #560
use H5AD for gene fetaure_length values. Fixes Modify gene_length description in schema when those are available in h5ads #8 Fixes The cell census builder must not add gene_lengths when those are available in h5ads #558
Update dependency pins to more recent versions of upstream deps
Remove (now) obsolete AnnData/Arrow bug work-around

codecov · 2023-12-03T19:30:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (7cca895) 86.81% compared to head (409b86e) 86.76%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #872      +/-   ##
==========================================
- Coverage   86.81%   86.76%   -0.05%     
==========================================
  Files          72       72              
  Lines        5255     5228      -27     
==========================================
- Hits         4562     4536      -26     
+ Misses        693      692       -1

Flag	Coverage Δ
unittests	`86.76% <100.00%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ebezzi

Few typos/nits, but LGTM otherwise. Thanks!

ebezzi · 2023-12-19T18:56:23Z

tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/experiment_builder.py

@@ -370,7 +335,7 @@ def populate_presence_matrix(self, datasets: List[Dataset]) -> None:
            max_dataset_joinid = max(d.soma_joinid for d in datasets)

            # LIL is fast way to create spmatrix
-            pm = sparse.lil_array((max_dataset_joinid + 1, self.n_var), dtype=bool)
+            pm = sparse.lil_matrix((max_dataset_joinid + 1, self.n_var), dtype=bool)


Are those aliases? The docsite of both methods seems the same.

They are almost aliases, but not quite - they have slightly different semantics, and the _array variant is preferred for new code. See the huge note at the top of the docs here.

The difference doesn't matter for this use case so lets stick with the modern API

The difference doesn't matter for this use case so lets stick with the modern API

but this change is going back to the older _matrix, no?

yes, correct - for consistency with the rest of the builder which uses _matrix. The difference is inconsequential in this case - see doc link. Historically single cell always uses matrix because community use (e.g., AnnData) predates the arrival of _array

ebezzi · 2023-12-19T18:59:10Z

tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/globals.py

@@ -134,44 +137,29 @@
    "tissue_ontology_term_id",
    "tissue_general",
    "tissue_general_ontology_term_id",
+    "tissue_type",
+]
+_NonRepeatitiveStringObs = [


Typo: this should be _NonRepetitiveStringObs and the other variable _RepetitiveStringLabelObs (the spelling isn't consistent).

This code all goes away in #896, which is stacked on top of this PR. So I'd propose to ignore this issue in favor of the changes in that PR. LMK if you are not OK with that.

Totally OK. 🚀

pablo-gar

Data spot checking of data build provided by @bkmartinjr did not yield any noticeable errors:

All fields were added correctly.
data_type matches expectation for one dataset
feature_length matches previous builds
citation properly added
observation_joinid no further checks other than existance
schema version number correctly updated in census_info

See notebook here https://colab.research.google.com/drive/1tX_M1BW9ai_4joXOXlAmuthJVCUdIsVR

Schema doc changes look good!

atolopko-czi

LGTM.

Though inconsequential, I'm not clear if we really want scipy *_matrix or *_array types.

atolopko-czi · 2023-12-20T14:07:35Z

tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/experiment_builder.py

@@ -370,7 +335,7 @@ def populate_presence_matrix(self, datasets: List[Dataset]) -> None:
            max_dataset_joinid = max(d.soma_joinid for d in datasets)

            # LIL is fast way to create spmatrix
-            pm = sparse.lil_array((max_dataset_joinid + 1, self.n_var), dtype=bool)
+            pm = sparse.lil_matrix((max_dataset_joinid + 1, self.n_var), dtype=bool)


The difference doesn't matter for this use case so lets stick with the modern API

but this change is going back to the older _matrix, no?

bkmartinjr · 2023-12-20T15:29:13Z

Though inconsequential, I'm not clear if we really want scipy *_matrix or *_array types

our entire code base (for legacy reasons) uses _matrix, so we might as well stick with it until we make a change en masse

* improve normalized layer floating point precision, and correct normalized calc for smart-seq assays * fix int32 overflow in sparse matrix code * add check for tiledb issue 1969 * bump dependency versions * work around SOMA bug temporarily * pr feedback * [builder] port to use enums in schema (#896) * first pass at using enum types * add better error logging for file size assertion * add feature flag for dict schema fields * update a few dependencies * remove debugging print * update comment * bump compression level * pr feedback * fix typos in comments * add schema_util tests and fix a bug found by those tests * lint

bkmartinjr added 2 commits December 3, 2023 18:13

schema 4

cb25b96

update dep pins

58c775f

bkmartinjr added the 4.0-dataset-schema label Dec 3, 2023

bkmartinjr self-assigned this Dec 3, 2023

bkmartinjr requested review from pablo-gar and ebezzi December 3, 2023 19:11

bkmartinjr added 13 commits December 4, 2023 15:53

AnnData version update allows for compat code cleanup

d7bf939

fix bug in feature_length

8f55095

Merge branch 'main' into bkmartinjr/schema-four

fed161f

Merge branch 'main' into bkmartinjr/schema-four

a0ff3eb

Merge branch 'main' into bkmartinjr/schema-four

2324310

bump tiledbsoma dependency to latest

0939b33

Merge branch 'main' into bkmartinjr/schema-four

91394f0

bump schema version

d866def

Merge branch 'main' into bkmartinjr/schema-four

0724e8d

update census schema version

e8ed935

Merge branch 'main' into bkmartinjr/schema-four

ba45ac4

more dependency updates

6b0a4c8

update to use production REST API

d9ae302

bkmartinjr marked this pull request as ready for review December 18, 2023 19:55

bkmartinjr requested a review from atolopko-czi December 18, 2023 19:56

Merge branch 'main' into bkmartinjr/schema-four

fab787c

ebezzi approved these changes Dec 19, 2023

View reviewed changes

pablo-gar approved these changes Dec 19, 2023

View reviewed changes

Merge branch 'main' into bkmartinjr/schema-four

409b86e

atolopko-czi approved these changes Dec 20, 2023

View reviewed changes

bkmartinjr merged commit a53e34b into main Dec 21, 2023
14 checks passed

bkmartinjr deleted the bkmartinjr/schema-four branch December 21, 2023 01:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[builder] schema 4.0 #872

[builder] schema 4.0 #872

bkmartinjr commented Dec 3, 2023 •

edited

Loading

codecov bot commented Dec 3, 2023 •

edited

Loading

ebezzi left a comment

ebezzi Dec 19, 2023

bkmartinjr Dec 19, 2023

atolopko-czi Dec 20, 2023

bkmartinjr Dec 20, 2023

ebezzi Dec 19, 2023

bkmartinjr Dec 19, 2023

ebezzi Dec 19, 2023

pablo-gar left a comment

atolopko-czi left a comment

atolopko-czi Dec 20, 2023

bkmartinjr commented Dec 20, 2023

[builder] schema 4.0 #872

[builder] schema 4.0 #872

Conversation

bkmartinjr commented Dec 3, 2023 • edited Loading

codecov bot commented Dec 3, 2023 • edited Loading

Codecov Report

ebezzi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pablo-gar left a comment

Choose a reason for hiding this comment

atolopko-czi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkmartinjr commented Dec 20, 2023

bkmartinjr commented Dec 3, 2023 •

edited

Loading

codecov bot commented Dec 3, 2023 •

edited

Loading