Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[builder] schema 4.0 #872

Merged
merged 18 commits into from
Dec 21, 2023
Merged

[builder] schema 4.0 #872

merged 18 commits into from
Dec 21, 2023

Conversation

bkmartinjr
Copy link
Contributor

@bkmartinjr bkmartinjr commented Dec 3, 2023

Schema 4.0.0 support in both builder and Census schema. Changes:

Copy link

codecov bot commented Dec 3, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (7cca895) 86.81% compared to head (409b86e) 86.76%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #872      +/-   ##
==========================================
- Coverage   86.81%   86.76%   -0.05%     
==========================================
  Files          72       72              
  Lines        5255     5228      -27     
==========================================
- Hits         4562     4536      -26     
+ Misses        693      692       -1     
Flag Coverage Δ
unittests 86.76% <100.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bkmartinjr bkmartinjr marked this pull request as ready for review December 18, 2023 19:55
Copy link
Member

@ebezzi ebezzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few typos/nits, but LGTM otherwise. Thanks!

@@ -370,7 +335,7 @@ def populate_presence_matrix(self, datasets: List[Dataset]) -> None:
max_dataset_joinid = max(d.soma_joinid for d in datasets)

# LIL is fast way to create spmatrix
pm = sparse.lil_array((max_dataset_joinid + 1, self.n_var), dtype=bool)
pm = sparse.lil_matrix((max_dataset_joinid + 1, self.n_var), dtype=bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those aliases? The docsite of both methods seems the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are almost aliases, but not quite - they have slightly different semantics, and the _array variant is preferred for new code. See the huge note at the top of the docs here.

The difference doesn't matter for this use case so lets stick with the modern API

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference doesn't matter for this use case so lets stick with the modern API

but this change is going back to the older _matrix, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, correct - for consistency with the rest of the builder which uses _matrix. The difference is inconsequential in this case - see doc link. Historically single cell always uses matrix because community use (e.g., AnnData) predates the arrival of _array

@@ -134,44 +137,29 @@
"tissue_ontology_term_id",
"tissue_general",
"tissue_general_ontology_term_id",
"tissue_type",
]
_NonRepeatitiveStringObs = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: this should be _NonRepetitiveStringObs and the other variable _RepetitiveStringLabelObs (the spelling isn't consistent).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code all goes away in #896, which is stacked on top of this PR. So I'd propose to ignore this issue in favor of the changes in that PR. LMK if you are not OK with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally OK. 🚀

Copy link
Contributor

@pablo-gar pablo-gar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data spot checking of data build provided by @bkmartinjr did not yield any noticeable errors:

  • All fields were added correctly.
  • data_type matches expectation for one dataset
  • feature_length matches previous builds
  • citation properly added
  • observation_joinid no further checks other than existance
  • schema version number correctly updated in census_info

See notebook here https://colab.research.google.com/drive/1tX_M1BW9ai_4joXOXlAmuthJVCUdIsVR


Schema doc changes look good!

Copy link
Collaborator

@atolopko-czi atolopko-czi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Though inconsequential, I'm not clear if we really want scipy *_matrix or *_array types.

@@ -370,7 +335,7 @@ def populate_presence_matrix(self, datasets: List[Dataset]) -> None:
max_dataset_joinid = max(d.soma_joinid for d in datasets)

# LIL is fast way to create spmatrix
pm = sparse.lil_array((max_dataset_joinid + 1, self.n_var), dtype=bool)
pm = sparse.lil_matrix((max_dataset_joinid + 1, self.n_var), dtype=bool)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference doesn't matter for this use case so lets stick with the modern API

but this change is going back to the older _matrix, no?

@bkmartinjr
Copy link
Contributor Author

Though inconsequential, I'm not clear if we really want scipy *_matrix or *_array types

our entire code base (for legacy reasons) uses _matrix, so we might as well stick with it until we make a change en masse

* improve normalized layer floating point precision, and correct normalized calc for smart-seq assays

* fix int32 overflow in sparse matrix code

* add check for tiledb issue 1969

* bump dependency versions

* work around SOMA bug temporarily

* pr feedback

* [builder] port to use enums in schema (#896)

* first pass at using enum types

* add better error logging for file size assertion

* add feature flag for dict schema fields

* update a few dependencies

* remove debugging print

* update comment

* bump compression level

* pr feedback

* fix typos in comments

* add schema_util tests and fix a bug found by those tests

* lint
@bkmartinjr bkmartinjr merged commit a53e34b into main Dec 21, 2023
14 checks passed
@bkmartinjr bkmartinjr deleted the bkmartinjr/schema-four branch December 21, 2023 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment