-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[builder] schema 4.0 #872
[builder] schema 4.0 #872
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #872 +/- ##
==========================================
- Coverage 86.81% 86.76% -0.05%
==========================================
Files 72 72
Lines 5255 5228 -27
==========================================
- Hits 4562 4536 -26
+ Misses 693 692 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few typos/nits, but LGTM otherwise. Thanks!
@@ -370,7 +335,7 @@ def populate_presence_matrix(self, datasets: List[Dataset]) -> None: | |||
max_dataset_joinid = max(d.soma_joinid for d in datasets) | |||
|
|||
# LIL is fast way to create spmatrix | |||
pm = sparse.lil_array((max_dataset_joinid + 1, self.n_var), dtype=bool) | |||
pm = sparse.lil_matrix((max_dataset_joinid + 1, self.n_var), dtype=bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those aliases? The docsite of both methods seems the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are almost aliases, but not quite - they have slightly different semantics, and the _array
variant is preferred for new code. See the huge note at the top of the docs here.
The difference doesn't matter for this use case so lets stick with the modern API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference doesn't matter for this use case so lets stick with the modern API
but this change is going back to the older _matrix
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, correct - for consistency with the rest of the builder which uses _matrix
. The difference is inconsequential in this case - see doc link. Historically single cell always uses matrix because community use (e.g., AnnData) predates the arrival of _array
@@ -134,44 +137,29 @@ | |||
"tissue_ontology_term_id", | |||
"tissue_general", | |||
"tissue_general_ontology_term_id", | |||
"tissue_type", | |||
] | |||
_NonRepeatitiveStringObs = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: this should be _NonRepetitiveStringObs
and the other variable _RepetitiveStringLabelObs
(the spelling isn't consistent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code all goes away in #896, which is stacked on top of this PR. So I'd propose to ignore this issue in favor of the changes in that PR. LMK if you are not OK with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally OK. 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data spot checking of data build provided by @bkmartinjr did not yield any noticeable errors:
- All fields were added correctly.
data_type
matches expectation for one datasetfeature_length
matches previous buildscitation
properly addedobservation_joinid
no further checks other than existance- schema version number correctly updated in
census_info
See notebook here https://colab.research.google.com/drive/1tX_M1BW9ai_4joXOXlAmuthJVCUdIsVR
Schema doc changes look good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Though inconsequential, I'm not clear if we really want scipy *_matrix or *_array types.
@@ -370,7 +335,7 @@ def populate_presence_matrix(self, datasets: List[Dataset]) -> None: | |||
max_dataset_joinid = max(d.soma_joinid for d in datasets) | |||
|
|||
# LIL is fast way to create spmatrix | |||
pm = sparse.lil_array((max_dataset_joinid + 1, self.n_var), dtype=bool) | |||
pm = sparse.lil_matrix((max_dataset_joinid + 1, self.n_var), dtype=bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference doesn't matter for this use case so lets stick with the modern API
but this change is going back to the older _matrix
, no?
our entire code base (for legacy reasons) uses |
* improve normalized layer floating point precision, and correct normalized calc for smart-seq assays * fix int32 overflow in sparse matrix code * add check for tiledb issue 1969 * bump dependency versions * work around SOMA bug temporarily * pr feedback * [builder] port to use enums in schema (#896) * first pass at using enum types * add better error logging for file size assertion * add feature flag for dict schema fields * update a few dependencies * remove debugging print * update comment * bump compression level * pr feedback * fix typos in comments * add schema_util tests and fix a bug found by those tests * lint
Schema 4.0.0 support in both builder and Census schema. Changes:
4.0.0
. Fixes Schema: Update dataset (H5AD) schema version from 3.1.0 to 4.0.0 in Census schema #749 Fixes Builder: Update dataset (H5AD) schema version from 3.1.0 to 4.0.0 in Census builder #750obs["tissue_type]
to schema #747 Fixes Builder must implementobs["tissue_type"]
#748observation_joinid
to schema #610 Fixes Builder must implementobservation_joinid
#611citation
to schema #559 Fixes The Census builder must addcitation
to Census data #560fetaure_length
values. Fixes Modifygene_length
description in schema when those are available in h5ads #8 Fixes The cell census builder must not add gene_lengths when those are available in h5ads #558