-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[builder] port to use enums in schema #896
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## bkmartinjr/norm-layer #896 +/- ##
=========================================================
- Coverage 86.49% 86.38% -0.12%
=========================================================
Files 72 73 +1
Lines 5311 5412 +101
=========================================================
+ Hits 4594 4675 +81
- Misses 717 737 +20
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -63,7 +63,11 @@ def _copy_file(n: int, dataset: Dataset, asset_dir: str, N: int) -> str: | |||
raise last_error | |||
|
|||
# verify file size is as expected, if we know the size a priori | |||
assert (dataset.asset_h5ad_filesize == -1) or (dataset.asset_h5ad_filesize == os.path.getsize(dataset_path)) | |||
assert (dataset.asset_h5ad_filesize == -1) or ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added better message for assert. Change is unrelated to this PR - tripped over this during my testing, which overlapped with the schema four migration, which in turn triggered this assert to fail.
|
||
|
||
@attrs.define(frozen=True, kw_only=True, slots=True) | ||
class FieldSpec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if this is not intended as a public class, adding some docstrings could help better understand some of the methods (and especially parameters) to first-time readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -94,6 +95,7 @@ def accumulate_summary_counts(current: pd.DataFrame, obs_df: pd.DataFrame) -> pd | |||
columns="is_primary_data", | |||
index=["organism", "ontology_term_id", "label"], | |||
fill_value=0, | |||
aggfunc="sum", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why this wasn't specified before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the aggfunc is a noop no matter what it is specified as, because each value is unique (i.e., no aggregation occurs). However, the default aggfunc ('mean') causes the int values to be cast to a float. As the dataframe column is defined as int, this is no good. The sum aggfunc is specified because Python sum will leave ints as ints, bypassing the cast.
This was a latent bug I only found when the new TableSpec/FieldSpec class actually checked that the resulting schema was as expected. Hence the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and I'll add a comment to this effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This explains it.
tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/validate_soma.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I like the {Field,Table}Spec
design. One minor concern is that there's a non-trivial amount of code involved in generating the arrow table schemas. Is it worth having a basic test or two to verify the translation of a TableSpec to a schema? Or do we have enough checks in the Census validator?
return pa.schema(pa_fields) | ||
|
||
def field_names(self) -> Sequence[str]: | ||
"""Return field names for this TableSpec as a seuqence of string""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Return field names for this TableSpec as a seuqence of string""" | |
"""Return field names for this TableSpec as a sequence of string""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
* Benchmarking X slicing (using lung demo notebook) used to tune X[raw]. Read / query performance did not benefit from | ||
higher Zstd compression beyond level=5, so the level was not increased further (and level=5 is still reasonable for | ||
writes) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciated these notes, but also understand if they're now sufficiently outdated to not bother keeping (and updating).
""" | ||
Return True if this FieldSpec is equivalent to the Arrow `other_type`. | ||
For convenience in comparing with types inferred from Pandas DataFrames, | ||
where strings and other Arrow non-primtives are stored as objects, allow a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where strings and other Arrow non-primtives are stored as objects, allow a | |
where strings and other Arrow non-primitives are stored as objects, allow a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
good call - added (and fixed a bug immediately revealed by said tests!) |
* improve normalized layer floating point precision, and correct normalized calc for smart-seq assays * fix int32 overflow in sparse matrix code * add check for tiledb issue 1969 * bump dependency versions * work around SOMA bug temporarily * pr feedback * [builder] port to use enums in schema (#896) * first pass at using enum types * add better error logging for file size assertion * add feature flag for dict schema fields * update a few dependencies * remove debugging print * update comment * bump compression level * pr feedback * fix typos in comments * add schema_util tests and fix a bug found by those tests * lint
* schema 4 * update dep pins * AnnData version update allows for compat code cleanup * fix bug in feature_length * bump tiledbsoma dependency to latest * bump schema version * update census schema version * more dependency updates * update to use production REST API * [builder] normalized layer improvements (#884) * improve normalized layer floating point precision, and correct normalized calc for smart-seq assays * fix int32 overflow in sparse matrix code * add check for tiledb issue 1969 * bump dependency versions * work around SOMA bug temporarily * pr feedback * [builder] port to use enums in schema (#896) * first pass at using enum types * add better error logging for file size assertion * add feature flag for dict schema fields * update a few dependencies * remove debugging print * update comment * bump compression level * pr feedback * fix typos in comments * add schema_util tests and fix a bug found by those tests * lint
Add support for Arrow dictionary / TileDB enums in the Census schema. Encode as dict/enum all DataFrame columns where it is useful to do so (e.g., obs.cell_type).
Current status: the schema changes are is implemented, but hidden behind a feature flag to allow additional bugs to be resolved upstream, including:
These upstream issues are tracked in #604.
Changes in this PR:
With the feature flag set to false, the schema is unchanged. With it set to true, it will use Arrow dicts. Currently set to False.