Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[builder] port to use enums in schema #896

Merged
merged 19 commits into from
Dec 21, 2023

Conversation

bkmartinjr
Copy link
Contributor

@bkmartinjr bkmartinjr commented Dec 15, 2023

Add support for Arrow dictionary / TileDB enums in the Census schema. Encode as dict/enum all DataFrame columns where it is useful to do so (e.g., obs.cell_type).

Current status: the schema changes are is implemented, but hidden behind a feature flag to allow additional bugs to be resolved upstream, including:

These upstream issues are tracked in #604.

Changes in this PR:

  • introduced an abstraction for specifying DataFrame/Table schema
  • add feature flag to enable/disable dict/enum. when false, uses base primitive, when true, turns it into a dict
  • modifications to validation and other code to use the new Table schema spec

With the feature flag set to false, the schema is unchanged. With it set to true, it will use Arrow dicts. Currently set to False.

Copy link

codecov bot commented Dec 16, 2023

Codecov Report

Attention: 21 lines in your changes are missing coverage. Please review.

Comparison is base (c8fc4bc) 86.49% compared to head (f5a0a17) 86.38%.

❗ Current head f5a0a17 differs from pull request most recent head e7db9b0. Consider uploading reports for the commit e7db9b0 to get more accurate results

Files Patch % Lines
...cellxgene_census_builder/build_soma/schema_util.py 80.45% 17 Missing ⚠️
...llxgene_census_builder/build_soma/source_assets.py 0.00% 3 Missing ⚠️
...llxgene_census_builder/build_soma/validate_soma.py 93.75% 1 Missing ⚠️
Additional details and impacted files
@@                    Coverage Diff                    @@
##           bkmartinjr/norm-layer     #896      +/-   ##
=========================================================
- Coverage                  86.49%   86.38%   -0.12%     
=========================================================
  Files                         72       73       +1     
  Lines                       5311     5412     +101     
=========================================================
+ Hits                        4594     4675      +81     
- Misses                       717      737      +20     
Flag Coverage Δ
unittests 86.38% <86.00%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -63,7 +63,11 @@ def _copy_file(n: int, dataset: Dataset, asset_dir: str, N: int) -> str:
raise last_error

# verify file size is as expected, if we know the size a priori
assert (dataset.asset_h5ad_filesize == -1) or (dataset.asset_h5ad_filesize == os.path.getsize(dataset_path))
assert (dataset.asset_h5ad_filesize == -1) or (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added better message for assert. Change is unrelated to this PR - tripped over this during my testing, which overlapped with the schema four migration, which in turn triggered this assert to fail.

@bkmartinjr bkmartinjr marked this pull request as ready for review December 18, 2023 23:47
@bkmartinjr bkmartinjr mentioned this pull request Dec 19, 2023


@attrs.define(frozen=True, kw_only=True, slots=True)
class FieldSpec:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if this is not intended as a public class, adding some docstrings could help better understand some of the methods (and especially parameters) to first-time readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -94,6 +95,7 @@ def accumulate_summary_counts(current: pd.DataFrame, obs_df: pd.DataFrame) -> pd
columns="is_primary_data",
index=["organism", "ontology_term_id", "label"],
fill_value=0,
aggfunc="sum",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this wasn't specified before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the aggfunc is a noop no matter what it is specified as, because each value is unique (i.e., no aggregation occurs). However, the default aggfunc ('mean') causes the int values to be cast to a float. As the dataframe column is defined as int, this is no good. The sum aggfunc is specified because Python sum will leave ints as ints, bypassing the cast.

This was a latent bug I only found when the new TableSpec/FieldSpec class actually checked that the resulting schema was as expected. Hence the fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I'll add a comment to this effect

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This explains it.

Copy link
Collaborator

@atolopko-czi atolopko-czi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I like the {Field,Table}Spec design. One minor concern is that there's a non-trivial amount of code involved in generating the arrow table schemas. Is it worth having a basic test or two to verify the translation of a TableSpec to a schema? Or do we have enough checks in the Census validator?

return pa.schema(pa_fields)

def field_names(self) -> Sequence[str]:
"""Return field names for this TableSpec as a seuqence of string"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Return field names for this TableSpec as a seuqence of string"""
"""Return field names for this TableSpec as a sequence of string"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* Benchmarking X slicing (using lung demo notebook) used to tune X[raw]. Read / query performance did not benefit from
higher Zstd compression beyond level=5, so the level was not increased further (and level=5 is still reasonable for
writes)
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciated these notes, but also understand if they're now sufficiently outdated to not bother keeping (and updating).

"""
Return True if this FieldSpec is equivalent to the Arrow `other_type`.
For convenience in comparing with types inferred from Pandas DataFrames,
where strings and other Arrow non-primtives are stored as objects, allow a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
where strings and other Arrow non-primtives are stored as objects, allow a
where strings and other Arrow non-primitives are stored as objects, allow a

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@bkmartinjr
Copy link
Contributor Author

worth having a basic test or two

good call - added (and fixed a bug immediately revealed by said tests!)

@bkmartinjr bkmartinjr merged commit 0d9ba30 into bkmartinjr/norm-layer Dec 21, 2023
12 checks passed
@bkmartinjr bkmartinjr deleted the bkmartinjr/use-enums branch December 21, 2023 01:00
bkmartinjr pushed a commit that referenced this pull request Dec 21, 2023
* improve normalized layer floating point precision, and correct normalized calc for smart-seq assays

* fix int32 overflow in sparse matrix code

* add check for tiledb issue 1969

* bump dependency versions

* work around SOMA bug temporarily

* pr feedback

* [builder] port to use enums in schema (#896)

* first pass at using enum types

* add better error logging for file size assertion

* add feature flag for dict schema fields

* update a few dependencies

* remove debugging print

* update comment

* bump compression level

* pr feedback

* fix typos in comments

* add schema_util tests and fix a bug found by those tests

* lint
bkmartinjr pushed a commit that referenced this pull request Dec 21, 2023
* schema 4

* update dep pins

* AnnData version update allows for compat code cleanup

* fix bug in feature_length

* bump tiledbsoma dependency to latest

* bump schema version

* update census schema version

* more dependency updates

* update to use production REST API

* [builder] normalized layer improvements (#884)

* improve normalized layer floating point precision, and correct normalized calc for smart-seq assays

* fix int32 overflow in sparse matrix code

* add check for tiledb issue 1969

* bump dependency versions

* work around SOMA bug temporarily

* pr feedback

* [builder] port to use enums in schema (#896)

* first pass at using enum types

* add better error logging for file size assertion

* add feature flag for dict schema fields

* update a few dependencies

* remove debugging print

* update comment

* bump compression level

* pr feedback

* fix typos in comments

* add schema_util tests and fix a bug found by those tests

* lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants