[builder] port to use enums in schema #896

bkmartinjr · 2023-12-15T18:53:53Z

Add support for Arrow dictionary / TileDB enums in the Census schema. Encode as dict/enum all DataFrame columns where it is useful to do so (e.g., obs.cell_type).

Current status: the schema changes are is implemented, but hidden behind a feature flag to allow additional bugs to be resolved upstream, including:

[python/r] DataFrame: value filter on enum/dict column generates internal error when sought value not in enumeration single-cell-data/TileDB-SOMA#1988
[feature request] Query conditions have unexpected behavior with enum attributes TileDB-Inc/TileDB-Py#1880
[bug] Enumeration attribute incorrectly writes Pandas categorical column TileDB-Inc/TileDB-Py#1879

These upstream issues are tracked in #604.

Changes in this PR:

introduced an abstraction for specifying DataFrame/Table schema
add feature flag to enable/disable dict/enum. when false, uses base primitive, when true, turns it into a dict
modifications to validation and other code to use the new Table schema spec

With the feature flag set to false, the schema is unchanged. With it set to true, it will use Arrow dicts. Currently set to False.

codecov · 2023-12-16T02:58:24Z

Codecov Report

Attention: 21 lines in your changes are missing coverage. Please review.

Comparison is base (c8fc4bc) 86.49% compared to head (f5a0a17) 86.38%.

❗ Current head f5a0a17 differs from pull request most recent head e7db9b0. Consider uploading reports for the commit e7db9b0 to get more accurate results

Files	Patch %	Lines
...cellxgene_census_builder/build_soma/schema_util.py	80.45%	17 Missing ⚠️
...llxgene_census_builder/build_soma/source_assets.py	0.00%	3 Missing ⚠️
...llxgene_census_builder/build_soma/validate_soma.py	93.75%	1 Missing ⚠️

Additional details and impacted files

@@                    Coverage Diff                    @@
##           bkmartinjr/norm-layer     #896      +/-   ##
=========================================================
- Coverage                  86.49%   86.38%   -0.12%     
=========================================================
  Files                         72       73       +1     
  Lines                       5311     5412     +101     
=========================================================
+ Hits                        4594     4675      +81     
- Misses                       717      737      +20

Flag	Coverage Δ
unittests	`86.38% <86.00%> (-0.12%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bkmartinjr · 2023-12-18T23:06:25Z

tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/source_assets.py

@@ -63,7 +63,11 @@ def _copy_file(n: int, dataset: Dataset, asset_dir: str, N: int) -> str:
        raise last_error

    # verify file size is as expected, if we know the size a priori
-    assert (dataset.asset_h5ad_filesize == -1) or (dataset.asset_h5ad_filesize == os.path.getsize(dataset_path))
+    assert (dataset.asset_h5ad_filesize == -1) or (


added better message for assert. Change is unrelated to this PR - tripped over this during my testing, which overlapped with the schema four migration, which in turn triggered this assert to fail.

ebezzi · 2023-12-20T18:44:03Z

tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/schema_util.py

+
+
+@attrs.define(frozen=True, kw_only=True, slots=True)
+class FieldSpec:


Even if this is not intended as a public class, adding some docstrings could help better understand some of the methods (and especially parameters) to first-time readers.

ebezzi · 2023-12-20T18:49:41Z

tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/summary_cell_counts.py

@@ -94,6 +95,7 @@ def accumulate_summary_counts(current: pd.DataFrame, obs_df: pd.DataFrame) -> pd
                columns="is_primary_data",
                index=["organism", "ontology_term_id", "label"],
                fill_value=0,
+                aggfunc="sum",


Curious why this wasn't specified before?

the aggfunc is a noop no matter what it is specified as, because each value is unique (i.e., no aggregation occurs). However, the default aggfunc ('mean') causes the int values to be cast to a float. As the dataframe column is defined as int, this is no good. The sum aggfunc is specified because Python sum will leave ints as ints, bypassing the cast.

This was a latent bug I only found when the new TableSpec/FieldSpec class actually checked that the resulting schema was as expected. Hence the fix.

and I'll add a comment to this effect

Thanks! This explains it.

tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/validate_soma.py

atolopko-czi

LGTM. I like the {Field,Table}Spec design. One minor concern is that there's a non-trivial amount of code involved in generating the arrow table schemas. Is it worth having a basic test or two to verify the translation of a TableSpec to a schema? Or do we have enough checks in the Census validator?

atolopko-czi · 2023-12-20T22:28:04Z

tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/schema_util.py

+        return pa.schema(pa_fields)
+
+    def field_names(self) -> Sequence[str]:
+        """Return field names for this TableSpec as a seuqence of string"""


Suggested change

"""Return field names for this TableSpec as a seuqence of string"""

"""Return field names for this TableSpec as a sequence of string"""

atolopko-czi · 2023-12-20T22:32:45Z

tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/globals.py

-* Benchmarking X slicing (using lung demo notebook) used to tune X[raw]. Read / query performance did not benefit from
-  higher Zstd compression beyond level=5, so the level was not increased further (and level=5 is still reasonable for
-  writes)
-"""


I appreciated these notes, but also understand if they're now sufficiently outdated to not bother keeping (and updating).

ebezzi · 2023-12-20T22:45:03Z

tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/schema_util.py

+        """
+        Return True if this FieldSpec is equivalent to the Arrow `other_type`.
+        For convenience in comparing with types inferred from Pandas DataFrames,
+        where strings and other Arrow non-primtives are stored as objects, allow a


Suggested change

where strings and other Arrow non-primtives are stored as objects, allow a

where strings and other Arrow non-primitives are stored as objects, allow a

bkmartinjr · 2023-12-21T00:07:29Z

worth having a basic test or two

good call - added (and fixed a bug immediately revealed by said tests!)

* improve normalized layer floating point precision, and correct normalized calc for smart-seq assays * fix int32 overflow in sparse matrix code * add check for tiledb issue 1969 * bump dependency versions * work around SOMA bug temporarily * pr feedback * [builder] port to use enums in schema (#896) * first pass at using enum types * add better error logging for file size assertion * add feature flag for dict schema fields * update a few dependencies * remove debugging print * update comment * bump compression level * pr feedback * fix typos in comments * add schema_util tests and fix a bug found by those tests * lint

* schema 4 * update dep pins * AnnData version update allows for compat code cleanup * fix bug in feature_length * bump tiledbsoma dependency to latest * bump schema version * update census schema version * more dependency updates * update to use production REST API * [builder] normalized layer improvements (#884) * improve normalized layer floating point precision, and correct normalized calc for smart-seq assays * fix int32 overflow in sparse matrix code * add check for tiledb issue 1969 * bump dependency versions * work around SOMA bug temporarily * pr feedback * [builder] port to use enums in schema (#896) * first pass at using enum types * add better error logging for file size assertion * add feature flag for dict schema fields * update a few dependencies * remove debugging print * update comment * bump compression level * pr feedback * fix typos in comments * add schema_util tests and fix a bug found by those tests * lint

bkmartinjr added 5 commits December 15, 2023 15:45

first pass at using enum types

f5fa99d

Merge branch 'bkmartinjr/norm-layer' into bkmartinjr/use-enums

b88801b

Merge branch 'bkmartinjr/norm-layer' into bkmartinjr/use-enums

e882cbf

add better error logging for file size assertion

9204f95

add feature flag for dict schema fields

498af3e

bkmartinjr added 8 commits December 16, 2023 17:11

Merge branch 'bkmartinjr/norm-layer' into bkmartinjr/use-enums

8ece6f1

update a few dependencies

929d1da

remove debugging print

5df02c3

Merge branch 'bkmartinjr/norm-layer' into bkmartinjr/use-enums

5fb6d38

update comment

807431f

Merge branch 'bkmartinjr/norm-layer' into bkmartinjr/use-enums

e5b45c7

bump compression level

de47e73

Merge branch 'bkmartinjr/norm-layer' into bkmartinjr/use-enums

64dbaf6

bkmartinjr commented Dec 18, 2023

View reviewed changes

bkmartinjr requested review from atolopko-czi, ebezzi and pablo-gar December 18, 2023 23:28

bkmartinjr marked this pull request as ready for review December 18, 2023 23:47

bkmartinjr mentioned this pull request Dec 19, 2023

[builder] schema 4.0 #872

Merged

bkmartinjr added 2 commits December 19, 2023 22:22

Merge branch 'bkmartinjr/norm-layer' into bkmartinjr/use-enums

991e136

Merge branch 'bkmartinjr/norm-layer' into bkmartinjr/use-enums

98cddc5

bkmartinjr mentioned this pull request Dec 20, 2023

add enumerated/categorical support #604

Closed

ebezzi reviewed Dec 20, 2023

View reviewed changes

pr feedback

f5a0a17

atolopko-czi approved these changes Dec 20, 2023

View reviewed changes

ebezzi reviewed Dec 20, 2023

View reviewed changes

ebezzi approved these changes Dec 20, 2023

View reviewed changes

bkmartinjr added 2 commits December 20, 2023 22:49

fix typos in comments

be0c036

add schema_util tests and fix a bug found by those tests

bb298a6

lint

e7db9b0

bkmartinjr merged commit 0d9ba30 into bkmartinjr/norm-layer Dec 21, 2023
12 checks passed

bkmartinjr deleted the bkmartinjr/use-enums branch December 21, 2023 01:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[builder] port to use enums in schema #896

[builder] port to use enums in schema #896

bkmartinjr commented Dec 15, 2023 •

edited

Loading

codecov bot commented Dec 16, 2023 •

edited

Loading

bkmartinjr Dec 18, 2023

ebezzi Dec 20, 2023

bkmartinjr Dec 20, 2023

ebezzi Dec 20, 2023

bkmartinjr Dec 20, 2023

bkmartinjr Dec 20, 2023

ebezzi Dec 20, 2023

atolopko-czi left a comment

atolopko-czi Dec 20, 2023

bkmartinjr Dec 20, 2023

atolopko-czi Dec 20, 2023

ebezzi Dec 20, 2023

bkmartinjr Dec 20, 2023

bkmartinjr commented Dec 21, 2023



		@attrs.define(frozen=True, kw_only=True, slots=True)
		class FieldSpec:

	"""Return field names for this TableSpec as a seuqence of string"""
	"""Return field names for this TableSpec as a sequence of string"""

	where strings and other Arrow non-primtives are stored as objects, allow a
	where strings and other Arrow non-primitives are stored as objects, allow a

[builder] port to use enums in schema #896

[builder] port to use enums in schema #896

Conversation

bkmartinjr commented Dec 15, 2023 • edited Loading

codecov bot commented Dec 16, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atolopko-czi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkmartinjr commented Dec 21, 2023

bkmartinjr commented Dec 15, 2023 •

edited

Loading

codecov bot commented Dec 16, 2023 •

edited

Loading