Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python/r] DataFrame: value filter on enum/dict column generates internal error when sought value not in enumeration #1988

Closed
bkmartinjr opened this issue Dec 15, 2023 · 10 comments
Assignees
Labels
blocks-1.9 bug Something isn't working

Comments

@bkmartinjr
Copy link
Member

bkmartinjr commented Dec 15, 2023

I have an empty dataframe containing dictionary/enum attributes. When a value filter / query condition is applied to it, it triggers an internal Arrow error. It should return an empty result. All works fine for non-dictionary attributes, so it appears that value filters do not always work correctly with dict/enum attributes.

Note that the tiledb package also has questionable behavior here, returning an exception if the value filter attempts to test for a value not in the enumeration. So it is likely that the Arrow error is unique to the libtiledbsoma codepath, but both behaviors make the combination of filters and enums problematic.

What I think should happen: the value filter should have identical behavior (ie., results) for a column of type "T" and a column of type "enum-of-T", where T is string, int, etc (e.g., a query against a "dict of strings" column should perform the same as a query against a string column).

<late edit>
The empty dataframe is unrelated. It fails in exactly the same way for non-empty arrays. I'll add an example of that below.
</late edit>

The schema (abbreviated for ease of reading):

In [102]: obs.schema
Out[102]: 
soma_joinid: int64
dataset_id: dictionary<values=string, indices=int8, ordered=0>
is_primary_data: bool
observation_joinid: large_string
# lots of other columns removed for brevity

Reading the entire thing works correctly (output abbreviated):

In [103]: obs.read().concat()
Out[103]: 
pyarrow.Table
soma_joinid: int64
dataset_id: dictionary<values=string, indices=int8, ordered=0>
is_primary_data: bool
observation_joinid: large_string
----
soma_joinid: [[]]
dataset_id: [  -- dictionary:
[]  -- indices:
[]]
assay: [  -- dictionary:
[]  -- indices:
[]]
...

Read with a value filter on a string attribute works fine (output abbreviated):

In [104]: obs.read(value_filter="observation_joinid == 'foobar'").concat()
Out[104]: 
pyarrow.Table
soma_joinid: int64
dataset_id: dictionary<values=string, indices=int8, ordered=0>
is_primary_data: bool
observation_joinid: large_string
----
soma_joinid: [[]]
dataset_id: [  -- dictionary:
[]  -- indices:
[]]
...

Reading with a value filter on a dict column fails an internal Arrow error check:

In [105]: obs.read(value_filter="""dataset_id == 'foobar'""").concat()
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[105], line 1
----> 1 obs.read(value_filter="""dataset_id == 'foobar'""").concat()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledbsoma/_read_iters.py:72, in TableReadIter.concat(self)
     70 def concat(self) -> pa.Table:
     71     """Concatenate remainder of iterator, and return as a single `Arrow Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`_"""
---> 72     return pa.concat_tables(self)

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/table.pxi:5233, in pyarrow.lib.concat_tables()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledbsoma/_read_iters.py:68, in TableReadIter.__next__(self)
     67 def __next__(self) -> pa.Table:
---> 68     return next(self._reader)

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledbsoma/_read_iters.py:454, in _arrow_table_reader(sr)
    452 def _arrow_table_reader(sr: clib.SOMAArray) -> Iterator[pa.Table]:
    453     """Private. Simple Table iterator on any Array"""
--> 454     tbl = sr.read_next()
    455     while tbl is not None:
    456         yield tbl

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/table.pxi:3986, in pyarrow.lib.Table.from_arrays()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/table.pxi:3266, in pyarrow.lib.Table.validate()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: Column 1 named dataset_id expected length 2097152 but got length 16777216

Using the latest tiledb has a different (and also arguably incorrect) behavior:

In [108]: A = tiledb.open("tmp/census/2023-12-15/soma/census_data/mus_musculus/obs")

In [109]: A.query(use_arrow=True).df[:]
Out[109]: 
Empty DataFrame
Columns: [soma_joinid, dataset_id, assay, assay_ontology_term_id, cell_type, cell_type_ontology_term_id, development_stage, development_stage_ontology_term_id, disease, disease_ontology_term_id, donor_id, is_primary_data, observation_joinid, self_reported_ethnicity, self_reported_ethnicity_ontology_term_id, sex, sex_ontology_term_id, suspension_type, tissue, tissue_ontology_term_id, tissue_type, tissue_general, tissue_general_ontology_term_id, raw_sum, nnz, raw_mean_nnz, raw_variance_nnz, n_measured_vars]
Index: []

In [110]: A.query(cond="dataset_id == 'foobar'", use_arrow=True).df[:]
---------------------------------------------------------------------------
TileDBError                               Traceback (most recent call last)
Cell In[110], line 1
----> 1 A.query(cond="dataset_id == 'foobar'", use_arrow=True).df[:]

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledb/multirange_indexing.py:256, in _BaseIndexer.__getitem__(self, idx)
    254     self.subarray = Subarray(self.array)
    255     self._set_ranges(idx)
--> 256 return self if self.return_incomplete else self._run_query()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledb/multirange_indexing.py:399, in DataFrameIndexer._run_query(self)
    396 import pyarrow
    398 if self.pyquery is not None:
--> 399     self.pyquery.submit()
    401 if self.pyquery is None:
    402     df = pandas.DataFrame(self._empty_results)

TileDBError: TileDB internal: Enumeration value not found for field 'dataset_id'

Package version info:

tiledbsoma.__version__        1.6.1
TileDB-Py tiledb.version()    (0, 24, 0)
TileDB core version           2.18.2
libtiledbsoma version()       libtiledb=2.18.2
python version                3.10.12.final.0
OS version                    Linux 6.2.0-1017-aws

I can make the problematic empty dataframe available if helpful.


The empty/non-empty state of the array is unrelated. Here is an example on a non-empty dataframe with the same schema, failing in the same way:

In [6]: obs.count
Out[6]: 31470

In [7]: obs.read(value_filter="""dataset_id == 'foobar'""").concat()
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[7], line 1
----> 1 obs.read(value_filter="""dataset_id == 'foobar'""").concat()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledbsoma/_read_iters.py:72, in TableReadIter.concat(self)
     70 def concat(self) -> pa.Table:
     71     """Concatenate remainder of iterator, and return as a single `Arrow Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`_"""
---> 72     return pa.concat_tables(self)

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/table.pxi:5233, in pyarrow.lib.concat_tables()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledbsoma/_read_iters.py:68, in TableReadIter.__next__(self)
     67 def __next__(self) -> pa.Table:
---> 68     return next(self._reader)

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledbsoma/_read_iters.py:454, in _arrow_table_reader(sr)
    452 def _arrow_table_reader(sr: clib.SOMAArray) -> Iterator[pa.Table]:
    453     """Private. Simple Table iterator on any Array"""
--> 454     tbl = sr.read_next()
    455     while tbl is not None:
    456         yield tbl

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/table.pxi:3986, in pyarrow.lib.Table.from_arrays()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/table.pxi:3266, in pyarrow.lib.Table.validate()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: Column 1 named dataset_id expected length 2097152 but got length 16777216

@bkmartinjr bkmartinjr changed the title [python] DataFrame: value filter on enum/dict column generates internal error when array is empty [python] DataFrame: value filter on enum/dict column generates internal error Dec 15, 2023
@bkmartinjr
Copy link
Member Author

Also see TileDB-Inc/TileDB-Py#1880 which is a related ease-of-use issue for our use case. For many of our dataframe columns, where we want to use enums, it would be far easier to use if the value filter equality ops (==, in [...], etc) worked on enums/dicts.

@johnkerl johnkerl added the bug Something isn't working label Dec 15, 2023
@johnkerl
Copy link
Member

Needs triaging for R as well

@johnkerl
Copy link
Member

johnkerl commented Jan 8, 2024

[sc-38450]

@johnkerl
Copy link
Member

johnkerl commented Jan 8, 2024

@eddelbuettel this needs triaging for R as well please

@johnkerl johnkerl changed the title [python] DataFrame: value filter on enum/dict column generates internal error [python] DataFrame: value filter on enum/dict column generates internal error when value not in enumeration Jan 16, 2024
@johnkerl johnkerl changed the title [python] DataFrame: value filter on enum/dict column generates internal error when value not in enumeration [python] DataFrame: value filter on enum/dict column generates internal error when sought value not in enumeration Jan 16, 2024
@johnkerl johnkerl assigned ryan-williams and unassigned nguyenv Mar 21, 2024
ryan-williams added a commit that referenced this issue Mar 22, 2024
* typeguard nit

missed in #1960

* factor common fixtures into conftest.py

* factor test_update_dataframes fixture

* `verify_obs_var` helper, more `test_update_dataframes` factoring

* test_experiment_query.py: verify #1988

* `s/h5ad_file/h5ad_path/g`, factor `HERE`s
github-actions bot pushed a commit that referenced this issue Mar 22, 2024
* typeguard nit

missed in #1960

* factor common fixtures into conftest.py

* factor test_update_dataframes fixture

* `verify_obs_var` helper, more `test_update_dataframes` factoring

* test_experiment_query.py: verify #1988

* `s/h5ad_file/h5ad_path/g`, factor `HERE`s
johnkerl pushed a commit that referenced this issue Mar 22, 2024
* typeguard nit

missed in #1960

* factor common fixtures into conftest.py

* factor test_update_dataframes fixture

* `verify_obs_var` helper, more `test_update_dataframes` factoring

* test_experiment_query.py: verify #1988

* `s/h5ad_file/h5ad_path/g`, factor `HERE`s

Co-authored-by: Ryan Williams <[email protected]>
@ryan-williams
Copy link
Member

#2299 added this test, which verifies the issue no longer exists (as of TileDB 2.21.0).

Not sure if there is independent verification that still needs to happen in R…

@johnkerl
Copy link
Member

@ryan-williams there is independent verification in R. I'll do that. This PR is for Python and that's fine.

@johnkerl
Copy link
Member

I am blocked on the R side. Questions in Slack.

@johnkerl
Copy link
Member

See also #2311 for tracking toward 1.9

@johnkerl johnkerl changed the title [python] DataFrame: value filter on enum/dict column generates internal error when sought value not in enumeration [python/r] DataFrame: value filter on enum/dict column generates internal error when sought value not in enumeration Mar 25, 2024
@johnkerl johnkerl assigned mojaveazure and unassigned eddelbuettel Mar 25, 2024
@johnkerl
Copy link
Member

I am blocked on the R side. Questions in Slack.

@mojaveazure has set me up! :)

@johnkerl
Copy link
Member

Closed with #2308 and #2316

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocks-1.9 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants