Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] Fix display of Arrow schema for enum of bytes #2306

Closed
johnkerl opened this issue Mar 22, 2024 · 1 comment
Closed

[c++] Fix display of Arrow schema for enum of bytes #2306

johnkerl opened this issue Mar 22, 2024 · 1 comment
Assignees
Labels
blocks-1.9 bug Something isn't working

Comments

@johnkerl
Copy link
Member

johnkerl commented Mar 22, 2024

Issue and/or context: As found by @bkmartinjr in Slack.

Here is a repro script to write a SOMADataFrame with attributes of various non-enum and enum types:

cat ./2305.py

#!/usr/bin/env python

import numpy as np
import pandas as pd
import pyarrow as pa
import tiledbsoma as soma
import os, shutil

def main():
    fname = "./test_dataframe"
    if os.path.exists(fname):
        shutil.rmtree(fname)

    pandas_df = pd.DataFrame(
        {
            "soma_joinid": pd.Series([0, 1, 2, 3, 4, 5], dtype=np.int64),
            "int_cat": pd.Series([10, 20, 10, 20, 20, 20], dtype="category"),
            "int": pd.Series([10, 20, 10, 20, 20, 20]),
            "str_cat": pd.Series(["A", "B", "A", "B", "B", "B"], dtype="category"),
            "str": pd.Series(["A", "B", "A", "B", "B", "B"]),
            "byte_cat": pd.Series([b"A", b"B", b"A", b"B", b"B", b"B"], dtype="category"),
            "byte": pd.Series([b"A", b"B", b"A", b"B", b"B", b"B"]),
        },
    )

    print("** Original Pandas schema")
    print(pandas_df.dtypes)
    for c in pandas_df:
        print(f"{c}: {repr(pandas_df[c].dtype)}")

    schema = pa.Schema.from_pandas(pandas_df, preserve_index=False)
    print("-----")

    print("** Arrow schema, derived from Pandas")
    print(schema)
    print("-----")

    print("** Arrow Table derived from pandas")
    print(pa.Table.from_pandas(pandas_df, preserve_index=False))
    print("-----")

    with soma.DataFrame.create(fname, schema=schema) as soma_dataframe:
        tbl = pa.Table.from_pandas(pandas_df, preserve_index=False)
        soma_dataframe.write(tbl)

    with soma.open(fname) as soma_dataframe:
        print("**Created TileDB Array schema")
        print(soma_dataframe.schema)
        df = soma_dataframe.read().concat().to_pandas()
        for c in df:
            print(f"{c}: {repr(df[c].dtype)}, {repr(pandas_df[c].dtype)}")
            if df[c].dtype == 'category':
                print(f"Categories dtype: {repr(df[c].cat.categories.dtype)}, {repr(pandas_df[c].cat.categories.dtype)}")

            assert df[c].dtype == pandas_df[c].dtype
            if df[c].dtype == 'category':
                assert df[c].cat.categories.dtype == pandas_df[c].cat.categories.dtype

        print(df)


if __name__ == "__main__":
    main()

Here is how it reads back from TileDB-Py:

import tiledb
A = tiledb.open("test_dataframe")
print(A.schema)

for i in range(A.schema.nattr):
    attr = A.schema.attr(i)
    try:
        index_type = attr.dtype
        value_type = A.enum(attr.name).dtype
        print(f"enum name={attr.name} index_type={index_type.name} value_type={value_type.name}")
    except tiledb.cc.TileDBError:
        pass # not an eum

Output from TileDB-Py:

ArraySchema(
  domain=Domain(*[
    Dim(name='soma_joinid', domain=(0, 2147483646), tile=2048, dtype='int64', filters=FilterList([ZstdFilter(level=3), ])),
  ]),
  attrs=[
    Attr(name='int_cat', dtype='int8', var=False, nullable=False, enum_label='int_cat', filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='int', dtype='int64', var=False, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='str_cat', dtype='int8', var=False, nullable=False, enum_label='str_cat', filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='str', dtype='<U0', var=True, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='byte_cat', dtype='int8', var=False, nullable=False, enum_label='byte_cat', filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='byte', dtype='|S0', var=True, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=-1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=False,
)

enum name=int_cat index_type=int8 value_type=int64
enum name=str_cat index_type=int8 value_type=str32
enum name=byte_cat index_type=int8 value_type=bytes

Note that TileDB-Py correctly says byte_cat has value_type=bytes.

Here is a repro using TileDB-SOMA to print the Arrow schema:

import tiledbsoma as soma
sdf = soma.open('test_dataframe')
print(sdf.schema)

Output before #2305:

soma_joinid: int64 not null
int_cat: dictionary<values=int64, indices=int8, ordered=0> not null
int: int64 not null
str_cat: dictionary<values=string, indices=int8, ordered=0> not null
str: large_string not null
byte_cat: dictionary<values=string, indices=int8, ordered=0> not null
byte: large_string not null
@johnkerl johnkerl self-assigned this Mar 22, 2024
@johnkerl johnkerl added bug Something isn't working blocks-1.9 labels Mar 22, 2024
@johnkerl johnkerl changed the title [c++] [c++] Fix display of Arrow schema for enum of bytes Mar 22, 2024
@johnkerl
Copy link
Member Author

See also #2311 for tracking toward 1.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocks-1.9 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant