You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A practical example (from @kkraus14 in gh-38), for a categorical column of ['gold', 'bronze', 'silver', null, 'bronze', 'silver', 'gold'] with categories of ['gold' < 'silver' < 'bronze']:
categorical column: {
mask_buffer: [119], # 01110111 in binary
data_buffer: [0, 2, 1, 127, 2, 1, 0], # the 127 value in here is undefined since it's null
children: [
string column: {
mask_buffer: None,
offsets_buffer: [0, 4, 10, 16],
data_buffer: [103, 111, 108, 100, 115, 105, 108, 118, 101, 114, 98, 114, 111, 110, 122, 101]
}
]
}
Also see https://arrow.apache.org/docs/python/data.html#dictionary-arrays for what PyArrow does - it matches the current exchange protocol more closely than the Arrow C Data Interface. E.g., it uses an actual Python dictionary for the mapping of values to categories.
Vaex
EDIT: Vaex's API was done pre Arrow integration, and will change to match Arrow in the future.
Add get_children() method, and store the mapping that is now in Column.describe_categorical in a child column instead. Note that child columns are also needed for variable-length strings.
To discuss:
If dtype is the logical dtype for the column, where to store how to interpret the actual data buffer? Right now this is done not in a static attribute but by returning the dtype along with the buffer when accessing it:
defget_data_buffer(self) ->Tuple[_PandasBuffer, _Dtype]:
""" Return the buffer containing the data. """_k=_DtypeKindifself.dtype[0] in (_k.INT, _k.UINT, _k.FLOAT, _k.BOOL):
buffer=_PandasBuffer(self._col.to_numpy())
dtype=self.dtypeelifself.dtype[0] ==_k.CATEGORICAL:
codes=self._col.values.codesbuffer=_PandasBuffer(codes)
dtype=self._dtype_from_pandasdtype(codes.dtype)
else:
raiseNotImplementedError(f"Data type {self._col.dtype} not handled yet")
returnbuffer, dtype
What goes in the data buffer on the column? The category-encoded data makes sense, because the buffer needs to be the same size as the column (number of elements), otherwise it would be inconsistent with other dtypes.
What happens when the data is strings?
The text was updated successfully, but these errors were encountered:
Categorical dtypes
xref gh-26 for some discussion on categorical dtypes.
What it looks like in different libraries
Pandas
The dtype is called
category
there. See pandas.Categorical docs:Apache Arrow
The dtype is called _"dictionary-encoded" in Arrow - so a dataframe with a categorical dtype is called a "dictionary-encoded array" there.
See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions for details.
A practical example (from @kkraus14 in gh-38), for a categorical column of
['gold', 'bronze', 'silver', null, 'bronze', 'silver', 'gold']
with categories of['gold' < 'silver' < 'bronze']
:Also see https://arrow.apache.org/docs/python/data.html#dictionary-arrays for what PyArrow does - it matches the current exchange protocol more closely than the Arrow C Data Interface. E.g., it uses an actual Python dictionary for the mapping of values to categories.
Vaex
EDIT: Vaex's API was done pre Arrow integration, and will change to match Arrow in the future.
Other libraries
Exchange protocol
This is the current form in gh-38 for the Pandas implementation of the exchange protocol:
Changes needed & discussion points
What we already determined needs changing:
get_children()
method, and store themapping
that is now inColumn.describe_categorical
in a child column instead. Note that child columns are also needed for variable-length strings.To discuss:
dtype
is the logical dtype for the column, where to store how to interpret the actual data buffer? Right now this is done not in a static attribute but by returning the dtype along with the buffer when accessing it:What goes in the data buffer on the column? The category-encoded data makes sense, because the buffer needs to be the same size as the column (number of elements), otherwise it would be inconsistent with other dtypes.
The text was updated successfully, but these errors were encountered: