Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💫 Categorical dtypes #5

Open
3 tasks done
AlenkaF opened this issue Aug 12, 2021 · 3 comments
Open
3 tasks done

💫 Categorical dtypes #5

AlenkaF opened this issue Aug 12, 2021 · 3 comments

Comments

@AlenkaF
Copy link
Owner

AlenkaF commented Aug 12, 2021

Adapt the Dataframe protocol implementation for Vaex to work with categorical columns.

  • Research how Vaex handles categorical dtypes
  • Use the knowladge to adapt the interchange protocol
  • Add a roundtrip test
@AlenkaF
Copy link
Owner Author

AlenkaF commented Aug 12, 2021

In Vaex there are two ways of constructing a categorical column:

  1. using categorize() method on Vaex dataframe
  2. convert Arrow Dictionary to Vaex dataframe

For now the implementation works for the first possibility and gives an error for the second.
There are also two possibilities one can use categorize method:

  • applying list of labels
  • using min and max values of the data (in this case values in the dataframe are the labels)

See more in the research Notebook.

There is also some confusion in the categorical columns generated by categorize method. I created an example Notebook and send it to Vaex team (Maatren and Jovan) for comments and posted a PR. The PR will probably get solved by Maartens work and I will have to adapt expressions.codes in the protocol implementation.

  • post a PR to Vaex with test erroring when converting Arrow Dictionary dtype
  • resolve the error for Arrow Dictionary
  • implement expressions.codes from Maartens PR

@AlenkaF
Copy link
Owner Author

AlenkaF commented Aug 24, 2021

The only thing left that needs to be done is to implement expressions.codes from Maartens PR. Otherwise the protocol now works for categorical dtypes.

@AlenkaF
Copy link
Owner Author

AlenkaF commented Sep 2, 2021

One question on categorical dtypes came up while working on categorical metadata - #10 (comment):

What needed to be done also is to separately define the dtype for categorical columns (in Vaex dtype of a categorical column is the dtype of data itself). I am not sure if the default is correct:

(_DtypeKind.CATEGORICAL, 64, 'u', '=')

# Categorical
# If it is internal, kind is categorical (23)
# If it is external (call from_dataframe) must give data dtype
if self._col.df.is_category(self._col):
return (_DtypeKind.CATEGORICAL, 64, 'u', '=') # what should be the default??

Ralf pointed me to an open issue in the general dataframe-api:

data-apis/dataframe-api#49 (comment)
data-apis/dataframe-api#46 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant