Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft for the Dataframe interchange protocol #1509

Merged
merged 40 commits into from
Oct 13, 2021

Conversation

AlenkaF
Copy link
Contributor

@AlenkaF AlenkaF commented Aug 12, 2021

This is the first draft of the Dataframe protocol implementation for Vaex.
It works for

  • numerical and boolean dtypes
  • categorical columns constructed with categorize()

It is erroring with Arrow Dictionary when trying to convert dtype arrow_type.to_pandas_dtype(), line 305 in _dtype_from_vaexdtype. Do you have any suggestions?

What is still on my todo list:

cc @maartenbreddels @JovanVeljanoski

@AlenkaF AlenkaF force-pushed the dataframe-protocol branch 2 times, most recently from 8c52ceb to 48c34f6 Compare August 12, 2021 13:44
@AlenkaF AlenkaF marked this pull request as draft August 13, 2021 11:50
Copy link
Member

@maartenbreddels maartenbreddels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work, I gave it a first pass, hope that helps.

packages/vaex-core/vaex/dataframe_protocol.py Outdated Show resolved Hide resolved
packages/vaex-core/vaex/dataframe_protocol.py Outdated Show resolved Hide resolved
packages/vaex-core/vaex/dataframe_protocol.py Outdated Show resolved Hide resolved
packages/vaex-core/vaex/dataframe_protocol.py Outdated Show resolved Hide resolved
tests/dataframe_protocol_test.py Outdated Show resolved Hide resolved
Comment on lines 402 to 408
if self._col.values[0] in self.labels:
for i in self._col.values:
codes[np.where(codes==i)] = np.where(self.labels == i) # if values are same as labels
else:
codes = self._col.values # values are already codes for the labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think is needed. Could you explain what you are trying to do here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this will not be needed when I will be able to use expressions.codes.

But for now, if I take this example:

df = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6])
df = df.categorize('year', min_value=2012, max_value=2019)
df = df.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])

and use .values function on year and weekday column I get two different types of output. In the first case it is a list of labels and in the other case I get the codes.

array([2012, 2015, 2019])
array([0, 4, 6])

To get the codes for the year column I calculate them as in line 402-404.

For now I can't seem to find another way.

packages/vaex-core/vaex/dataframe_protocol.py Outdated Show resolved Hide resolved
@AlenkaF AlenkaF force-pushed the dataframe-protocol branch from 48c34f6 to 954c4bb Compare August 16, 2021 12:21
@AlenkaF AlenkaF force-pushed the dataframe-protocol branch 2 times, most recently from 44fc157 to f74a168 Compare August 31, 2021 08:17
# If it is internal, kind must be categorical (23)
# If it is external (call from_dataframe), dtype must give type of the data
if self._col.df.is_category(self._col):
return (_DtypeKind.CATEGORICAL, 64, 'u', '=') # what should be the default??
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have to be revised. See first two items in data-apis/dataframe-api#46 (comment)


# Join the chunks into tuple for now
df_new = vaex.concat(dataframe)
df_new._buffers = _buffers
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nested list may be hard to read. Could be changed to a list of dictionaries with column names as keys if that would be better.

@AlenkaF AlenkaF force-pushed the dataframe-protocol branch 2 times, most recently from c06da90 to 14fa9a4 Compare September 7, 2021 06:59
@AlenkaF
Copy link
Contributor Author

AlenkaF commented Sep 7, 2021

@maartenbreddels @JovanVeljanoski don't know if it will stay green =) nevertheless I think this is ready for a review.

@AlenkaF AlenkaF marked this pull request as ready for review September 7, 2021 10:26
Convert an int, uint, float or bool column to an arrow array
"""
if col.offset != 0:
raise NotImplementedError("column.offset > 0 not handled yet")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is offset used in Vaex?

maartenbreddels added a commit to maartenbreddels/plotly.py that referenced this pull request Sep 16, 2021
This allows plotly express to take in any dataframe that supports
the dataframe protocol, see:
https://data-apis.org/blog/dataframe_protocol_rfc/
https://data-apis.org/dataframe-protocol/latest/index.html

Test includes an example with vaex, which should work with
vaexio/vaex#1509
(not yet released)
maartenbreddels added a commit to maartenbreddels/plotly.py that referenced this pull request Sep 16, 2021
This allows plotly express to take in any dataframe that supports
the dataframe protocol, see:
https://data-apis.org/blog/dataframe_protocol_rfc/
https://data-apis.org/dataframe-protocol/latest/index.html

Test includes an example with vaex, which should work with
vaexio/vaex#1509
(not yet released)
Copy link
Member

@maartenbreddels maartenbreddels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! my biggest comment/change was the change of the describe_null, which can be 3 options (no null, np.ma or arrow). I hope you agree with that.
Would love to see string support!

PS: Please keep the line endings as line feeds.

packages/vaex-core/vaex/dataframe_protocol.py Outdated Show resolved Hide resolved
packages/vaex-core/vaex/dataframe_protocol.py Show resolved Hide resolved
if self.dtype[0] == _k.BOOL and isinstance(self._col.values, (pa.Array, pa.ChunkedArray)):
buffer = _VaexBuffer(np.array(self._col.tolist(), dtype=bool))
else:
buffer = _VaexBuffer(self._col.to_numpy())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could make a copy. We can use .values/.evaluate() but then we'd have to support arrow and numpy arrays, which should be fine, maybe the following works:

Suggested change
buffer = _VaexBuffer(self._col.to_numpy())
buffer = _VaexBuffer(np.asarray(self._col.evaluate(), dtype=bool))

# If arrow array is boolean .to_numpy changes values for some reason
# For that reason data is transferred to numpy through .tolist
if self.dtype[0] == _k.BOOL and isinstance(self._col.values, (pa.Array, pa.ChunkedArray)):
buffer = _VaexBuffer(np.array(self._col.tolist(), dtype=bool))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why that happens, I'll see if a test fails because of that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should restate: the values changed in a strange way when transferring data with buffer_to_ndarray in the case of a bool arrow array if .to_numpy was used.

But I will try to use previous suggestion (.values/.evaluate()) for arrow and numpy and I will delete the comment if it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the same strange change of values for evaluate() also. It happens in the case of arrow arrays (int and bool) with missing values.

df = vaex.from_arrays(
    arrow_int_m = pa.array([0, 1, 2, None, 0], mask=np.array([0, 0, 0, 1, 1], dtype=bool)),
    arrow_float_m = pa.array([0.5, 1.5, 2.5, None, 0.5], mask=np.array([0, 0, 0, 1, 0], dtype=bool)),
    arrow_bool_m = pa.array([True, False, True, None, True], mask=np.array([0, 0, 1, 1, 0], dtype=bool))
)

col = df.__dataframe__().get_column_by_name('arrow_int_m')
b, d = col.get_buffers()["data"]
buffer_to_ndarray(b, d)

output:

array([                  0, 4607182418800017408, 4611686018427387904,
         -2251799813685248,   -2251799813685248], dtype=int64)

I can commit with the error so you can have a look?

packages/vaex-core/vaex/dataframe_protocol.py Show resolved Hide resolved
packages/vaex-core/vaex/dataframe_protocol.py Outdated Show resolved Hide resolved
Comment on lines 681 to 686
size = self.num_rows()
i = self._df.evaluate_iterator(self.get_column(0)._col, chunk_size=size // n_chunks)
iterator = []
for i1, i2, chunk in i:
iterator.append(_VaexColumn(self._df[i1:i2]))
return iterator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I misunderstood, but I think we can simply do

Suggested change
size = self.num_rows()
i = self._df.evaluate_iterator(self.get_column(0)._col, chunk_size=size // n_chunks)
iterator = []
for i1, i2, chunk in i:
iterator.append(_VaexColumn(self._df[i1:i2]))
return iterator
yield self

Similar to https://github.com/data-apis/dataframe-api/blob/27b8e1cb676bf10704d1dfc3dca0d0d806e2e802/protocol/pandas_implementation.py#L766

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this part thinking one could read a dataframe (that is not chunked) in chunks specifying the n_chunks (item 12 from protocol-design-requirements).

But there are some errors left, sorry about that. It should be:

    def get_chunks(self, n_chunks: Optional[int] = None) -> Iterable["_VaexDataFrame"]:
        """
        Return an iterator yielding the chunks.
        TODO: details on ``n_chunks``
        """
        if n_chunks == None:
            size = self.num_rows()
            n_chunks = self.num_chunks()
            i = self._df.evaluate_iterator(self.get_column(0)._col, chunk_size=size // n_chunks)
            iterator = []
            for i1, i2, chunk in i:
                iterator.append(_VaexDataFrame(self._df[i1:i2]))
            return iterator
        elif self.num_chunks() == 1:
            size = self.num_rows()
            i = self._df.evaluate_iterator(self.get_column(0)._col, chunk_size=size // n_chunks)
            iterator = []
            for i1, i2, chunk in i:
                iterator.append(_VaexDataFrame(self._df[i1:i2]))
            return iterator
        else:
            raise ValueError("Dataframe is already chunked.")

packages/vaex-core/vaex/dataframe_protocol.py Show resolved Hide resolved
@AlenkaF AlenkaF force-pushed the dataframe-protocol branch 2 times, most recently from 00cab17 to ba8cef8 Compare September 17, 2021 12:40
@maartenbreddels
Copy link
Member

Awesome to see the strings coming in, getting close!
I pushed some a change in how to consume the buffers for strings, which is more efficient (see vaex/arrow/convert.py for some more code examples), and I think we should also use that method to produce the buffers (so that my new mem copy test does not fail).

@AlenkaF AlenkaF force-pushed the dataframe-protocol branch from 1133384 to a81f4f5 Compare October 6, 2021 06:03
@AlenkaF
Copy link
Contributor Author

AlenkaF commented Oct 6, 2021

@maartenbreddels I added the method suggested to produce the buffers for string dtpe. Also added the mask handling in the convert_string_column(). I think this is ready for another round of review. Thanks!

@maartenbreddels maartenbreddels merged commit d5410f8 into vaexio:master Oct 13, 2021
@maartenbreddels
Copy link
Member

@AlenkaF many many things for your work, this is actually already released in vaex-core 4.6.0a3

@rgommers
Copy link
Contributor

This is really great to see - thanks a lot @AlenkaF and @maartenbreddels!

xdssio pushed a commit to xdssio/vaex that referenced this pull request Dec 31, 2021
* 🐛 handle offset for categories

* Draft for the Dataframe interchange protocol

* Adding test for virtual column plus typo.

* Roundtrip test change plus some corrections in functions parameters

* Apply suggestions from code review

* Dtype for arrow dict plus use of arrow dict in convert_categorical_column

* Add missing value handling

* Added chunk handling and tests

* Corrected usage of metadata for categories

* Applying changes from general dataframe protocol

* Delete copy error

* Change sentinel value handling in convert_categorical_column

* Add select_columns() and test

* Update to _get_data_buffer() for Arrow Dictionary

* Minor commenting changes

* Correct typo error

* Add _VaexBuffer test

* Add tests and correction for _VaexColumn

* Added tests for _VaexDataFrame

* Added more tests and one correction for format_str

* format to LF and black

* support passing in allow_copy

* correct descibe_null for arrow and numpy

* correct _get_validity_buffer to match describe_null

* correct describe_null, convert_categorical_column and test_categorical_ordinal for categorical dtypes

* Apply suggestions from code review

* correct get_chunks for _VaexDataFrame

* Replace return with yield in get_chunks

* Check for LF and run black with -l 220

* Black with line length 220

* Add string dtype support

* Add Arrow Dict check to describe_categorical

* avoid copying data for strings

* small fix

* also test sliced dataframe

* test that we do not copy data

* Apply string no-mem copy suggestions

* fix and test get_chunks

* use future ordinal encoding feature

* make test work with dict encoded

Co-authored-by: Maarten A. Breddels <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
maartenbreddels added a commit to maartenbreddels/plotly.py that referenced this pull request Sep 30, 2022
This allows plotly express to take in any dataframe that supports
the dataframe protocol, see:
https://data-apis.org/blog/dataframe_protocol_rfc/
https://data-apis.org/dataframe-protocol/latest/index.html

Test includes an example with vaex, which should work with
vaexio/vaex#1509
(not yet released)
anmyachev pushed a commit to anmyachev/plotly.py that referenced this pull request Jun 12, 2023
This allows plotly express to take in any dataframe that supports
the dataframe protocol, see:
https://data-apis.org/blog/dataframe_protocol_rfc/
https://data-apis.org/dataframe-protocol/latest/index.html

Test includes an example with vaex, which should work with
vaexio/vaex#1509
(not yet released)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants