-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interchange dataframe protocol #9071
Conversation
…al without missing values
…on the dataframe object
Can one of the admins verify this patch? |
1 similar comment
Can one of the admins verify this patch? |
Hi @iskode thanks for the contribution! We're aware of the Data API consortium and are on board with eventually moving in that direction. However, I don't think that the DataFrame standard is quite stable enough yet for us to be adding support (unlike the array API that has been released). @shwina is the RAPIDS representative in that group, though, and can probably speak to this in more detail. |
Ok to test |
Hi @iskode - to unblock CI, could you please resolve the conflicts on this PR when you have a chance? Specifically |
…cp.asarray' to enforce zero-copy
It's Ok now I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @iskode! Thanks for working through the mypy bits, I apologize for not getting back to you sooner. The primary change I'd like to see is removing the aliases of _k = _DtypeKind
in favor of using _DtypeKind
directly. I marked some but not all instances of that pattern.
Otherwise we're down to pretty minor comments. I would be happy enough with the current state, so I'm approving. I'll let @shwina merge when ready.
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
rerun tests |
rerun tests |
Java unit tests are not passing.... it is like a regression from Java code. Between the two builds, number of failures (unit test) went down from 58 to 55. So recent changes made 3 more tests passing. Is it possible to mute Java tests to see if the dataframe protocol implementation pass all tests and the gpuCI ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry about Java tests. There are lots of other changes in flight right now that are causing problems there, but none of them are related to you. If everything other than Java tests pass then I think we're good to go.
@shwina the PR has gone through a few rounds of review since you approved, so I'll let you do the honors and merge when you're ready.
I'm going to go ahead and merge this. Thank you @iskode for all your hard work on this PR and for your patience with multiple rounds of reviews! Fantastic job! For anyone wanting to follow up on this work, here are a couple of suggestions for the future:
|
@gpucibot merge |
Thank you so much your investment (in particular @shwina and the rest of the team) in assisting me during this process. It has been a great pleasure and rewarding adventure to work with you. I've learned many things along the way. Very interesting and justified next steps. |
This PR is a basic implementation of the interchange dataframe protocol for cudf.
As well-known, there are many dataframe libraries out there where one's weakness is handle by another. To work across these libraries, we rely on
pandas
with method likefrom_pandas
andto_pandas
.This is a bad design as libraries should maintain an additional dependency to pandas peculiarities.
This protocol provides a high level API that must be implemented by dataframe libraries to allow communication between them.
Thus, we get rid of the high coupling with pandas and depend only on the protocol API where each library has the freedom of its implementation details.
To illustrate:
df_obj = cudf_dataframe.__dataframe__()
df_obj
can be consumed by any library implementing the protocol.df = cudf.from_dataframe(any_supported_dataframe)
here we create a
cudf dataframe
from any dataframe object supporting the protocol.So far, it supports the following:
uint8
,int
,float
,bool
andcategorical
.string
support is on the way.Additionally, we support dataframe from CPU device like
pandas
. But it is not testable here as pandas has not yet adopted the protocol. We've tested it locally with a pandas monkey patched implementation of the protocol.