-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to reconstruct a library-specific DataFrame object from an interchange object #85
Comments
Thanks for the summary @jorisvandenbossche. @thomasjpfan might also be interested in this discussion. |
That seems like a very useful thing to support indeed.
I'm inclined to go this route, for a couple of reasons:
The signature that this constructor method should have is not 100% obvious though. Maybe the input dataframe has properties that need preserving (e.g. a And a separate question: given that the shape of the returned dataframe may be different from the input shape, and column names etc. may be different, does the user (scikit-learn here) need any functionality to construct new dataframe interchange objects? Or are we expecting them to reinvent the wheel there? |
The scikit-learn "transformer" use case would only need a standard way to call However other parts of scikit-learn would benefit from a standard API:
Note that if a standard Implementing |
@ogrisel I'm trying to interpret the "subdataframe/view" part here, but I'm not 100% sure what you mean. The current protocol has a
That is worth considering for the protocol perhaps. A purely positional |
@ogrisel do note that, as there are now very few libraries actually supporting the protocol, you can rather easily introspect which library the interchange object belongs to, and then import the correct Something like (untested): def _get_from_df(xchg_obj):
lib = xchg_obj.__class__.__module__.split('.')[0]
if lib == 'modin':
from modin.utils import from_dataframe
elif lib == 'pandas':
from pandas.api.interchange import from_dataframe
# add more branches for vaex and cudf
else:
raise RuntimeError(f'Unknown library: {lib}')
return from_dataframe |
This indeed would be enough with the addition of a standard way to rebuild the dataframe object with a public |
Folks on the call last Thursday seemed to be happy with this idea, as long as this is a constructor that can be retrieved directly from the dataframe object in the interchange protocol - that would make it easier to reconstruct a dataframe from the correct library. In the absence of it, users are probably more likely to grab the Pandas xref gh-42 for the signature of this constructor. |
From the discussions at EuroScipy with scikit-learn developers (cc @ogrisel), the following use case came to mind: assume you have a method that transforms your data, a workflow could be:
from_dataframe
from the input object's library)This last step is currently not possible, because you don't (want to) know each possible library that implements
__dataframe__
and where itsfrom_dataframe
lives.This is very much related with a possible "namespace" like the array api uses (cfr #79).
With that this could look like:
But we could also think about (shorter-term) alternatives directly tied to the interchange protocol object. For example, we could have a class method or attribute that points to the
from_dataframe
method of the library that created the object.The text was updated successfully, but these errors were encountered: