-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a prototype of the dataframe interchange protocol #38
Conversation
Related to requirements in gh-35. TBD (to be discussed) comments and design decisions at the top of the file indicate topics for closer review/discussion.
Design decisions | ||
---------------- | ||
|
||
**1. Use a separate column abstraction in addition to a dataframe interface.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is an interchange protocol I do not see value in having a column abstraction. If we want a way to treat a dataframe as an array we should make that explicit rather than through a column interface. The Modin implementation does not support a "column" abstraction, rather a dataframe with 1 column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Devin, this is a good discussion to have. As the note 7 lines below says, this is not user-facing API. It seems a useful abstraction to in the end get down to a description of memory. There are multiple reasonable ways to model a dataframe though. What this design does is use 5 concepts in 3 classes:
- A
Buffer
class. A "buffer" is a contiguous block of memory - this is the only thing that actually maps to a 1-D array in a sense that it could be converted to NumPy, CuPy, et al. - A
Column
class. A "column" has a name and a single dtype. It can consist of multiple "chunks". A single chunk of a column (which may be the whole column ifnum_chunks == 1
) is modeled as again aColumn
instance, and contains 1 data buffer and (optionally) one "mask" for missing data. - A
DataFrame
class. A "data frame" is an ordered collection of columns. It has a single device, and all its rows are the same length. It can consist of multiple "chunks". A single chunk of a data frame is modeled as again aDataFrame
instance. - A "mask" concept. A mask of a single-chunk column is a buffer.
- A "chunk" concept. A chunk is a sub-dividing element that can be applied to a data frame or a column.
These I think are all the actual abstractions we use when talking about data frames. It's fairly explicit and easy to understand. An alternative approach would be to do something like what Arrow did - its "buffer" is the same as in this design, but it doesn't use any of the other concepts - its "array" (which is a little like Column
here and not at all like the "array" we use in the array API standard) and "children" are more ways of combining buffers than that they directly map to data frame concepts. Its approach seems to be to use the minimal number of concepts to describe memory layout of a data frame, and leave it up to its users to map that to their actual data frame concepts.
Taking into account our previous discussions of this topic as well as the early prototypes sketches out by @wesm and @kkraus14, I believe we did want something that looked more like a minimal dataframe API with explicitly named methods aimed at library developers. Given that, I think this is at least in the right direction. The main question regarding these concepts that I wasn't sure about is: should a column be allowed to have multiple buffers yes or no (in a single chunk)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the note 7 lines below says, this is not user-facing API
Yes, I understood this. A better wording of my statement would have been "I do not see value in having a column abstraction in an interchange protocol."
I guess I do not understand the rationales here and how that maps to the interchange protocol.
- This is how it works in R, Julia and Apache Arrow.
- Semantically most existing applications and users treat a column similar to a 1-D array
- We should be able to connect a column to the array data interchange mechanism(s)
Here you have given examples, and make a point about applications and users, along with the array API. This does not seem to be about the interchange protocol.
Maybe I misunderstood the purpose of this PR based on your comment. It seems you intend to extend this interchange protocol into the developer API we have been talking about. Is that correct?
I may have missed a week or two scrambling toward paper/talk deadlines 😄. Perhaps I would have pushed back (or at least forced a more directed discussion) on some things that were decided if I were there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed a week or two scrambling toward paper/talk deadlines 😄. Perhaps I would have pushed back (or at least forced a more directed discussion) on some things that were decided if I were there.
Now is a good time to push:) Nothing is set in stone.
I do not see value in having a column abstraction in an interchange protocol."
So what would you do with the methods that are now on Column
but not on DataFrame
? Attach them to DataFrame
and make them raise an exception unless it's a single-column dataframe?
I guess I do not understand the rationales here and how that maps to the interchange protocol.
Good point. Those rationales were a summary of opinions on wesm/dataframe-protocol#1, I didn't put enough thought into whether they did or did not apply here.
Maybe I misunderstood the purpose of this PR based on your comment. It seems you intend to extend this interchange protocol into the developer API we have been talking about. Is that correct?
In our last meeting we discussed that. There's no such intention to extend currently I'd say, although it's a possibility. The mood was more "let's get this interchange protocol done, and then see based on experience and adoption what the right approach is for a more complete API".
Personally I think a developer API may have some similarities, but will not be a superset of this interchange protocol. For example, I'm not convinced it should have a column abstraction, and certainly not memory-oriented APIs like get_data_buffer
or get_mask
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I am one of the guys who push for getting things out, I think we need to be careful when adding new abstractions. I am for a minimally viable protocol in the first version, and I second that adding a Column and even a Buffer seems to follow the path of the current affairs without critically challenging the need in the first version. This is especially dangerous as it has a tendency to become another version to please everyone, especially Pandas' accustomed user base. IMHO we need to keep the balance of usefulness and adoption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can call it Array
instead of Column
, then we don't have a Column abstraction any more ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub does not have the range of emoji's needed to accurately reflect my initial reaction to that:)
This is better for implementations that may rely on, for example, lazy | ||
computation. | ||
|
||
**3. No row names. If a library uses row names, use a regular column for them.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with this. If two libraries support row labels it would generally be better for both libraries if there was a way to convert between row labels. In general, row labels are metadata and moving metadata into the data can be costly and require copying. I think this places too much of a constraint on dataframe implementations and assumes that moving metadata into data is cheap, which is certainly not the case in Modin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion on this, this was my interpretation of the linked discussion. Happy to change, let's wait for a few more opinions though - @kkraus14, @maartenbreddels, @TomAugspurger and @datapythonista all seemed to lean towards "no row labels".
Maybe my adding "use a regular column for them" isn't correct though - perhaps should simply have been "is not supported".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly voiced my opinion on row names in that they should either be required and part of the spec or entirely not part of the spec. Making them optional just ends up with pushing everyone to implement them which is what we've seen with everyone following the Pandas API.
In the early days of cuDF we tried to take the stance that we won't implement row names (indexes), eventually enough people pushed back that for Pandas compatibility it's needed that we added a single column Index which then eventually succumbed to the same argument for eventually supporting MultiIndexes. Generally all of the arguments were always just for Pandas compatibility / minimizing changes to existing code as opposed to better implementations or better code paths in any way.
+1 to just saying "is not supported"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kkraus14 I don't agree with this idea:
Generally all of the arguments were always just for Pandas compatibility / minimizing changes to existing code as opposed to better implementations or better code paths in any way.
All of the discussions against row labels ignore the users and questions about why they exist in the first place. Yes, it is easier to have an implementation without them, I agree. I would argue that it is not about "legacy compatibility" of changing existing code. It has become a part of the mental model of (millions of) users, and row names were a part of S and R dataframes long before pandas. I have never been a fan of the idea that users should change their mental models of a system to fit what the library authors found easy to implement or optimize. Maybe this is not the place to discuss the validity of row labels as a concept, because it likely boils down to deep seeded opinions on user centered designs vs performance centered designs. Sometimes those views are not compatible.
I do not think making row labels optional will force everyone to implement them. This is a developer API, so if something is optional, the developers using it will understand that and handle things appropriately. If this were an end user API things might be different.
I have a concern that narrowing this down into the least common denominator will result in something that is not particularly useful. I mentioned this in the live meetings but throwing specific exceptions or warnings should absolutely be a part of the spec and can help with this question about optional implementation details.
If there is no support for labels then the only two options are to (1) implicitly move the metadata into the data, which as I mentioned might be expensive, or (2) drop the row labels altogether, which might contain relevant data. Without row labels, there is no operator to explicitly move row labels into the data (reset_index
in pandas), so I don't see any other way to handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the discussions against row labels ignore the users and questions about why they exist in the first place. Yes, it is easier to have an implementation without them, I agree. I would argue that it is not about "legacy compatibility" of changing existing code. It has become a part of the mental model of (millions of) users, and row names were a part of S and R dataframes long before pandas. I have never been a fan of the idea that users should change their mental models of a system to fit what the library authors found easy to implement or optimize. Maybe this is not the place to discuss the validity of row labels as a concept, because it likely boils down to deep seeded opinions on user centered designs vs performance centered designs. Sometimes those views are not compatible.
95+% of the use cases we saw were people not actually using indexes in any meaningful way. They were constantly using reset_index
to get back to RangeIndex
objects and then using __getitem__
calls to access row numbers instead of using iloc
. Similarly, with MultiIndex we almost always saw just because an operation like groupby
or join
return a MultiIndex by default and instead of changing the code to use an as_index=False
option or something similar, they'd just add a reset_index
immediately afterwards.
I do not think making row labels optional will force everyone to implement them. This is a developer API, so if something is optional, the developers using it will understand that and handle things appropriately. If this were an end user API things might be different.
If row labels are optional then presumably Pandas will support it and developers will end up continuing to use it unnecessarily and inefficiently, which will then push other libraries to implement it unnecessarily from wanting to be able to integrate.
For what it's worth, on the cuDF side we now somewhat nicely handle the cases of indexing / multi-indexing, but I am pushing to avoid new DataFrame libraries from having to go through the same pain that we did unless they explicitly want developers to use indexing features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, absent the argument is any discussion on the user's mental model.
I'm not sure that's the argument. My experience is the same as @kkraus14's with code I got from less experienced colleagues in the past: they just use the defaults (which can be as_index=True
), then get to "this stuff is weird, how do I get rid of it", then hit upon .reset_index()
via a StackOverflow post. There may be some advanced users who make good use of row labels, but I suspect it's a small-ish minority. For the rest, row labels do not seem to be part of their mental model.
Would be good to get some pandas maintainer opinions on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I am a bit a biased pandas maintainer, as I would like to see pandas get rid of default row indices; "default" to be clear, still have the functionality if you explicitly ask for it)
I think regarding data exchange, the index can be passed through just like another column? In which case it might a question whether we want to make it possible to give an indication that a certain column(s) can be used as index (eg a describe_index()
that returns the names of index columns, and someone not interested in that can just ignore that method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revisiting this, passing the row labels as a separate buffer (or a "column buffer") would work for me as long as the labels would not need to be moved into the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, that's helpful @devin-petersohn. I think it could be combined with a metadata
escape hatch perhaps, that would also be separate from the regular data, and that we discussed last time as being desirable to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update on this, since we had multiple discussions on this: in the end we settled on going with metadata
(added in gh-43), and decided that libraries could store anything there (it's a Dict[str, Any]
attribute) so indeed things like row labels or what the index column is can be stored there and it doesn't need to be moved into the set of regular columns. This seemed to work for everyone.
protocol/dataframe_protocol.py
Outdated
|
||
def __array_interface__(self): | ||
""" | ||
TBD: implement or not? Will work for all dtypes except bit masks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is just a buffer we don't need to worry about bit masks at this level. That being said, what would the behavior of a GPU-memory buffer be here? I'd push that since we're making a new protocol we should make it device agnostic from the beginning to prevent people from using things that aren't device agnostic, but I'm obviously biased 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the whole design is definitely meant to be device-agnostic.
__array_interface__
itself doesn't have a device concept, but I think by design it's similar to DLPack and the shape
, typestr
, offset
and version
of https://numpy.org/doc/stable/reference/arrays.interface.html will work just fine on GPUs?
That said, that is a bit of an abuse of the protocol and the only reason to add it would be to save people that don't want to implement DLPack support some effort (__array_interface__
is the only pure-Python memory description protocol).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we shoved a GPU pointer into __array_interface__
and then a library like numpy tried to read it it would likely segfault or otherwise explode in a not great way. That's the whole reason we implemented __cuda_array_interface__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we shoved a GPU pointer into
__array_interface__
and then a library like numpy tried to read it it would likely segfault or otherwise explode in a not great way.
That'd be consumer library error. Since there's DataFrame.device
, the consuming dataframe library must not be calling NumPy unless the device is CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is an array protocol where presumably there's plenty of array libraries that implement this that don't know to check DataFrame.device
. If this was something like __column_interface__
then I could imagine it, but this is an existing protocol so we can't add extra pieces to it unless we push those pieces into the actual protocol from my perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolving:) That was my initial thinking @jorisvandenbossche. __array_interface__
should work, especially if device
moves to the buffer level. Could be implemented as a property that raises unless device == cpu
.
(or
__array__
, or__arrow_array__
).
Those are a less good idea, because they rely on NumPy or Arrow being present, which we cannot force array libraries to depend on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that will be easier to interface with the interchange protocol than manually converting the buffers.
That doesn't sound pleasant. The idea would be not to do manual memory management, but implement __dlpack__
in NumPy (NEP will be ready soon), and then Pandas et al. can simply call np.from_dlpack(df)
if they want to turn the buffer into a NumPy array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for me a big unclear question is how people are going to use this. For example, I assume I am a dataframe library that uses pyarrow under the hood, and I get some object using this protocol. With the current interface, I would still need to manually use the buffers to construct arrow arrays?
Those are a less good idea, because they rely on NumPy or Arrow being present, which we cannot force array libraries to depend on.
It only "forces" the consumer to have it, I think? (and if they already use it, it is not forcing anybody, but just convenient IMO)
It's true that you need a lazy import of numpy/arrow, of course. __array_interface__
is also fine for me, the end result is the same for the user (np.asarray(buffer)
working).
We could discuss a similar __arrow_array_interface__
if that would be more acceptable than __arrow_array__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only "forces" the consumer to have it, I think?
No, the semantics of __array__()
are: it must return a numpy array. The producer has to do the work here. What you're saying is true for __array_interface__
, not __array__
.
__arrow_array__
works like __array__
: https://arrow.apache.org/docs/python/extending_types.html#controlling-conversion-to-pyarrow-array-with-the-arrow-array-protocol
We could discuss a similar
__arrow_array_interface__
if
If it's at the buffer level, that would be identical to __array_interface__
- it's just a description of memory.
I think for me a big unclear question is how people are going to use this. For example, I assume I am a dataframe library that uses pyarrow under the hood, and I get some object using this protocol. With the current interface, I would still need to manually use the buffers to construct arrow arrays?
Yep that's a great question. Right now you need to convert data buffer by buffer. I imagine Pandas would have a from_dataframe
function that would:
- iterate over columns
- per column, check dtype and number of buffers
- obtain column data:
a. if a single buffer, docol = np.from_dlpack(buffer)
b. if multiple buffers/chunks, then instantiate a numpy array of the correct dtype for the column, e.g. withnp.empty
, (or an Arrow array if you have dtypes that are Arrow-backed?), and copy the buffers into the array - call a
pd.DataFrame
constructor with all the column arrays and column names/dtypes.
(add an iteration over chunks if num_chunks > 1
)
There's other checks to be made:
- raise if
device
isn't CPU - if there's bit masks and you want a numpy array, need to either figure out how to do that with
np.frombuffer
(better), or be lazy and call__dataframe__
again withnan_as_null = True
.
It will be instructive to write this implementation, shouldn't be that hard for the simpler dtypes/cases. I could perhaps have a go at that over the weekend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most interesting topic here is actually not "shall we have __array__
or not, __array_interface__
or not", but "how do you effectively iterate over all chunks/columns/buffers with as little data copying as possible".
protocol/dataframe_protocol.py
Outdated
Kind : | ||
|
||
- 0 : signed integer | ||
- 1 : unsigned integer | ||
- 2 : IEEE floating point | ||
- 20 : boolean | ||
- 21 : string (UTF-8) | ||
- 22 : datetime | ||
- 23 : categorical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need both Kind
and Format string
. Seems like duplicate information except Kind
is less flexible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Format strings are quite annoying to parse, so for anything besides datetime I think library authors would have a preference for using kind
+ bit_width
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the answer be for us to expand Kind
in order to sufficiently handle all of the cases that format string
handles then? The way I read this spec is everyone will have to produce the format strings regardless as of now so they'll have to deal with the annoyance regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, however producing format string is much easier than parsing - I think it's a minor annoyance.
I don't see how we'd cover all the datetime formats including time zones without a format string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use 16 kind
integers for the 16 formats in https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings, then a separate string field for timezone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about supporting things like Decimal or Fixed Size Lists in the future where we need to specify some semi arbitrary attributes alongside the dtype? If we do then we should make sure that whatever specification we decide on here can eventually support those gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should at least keep the possibility open to support more data types like decimal and nested types (which probably means we need format strings?)
- "mapping" : dict, Python-level only (e.g. ``{int: str}``). | ||
None if not a dictionary-style categorical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the actual dictionary mapping? I.e. if I had a column of ['gold', 'bronze', 'silver', 'gold', 'silver', 'bronze']
with a dictionary of ['gold' < 'silver' < 'bronze']
would this mapping
key be: {0: 'gold', 1: 'silver', 2: 'bronze'}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a HUGE -1 on this. We'd want our dictionary to just be another column which is on the device for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you put a dictionary in a single column? It'd be a dataframe of two columns, index
and value
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a categorical column of ['gold', 'bronze', 'silver', null, 'bronze', 'silver', 'gold']
with categories of ['gold' < 'silver' < 'bronze']
using Arrow formatting:
categorical column: {
mask_buffer: [119], # 01110111 in binary
data_buffer: [0, 2, 1, 127, 2, 1, 0], # the 127 value in here is undefined since it's null
children: [
string column: {
mask_buffer: None,
offsets_buffer: [0, 4, 10, 16],
data_buffer: [103, 111, 108, 100, 115, 105, 108, 118, 101, 114, 98, 114, 111, 110, 122, 101]
}
]
}
In the case of cuDF all of these buffers on the GPU and getting a host mapping of {0: 'gold', 1: 'silver', 2: 'bronze'}
would be extremely expensive for any sufficiently large dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note on the above example: children
is actually named dictionary
in ArrowSchema
(a separate struct in the schema rather than being one of the child arrays; history for that decision a little unclear).
protocol/dataframe_protocol.py
Outdated
# # NOTE: not needed unless one considers nested dtypes | ||
# def get_children(self) -> Iterable[Column]: | ||
# """ | ||
# Children columns underneath the column, each object in this iterator | ||
# must adhere to the column specification | ||
# """ | ||
# pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we handling strings without this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing some context for this question. NumPy-like fixed length strings would be handled as a single buffer; for variable-length strings it's unclear to me how you would handle them even with get_children
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For variable-length strings the way Arrow handles them is by having a Column of bytes (int8 / uint8 for example) and then a Column of integer offsets into the bytes. So for example, say you had a column of strings of: ['joe', null, 'bob', '']
then under the hood the column would look like:
string column: {
data_buffer: None,
mask_buffer: [13], # 00001101 in binary
children: [
characters int8 column: {
data_buffer: [106, 111, 101, 98, 111, 98],
mask_buffer: None
},
offsets int32 column: {
data_buffer: [0, 3, 3, 6, 6],
mask_buffer: None
}
]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also how we handle strings in cuDF and I suspect that generally people will want variable length strings as opposed to fixed length strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For variable-length strings the way Arrow handles them is by having a Column of bytes (int8 / uint8 for example)
It has multiple ways of doing this? There's exactly zero specification of this in https://arrow.apache.org/docs/format/CDataInterface.html, and I can't find a separate description either. All I see is it uses format strings 'u'
and 'U'
for regular and large utf-8 strings.
Would you expect all libraries to handle multiple flavors of variable-length strings? So if the type specifiers say "utf-8 string", then there's some self-describing format for how that's laid out with multiple child buffers, each of which can have a different dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a specialization on
Variable Binary
https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout.
:( It's like they want people to not understand them. The writing is so terse - there's zero connection to strings in the docs there.
So Arrow doesn't necessarily use child Columns in the case of Strings, but 3 buffers:
validity
,offsets
, anddata
.
Thanks for the explanation. I understand now. Okay, that indeed requires a change from what I have in this PR currently. There's three options:
- add children columns
- add multiple buffers per column (actually we have
data
andmask
, so just addingoffsets
would be enough). - do both (1) and (2) - perhaps handy to future-proof it for nested dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be in favor of option 3 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:( It's like they want people to not understand them. The writing is so terse - there's zero connection to strings in the docs there.
You are welcome to open an issue about that ;)
(but to be clear, you are totally right. I think the explanation of variable size binary is decent (https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout), but if you don't know that string is just variable size binary where the bytes represent UTF8, that doesn't help much ..)
Also in favor of option 3 (I would personally just follow Arrow's layout of multiple buffers + children arrays)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are welcome to open an issue about that ;)
I know:) Maybe more thinking about a whole docs section with examples and mapping dataframe concepts to Arrow concepts.
Also in favor of option 3 (I would personally just follow Arrow's layout of multiple buffers + children arrays)
A worry may be the amount of complexity that a consumer is expected to parse. Arbitrary many buffers and nested child arrays is hugely complex; right now there seems to be only the need for:
- <=3 buffers (
data
,offsets
,mask
) - one child level for categoricals or strings.
So it might make sense to do that the same way as Arrow, but restrict it to that - everything else is either producer error or an unsupported dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we take the 3 buffer approach for string columns we actually don't need a child level for strings as it would be represented directly by the 3 buffers.
If we don't take the 3 buffer approach for string columns and instead go with the children approach, then we'd need 2 child levels to handle a categorical column who's categories are a string column.
A column object, with only the methods and properties required by the | ||
interchange protocol defined. | ||
|
||
A column can contain one or more chunks. Each chunk can contain either one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a bit confusing that we use the same object for a collection of chunks as the chunk itself.
Eg what does get_data_buffer
return for a Column with multiple chunks?
(in Arrow there is the concept of Array and ChunkedArray (and this represents a column of a DataFrame), where the second is just a vector of same-typed arrays)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eg what does
get_data_buffer
return for a Column with multiple chunks?
That should raise.
I find it a bit confusing that we use the same object for a collection of chunks as the chunk itself.
It's possible to have a separate object like ChunkedColumn
, but then for symmetry it should also have ChunkedDataFrame
probably.
That'll be an interesting discussion to have today. @devin-petersohn preferred fewer concepts (no Column
). I think the two extremes here are:
- Only
DataFrame
andBuffer
- All of
DataFrame
,ChunkedDataFrame
,Column
,ChunkedColumn
, andBuffer
.
In the end fewer separate classes just means more methods/properties on classes that will raise when they're called in a situation that's not appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DataFrame could always be chunked, so DataFrame -> ChunkedColumn (with potentially only 1 chunk), to avoid the additional class. But always having to first access the single chunk (if you don't work with multiple chunks) also has its disadvantages of course).
""" | ||
|
||
|
||
class Buffer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that didn't occur to me in the discussion we just had: if we get rid of this Buffer
class and change it to a plain dict
, then we cannot attach __dlpack__
to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could still be attached to the Column chunk, in which case it would only work if the column is backed by a single buffer. But I suppose that is limiting the use of dlpack too much? (because that would mean there is no easy way to get a separate buffer as a (numpy) array)
# either make a copy or hold on to a reference of the column or | ||
# buffer! (not done yet, this is pretty awful ...) | ||
x = np.ctypeslib.as_array(data_pointer, | ||
shape=(_buffer.bufsize // (bitwidth//8),)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ugliness here is the main problem with passing around pointers in a Python-only protocol. If anyone has better ideas, I'd love to hear them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In __cuda_array_interface__
we've generally stated that holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well.
I'd argue this is a place where we should really align with the array interchange protocol though as the same problem is being solved there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In
__cuda_array_interface__
we've generally stated that holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well.
Yes that works and I've thought about it. The trouble is where to hold the reference. You really need one reference per buffer, not just store a reference to the whole exchange dataframe object (buffers can end up elsewhere outside the new pandas dataframe here). And given that a buffer just has a raw pointer plus a size, there's nothing to hold on to. I don't think there's a sane pure Python solution.
__cuda_array_interface__
is directly attached to the object you need to hold on to, which is not the case for this Buffer
.
I'd argue this is a place where we should really align with the array interchange protocol though as the same problem is being solved there.
Yep, for numerical data types the solution can simply be: hurry up with implementing __dlpack__
, and the problem goes away. The dtypes that DLPack does not support are more of an issue.
I added an implementation of the protocol for Pandas, to the point where it round-trips for columns with regular numpy dtypes. Hopefully that makes things more concrete. Most of the API feels okay, just the buffer object doesn't. |
2. A `Column` class. A *column* has a name and a single dtype. It can consist | ||
of multiple *chunks*. A single chunk of a column (which may be the whole | ||
column if ``num_chunks == 1``) is modeled as again a `Column` instance, and | ||
contains 1 data *buffer* and (optionally) one *mask* for missing data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saying a Column contains "1 data buffer" is a bit ambiguous. For a Column of strings is the characters buffer or the offsets buffer the data buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't update this bit yet, sorry - we decided already we needed Arrow-like children to properly support strings and categoricals.
# either make a copy or hold on to a reference of the column or | ||
# buffer! (not done yet, this is pretty awful ...) | ||
x = np.ctypeslib.as_array(data_pointer, | ||
shape=(_buffer.bufsize // (bitwidth//8),)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In __cuda_array_interface__
we've generally stated that holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well.
I'd argue this is a place where we should really align with the array interchange protocol though as the same problem is being solved there.
This shows the simple design doesn't fully work (see the FIXMEs in the diff). Instead, the `children` concept is needed. That way the categorical encoded data values can be returned as a child Column rather than a Buffer, and hence there's the necessary Column.dtype to interpret the buffer backing the column.
else: | ||
raise NotImplementedError(f"Data type {self._col.dtype} not handled yet") | ||
|
||
return buffer, dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pointing this out: the method to get at a buffer must also return the dtype needed to interpret the buffer. Because when you have multiple buffers on a column (we decided to have buffer
, mask
, offset
) you cannot look at Column.dtype
- that may be categorical or string, which doesn't help in interpreting the buffer.
This was slightly counterintuitive, but it was the thing necessary to make a categorical dtype work (see roundtrip test at end of this file).
Note: variable-length strings not included yet, that'll require children
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be an argument to use a child column instead of a buffer in this situation. What is the data buffer for a Categorical Column, the codes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it seemed natural to make it the codes (nb. Pandas calls it codes
, Arrow "index"), and have the mapping separately. However the issue is that there should be something on the column buffer, even with child columns. And the dtype
only says "categorical", which is unrelated to the buffer itself.
I investigated how Arrow does it:
A data type is described using a format string. The format string only encodes information about the top-level type; for nested type, child types are described separately.
...
Dictionary-encoded types do not have a specific format string. Instead, the format string of the base array represents the dictionary index type, and the value type can be read from the dependent dictionary array
....
For dictionary-encoded arrays, the ArrowSchema.format string encodes the index type. The dictionary value type can be read from the ArrowSchema.dictionary structure.
The same holds for ArrowArray structure: while the parent structure points to the index data, the ArrowArray.dictionary points to the dictionary values array.
It works, but conceptually I dislike this approach. The format string is the equivalent to our Column.dtype
. If that says int64
(or l
format string for Arrow), then you cannot tell if it's just a column of integers or categoricals without looking at other attributes. That makes things unnecessarily complex. In this way, you need a separate dictionary
property, which should be checked for being null before using the dtype:
struct ArrowSchema {
// Array type description
const char* format;
const char* name;
const char* metadata;
int64_t flags;
int64_t n_children;
struct ArrowSchema** children;
struct ArrowSchema* dictionary;
This feels much more Pythonic:
>>> df = pd.DataFrame({"A": [1, 2, 5, 1]})
>>> df["B"] = df["A"].astype("category")
>>> df.dtypes
A int64
B category <-- there is an actual categorical dtype here
dtype: object
So my impression is that it's best to have both child columns and always returning a buffer together with its dtype so you can parse the buffer. That makes sense to me, because a buffer without any dtype info is completely useless - so why look up that dtype info somewhere else rather than pass it along together?
The only (minor) reason not to have the dtype on the buffer directly is to make it possible to have one buffer used in different columns with different interpretations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some discussion in gh-26 about categorical dtypes, which links to apache/arrow#4316 for the rationale for why Arrow does it this way (seems related to the limitations of the C++ API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: variable-length strings not included yet, that'll require children.
In Arrow variable-length strings actually use an offsets and data buffer, not a child array (https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout)
(the data buffer is not an array in typical sense like a numpy array)
However the issue is that there should be something on the column buffer, even with child columns. And the dtype only says "categorical", which is unrelated to the buffer itself.
I think that's the main difference between how pandas and arrow handle it: whether the codes/indices dtype (int8, int16, etc) is part of the column dtype or not.
If it's part of the column dtype, then you know how to interpret the buffer, and the buffer doesn't need its own dtype to describe it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's part of the column dtype
I can imagine it being part of the dtype, the question is how. Possibilities that come to mind are:
- in the format string,
- extending with a 5th field that's normally None except for compound types like this
- making
.dtype
have some kind of nesting to describe both the logical dtype and the buffer dtypes.
We had a lot of discussion in various issues and calls. To get an overview, I went through all minutes and other content, and made a list of TODOs and questions for finishing up this PR (plus resolve review comments on this PR):
|
As discussed yesterday, we'll merge this in its current state, open a tracking issue (done in gh-46), and address the remaining features in separate follow-up PRs (several of which are already open). I went through the review comments again, and the remaining unresolved ones contain a lot of interesting discussion so I'll leave them unresolved. |
Related to requirements in gh-35.
TBD (to be discussed) comments and design decisions at the top of the file indicate topics for closer review/discussion.