-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2310: implementation status #34
PARQUET-2310: implementation status #34
Conversation
@alippai Perhaps you'd like to start filling in the table for Rust? |
Thanks @pitrou for the comments. Yes, I’ll address them and fill the table best to my knowledge and online resources |
I am very interested in helping this post along. Is there any interest in pursuing this PR or would it be better to make a new one?
I can certainly fill the one in for Rust |
I’ll open a new draft later this week |
FYI #53 is a related conversation. Once that PR merges perhaps there will be a more natural location for this chart / location |
9ad826b
to
6b26e2e
Compare
6b26e2e
to
2db6877
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alippai
I have a few suggestions
- Start the initial PR with a smaller subset of featres (maybe with read/write physical and logical types and compression)
- Consider distinguishing read and read/write support for the different features (as I think some implementations can read but not create certain encodings or other features)
- Define terms: define the different implementations (e.g. link to what the "Java" implementation means) and define what is required to have the implementation in these charts this list. Perhaps use the list on https://parquet.apache.org/docs/overview/
Smaller Subset
The rationale for a smaller subset is that it will be easier to get consensus. For example, we won't have to try and precisely describe what "modular encryption support" means
"Support"
I think it would also help to define what is required of an implementation to have a "check" in the corresponding feature row.
I think there is an eventual goal to have an automated integration test (e.g. @pitrou 's suggestion in the mailing list https://lists.apache.org/thread/xjg1zrhll5ztqr4n5lv92f6tfwvfb8rm)
However, before that perhaps we could define something less formal, like can read the values of certain files in https://github.com/apache/parquet-testing (e.g. alltypes_plain.parquet
Totally agree, thanks for the guide. What do you think about the non-Apache or other projects? (Duckdb, fastparquet, impala, cudf) |
IMHO, any currently maintained open source implementation of Parquet deserves mentioning there. But that also requires involvement from their respective maintainers (we shouldn't expect us Parquet maintainers to make sure the information is up to date). |
Echoing what @pitrou said, I suggest we add a note somewhere that "this table includes any open source currently maintained implementation of parquet whose maintainers have helped fill it out. If you wish to add a new implementation to this table, please open a PR to do so" It might also be worth adding a column with some sort of placeholder ( |
I like both! |
I worry about this becoming a little bit of a pedantic discussion. My 2 cents: I think a reasonable approach is to let maintainers decide on checking the box or not. Maybe we can have a ternary value. If a maintainer feels it is fully supported it is a check. If they think reasonable people might be confused it gets '-' or check without footnote to explain the exception. If it is completely not supported it gets an 'X'. We can always adjust criteria as we get feedback. |
I agree that it would be best to start with a relatively lax, low barrier to entry for self reporting Over time we can add more stringency (ideally with automated checks) if/when that would add value |
Agreed with that. I would be happy to contribute on cudf's behalf! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @alippai and everyone else who has contributed to this PR.
It is my opinion that we should try and address any outstanding comments that are straightforward, but then merge this PR even if there are other more substantial ones unresolved (like guidance https://github.com/apache/parquet-site/pull/34/files#r1635005006).
We can then iteratively add content over several more PRs and I think some of the comments will be easier to address as we start trying to fill out the chart for implementations.
I worry that we will run out of ambition if we try to complete this page on the first try (the fact that this PR has been open for more than a year is evidence if telling).
I would be happy to help write a pre-amble and/or try to fill out this table for the Rust implementation, for example, as a follow on PR
This is so exciting to see. I rendered this page locally and it looks good.
I agree with this sentiment. @alippai do you have bandwidth do another pass? I can merge after another pass (or do so now if you don't have bandwidth for an update). |
Yes, the input is great and I’ll give it a go tomorrow |
Please mark PRs that are ready for review as such. I have not given this as a review as it showed up as Draft in my notification. I will do this now in the next 2h. |
I've learned a ton and I see that many of us saw a few surprises. Looks like the page is needed indeed :) Following the advice a few of you raised I removed any controversial or too detailed items, we'll handle it in a next PR. Most of the comments are addressed and if we are happy with the changes, I'll:
|
@alippai thanks for updates! I agree with @alamb and @emkornfield that it's the right time to merge this and let the various implementations take a crack at filling in their columns. Questions that arise while doing that can drive the next iteration. |
Actually @mhaseeb123 will be handling cuDF. I'm just an interested bystander 😅. |
I would be happy to contribute on cuDF's behalf. I am just not fully sure if we are changing column names from language names (C++, Python etc) to implementation names (Arrow, cuDF etc) as @pitrou suggested or filling the rows with implementation names that support a feature in that language? |
It seems there are no objections. Let me do one more round tomorrow with the update mentioned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this PR merged!
I started sketching out what a parquet integration suite might look like here: apache/arrow-rs#5956
Once this PR is merged, i will work on honing the integration suite and use that to fill out this table
--- | ||
### Physical types | ||
|
||
| Data type | C++ | Java | Go | Rust | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- let's rename the columns later (as we fill out the details) with some name that can be mapped to the implementation (e.g.
parquet-cpp,
parquet-java`, etc)
title: "Implementation status" | ||
linkTitle: "Implementation status" | ||
weight: 8 | ||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a legend:
✅ supported
❌ not supported
[blank] no data
The main goal being to clarify the difference between missing information and not supported feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed addition (targeting this PR) in alippai#1
I tried to address @julienledem 's comments in alippai#1 |
Add introduction paragraph and legend
Hi @alamb, thank you for the all work and addressing comments. I saw you added a legend and some description in alippai#1 but I don't see cuDF in there and also am not sure how to exactly fill the table for cuDF. This is confusing especially as cuDF provides C++, Python and (limited? @etseidl) Java Parquet interfaces with GPU-accelerated backends. Please do note that I am in favor of merging the PR and making things live atm and thinking about this later in a separate PR as well. |
@alippai were you planning on adding anything else here or should we merge? |
My suggestion is to add a column/columns for cuDF in a future PR when you have information to fill out. I think the intent of these tables is to describe the parquet implementations (so it lists parquet-cpp, not its python bindings in pyarrow, for example). Perhaps the same model would apply to the parquet reader in cuDF In terms of how to fill out the table, we are discussing various option for "what does it mean to support a feature" on apache/parquet-format#441 |
@alippai thank you for getting this sarted, I'm going to merge now to let others try to fill it out. |
Moved from: apache/arrow#36027
I only can agree with @westonpace:
The original goal was not copying https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features, but quite orthogonal to this: showing which high level APIs are available for the end-user.
What I mean by this is that based on the document I should be able to know if I can manipulate Parquet files without decoding some values or do I have to convert it to actual objects / Apache Arrow table first. This was the info what is missing from all the docs and I'd like maintain something which reflects the public API functionalities.
Many of you were surprised by including xy metadata, what I was referring to was the availability of the https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.metadata or https://arrow.apache.org/docs/python/generated/pyarrow.Field.html#pyarrow.Field.metadata
Similar happens with the Column Index. It's available internally, but if it's not used when adding filters to the parquet reader, that's useful info for the Apache Arrow and Parquet developers, but not for the library users.
This is a limitation of the pyarrow API and this might be temporary or by design, but making this transparent was my intention.
It's absolutely OK if parquet-mr doesn't have a dataset API using the parquet features and it's OK for pyarrow not exposing the raw objects / byte arrays (ever). Different ecosystems have different abstractions and there doesn't have to be a "correct" or "standard" one. This is actually a nice thing - DataFusion, pyarrow, Acero, Iceberg etc have all different communities complementing each other.
Let me know if this is too vague or doesn't belong to the standard docs (or help me achieving a concise summary reflecting the above)