-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardisation of cv_terms in parquet. #79
Comments
I'm pretty sure unions aren't actually supported in Parquet itself (see apache/parquet-format#316 | apache/parquet-format#44). This was something I tried out a year ago when I started initial experiments with the format. PyArrow will blithely let you build them in memory and then crash and burn when you go to write them out. They're blocked on what the appropriate behavior should be for query engines and have been stalled for a long, long time. That leaves the multi-lane nullable direction which looks good on disk but will be unpleasant in memory. Another alternative is having a parameter list for each logical value-type, e.g int-params, float-params, string-params, bool-params, null-params, which is better storage in every way but has terrible ergonomics. |
If we specify the fields in
|
cv_terms stores now, key-value pairs where the key is clearly and string, but the value could be string, boolean, double, etc. Right now everything is written as string:string which means that we have to write in the parquet doubles and int as string.
I have read a bit about the best representation in terms of performance, speed and also compression about something like:
This means that in the schema we can have a
UNION of null, float, int
, etc. Can we evaluate @zprobot if that is better? @lazear do you have an opinion on this?Here, is how it should look like in pyarrow:
I think another suggestion could be:
The text was updated successfully, but these errors were encountered: