Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardisation of cv_terms in parquet. #79

Closed
ypriverol opened this issue Oct 21, 2024 · 2 comments · Fixed by #81
Closed

Standardisation of cv_terms in parquet. #79

ypriverol opened this issue Oct 21, 2024 · 2 comments · Fixed by #81
Assignees

Comments

@ypriverol
Copy link
Member

ypriverol commented Oct 21, 2024

cv_terms stores now, key-value pairs where the key is clearly and string, but the value could be string, boolean, double, etc. Right now everything is written as string:string which means that we have to write in the parquet doubles and int as string.

I have read a bit about the best representation in terms of performance, speed and also compression about something like:

message Schema {
  required group my_field (LIST) {
    repeated group list {
      required group element {
        required binary key (UTF8);
        required group value (UNION) {
          optional binary string_value (UTF8);
          optional double double_value;
          optional int64 int_value;
        }
      }
    }
  }
}

This means that in the schema we can have a UNION of null, float, int, etc. Can we evaluate @zprobot if that is better? @lazear do you have an opinion on this?

Here, is how it should look like in pyarrow:

import pyarrow as pa
import pyarrow.parquet as pq

# Define the schema
field_schema = pa.list_(pa.struct([
    ('key', pa.string()),
    ('value', pa.union([
        pa.field('string_value', pa.string()),
        pa.field('double_value', pa.float64()),
        pa.field('int_value', pa.int64())
    ]))
]))

# Create the full schema
schema = pa.schema([
    ('my_field', field_schema)
])

# Example data
data = [
    {'my_field': [
        {'key': 'centroid', 'value': 'yes'},
        {'key': 'ibaq_value', 'value': 49.0},
        {'key': 'consensus_support', 'value': 4},
        {'key': 'software', 'value': 'maxquant'}
    ]}
]

# Create a Table
table = pa.Table.from_pylist(data, schema=schema)

# Write to Parquet
pq.write_table(table, 'example.parquet')

# Read from Parquet
read_table = pq.read_table('example.parquet')
print(read_table.schema)
print(read_table.to_pylist())

I think another suggestion could be:

message Schema {
  required group my_field (LIST) {
    repeated group list {
      required group element {
        required binary key (UTF8);
        required group value (UNION) {
          optional binary string_value (UTF8);
          optional double double_value;
          optional int64 int_value;
        }
      }
    }
  }
}
@mobiusklein
Copy link
Contributor

mobiusklein commented Oct 21, 2024

I'm pretty sure unions aren't actually supported in Parquet itself (see apache/parquet-format#316 | apache/parquet-format#44). This was something I tried out a year ago when I started initial experiments with the format. PyArrow will blithely let you build them in memory and then crash and burn when you go to write them out. They're blocked on what the appropriate behavior should be for query engines and have been stalled for a long, long time.

That leaves the multi-lane nullable direction which looks good on disk but will be unpleasant in memory. Another alternative is having a parameter list for each logical value-type, e.g int-params, float-params, string-params, bool-params, null-params, which is better storage in every way but has terrible ergonomics.

@zprobot
Copy link
Collaborator

zprobot commented Oct 22, 2024

If we specify the fields in cv_params, we can take this form.

my_schema = pa.schema([
    pa.field("cv_params", type=pa.struct([
    ('centroid', pa.string()),
    ('consensus_support', pa.float32()),
    ('ibaq_value', pa.float32()),
    ('software', pa.string())
    ]))
])
data = [
    {
        "centroid": 'yes',
        'ibaq_value': 46.9,
        "consensus_support": 6.28,
        "software": "maxquant"
    },
    {
        "centroid": "yes",
        'ibaq_value': 46.9,
        "consensus_support": 6.28,
        "software": "maxquant"
    }
]

my_data = [
    pa.array(data)
]
my_table = pa.table(my_data, schema=my_schema)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants