Standardisation of cv_terms in parquet. #79

ypriverol · 2024-10-21T12:29:49Z

cv_terms stores now, key-value pairs where the key is clearly and string, but the value could be string, boolean, double, etc. Right now everything is written as string:string which means that we have to write in the parquet doubles and int as string.

I have read a bit about the best representation in terms of performance, speed and also compression about something like:

message Schema {
  required group my_field (LIST) {
    repeated group list {
      required group element {
        required binary key (UTF8);
        required group value (UNION) {
          optional binary string_value (UTF8);
          optional double double_value;
          optional int64 int_value;
        }
      }
    }
  }
}

This means that in the schema we can have a UNION of null, float, int, etc. Can we evaluate @zprobot if that is better? @lazear do you have an opinion on this?

Here, is how it should look like in pyarrow:

import pyarrow as pa
import pyarrow.parquet as pq

# Define the schema
field_schema = pa.list_(pa.struct([
    ('key', pa.string()),
    ('value', pa.union([
        pa.field('string_value', pa.string()),
        pa.field('double_value', pa.float64()),
        pa.field('int_value', pa.int64())
    ]))
]))

# Create the full schema
schema = pa.schema([
    ('my_field', field_schema)
])

# Example data
data = [
    {'my_field': [
        {'key': 'centroid', 'value': 'yes'},
        {'key': 'ibaq_value', 'value': 49.0},
        {'key': 'consensus_support', 'value': 4},
        {'key': 'software', 'value': 'maxquant'}
    ]}
]

# Create a Table
table = pa.Table.from_pylist(data, schema=schema)

# Write to Parquet
pq.write_table(table, 'example.parquet')

# Read from Parquet
read_table = pq.read_table('example.parquet')
print(read_table.schema)
print(read_table.to_pylist())

I think another suggestion could be:

message Schema {
  required group my_field (LIST) {
    repeated group list {
      required group element {
        required binary key (UTF8);
        required group value (UNION) {
          optional binary string_value (UTF8);
          optional double double_value;
          optional int64 int_value;
        }
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

mobiusklein · 2024-10-21T13:01:01Z

I'm pretty sure unions aren't actually supported in Parquet itself (see apache/parquet-format#316 | apache/parquet-format#44). This was something I tried out a year ago when I started initial experiments with the format. PyArrow will blithely let you build them in memory and then crash and burn when you go to write them out. They're blocked on what the appropriate behavior should be for query engines and have been stalled for a long, long time.

That leaves the multi-lane nullable direction which looks good on disk but will be unpleasant in memory. Another alternative is having a parameter list for each logical value-type, e.g int-params, float-params, string-params, bool-params, null-params, which is better storage in every way but has terrible ergonomics.

zprobot · 2024-10-22T08:11:42Z

If we specify the fields in cv_params, we can take this form.

my_schema = pa.schema([
    pa.field("cv_params", type=pa.struct([
    ('centroid', pa.string()),
    ('consensus_support', pa.float32()),
    ('ibaq_value', pa.float32()),
    ('software', pa.string())
    ]))
])
data = [
    {
        "centroid": 'yes',
        'ibaq_value': 46.9,
        "consensus_support": 6.28,
        "software": "maxquant"
    },
    {
        "centroid": "yes",
        'ibaq_value': 46.9,
        "consensus_support": 6.28,
        "software": "maxquant"
    }
]

my_data = [
    pa.array(data)
]
my_table = pa.table(my_data, schema=my_schema)

ypriverol assigned zprobot Oct 21, 2024

ypriverol linked a pull request Nov 7, 2024 that will close this issue

Major release of quantms.io format #81

Merged

ypriverol closed this as completed in #81 Nov 22, 2024

mobiusklein mentioned this issue Dec 22, 2024

How to represent controlled vocabulary parameters, round 2 #85

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardisation of cv_terms in parquet. #79

Standardisation of cv_terms in parquet. #79

ypriverol commented Oct 21, 2024 •

edited

Loading

mobiusklein commented Oct 21, 2024 •

edited

Loading

zprobot commented Oct 22, 2024

Standardisation of cv_terms in parquet. #79

Standardisation of cv_terms in parquet. #79

Comments

ypriverol commented Oct 21, 2024 • edited Loading

mobiusklein commented Oct 21, 2024 • edited Loading

zprobot commented Oct 22, 2024

ypriverol commented Oct 21, 2024 •

edited

Loading

mobiusklein commented Oct 21, 2024 •

edited

Loading