Use compatible column name to set Parquet bloom filter #11799

huaxingao · 2024-12-16T23:46:18Z

When writing a Parquet file, if a column name contains special characters, e.g. -, Iceberg converts it to a compatible format. However, the bloom filter is still set using the original column name, which results in an invalid bloom filter. This pull request resolves the issue by setting the bloom filter with the compatible column name instead of the original one.

huaxingao · 2024-12-17T01:10:30Z

cc @szehon-ho

singhpk234 · 2024-12-17T01:34:08Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+          .columnBloomFilterEnabled()
+          .forEach(
+              (colPath, isEnabled) -> {
+                Types.NestedField fieldId = schema.caseInsensitiveFindField(colPath);


[doubt] does sensitivity matters ? can this :

write.parquet.bloom-filter-enabled.column.CoL1

be applied to parquet files with schema containing col1 ?

if not should we explicitly do lowercase post deriving the configs

Sensitivity matters. I changed to findField. Thanks!

singhpk234 · 2024-12-19T22:03:24Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+
+                String parquetColumnPath = fieldIdToParquetPath.get(fieldId.fieldId());
+                if (parquetColumnPath == null) {
+                  LOG.warn("Skipping bloom filter config for missing field: {}", fieldId);


should we update this message to say something like

Suggested change

LOG.warn("Skipping bloom filter config for missing field: {}", fieldId);

LOG.warn("Skipping bloom filter config for field: {} due to missing parquetColumnPath for fieldId: {}", colPath, fiedId);

mostly coming from the above log lines are identical mostly though at one part we add columnPath and the other we do fielId

Fixed. Thanks!

singhpk234 · 2024-12-19T22:08:57Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+      context
+          .columnBloomFilterEnabled()
+          .forEach(
+              (colPath, isEnabled) -> {


[question] do we need to do anything for isEnabled as false ? or can parquet pro-actively decide if it should have a BF for a column and this isEnabled as false serves as explicit deny ?

If isEnable is true, iceberg will call withBloomFilterEnabled(String columnPath, boolean enabled). If isEnable is false, we don't need to do anything.

singhpk234 · 2024-12-19T22:09:53Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+          .columnBloomFilterEnabled()
+          .forEach(
+              (colPath, isEnabled) -> {
+                Types.NestedField fieldId = schema.findField(colPath);


can we call this as field instead ?

Suggested change

Types.NestedField fieldId = schema.findField(colPath);

Types.NestedField field = schema.findField(colPath);

Updated. Thanks!

singhpk234

LGTM, Thank you @huaxingao !

RussellSpitzer · 2025-01-02T20:45:52Z

parquet/src/test/java/org/apache/iceberg/parquet/TestBloomRowGroupFilter.java

@@ -109,7 +109,7 @@ public class TestBloomRowGroupFilter {
          optional(22, "timestamp", Types.TimestampType.withoutZone()),
          optional(23, "timestamptz", Types.TimestampType.withZone()),
          optional(24, "binary", Types.BinaryType.get()),
-          optional(25, "int_decimal", Types.DecimalType.of(8, 2)),
+          optional(25, "int-decimal", Types.DecimalType.of(8, 2)),


Minor quibble here, let's not re-use this field and instead add a new field like we did with "non_bloom", "struct_not_null" etc ...
So something like

optional(28, "incompatible-name", StringType.get())

Changed. Thanks

RussellSpitzer · 2025-01-02T20:46:12Z

parquet/src/test/java/org/apache/iceberg/parquet/TestBloomRowGroupFilter.java

            .shouldRead(parquetSchema, rowGroupMetadata, bloomStore);
    assertThat(shouldRead).as("Should not read: decimal outside range").isFalse();
  }

  @Test
-  public void testLongDeciamlEq() {
+  public void testLongDecimalEq() {


Let's keep this fix!

RussellSpitzer · 2025-01-02T20:57:42Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+        Context context,
+        MessageType parquetSchema,
+        BiConsumer<String, Boolean> withBloomFilterEnabled,
+        BiConsumer<String, Double> withBloomFilterFPP) {


Is it possible to use the same Name compat function we use in the parquet writer to get the compatible column names? Just seems a bit odd to look up the paths when we know how to generate them?

For example couldn't we keep the old layout of the code and just use

iceberg/core/src/main/java/org/apache/iceberg/avro/AvroSchemaUtil.java

Line 483 in afda8be

public static String makeCompatibleName(String name) {

for (Map.Entry<String, String> entry : columnBloomFilterEnabled.entrySet()) { String colPath = makeCompatibleName(entry.getKey()); String bloomEnabled = entry.getValue(); parquetWriteBuilder.withBloomFilterEnabled(colPath, Boolean.parseBoolean(bloomEnabled)); }

Feel free to tell me if this is a bad idea, I think your approach is fine as well since it does seem to be a better validation

It's simpler to use makeCompatibleName. I have made the changes. Thanks!

RussellSpitzer · 2025-01-03T17:16:11Z

parquet/src/test/java/org/apache/iceberg/parquet/TestBloomRowGroupFilter.java

@@ -193,6 +195,7 @@ public void createInputFile() throws IOException {

    // build struct field schema
    org.apache.avro.Schema structSchema = AvroSchemaUtil.convert(UNDERSCORE_STRUCT_FIELD_TYPE);
+    String compatibleFieldName = AvroSchemaUtil.makeCompatibleName("_incompatible-name");


This is a personal nit of mine, but I don't like when the tests use the same implementation as the prod code to get expected values. Could we just hard wire in the converted name?

RussellSpitzer · 2025-01-03T17:17:14Z

Looks like tests are not passing?

TestBloomRowGroupFilter > testStructFieldEq() FAILED
    org.opentest4j.AssertionFailedError: [Should not read: value outside range] 
    Expecting value to be false but was true
        at app//org.apache.iceberg.parquet.TestBloomRowGroupFilter.testStructFieldEq(TestBloomRowGroupFilter.java:954)

huaxingao · 2025-01-06T21:42:07Z

@RussellSpitzer

Looks like tests are not passing?

I looked at the failed test again. The reason it failed is that the bloom filter is set on a field of the struct type struct_not_null._int_field. When we use:

String colPath = makeCompatibleName(entry.getKey());

makeCompatibleName changes struct_not_null._int_field to struct_not_null_x2E_int_field, which we actually don't want. If the entry contains a period, we could check if it is a field of a complex type and only apply makeCompatibleName to the field name. However, I feel it's probably simpler to use my original approach: get the fieldId of the entry, and then get the corresponding Parquet path for that fieldId.

huaxingao · 2025-01-10T01:55:14Z

I looked the code again, for the newly added method

    private <T> void setBloomFilterConfig(
        Context context,
        MessageType parquetSchema,
        BiConsumer<String, Boolean> withBloomFilterEnabled,
        BiConsumer<String, Double> withBloomFilterFPP)

The reason I want to pass by function is that the callers are different functions.

One caller is

        setBloomFilterConfig(
            context, type, propsBuilder::withBloomFilterEnabled, propsBuilder::withBloomFilterFPP);

propsBuilder is a ParquetProperties and it calls https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L637

Another caller is

        setBloomFilterConfig(
            context,
            type,
            parquetWriteBuilder::withBloomFilterEnabled,
            parquetWriteBuilder::withBloomFilterFPP);

parquetWriteBuilder is a ParquetWriter and it calls https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L810

RussellSpitzer · 2025-01-10T15:22:22Z

Thanks @huaxingao for the PR and @singhpk234 For the review!

huaxingao · 2025-01-10T16:58:04Z

Thanks a lot! @RussellSpitzer @singhpk234

Use compatible column name to set Parquet bloom filter

d7e6891

github-actions bot added the parquet label Dec 16, 2024

singhpk234 reviewed Dec 17, 2024

View reviewed changes

change to schema.findField

75b3b5a

singhpk234 reviewed Dec 19, 2024

View reviewed changes

singhpk234 approved these changes Dec 20, 2024

View reviewed changes

RussellSpitzer reviewed Jan 2, 2025

View reviewed changes

huaxingao force-pushed the bloomfilter branch from 15a7026 to b8ff9f5 Compare January 3, 2025 05:18

address comments

942d24f

huaxingao force-pushed the bloomfilter branch from b8ff9f5 to 942d24f Compare January 3, 2025 08:04

RussellSpitzer reviewed Jan 3, 2025

View reviewed changes

RussellSpitzer approved these changes Jan 3, 2025

View reviewed changes

huaxingao force-pushed the bloomfilter branch from db11e9e to 102c71c Compare January 6, 2025 23:15

revert to previous approach

3246521

huaxingao force-pushed the bloomfilter branch from 102c71c to 3246521 Compare January 10, 2025 01:54

RussellSpitzer approved these changes Jan 10, 2025

View reviewed changes

RussellSpitzer merged commit 3dbb5cc into apache:main Jan 10, 2025
49 checks passed

huaxingao deleted the bloomfilter branch January 10, 2025 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use compatible column name to set Parquet bloom filter #11799

Use compatible column name to set Parquet bloom filter #11799

huaxingao commented Dec 16, 2024

huaxingao commented Dec 17, 2024

singhpk234 Dec 17, 2024

huaxingao Dec 17, 2024

singhpk234 Dec 19, 2024

huaxingao Dec 19, 2024

singhpk234 Dec 19, 2024

huaxingao Dec 19, 2024

singhpk234 Dec 19, 2024

huaxingao Dec 19, 2024

singhpk234 left a comment

RussellSpitzer Jan 2, 2025

huaxingao Jan 3, 2025

RussellSpitzer Jan 2, 2025

RussellSpitzer Jan 2, 2025

RussellSpitzer Jan 2, 2025

RussellSpitzer Jan 2, 2025

huaxingao Jan 3, 2025

RussellSpitzer Jan 3, 2025

RussellSpitzer commented Jan 3, 2025

huaxingao commented Jan 6, 2025

huaxingao commented Jan 10, 2025

RussellSpitzer commented Jan 10, 2025

huaxingao commented Jan 10, 2025

	LOG.warn("Skipping bloom filter config for missing field: {}", fieldId);
	LOG.warn("Skipping bloom filter config for field: {} due to missing parquetColumnPath for fieldId: {}", colPath, fiedId);

	Types.NestedField fieldId = schema.findField(colPath);
	Types.NestedField field = schema.findField(colPath);

Use compatible column name to set Parquet bloom filter #11799

Use compatible column name to set Parquet bloom filter #11799

Conversation

huaxingao commented Dec 16, 2024

huaxingao commented Dec 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singhpk234 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer commented Jan 3, 2025

huaxingao commented Jan 6, 2025

huaxingao commented Jan 10, 2025

RussellSpitzer commented Jan 10, 2025

huaxingao commented Jan 10, 2025