Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use compatible column name to set Parquet bloom filter #11799

Merged
merged 4 commits into from
Jan 10, 2025

Conversation

huaxingao
Copy link
Contributor

When writing a Parquet file, if a column name contains special characters, e.g. -, Iceberg converts it to a compatible format. However, the bloom filter is still set using the original column name, which results in an invalid bloom filter. This pull request resolves the issue by setting the bloom filter with the compatible column name instead of the original one.

@huaxingao
Copy link
Contributor Author

cc @szehon-ho

.columnBloomFilterEnabled()
.forEach(
(colPath, isEnabled) -> {
Types.NestedField fieldId = schema.caseInsensitiveFindField(colPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[doubt] does sensitivity matters ? can this :

write.parquet.bloom-filter-enabled.column.CoL1

be applied to parquet files with schema containing col1 ?

if not should we explicitly do lowercase post deriving the configs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sensitivity matters. I changed to findField. Thanks!


String parquetColumnPath = fieldIdToParquetPath.get(fieldId.fieldId());
if (parquetColumnPath == null) {
LOG.warn("Skipping bloom filter config for missing field: {}", fieldId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we update this message to say something like

Suggested change
LOG.warn("Skipping bloom filter config for missing field: {}", fieldId);
LOG.warn("Skipping bloom filter config for field: {} due to missing parquetColumnPath for fieldId: {}", colPath, fiedId);

mostly coming from the above log lines are identical mostly though at one part we add columnPath and the other we do fielId

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

context
.columnBloomFilterEnabled()
.forEach(
(colPath, isEnabled) -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] do we need to do anything for isEnabled as false ? or can parquet pro-actively decide if it should have a BF for a column and this isEnabled as false serves as explicit deny ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If isEnable is true, iceberg will call withBloomFilterEnabled(String columnPath, boolean enabled). If isEnable is false, we don't need to do anything.

.columnBloomFilterEnabled()
.forEach(
(colPath, isEnabled) -> {
Types.NestedField fieldId = schema.findField(colPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call this as field instead ?

Suggested change
Types.NestedField fieldId = schema.findField(colPath);
Types.NestedField field = schema.findField(colPath);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you @huaxingao !

@@ -109,7 +109,7 @@ public class TestBloomRowGroupFilter {
optional(22, "timestamp", Types.TimestampType.withoutZone()),
optional(23, "timestamptz", Types.TimestampType.withZone()),
optional(24, "binary", Types.BinaryType.get()),
optional(25, "int_decimal", Types.DecimalType.of(8, 2)),
optional(25, "int-decimal", Types.DecimalType.of(8, 2)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor quibble here, let's not re-use this field and instead add a new field like we did with "non_bloom", "struct_not_null" etc ...
So something like

optional(28, "incompatible-name", StringType.get())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Thanks

.shouldRead(parquetSchema, rowGroupMetadata, bloomStore);
assertThat(shouldRead).as("Should not read: decimal outside range").isFalse();
}

@Test
public void testLongDeciamlEq() {
public void testLongDecimalEq() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this fix!

Context context,
MessageType parquetSchema,
BiConsumer<String, Boolean> withBloomFilterEnabled,
BiConsumer<String, Double> withBloomFilterFPP) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use the same Name compat function we use in the parquet writer to get the compatible column names? Just seems a bit odd to look up the paths when we know how to generate them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example couldn't we keep the old layout of the code and just use

public static String makeCompatibleName(String name) {

        for (Map.Entry<String, String> entry : columnBloomFilterEnabled.entrySet()) {
          String colPath =  makeCompatibleName(entry.getKey());
          String bloomEnabled = entry.getValue();
          parquetWriteBuilder.withBloomFilterEnabled(colPath, Boolean.parseBoolean(bloomEnabled));
        }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to tell me if this is a bad idea, I think your approach is fine as well since it does seem to be a better validation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's simpler to use makeCompatibleName. I have made the changes. Thanks!

@@ -193,6 +195,7 @@ public void createInputFile() throws IOException {

// build struct field schema
org.apache.avro.Schema structSchema = AvroSchemaUtil.convert(UNDERSCORE_STRUCT_FIELD_TYPE);
String compatibleFieldName = AvroSchemaUtil.makeCompatibleName("_incompatible-name");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a personal nit of mine, but I don't like when the tests use the same implementation as the prod code to get expected values. Could we just hard wire in the converted name?

@RussellSpitzer
Copy link
Member

Looks like tests are not passing?

TestBloomRowGroupFilter > testStructFieldEq() FAILED
    org.opentest4j.AssertionFailedError: [Should not read: value outside range] 
    Expecting value to be false but was true
        at app//org.apache.iceberg.parquet.TestBloomRowGroupFilter.testStructFieldEq(TestBloomRowGroupFilter.java:954)

@huaxingao
Copy link
Contributor Author

@RussellSpitzer

Looks like tests are not passing?

I looked at the failed test again. The reason it failed is that the bloom filter is set on a field of the struct type struct_not_null._int_field. When we use:

String colPath = makeCompatibleName(entry.getKey());

makeCompatibleName changes struct_not_null._int_field to struct_not_null_x2E_int_field, which we actually don't want. If the entry contains a period, we could check if it is a field of a complex type and only apply makeCompatibleName to the field name. However, I feel it's probably simpler to use my original approach: get the fieldId of the entry, and then get the corresponding Parquet path for that fieldId.

@huaxingao
Copy link
Contributor Author

I looked the code again, for the newly added method

    private <T> void setBloomFilterConfig(
        Context context,
        MessageType parquetSchema,
        BiConsumer<String, Boolean> withBloomFilterEnabled,
        BiConsumer<String, Double> withBloomFilterFPP) 

The reason I want to pass by function is that the callers are different functions.

One caller is

        setBloomFilterConfig(
            context, type, propsBuilder::withBloomFilterEnabled, propsBuilder::withBloomFilterFPP);

propsBuilder is a ParquetProperties and it calls https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L637

Another caller is

        setBloomFilterConfig(
            context,
            type,
            parquetWriteBuilder::withBloomFilterEnabled,
            parquetWriteBuilder::withBloomFilterFPP);

parquetWriteBuilder is a ParquetWriter and it calls https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L810

@RussellSpitzer RussellSpitzer merged commit 3dbb5cc into apache:main Jan 10, 2025
49 checks passed
@RussellSpitzer
Copy link
Member

Thanks @huaxingao for the PR and @singhpk234 For the review!

@huaxingao
Copy link
Contributor Author

Thanks a lot! @RussellSpitzer @singhpk234

@huaxingao huaxingao deleted the bloomfilter branch January 10, 2025 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants