Avro: Add internal writer #11919

ajantha-bhat · 2025-01-06T11:02:58Z

Follow up for #11108

ajantha-bhat · 2025-01-06T11:04:10Z

core/src/main/java/org/apache/iceberg/avro/InternalReader.java

@@ -205,7 +205,6 @@ public ValueReader<?> primitive(Pair<Integer, Type> partner, Schema primitive) {
        case STRING:
          return ValueReaders.strings();
        case FIXED:


As per Types.java, fixed also should be written and read as plain ByteBuffers.

Refer the testcase in TestInternalWriter

I agree with this change.

Another check that this is correct is that ByteBuffer is used as the representation used by FixedLiteral, which must match the type of internal objects.

ajantha-bhat · 2025-01-06T11:11:22Z

core/src/test/java/org/apache/iceberg/avro/TestInternalWriter.java

+    try (AvroIterable<StructLike> reader =
+        Avro.read(file.toInputFile())
+            .project(schema)
+            .createResolvingReader(InternalReader::create)


This gives GenericRecord, couldn't find a way to get the pure StructLike.
I cannot use .setCustomType to get StructProjection? It is read only class.

It's okay that this produces GenericRecord that is a StructLike. There's no need to avoid it.

rdblue · 2025-01-07T17:47:52Z

core/src/main/java/org/apache/iceberg/avro/BaseAvroSchemaVisitor.java

+import org.apache.avro.Schema;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+abstract class BaseAvroSchemaVisitor extends AvroSchemaVisitor<ValueWriter<?>> {


It's true that this is a visitor, but I think it's a best practice to use the class name to convey the purpose of this particular visitor.

In this case, it's building a writer tree so I think a name like BaseWriteBuilder would be more appropriate (based on what was copied here). You were right to use Base to indicate that it is an abstract base class that is shared across object models. I might revise that suggestion when I read through more to see if this is more specific, too.

Another reason is that the name BaseAvroSchemaVisitor is misleading. This class builds a writer tree and has a specific purpose. It wouldn't be used as a base for a visitor that has a different purpose, but the name implies that it could be.

core/src/main/java/org/apache/iceberg/avro/BaseAvroSchemaVisitor.java

rdblue · 2025-01-07T19:49:00Z

core/src/main/java/org/apache/iceberg/avro/InternalWriter.java

+    return writer.metrics();
+  }
+
+  private static class GenericAvroSchemaVisitor extends BaseAvroSchemaVisitor {


This doesn't use Avro generics, so I think a better name is InternalWriteBuilder or just WriteBuilder since it is a private inner class. This is probably just a copy/paste error that wasn't corrected.

rdblue · 2025-01-07T19:51:58Z

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

@@ -484,4 +489,16 @@ protected Object get(IndexedRecord struct, int pos) {
      return struct.get(pos);
    }
  }
+
+  private static class StructLikeWriter extends StructWriter<StructLike> {
+    @SuppressWarnings("unchecked")


What is the unchecked cast?

rdblue · 2025-01-07T19:58:33Z

core/src/main/java/org/apache/iceberg/avro/InternalWriter.java

+
+    @Override
+    protected ValueWriter<?> fixedWriter(int size) {
+      return ValueWriters.byteBuffers();


This isn't correct for the write path. In the read path, we want to broadly accept values and let validation happen later, so reading directly as ByteBuffer and not checking the fixed length is okay. However in the write path we do need to validate that the ByteBuffer being written is the expected fixed size, which ValueWriters.byteBuffers() (ByteBufferWriter) doesn't do. However, FixedWriter and GenericFixedWriter do check:

@Override public void write(byte[] bytes, Encoder encoder) throws IOException { Preconditions.checkArgument( bytes.length == length, "Cannot write byte array of length %s as fixed[%s]", bytes.length, length); encoder.writeFixed(bytes); }

I think this needs a new ByteBufferWriter for fixed buffers:

public static ValueWriter<ByteBuffer> fixedBuffers(int length) { return new FixedByteBufferWriter(length); } private static class FixedByteBufferWriter implements ValueWriter<ByteBuffer> { private final int length; private FixedByteBufferWriter(int length) { this.length = length; } @Override public void write(ByteBuffer bytes, Encoder encoder) throws IOException { Preconditions.checkArgument( bytes.remaining() == length, "Cannot write byte buffer of length %s as fixed[%s]", bytes.remaining(), length); encoder.writeBytes(bytes); } }

rdblue · 2025-01-07T20:16:14Z

core/src/test/java/org/apache/iceberg/avro/TestInternalWriter.java

+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+public class TestInternalWriter {


I don't think that this one test is enough validation for the internal writer. Other object models are testing using the DataTest set of tests and a writeAndValidate implementation that is basically what you have here.

To add that, you'll need to create a random data generator for internal data (like RandomAvroData and RandomGenericData) and a helper to validate the records after they are read (like DataTestHelpers.assertEquals).

core/src/test/java/org/apache/iceberg/avro/TestInternalWriter.java

rdblue · 2025-01-07T20:18:22Z

core/src/test/java/org/apache/iceberg/avro/TestInternalWriter.java

+        13, Literal.of("12345678901234567890.1234567890").to(Types.DecimalType.of(38, 10)).value());
+
+    StructProjection structProjection = StructProjection.create(schema, schema);
+    StructProjection row = structProjection.wrap(record);


No need for a projection. Using GenericRecord exercises the StructLike write path.

rdblue · 2025-01-07T20:20:05Z

core/src/test/java/org/apache/iceberg/avro/TestInternalWriter.java

+    DataFile dataFile = dataWriter.toDataFile();
+
+    assertThat(dataFile.format()).as("Format should be Avro").isEqualTo(FileFormat.AVRO);
+    assertThat(dataFile.content()).as("Should be data file").isEqualTo(FileContent.DATA);


These assertions are testing the writer produced by Avro.writeData(...).build() rather than the object model. I don't think there's any need for them.

rdblue · 2025-01-07T20:22:10Z

core/src/test/java/org/apache/iceberg/avro/TestInternalWriter.java

+            .build()) {
+      writtenRecords = Lists.newArrayList(reader);
+    }
+    assertThat(writtenRecords).hasSize(1);


The style in the project is to leave an empty newline after control flow blocks for readability.

ajantha-bhat · 2025-01-08T14:44:22Z

@rdblue: Thanks a lot for the review. I have addressed all the comments. PR is ready.

core/src/test/java/org/apache/iceberg/avro/RandomAvroData.java

rdblue · 2025-01-08T17:58:49Z

core/src/test/java/org/apache/iceberg/avro/TestInternalAvro.java

+public class TestInternalAvro extends AvroDataTest {
+  @Override
+  protected void writeAndValidate(Schema schema) throws IOException {
+    List<StructLike> expected = RandomAvroData.generateStructLike(schema, 100, 0L);


It would be better to use a new random seed.

core/src/test/java/org/apache/iceberg/avro/AvroTestHelpers.java

rdblue · 2025-01-09T18:08:33Z

api/src/test/java/org/apache/iceberg/util/RandomUtil.java

@@ -228,4 +235,54 @@ private static BigInteger randomUnscaled(int precision, Random random) {

    return new BigInteger(sb.toString());
  }
+
+  public static List<Object> generateList(
+      Random random, Types.ListType list, Supplier<Object> elementResult) {


Nit: the supplier names no longer make much sense in this context, but this is minor.

rdblue

I approved too soon and noticed the random seed problem. Please fix and then we can merge. Thanks, @ajantha-bhat!

rdblue · 2025-01-09T18:11:15Z

core/src/test/java/org/apache/iceberg/avro/TestInternalAvro.java

+  @Override
+  protected void writeAndValidate(Schema schema) throws IOException {
+    List<StructLike> expected =
+        RandomInternalData.generate(schema, 100, System.currentTimeMillis());


Tests should not randomly generate the random seed. We want tests to be predictable. We normally choose a number and hard-code it.

I see. Since the comment was to use "random" seed. I went with random.

Tests should not randomly generate the random seed

Is it because if the test fails, we can't know what seed was used? Maybe logging the seed can help to solve this problem. As of now I hardcoded it.

ajantha-bhat · 2025-01-10T02:00:16Z

Rebased as I hit the flaky test (Flink)
#11833

rdblue · 2025-01-10T18:39:35Z

Would have been nice to fix the nit from the last review, but it isn't a blocker.

Thanks, @ajantha-bhat! I'll merge.

ajantha-bhat marked this pull request as draft January 6, 2025 11:03

github-actions bot added the core label Jan 6, 2025

ajantha-bhat commented Jan 6, 2025

View reviewed changes

rdblue reviewed Jan 7, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/avro/BaseAvroSchemaVisitor.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 7, 2025

View reviewed changes

core/src/test/java/org/apache/iceberg/avro/TestInternalWriter.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 7, 2025

View reviewed changes

ajantha-bhat force-pushed the avro_internal branch from 3b95d75 to 51e760f Compare January 8, 2025 13:15

ajantha-bhat marked this pull request as ready for review January 8, 2025 14:41

rdblue reviewed Jan 8, 2025

View reviewed changes

core/src/test/java/org/apache/iceberg/avro/RandomAvroData.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 8, 2025

View reviewed changes

core/src/test/java/org/apache/iceberg/avro/AvroTestHelpers.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 8, 2025

View reviewed changes

core/src/test/java/org/apache/iceberg/avro/AvroTestHelpers.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 8, 2025

View reviewed changes

core/src/test/java/org/apache/iceberg/avro/AvroTestHelpers.java Outdated Show resolved Hide resolved

github-actions bot added the API label Jan 9, 2025

ajantha-bhat mentioned this pull request Jan 9, 2025

Data: Add partition stats writer and reader #11216

Open

rdblue reviewed Jan 9, 2025

View reviewed changes

rdblue approved these changes Jan 9, 2025

View reviewed changes

rdblue requested changes Jan 9, 2025

View reviewed changes

ajantha-bhat added 3 commits January 10, 2025 07:29

Avro: Add internal writer

bb3a4c5

Address comments

75864f9

Address test comments

a1a1c40

Update random seed

2f4d8df

ajantha-bhat force-pushed the avro_internal branch from 9135a03 to 2f4d8df Compare January 10, 2025 01:59

rdblue approved these changes Jan 10, 2025

View reviewed changes

rdblue merged commit a100e6a into apache:main Jan 10, 2025
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avro: Add internal writer #11919

Avro: Add internal writer #11919

ajantha-bhat commented Jan 6, 2025 •

edited

Loading

ajantha-bhat Jan 6, 2025 •

edited

Loading

rdblue Jan 7, 2025

ajantha-bhat Jan 6, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025 •

edited

Loading

rdblue Jan 7, 2025 •

edited

Loading

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

rdblue Jan 7, 2025

ajantha-bhat commented Jan 8, 2025

rdblue Jan 8, 2025

rdblue Jan 9, 2025

rdblue left a comment

rdblue Jan 9, 2025

ajantha-bhat Jan 10, 2025 •

edited

Loading

ajantha-bhat commented Jan 10, 2025 •

edited

Loading

rdblue commented Jan 10, 2025

Avro: Add internal writer #11919

Avro: Add internal writer #11919

Conversation

ajantha-bhat commented Jan 6, 2025 • edited Loading

ajantha-bhat Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

rdblue Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajantha-bhat commented Jan 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajantha-bhat Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

ajantha-bhat commented Jan 10, 2025 • edited Loading

rdblue commented Jan 10, 2025

ajantha-bhat commented Jan 6, 2025 •

edited

Loading

ajantha-bhat Jan 6, 2025 •

edited

Loading

rdblue Jan 7, 2025 •

edited

Loading

rdblue Jan 7, 2025 •

edited

Loading

ajantha-bhat Jan 10, 2025 •

edited

Loading

ajantha-bhat commented Jan 10, 2025 •

edited

Loading