Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avro: Add internal writer #11919

Merged
merged 4 commits into from
Jan 10, 2025
Merged

Avro: Add internal writer #11919

merged 4 commits into from
Jan 10, 2025

Conversation

ajantha-bhat
Copy link
Member

@ajantha-bhat ajantha-bhat commented Jan 6, 2025

Follow up for #11108

@ajantha-bhat ajantha-bhat marked this pull request as draft January 6, 2025 11:03
@github-actions github-actions bot added the core label Jan 6, 2025
@@ -205,7 +205,6 @@ public ValueReader<?> primitive(Pair<Integer, Type> partner, Schema primitive) {
case STRING:
return ValueReaders.strings();
case FIXED:
Copy link
Member Author

@ajantha-bhat ajantha-bhat Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per Types.java, fixed also should be written and read as plain ByteBuffers.

Refer the testcase in TestInternalWriter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this change.

Another check that this is correct is that ByteBuffer is used as the representation used by FixedLiteral, which must match the type of internal objects.

try (AvroIterable<StructLike> reader =
Avro.read(file.toInputFile())
.project(schema)
.createResolvingReader(InternalReader::create)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives GenericRecord, couldn't find a way to get the pure StructLike.
I cannot use .setCustomType to get StructProjection? It is read only class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay that this produces GenericRecord that is a StructLike. There's no need to avoid it.

import org.apache.avro.Schema;
import org.apache.iceberg.relocated.com.google.common.base.Preconditions;

abstract class BaseAvroSchemaVisitor extends AvroSchemaVisitor<ValueWriter<?>> {
Copy link
Contributor

@rdblue rdblue Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that this is a visitor, but I think it's a best practice to use the class name to convey the purpose of this particular visitor.

In this case, it's building a writer tree so I think a name like BaseWriteBuilder would be more appropriate (based on what was copied here). You were right to use Base to indicate that it is an abstract base class that is shared across object models. I might revise that suggestion when I read through more to see if this is more specific, too.

Another reason is that the name BaseAvroSchemaVisitor is misleading. This class builds a writer tree and has a specific purpose. It wouldn't be used as a base for a visitor that has a different purpose, but the name implies that it could be.

return writer.metrics();
}

private static class GenericAvroSchemaVisitor extends BaseAvroSchemaVisitor {
Copy link
Contributor

@rdblue rdblue Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't use Avro generics, so I think a better name is InternalWriteBuilder or just WriteBuilder since it is a private inner class. This is probably just a copy/paste error that wasn't corrected.

@@ -484,4 +489,16 @@ protected Object get(IndexedRecord struct, int pos) {
return struct.get(pos);
}
}

private static class StructLikeWriter extends StructWriter<StructLike> {
@SuppressWarnings("unchecked")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the unchecked cast?


@Override
protected ValueWriter<?> fixedWriter(int size) {
return ValueWriters.byteBuffers();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct for the write path. In the read path, we want to broadly accept values and let validation happen later, so reading directly as ByteBuffer and not checking the fixed length is okay. However in the write path we do need to validate that the ByteBuffer being written is the expected fixed size, which ValueWriters.byteBuffers() (ByteBufferWriter) doesn't do. However, FixedWriter and GenericFixedWriter do check:

    @Override
    public void write(byte[] bytes, Encoder encoder) throws IOException {
      Preconditions.checkArgument(
          bytes.length == length,
          "Cannot write byte array of length %s as fixed[%s]",
          bytes.length,
          length);
      encoder.writeFixed(bytes);
    }

I think this needs a new ByteBufferWriter for fixed buffers:

  public static ValueWriter<ByteBuffer> fixedBuffers(int length) {
    return new FixedByteBufferWriter(length);
  }

  private static class FixedByteBufferWriter implements ValueWriter<ByteBuffer> {
    private final int length;

    private FixedByteBufferWriter(int length) {
      this.length = length;
    }

    @Override
    public void write(ByteBuffer bytes, Encoder encoder) throws IOException {
      Preconditions.checkArgument(
          bytes.remaining() == length,
          "Cannot write byte buffer of length %s as fixed[%s]",
          bytes.remaining(),
          length);
      encoder.writeBytes(bytes);
    }
  }

import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.io.TempDir;

public class TestInternalWriter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this one test is enough validation for the internal writer. Other object models are testing using the DataTest set of tests and a writeAndValidate implementation that is basically what you have here.

To add that, you'll need to create a random data generator for internal data (like RandomAvroData and RandomGenericData) and a helper to validate the records after they are read (like DataTestHelpers.assertEquals).

13, Literal.of("12345678901234567890.1234567890").to(Types.DecimalType.of(38, 10)).value());

StructProjection structProjection = StructProjection.create(schema, schema);
StructProjection row = structProjection.wrap(record);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a projection. Using GenericRecord exercises the StructLike write path.

DataFile dataFile = dataWriter.toDataFile();

assertThat(dataFile.format()).as("Format should be Avro").isEqualTo(FileFormat.AVRO);
assertThat(dataFile.content()).as("Should be data file").isEqualTo(FileContent.DATA);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions are testing the writer produced by Avro.writeData(...).build() rather than the object model. I don't think there's any need for them.

.build()) {
writtenRecords = Lists.newArrayList(reader);
}
assertThat(writtenRecords).hasSize(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style in the project is to leave an empty newline after control flow blocks for readability.

@ajantha-bhat ajantha-bhat marked this pull request as ready for review January 8, 2025 14:41
@ajantha-bhat
Copy link
Member Author

@rdblue: Thanks a lot for the review. I have addressed all the comments. PR is ready.

public class TestInternalAvro extends AvroDataTest {
@Override
protected void writeAndValidate(Schema schema) throws IOException {
List<StructLike> expected = RandomAvroData.generateStructLike(schema, 100, 0L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to use a new random seed.

@@ -228,4 +235,54 @@ private static BigInteger randomUnscaled(int precision, Random random) {

return new BigInteger(sb.toString());
}

public static List<Object> generateList(
Random random, Types.ListType list, Supplier<Object> elementResult) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the supplier names no longer make much sense in this context, but this is minor.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approved too soon and noticed the random seed problem. Please fix and then we can merge. Thanks, @ajantha-bhat!

@Override
protected void writeAndValidate(Schema schema) throws IOException {
List<StructLike> expected =
RandomInternalData.generate(schema, 100, System.currentTimeMillis());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should not randomly generate the random seed. We want tests to be predictable. We normally choose a number and hard-code it.

Copy link
Member Author

@ajantha-bhat ajantha-bhat Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Since the comment was to use "random" seed. I went with random.

Tests should not randomly generate the random seed

Is it because if the test fails, we can't know what seed was used? Maybe logging the seed can help to solve this problem. As of now I hardcoded it.

@ajantha-bhat
Copy link
Member Author

ajantha-bhat commented Jan 10, 2025

Rebased as I hit the flaky test (Flink)
#11833

@rdblue
Copy link
Contributor

rdblue commented Jan 10, 2025

Would have been nice to fix the nit from the last review, but it isn't a blocker.

Thanks, @ajantha-bhat! I'll merge.

@rdblue rdblue merged commit a100e6a into apache:main Jan 10, 2025
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants