-
Notifications
You must be signed in to change notification settings - Fork 1
EVF Row Set
This is a practical guide, so let's immediately start using the EVF in its simplest form: the Row Set framework used to create, read and compare record batches in unit tests. The row set framework makes a number of simplifying assumptions to make tests as simple as possible:
- When creating a batch, we define the schema up front.
- No projection, null columns, or type conversion is needed.
- Batches are "small"; the framework does not enforce memory limits.
While the row set framework is specific to tests, the column accessor mechanism is used throughout the EVF. The easiest way to understand the column accessors is to start with the row set framework. (The EVF uses the term "accessor" to mean either a vector reader or a vector writer.)
In this example, we refer to this documentation and the ExampleTest class.
This page focuses on the overall process. See the next page for the details for each vector type.
The examples here also appear in the ExampleOperatorTest
file.
Since the row set framework is typically used in a test, our example will work in that context. Find a handy spot within Drill to define your temporary test file. (Unfortunately, Drill is not designed to allow you to create these files as a separate project outside of Drill.) You can create the file in the test package if you like.
public class ExampleTest extends SubOperatorTest {
@Test
public void rowSetExample() {
}
}
The SubOperatorTest
base class takes care of configuring and launching an in-memory version of Drill so that you can focus on the specific test case at hand.
Next define your schema using the SchemaBuilder
class. Careful, there are two such classes in Drill: you want the one in the 'metadatapackage. Let's define a simple schema with two columns: a non-nullable
Intand a nullable
Varchar`.
import org.apache.drill.exec.record.metadata.SchemaBuilder;
import org.apache.drill.common.types.TypeProtos.MinorType;
...
@Test
public void rowSetExample() {
final TupleMetadata schema = new SchemaBuilder()
.add("id", MinorType.INT)
.addNullable("name", MinorType.VARCHAR)
.buildSchema();
}
Some things to note:
- The schema builder allows a fluent notation which is very handy in tests. Production code is never this easy since the schema is not known at compile time.
- The
add()
methods typically add a column with no options which is a non-nullable (AKA Required) column. - The
addNullable()
methods add a nullable (Optional) column. - Drill defines two classes called `MinorType. Use the import shown above to get the correct one.
- The result of
buildSchema()
is aTupleSchema
. - The schema builder an also build a
BatchSchema
by callingbuild()
.BatchSchema
is used by theVectorContainer
class, butTupleMetadata
holds a more complete set of metadata, and can define extended types properly thatBatchSchema
cannot.
The TupleMetadata
class describes both a record and (as we'll see later) a Drill "Map" (really a struct.) Each tuple is made up of columns, defined by the ColumnMetadata
interface. ColumnMetadata
provides a rich set of information about each column. Combined, the metadata classes drive much of the EVF as we'll see.
Now that you are familiar with the schema classes, we'll leave it as an exercise for the reader to explore them and learn all that they have to offer.
The next step is to create a record batch using the schema. In tests, the easy way to do this is with the RowSetBuilder
class:
final RowSet rowSet = new RowSetBuilder(fixture.allocator(), schema)
.addRow(1, "kiwi")
.addRow(2, "watermelon")
.build();
Things to notice here:
- The
RowSet
interface provides an easy-to-use wrapper around the actual record batch. - For more advanced tests, you may need to use one of the subclasses of
RowSet
. - The record batch itself is available via the
foo()
method. - The
RowSetBuilder
class provides a fluent way to create, populate, and return a row set. - The
addRow()
method takes a list of Java objects. The code uses the type of the Java object to figure out whichset
method to call. (We'll discuss those methods shortly.) - If you want to create a row with a single column, use
addSingleCol()
instead. Otherwise, Java sometimes gets confused about the type of the single argument.
The above technique is often all you need when writing tests to verify some operation. (You will write such unit tests for your present work, right? I thought so.)
If you are creating an operator that works in production code, you won't know the data at compile time. Instead, you must work with each column one-by-one using the column writer
classes.
DirectRowSet rs = DirectRowSet.fromSchema(fixture.allocator(), schema);
RowSetWriter writer = rs.writer();
writer.scalar("id").setInt(1);
writer.scalar("name").setString("kiwi");
writer.save();
...
final SingleRowSet rowSet = writer.done();
Some things to note:
- Here we saw a number of row set subclasses.
DirectRowSet
holds a writeable row set. -
SingleRowSet
holds a readable row set which may or may not have a single-batch (SV2) selection vector. In our case, it has no selection vector. - The
RowSetWriter
is a kind ofTupleWriter
that provides extra methods to work with entire rows, such as thesave()
method that says that the row is complete. (TupleWriter
is also used to write to Map vectors.) - The row set writer is always ready to write a row, so there is no "start row" method here. (Note that there is such a method in the result set loader as we'll see later.)
- If you omit the call to
save()
, the row set writer will happily overwrite any existing value in the current row. This is done deliberately to handle advanced use cases. - The
scalar(name)
method looks up aColumnWriter
by name. - The returned column writer has many different
set
methods. We usesetInt()
andsetString()
here. - The
setString()
method is a convenience method: it converts a Java string into the byte array required by the vector. If you already have a byte array, you can call thesetBytes()
method instead. - Every scalar reader supports all the
set
methods. This avoids the need for casting to the correct writer type. Also, as we'll see later, it allows automatic type conversions when configured to do so.
The above used the "get by name" methods to simplify the code. You'll want to optimize production code. You can do so by referencing columns by position (as defined by the schema):
writer.scalar(0).setInt(1);
writer.scalar(1).setString("kiwi");
Or, you can cache the column writers:
RowSetWriter writer = rs.writer();
ScalarWriter idWriter = writer.scalar("id");
ScalarWriter nameWriter = writer.scalar("name");
idWriter.setInt(1);
nameWriter.setString("kiwi");
writer.save();
...
Note that the set()
methods themselves are heavily optimized: they do the absolute minimum work to write your value into the underlying value vector. This consists of a couple of checks (for empty slots and to detect when the vector is full). Using the column writers has been shown to be at least as efficient as using the value vector Mutator
classes (and, for non-nullable and array values, much faster.)
Now that you have a record batch, the next step is to do something with it. The simplest thing you can do (in a test) is to print the record batch so you can see what you have:
rowSet.print();
Output:
#: id, name
0: 1, "kiwi"
1: 2, "watermelon"
Once you are done with the row set, you must clear it to release the memory held by value vectors:
rowSet.clear();
You can also verify vectors. Suppose we want to verify that the two forms of writing to vectors above produces the same record batch:
RowSet expected = // Build using RowSetBuilder
RowSet actual = // Build using column writers
RowSetUtilities.verify(expected, actual);
The above takes the first argument as the "expected" value and the second as the "actual", then compares the schemas and values. This is how we use the row set framework to verify the result of some operation on record batch (including the result of an entire query.) No need to clear the vector: the verify()
function clears both batches for us.
If we want to work with individual values, we can use the column readers
which work much like the column writers. Let's assume we've created a print()
method that will print a value.
final RowSetReader reader = rowSet.reader();
while (reader.next()) {
print(reader.scalar("id").getInt());
print(reader.scalar("name").getString());
}
Notes:
- The
RowSetReader
is a specializedTupleReader
that iterates over records in a batch by calling thenext()
method. - The reader starts positioned before the first record, so you must call
next()
to move to the first record. - Access to
column readers
works very much like the writer example. You can cache the column readers for performance, or access them by column index. - For reading, you call
get()
methods of the type appropriate for your column.
When reading values, Drill can work with a single batch or multiple batches using one of three kinds of indirection:
- Direct: Read values from the record batch in order.
- Single-batch indirection: Uses an "SV2" selection vector to reorder records within a batch, such as the result of sorting the batch.
- Multiple-batch indirection: Uses an "SV4" selection vector to reorder records across multiple batches, again perhaps as the result of a multi-batch sort.
Here is how single-batch indirection works (courtesy of the Drill documentation):
We've not discussed the use of offset vectors since these are not used in the scan operator or in record readers.