-
Notifications
You must be signed in to change notification settings - Fork 1
Code Generation
Drill relies heavily on code generation for the data-specific functionality in each operator. For example, the FilterBatch (implements the SQL WHERE clause) generates code for the actual filter condition. Code generation is necessary because Drill works with a large number of data types (over 120 different value vector implementations) and many kinds of expressions. The alternative, an interpreter, would be expensive to maintain (given the large number of value vector types) and slower to execute.
Drill has two major forms of code generation:
- The FreeMarker-based code generation done during the build process, and
- The
JCode
-based code generation done at execution time.
This writeup focuses on the second form.
(Code generation seems to follow a pattern established by Hive? Need to research.)
Drill developers prepare for code generation by:
At runtime, code generation is done as follows:
- Each operator has a method to generate the query-specific operator implementation class.
- The code generation (CG) method creates an instance of
CodeGenerator
using both global and query-specific settings. - The
CodeGenerator
gathers information about the new class including its name with is:org.apache.drill.exec.test.generated.CodeCompiler<TemplateName>Gen<serial-no>
where serial-no is a number assigned to each new generated class from a specific template. - Creates one or more
ClassGenerator
"that implement outer and inner classes associated with a particular runtime generated instance." - The operator (batch) uses the
ClassGenerator
to build up the operator-specific bits of the class using information specific to this operator instance (specific columns, expressions and so on.) - Code generation is heavily assisted by pre-defined templates and the
HolderContainer
class that represents a value vectorHolder
class. - The
DrillbitContext
provides aCodeCompiler
utility that generates and compiles the code. - The
CodeCompiler
has an associated [Google Guava 'LocalCache](https://github.com/google/guava/wiki/CachesExplained) that holds the compiled classes.
LocalCache` looks up the code, and will call a "loader" object to create the compiled class if it is not found in the cache. - If the code does not yet exist, the code is generated as follows.
- Invokes
QueryClassLoader.getClassByteCode
to compile the code. - Selects the configured compiler (Janino or JDK).
- The generated code is compiled into byte codes.
- The compiled code is merged with template code. (Yes, the byte code from the generated class is combined, at the byte code level, with the template class to create a combined class for which there never was any source code. See this blog post and this paper for information.)
-
CodeGenerator<T>
: generate the Java source code required to complete the implementation of an abstract template.T
is the interface that results from compiling and merging the runtime code that is generated. -
TemplateClassDefinition<T>
: Identifies the interface which the code will implement, along with the template code used to (what?). Also provides a serial number for classes generated from this template. The template is created from the interface that the generated code is to implement. -
SignatureHolder
: (Seems to hold information about the class to be generated: methods and so on.) -
CodeGeneratorMethod
: (Describes one method in the generated code: name, return value, arguments, etc.) -
FunctionImplementationRegistry
: -
OptionManager
: A Javascript-like, prototype-based "stack" of options set from various sources (system, query, etc.) (Used to customize code?) -
MappingSet
: -
GeneratorMapping
: Maps the four standard "conceptual" methods (setup, eval, reset, cleanup) to the names of the actual method (if any) in the generated code. Many method use a standard set of mappings defined inClassGenerator<T>
:DEFAULT_SCALAR_MAP
andDEFAULT_CONSTANT_MAP
. -
JCodeModel
: "CodeModel is a Java library for code generators; it provides a way to generate Java programs." -
FragmentContext
: Provides thegetImplementationClass( )
method which uses theDrillbitContext
to find the compiler module that generates and compiles the code. -
CodeCompiler
: Mechanism to generate and compile code. Holds the code cache (with a limit of 1000 entries.) -
ClassGenerator
: Generates an a query-specific class from a template.
The code cache is a mapping from CodeGenerator
to generated class. The code generator, however, holds onto code with a unique class name. To allow the cache to work, during code generation, CodeGenerator
replaces the specific class name with "GenericGenerated
", so that the generic version can be used as a key for the code cache.
CodeGenerator
computes its hash code based on two attributes: the definition and the generic code. Similarly, the equals( )
method compares these same two attributes.
The code cache is stored in a Google Guava LocalCache
.
LocalCache
has three very useful features:
- It is concurrent. If a new object must be added to the cache, it locks that cache entry, builds the item, and unlocks. This ensures that if two threads ask for the same value, the first one builds it and the second one waits for completion.
- It has "build if absent" semantics. (See
CodeCompiler.Loader
.) - It will expire old entries after a period of time or when the cache size goes above some threshold. The code cache holds up to 1000 entries.
Other notes:
-
LocalCache
provides the means to gather statistics for the hit rate, etc. Drill does not currently use this feature. -
LocalCache
provides a callback to remove a class. It would seem that a class must reside in aClassLoader
. Wouldn't we need a removal method to remove an expired class from the class loader? - (Note: this is probably wrong) It turns out the this class does not call the standard
Object.hash()
method, instead callingSystem.identityHashCode()
to get an identity hash code for the object itself. This means that Drill never actually reuses classes and instead generates a new class for each operator. This is because Drill does not set theCacheBuilder.keyEquivalence()
object, so thatLocalCache
An unusual aspect of Drill's code generation system is the use of the ASM package to merge Java classes. Byte code manipulation appeared to grow out of the Aspect Oriented Programming community, and appears to be used by projects such as HBase. A good place to start is this tutorial.
The code transformer attempts to do scalar replacement (which also seems to be a feature of the JVM itself.) More information is available in this paper.
The SingleBatchSorter
interface, template and template definition:
public interface SingleBatchSorter {
public void setup(FragmentContext context, SelectionVector2 vector2, VectorAccessible incoming)
throws SchemaChangeException;
public void sort(SelectionVector2 vector2);
public static TemplateClassDefinition<SingleBatchSorter> TEMPLATE_DEFINITION =
new TemplateClassDefinition<SingleBatchSorter>(SingleBatchSorter.class,
SingleBatchSorterTemplate.class);
}
public abstract class SingleBatchSorterTemplate implements SingleBatchSorter, IndexedSortable{
...
public abstract void doSetup(@Named("context") FragmentContext context,
@Named("incoming") VectorAccessible incoming,
@Named("outgoing") RecordBatch outgoing);
public abstract int doEval(@Named("leftIndex") char leftIndex,
@Named("rightIndex") char rightIndex);
}
The following code in ExternalSortBatch
generates the specialized class:
public SingleBatchSorter createNewSorter(FragmentContext context, VectorAccessible batch)
throws ClassTransformationException, IOException, SchemaChangeException{
CodeGenerator<SingleBatchSorter> cg = CodeGenerator.get(SingleBatchSorter.TEMPLATE_DEFINITION, context.getFunctionRegistry(), context.getOptions());
ClassGenerator<SingleBatchSorter> g = cg.getRoot();
generateComparisons(g, batch);
return context.getImplementationClass(cg);
}
Here, generateComparisons()
builds up the specific methods needed to do the comparisons needed to implement the sort.
Drill implements a code cache. The cache holds generated classes. The cache is indexed by the entire body of the generated code. This means that each operator must generate source code, then check if that body of source has already been compiled.
The reason for this approach is that each code generator has a very wide interface: the resulting code is dependent not just on properties of the template and query; but also on properties for the fragment, the query and globally. Thus, the only reliable, known key is the generated source itself rather than the (large number) of parameters used to generate the source.