Skip to content

Batch Handling Upgrades

Paul Rogers edited this page Jan 11, 2018 · 10 revisions

The batch handling framework consists of a number of layers that combine to enable Drill to control the size of each record batch, which in turn allows Drill to implement effective memory management and admission control.

The material here starts with concepts, then provides a tour of the various components. Each component is heavily commented, so after reading this material, you should be able to get the details from the code itself.

  1. Row set loader. Concept of overflow. Column states. Vector states. Overflow processing. Vector allocation. Vector cache and multi-reader model.

  2. Operator framework. Split of concerns. Protocol adapter. Schema change detection.

  3. Projection framework. Concepts. Project lists. Null columns. Implicit columns. Assembling the output batch. Column information in projection list. Recursive projection in maps. Schema smoothing and persistence.

  4. Mock reader. CSV reader. Easy format plugin. Concept of Parquet support.

  5. JSON concepts. JSON issues. Revised JSON parser. JSON semantics. Open issues. Possible opportunities.

  6. Future opportunities. Code generation. Plugin APIs. Reader retrofits. Fixed-size buffers.

Clone this wiki locally