2. 2
Basic Process
Zookeeper
DFS/HBase DFS/HBase DFS/HBase
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Query
1. Query comes to any Drillbit
2. Drillbit generates execution plan based on affinity
3. Fragments are farmed to individual nodes
4. Data is returned to driving node
4. 4
Query States
SQL
What we want to do (analyst friendly)
Logical Plan:
What we want to do (language agnostic, computer friendly)
Physical Plan
How we want to do it (the best way we can tell)
Execution Plan (fragments)
Where we want to do it
6. 6
Logical Plan: API/Format using JSON
Designed to be as easy as possible for language implementers to utilize
– Sugared syntax such as sequence meta-operator
Don’t constrain ourselves to SQL specific paradigm – support complex data type
operators such as collapse and expand as well
Allow late typing
sequence: [
{ op: scan, storageengine: m7, selection: {table: sales}}
{ op: project, projections: [
{ref: name, expr: cf1.name},
{ref: sales, expr: cf1.sales}]}
{ op: segment, ref: by_name, exprs: [name]}
{ op: collapsingaggregate, target: by_name, carryovers: [name],
aggregations: [{ref: total_sales, expr: sum(name)}]}
{ op: order, ordering: [{order: desc, expr: total_sales}]}
{ op: store, storageengine: screen}
]
7. 7
Physical Plan
Insert points of parallelization where optimizer thinks they are necessary
– If we thought that the cardinality of name would be high, we might use an alternative of
sort > range-merge-exchange > streaming aggregate > sort > range-merge-exchange
instead of the simpler hash-random-exchange > sorting-hash-aggregate.
Pick the right version of each operator
– For example, here we’ve picked the sorting hash aggregate. Since a hash aggregate is
already a blocking operator, doing the sort simultaneously allows us to avoid
materializing an intermediate state
Apply projection and other push-down rules into capable operators
– Note that the projection is gone, applied directly to the m7scan operator.
{ @id: 1, pop: m7scan, cluster: def, table: sales, cols: [cf1.name, cf2.name]}
{ @id: 2, op: hash-random-exchange, input: 1, expr: 1}
{ @id: 3, op: sorting-hash-aggregate, input: 2,
grouping: 1, aggr:[sum(2)], carry: [1], sort: ~agrr[0]
}
{ @id: 4, op: screen, input: 4}
8. 8
Execution Plan
Break plan into major fragments
Determine quantity of parallelization for each task based on
estimated costs as well as maximum parallelization for each
fragment (file size for now)
Collect up endpoint affinity for each particular HasAffinity operator
Assign particular nodes based on affinity, load and topology
Generate minor versions of each fragment for individual execution
FragmentId:
Major = portion of dataflow
Minor = a particular version of that execution (1 or more)
9. 9
Execution Plan, cont’d
Each execution plan has:
One root fragment (runs on driving node)
Leaf fragments (first tasks to run)
Intermediate fragments (won’t start until
they receive data from their children)
In the case where the query output is
routed to storage, the root operator will
often receive metadata to present rather
than data
Root
Intermediate
Leaf
Intermediate
Leaf
10. 10
Example Fragments
Leaf Fragment 1
{
pop : "hash-partition-sender",
@id : 1,
child : {
pop : "mock-scan",
@id : 2,
url : "http://apache.org",
entries : [ {
id : 1,
records : 4000}]
},
destinations : [ "Cglsb2NhbGhvc3QY0gk=" ]
Leaf Fragment 2
{
pop : "hash-partition-sender",
@id : 1,
child : {
pop : "mock-scan",
@id : 2,
url : "http://apache.org",
entries : [ {
id : 1,
records : 4000
}, {
id : 2,
records : 4000
} ]
},
destinations : [ "Cglsb2NhbGhvc3QY0gk=" ]
}
Root Fragment
{
pop : "screen",
@id : 1,
child : {
pop : "random-receiver",
@id : 2,
providingEndpoints : [ "Cglsb2NhbGhvc3QY0gk=" ]
}
}
Intermediate Fragment
{
pop : "single-sender",
@id : 1,
child : {
pop : "mock-store",
@id : 2,
child : {
pop : "filter",
@id : 3,
child : {
pop : "random-receiver",
@id : 4,
providingEndpoints : [ "Cglsb2NhbGhvc3QYqRI=",
"Cglsb2NhbGhvc3QY0gk=" ]
},
expr : " ('b') > (5) "
}
},
destinations : [ "Cglsb2NhbGhvc3QYqRI=" ]
}
12. 12
SQL Parser
Leverage Optiq
Add support for “any” type
Add support for nested and repeated[] references
Add transformation rules to convert from SQL AST to Logical plan
syntax
13. 13
Optimizer
Convert Logical to Physical
Very much TBD
Likely leverage Optiq
Hardest problem in system, especially given lack of statistics
Probably not parallel
14. 14
Execution Planner
Each scan operator provides a maximum width of parallelization
based on the number of read entries (similar to splits)
Decision of parallelization width is based on simple disk costs size
Affinity orders the location of fragment assignment
Storage, Scan and Exchange operators are informed of the actual
endpoint assignments to then re-decide their entries (splits)
16. 16
Execution Engine
Single JVM per Drillbit
Small heap space for object management
Small set of network event threads to manage socket operations
Callbacks for each message sent
Messages contain header and collection of native byte buffers
Designed to minimize copies and ser/de costs
Query setup and fragment runners are managed via processing
queues & thread pools
17. 17
Data
Records are broken into batches
Batches contain a schema and a collection of fields
Each field has a particular type (e.g. smallint)
Fields (a.k.a. columns) are stored in ValueVectors
ValueVectors are façades to byte buffers.
The in-memory structure of each ValueVector is well defined and
language agnostic
ValueVectors defined based on the width and nature of the
underlying data
– RepeatMap Fixed1 Fixed2 Fixed4 Fixed8 Fixed12 Fixed16 Bit FixedLen
VarLen1 VarLen2 VarLen4
There are three sub value vector types
– Optional (nullable), required or repeated
18. 18
Execution Paradigm
We will have a large amount of operators
Each operator works on a batch of records at a time
A loose goal is batches are roughly a single core’s L2 cache in size
Each batch of records carries a schema
An operator is responsible for reconfiguring itself if a new schema arrives (or rejecting
the record batch if the schema is disallowed)
Most operators are the combination of a set of static operations along with the
evaluation of query specific expressions
Runtime compiled operators are the combination of a pre-compiled template and a
runtime compiled set of expressions
Exchange operators are converted into Senders and Receiver when execution plan is
materialized
Each operator must support consumption of a SelectionVector, a partial
materialization of a filter
19. 19
Storage Engine
Input and output is done through storage engines
– (and the screen specialized storage operator)
A storage engine is responsible for providing metadata and statistics about
the data
A storage engine exposes a set of optimizer (plan rewrite) rules to support
things such as predicate pushdown
A storage engine provides one or more storage engine specific scan
operators that can support affinity exposure and task splitting
– These are generated based on a StorageEngine specific configuration
The primary interfaces are RecordReader and RecordWriter.
RecordReaders are responsible for
– Converting stored data into Drill canonical ValueVector format a batch at a time
– Providing schema for each record batch
Our initial storage engines will be for DFS and HBase
20. 20
Messages
Foreman drives query
Foreman saves intermediate fragments to distributed cache
Foreman sends leaf fragments directly to execution nodes
Executing fragments push record batches to their fragment’s destination
nodes
When destination node receives first fragment for a new query, it retrieves
its appropriate fragment from distributed cache, setups up required
framework, then waits until the start requirement is needed:
– A fragment is evaluated for the number of different sending streams that are
required before the query can actually be scheduled based on each exchanges
“supportsOutOfOrder” capability.
– When the IncomingBatchHandler recognizes that its start criteria has been
reached, it begins
– In the meantime, destination mode will buffer (potentially to disk)
Fragment status messages are pushed back to foreman directly from
individual nodes
A single failure status causes the foreman to cancel all other parts of query
21. 21
Scheduling
Plan is to leverage the concepts inside Sparrow
Reality is that receiver-side buffering and pre-assigned execution
locations means that this is very much up in the air right now
22. 22
Operation/Configuration
Drillbit is a single JVM
Extension is done by building to an api and generating a jar file
that includes a drill-module.conf file with information about where
that module needs to be inserted
All configuration is done via a JSON like configuration metaphort
that supports complex types
Node discovery/service registry is done through Zookeeper
Metrics are collected utilizing the Yammer metrics module
23. 23
User Interfaces
Drill provides DrillClient
– Encapsulates endpoint discovery
– Supports logical and physical plan submission, query cancellation, query
status
– Supports streaming return results
Drill will provide a JDBC driver which converts JDBC into DrillClient
communication.
– Currently SQL parsing is done client side
• Artifact of the current state of Optiq
• Need to slim up the JDBC driver and push stuff remotely
In time, will add REST proxy for DrillClient
24. 24
Technologies
Jackson for JSON SerDe for metadata
Typesafe HOCON for configuration and module management
Netty4 as core RPC engine, protobuf for communication
Vanilla Java, Larray and Netty ByteBuf for off-heap large data structure help
Hazelcast for distributed cache
Curator on top of Zookeeper for service registry
Optiq for SQL parsing and cost optimization
Parquet (probably) as ‘native’ format
Janino for expression compilation
ASM for ByteCode manipulation
Yammer Metrics for metrics
Guava extensively
Carrot HPC for primitive collections