Technical Overview of Apache Drill by Jacques Nadeau

1
> Technical Overview
Jacques Nadeau, jacques@apache.org
May 22, 2013

2
Basic Process
Zookeeper
DFS/HBase DFS/HBase DFS/HBase
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Query
1. Query comes to any Drillbit
2. Drillbit generates execution plan based on affinity
3. Fragments are farmed to individual nodes
4. Data is returned to driving node

3
Core Modules within a Drillbit
SQL Parser
Optimizer
PhysicalPlan
DFS Engine
HBase Engine
RPC Endpoint
Distributed Cache
StorageEngineInterface
LogicalPlan
Execution

4
Query States
SQL
 What we want to do (analyst friendly)
Logical Plan:
 What we want to do (language agnostic, computer friendly)
Physical Plan
 How we want to do it (the best way we can tell)
Execution Plan (fragments)
 Where we want to do it

5
SQL
SELECT
t.cf1.name as name,
SUM(t.cf1.sales) as total_sales
FROM m7://cluster1/sales t
GROUP BY name
ORDER BY by total_sales desc
LIMIT 10;

6
Logical Plan: API/Format using JSON
 Designed to be as easy as possible for language implementers to utilize
– Sugared syntax such as sequence meta-operator
 Don’t constrain ourselves to SQL specific paradigm – support complex data type
operators such as collapse and expand as well
 Allow late typing
sequence: [
{ op: scan, storageengine: m7, selection: {table: sales}}
{ op: project, projections: [
{ref: name, expr: cf1.name},
{ref: sales, expr: cf1.sales}]}
{ op: segment, ref: by_name, exprs: [name]}
{ op: collapsingaggregate, target: by_name, carryovers: [name],
aggregations: [{ref: total_sales, expr: sum(name)}]}
{ op: order, ordering: [{order: desc, expr: total_sales}]}
{ op: store, storageengine: screen}
]

7
Physical Plan
 Insert points of parallelization where optimizer thinks they are necessary
– If we thought that the cardinality of name would be high, we might use an alternative of
sort > range-merge-exchange > streaming aggregate > sort > range-merge-exchange
instead of the simpler hash-random-exchange > sorting-hash-aggregate.
 Pick the right version of each operator
– For example, here we’ve picked the sorting hash aggregate. Since a hash aggregate is
already a blocking operator, doing the sort simultaneously allows us to avoid
materializing an intermediate state
 Apply projection and other push-down rules into capable operators
– Note that the projection is gone, applied directly to the m7scan operator.
{ @id: 1, pop: m7scan, cluster: def, table: sales, cols: [cf1.name, cf2.name]}
{ @id: 2, op: hash-random-exchange, input: 1, expr: 1}
{ @id: 3, op: sorting-hash-aggregate, input: 2,
grouping: 1, aggr:[sum(2)], carry: [1], sort: ~agrr[0]
}
{ @id: 4, op: screen, input: 4}

8
Execution Plan
 Break plan into major fragments
 Determine quantity of parallelization for each task based on
estimated costs as well as maximum parallelization for each
fragment (file size for now)
 Collect up endpoint affinity for each particular HasAffinity operator
 Assign particular nodes based on affinity, load and topology
 Generate minor versions of each fragment for individual execution
FragmentId:
 Major = portion of dataflow
 Minor = a particular version of that execution (1 or more)

9
Execution Plan, cont’d
Each execution plan has:
 One root fragment (runs on driving node)
 Leaf fragments (first tasks to run)
 Intermediate fragments (won’t start until
they receive data from their children)
 In the case where the query output is
routed to storage, the root operator will
often receive metadata to present rather
than data
Root
Intermediate
Leaf
Intermediate
Leaf

10
Example Fragments
Leaf Fragment 1
{
pop : "hash-partition-sender",
@id : 1,
child : {
pop : "mock-scan",
@id : 2,
url : "http://apache.org",
entries : [ {
id : 1,
records : 4000}]
},
destinations : [ "Cglsb2NhbGhvc3QY0gk=" ]
Leaf Fragment 2
{
pop : "hash-partition-sender",
@id : 1,
child : {
pop : "mock-scan",
@id : 2,
url : "http://apache.org",
entries : [ {
id : 1,
records : 4000
}, {
id : 2,
records : 4000
} ]
},
destinations : [ "Cglsb2NhbGhvc3QY0gk=" ]
}
Root Fragment
{
pop : "screen",
@id : 1,
child : {
pop : "random-receiver",
@id : 2,
providingEndpoints : [ "Cglsb2NhbGhvc3QY0gk=" ]
}
}
Intermediate Fragment
{
pop : "single-sender",
@id : 1,
child : {
pop : "mock-store",
@id : 2,
child : {
pop : "filter",
@id : 3,
child : {
pop : "random-receiver",
@id : 4,
providingEndpoints : [ "Cglsb2NhbGhvc3QYqRI=",
"Cglsb2NhbGhvc3QY0gk=" ]
},
expr : " ('b') > (5) "
}
},
destinations : [ "Cglsb2NhbGhvc3QYqRI=" ]
}

11
Execution Flow
Drill Client
UserServer
Query
Foreman
BitCom
Parser Optimizer
Execution
Planner

12
SQL Parser
 Leverage Optiq
 Add support for “any” type
 Add support for nested and repeated[] references
 Add transformation rules to convert from SQL AST to Logical plan
syntax

13
Optimizer
 Convert Logical to Physical
 Very much TBD
 Likely leverage Optiq
 Hardest problem in system, especially given lack of statistics
 Probably not parallel

14
Execution Planner
 Each scan operator provides a maximum width of parallelization
based on the number of read entries (similar to splits)
 Decision of parallelization width is based on simple disk costs size
 Affinity orders the location of fragment assignment
 Storage, Scan and Exchange operators are informed of the actual
endpoint assignments to then re-decide their entries (splits)

16
Execution Engine
 Single JVM per Drillbit
 Small heap space for object management
 Small set of network event threads to manage socket operations
 Callbacks for each message sent
 Messages contain header and collection of native byte buffers
 Designed to minimize copies and ser/de costs
 Query setup and fragment runners are managed via processing
queues & thread pools

17
Data
 Records are broken into batches
 Batches contain a schema and a collection of fields
 Each field has a particular type (e.g. smallint)
 Fields (a.k.a. columns) are stored in ValueVectors
 ValueVectors are façades to byte buffers.
 The in-memory structure of each ValueVector is well defined and
language agnostic
 ValueVectors defined based on the width and nature of the
underlying data
– RepeatMap Fixed1 Fixed2 Fixed4 Fixed8 Fixed12 Fixed16 Bit FixedLen
VarLen1 VarLen2 VarLen4
 There are three sub value vector types
– Optional (nullable), required or repeated

18
Execution Paradigm
 We will have a large amount of operators
 Each operator works on a batch of records at a time
 A loose goal is batches are roughly a single core’s L2 cache in size
 Each batch of records carries a schema
 An operator is responsible for reconfiguring itself if a new schema arrives (or rejecting
the record batch if the schema is disallowed)
 Most operators are the combination of a set of static operations along with the
evaluation of query specific expressions
 Runtime compiled operators are the combination of a pre-compiled template and a
runtime compiled set of expressions
 Exchange operators are converted into Senders and Receiver when execution plan is
materialized
 Each operator must support consumption of a SelectionVector, a partial
materialization of a filter

19
Storage Engine
 Input and output is done through storage engines
– (and the screen specialized storage operator)
 A storage engine is responsible for providing metadata and statistics about
the data
 A storage engine exposes a set of optimizer (plan rewrite) rules to support
things such as predicate pushdown
 A storage engine provides one or more storage engine specific scan
operators that can support affinity exposure and task splitting
– These are generated based on a StorageEngine specific configuration
 The primary interfaces are RecordReader and RecordWriter.
 RecordReaders are responsible for
– Converting stored data into Drill canonical ValueVector format a batch at a time
– Providing schema for each record batch
 Our initial storage engines will be for DFS and HBase

20
Messages
 Foreman drives query
 Foreman saves intermediate fragments to distributed cache
 Foreman sends leaf fragments directly to execution nodes
 Executing fragments push record batches to their fragment’s destination
nodes
 When destination node receives first fragment for a new query, it retrieves
its appropriate fragment from distributed cache, setups up required
framework, then waits until the start requirement is needed:
– A fragment is evaluated for the number of different sending streams that are
required before the query can actually be scheduled based on each exchanges
“supportsOutOfOrder” capability.
– When the IncomingBatchHandler recognizes that its start criteria has been
reached, it begins
– In the meantime, destination mode will buffer (potentially to disk)
 Fragment status messages are pushed back to foreman directly from
individual nodes
 A single failure status causes the foreman to cancel all other parts of query

21
Scheduling
 Plan is to leverage the concepts inside Sparrow
 Reality is that receiver-side buffering and pre-assigned execution
locations means that this is very much up in the air right now

22
Operation/Configuration
 Drillbit is a single JVM
 Extension is done by building to an api and generating a jar file
that includes a drill-module.conf file with information about where
that module needs to be inserted
 All configuration is done via a JSON like configuration metaphort
that supports complex types
 Node discovery/service registry is done through Zookeeper
 Metrics are collected utilizing the Yammer metrics module

23
User Interfaces
 Drill provides DrillClient
– Encapsulates endpoint discovery
– Supports logical and physical plan submission, query cancellation, query
status
– Supports streaming return results
 Drill will provide a JDBC driver which converts JDBC into DrillClient
communication.
– Currently SQL parsing is done client side
• Artifact of the current state of Optiq
• Need to slim up the JDBC driver and push stuff remotely
 In time, will add REST proxy for DrillClient

24
Technologies
 Jackson for JSON SerDe for metadata
 Typesafe HOCON for configuration and module management
 Netty4 as core RPC engine, protobuf for communication
 Vanilla Java, Larray and Netty ByteBuf for off-heap large data structure help
 Hazelcast for distributed cache
 Curator on top of Zookeeper for service registry
 Optiq for SQL parsing and cost optimization
 Parquet (probably) as ‘native’ format
 Janino for expression compilation
 ASM for ByteCode manipulation
 Yammer Metrics for metrics
 Guava extensively
 Carrot HPC for primitive collections

Technical Overview of Apache Drill by Jacques Nadeau

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Technical Overview of Apache Drill by Jacques Nadeau

Ähnlich wie Technical Overview of Apache Drill by Jacques Nadeau (20)

Mehr von MapR Technologies

Mehr von MapR Technologies (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Technical Overview of Apache Drill by Jacques Nadeau