Hug france-2012-12-04

Apache Drill

©MapR Technologies - Confidential 1

My Background

 Startups
– Aptex, MusicMatch, ID Analytics, Veoh
– Big data since before big

 Open source
– since the dark ages before the internet
– Mahout, Zookeeper, Drill
– bought the beer at first HUG

 MapR
 Founding member of Apache Drill


MapR Technologies

 The open enterprise-grade distribution for Hadoop
– Easy, dependable and fast
– Open source with standards-based extensions

 MapR is deployed at 1000’s of companies
– From small Internet startups to the world’s largest enterprises

 MapR customers analyze massive amounts of data:
– Hundreds of billions of events daily
– 90% of the world’s Internet population monthly
– $1 trillion in retail purchases annually

 MapR has partnered with Google to provide Hadoop on Google Compute
Engine


Agenda

 What?
– What exactly does Drill do?
 Why?
– Why do we need Apache Drill?
 Who?
– Who is doing this?
 How?
– How does Drill work inside?
 Conclusion
– How can you help?
– Where can you find out more?


Apache Drill Overview

 Drill overview
– Low latency interactive queries
– Standard ANSI SQL support

 Open-Source
– 100’s involved across US and Europe
– Community consensus on API, functionality

 PMC expects first version late this quarter
– Several components already developed


Big Data Processing – Hadoop

Batch processing
Query runtime Minutes to hours

Data volume TBs to PBs
Programming MapReduce
model
Users Developers

Google project MapReduce
Open source Hadoop
project MapReduce


Big Data Processing – Hadoop and Storm

Batch processing Stream processing
Query runtime Minutes to hours Never-ending

Data volume TBs to PBs Continuous stream
Programming MapReduce DAG
model (pre-programmed)
Users Developers Developers

Open source Hadoop Storm or Apache S4
project MapReduce


Big Data Processing – The missing part

Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Never-ending

Data volume TBs to PBs Continuous stream
Programming MapReduce DAG
model (pre-programmed)
Users Developers Developers

Open source Hadoop Storm and S4
project MapReduce


Big Data Processing – The missing part

Query runtime Minutes to hours Milliseconds to Never-ending
minutes
Data volume TBs to PBs GBs to PBs Continuous stream
Programming MapReduce Queries DAG
model (ad hoc) (pre-programmed)
Users Developers Analysts and Developers
developers
project MapReduce


Big Data Processing

minutes
model
developers
Google project MapReduce Dremel
project MapReduce


Big Data Processing

minutes
model
developers
Google project MapReduce Dremel
project MapReduce

Introducing Apache Drill

Latency Matters

 Ad-hoc analysis with interactive tools

 Real-time dashboards

 Event/trend detection and analysis
– Network intrusions
– Fraud
– Failures


Nested Query Languages

 DrQL
– SQL-like query language for nested data
– Compatible with Google BigQuery/Dremel
• BigQuery applications should work with Drill
– Designed to support efficient column-based processing
• No record assembly during query processing

 Mongo Query Language
– {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}

 Other languages/programming models can plug in


Nested Data Model

 The data model in Dremel is Protocol Buffers
– Nested
– Schema
 Apache Drill is designed to support multiple data models
– Schema: Protocol Buffers, Apache Avro, …
– Schema-less: JSON, BSON, …
 Flat records are supported as a special case of nested data
– CSV, TSV, …

Avro IDL JSON
enum Gender { {
MALE, FEMALE "name": "Srivas",
} "gender": "Male",
"followers": 100
record User { }
string name; {
Gender gender; "name": "Raina",
long followers; "gender": "Female",
} "followers": 200,
"zip": "94305"
}

Extensibility

 Nested query languages
– Pluggable model
– DrQL
– Mongo Query Language
– Cascading

 Distributed execution engine
– Extensible model (eg, Dryad)
– Low-latency
– Fault tolerant

 Nested data formats
– Pluggable model
– Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV)
– Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON)

 Scalable data sources
– Pluggable model
– Hadoop
– HBase


Design Principles

Flexible Easy
• Pluggable query languages • Unzip and run
• Extensible execution engine • Zero configuration
• Pluggable data formats • Reverse DNS not needed
• Column-based and row-based • IP addresses can change
• Schema and schema-less • Clear and concise log messages
• Pluggable data sources

Dependable Fast
• No SPOF • C/C++ core with Java support
• Instant recovery from crashes • Google C++ style guide
• Min latency and max throughput
(limited only by hardware)


Apache DRill


Architecture

 Only the execution engine knows the physical attributes of the cluster
– # nodes, hardware, file locations, …

 Public interfaces enable extensibility
– Developers can build parsers for new query languages
– Developers can provide an execution plan directly

 Each level of the plan has a human readable representation
– Facilitates debugging and unit testing


Execution Engine Layers

 Drill execution engine has two layers
– Operator layer is serialization-aware
• Processes individual records
– Execution layer is not serialization-aware
• Processes batches of records (blobs)
• Responsible for communication, dependencies and fault tolerance


DrQL Example

SELECT DocId AS Id,
COUNT(Name.Language.Code) WITHIN Name AS
Cnt,
Name.Url + ',' + Name.Language.Code AS
Str
FROM t
WHERE REGEXP(Name.Url, '^http')
AND DocId < 20;

©MapR Technologies - Confidential 20 * Example from the Dremel paper

Query Components

 Query components:
– SELECT
– FROM
– WHERE
– GROUP BY
– HAVING
– (JOIN)

 Key logical operators:
– Scan
– Filter
– Aggregate
– (Join)


Logical Plan

scan-json "table-1"

ﬁlter exp1

ﬂatten

aggregate exp2


Logical Plan Syntax
{op: "sequence",
do: [
{op: "scan",
source: "table-1.json"
selection: "*"
},
{op: "filter",
expr: <expr>
},
{op: "flatten",
expr: <expr>,
drop: "false"
},
{op: "aggregate",
type: repeat,
keys: [<name>,...],
aggregations: [
{ref: <name>, expr: <aggexpr> },...
]
}
]
}

Representing a DAG

18

aggregate exp2

19
{ @id: 19, op: "aggregate",
input: 18,
type: <simple|running|repeat>,
keys: [<name>,...],
aggregations: [
{ref: <name>, expr: <aggexpr> },...
]
}


Multiple Inputs

id 23 24 id

cogroup
{ @id: 25, op: "cogroup",
groupings: [
25 {ref: 23, expr: “id”}, {ref:
24, expr: “id”}
]
}


Scan Operators

• Drill supports multiple data formats by having per-format scan operators
• Queries involving multiple data formats/sources are supported

• Fields and predicates can be pushed down into the scan operator

• Scan operators may have adaptive side-effects (database cracking)
• Produce ColumnIO from RecordIO
• Google PowerDrill stores materialized expressions with the data
Scan with schema Scan without schema

Operator Protocol Buffers JSON-like (MessagePack)
output
Supported ColumnIO (column-based protobuf/Dremel) JSON
data formats RecordIO (row-based protobuf) HBase
CSV
SELECT … ColumnIO(proto URI, data URI) Json(data URI)
FROM … RecordIO(proto URI, data URI) HBase(table name)


Design Principles

Flexible Easy
• Pluggable query languages • Unzip and run
• Extensible execution engine • Zero configuration
• Pluggable data formats • Reverse DNS not needed
• Column-based and row-based • IP addresses can change
• Schema and schema-less • Clear and concise log messages
• Pluggable data sources

Dependable Fast
• No SPOF • C/C++ core with Java support
• Instant recovery from crashes • Google C++ style guide
• Min latency and max throughput
(limited only by hardware)


Hadoop Integration

 Hadoop data sources
– Hadoop FileSystem API (HDFS/MapR-FS)
– HBase
 Hadoop data formats
– Apache Avro
– RCFile
 MapReduce-based tools to create column-based formats
 Table registry in HCatalog
 Run long-running services in YARN


Get Involved!

 Download (almost) these slides
– http://www.mapr.com/company/events/bay-area-hug/9-19-2012

 Join the project
– drill-dev-subscribe@incubator.apache.org
– #apachedrill

 Contact me:
– tdunning@maprtech.com
– tdunning@apache.org
– ted.dunning@maprtech.com
– @ted_dunning

 Join MapR
– jobs@mapr.com


Hug france-2012-12-04

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Hug france-2012-12-04

Ähnlich wie Hug france-2012-12-04 (20)

Mehr von MapR Technologies

Mehr von MapR Technologies (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hug france-2012-12-04

Hinweis der Redaktion