2. Who am I?
http://www.mapr.com/company/events/h
adoop-dc-11-29-12
• Keys Botzum
• kbotzum@maprtech.com
• Senior Principal Technologist, MapR Technologies
• MapR Federal and Eastern Region
2
3. MapR Technologies
• The open enterprise-grade distribution for
Hadoop
– Easy, dependable and fast
– Open source with standards-based extensions
• MapR is recognized as a technology leader
– Both Amazon and Google selected MapR as their
Hadoop partner
3
5. Latency Matters
• Ad-hoc analysis with interactive tools
• Real-time dashboards
• Event/trend detection and analysis
– Network intrusion analysis on the fly
– Fraud
– Failure detection and analysis
5
6. Big Data Processing
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to minutes Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model MapReduce Queries DAG
Users Developers Analysts and developers Developers
Google project MapReduce Dremel
Open source project Hadoop MapReduce Storm and S4
Introducing Apache Drill…
6
7. Google Dremel
• Interactive analysis of large-scale datasets
– Trillion records at interactive speeds
– Complementary to MapReduce
– Used by thousands of Google employees
– Paper published at VLDB 2010
• Model
– Nested data model with schema
• Most data at Google is stored/transferred in Protocol Buffers
• Normalization (to relational) is prohibitive
– SQL-like query language with nested data support
• Implementation
– Column-based storage and processing
– In-situ data access (GFS and Bigtable)
– Tree architecture as in Web search (and databases)
7
8. Innovations
• MapReduce
– Highly parallel algorithms running on commodity systems can deliver real
value at reasonable cost
– Scalable IO and compute trumps efficiency with today's commodity hardware
– With many datasets, schemas and indexes are limiting
– Flexibility is more important than efficiency
– An easy, scalable, fault tolerant execution framework is key for large clusters
• Dremel
– Columnar storage provides significant performance benefits at scale
– Columnar storage with nesting preserves structure and can be very efficient
– Avoiding final record assembly as long as possible improves efficiency
– Optimizing for the query use case can avoid the full generality of MR and thus
significantly reduce latency. E.g., no need to start JVMs, just push compact
queries to running agents.
9
9. Apache Drill
• Borrows heavily from Dremel, PowerDrill, and
others
– Open source Apache project
– Highly extensible and pluggable
10
10. Nested Data Model
• The data model in Dremel is Protocol Buffers
– Nested
– Schema
• Apache Drill is designed to support multiple data models
– Schema: Protocol Buffers, Apache Avro, …
– Schema-less: JSON, BSON, …
• Flat records are supported as a special case of nested data
– CSV, TSV, …
Avro IDL JSON
enum Gender { {
MALE, FEMALE "name": "Tomer",
} "gender": "Male",
"followers": 100
record User { }
string name; {
Gender gender; "name": "Maya",
long followers; "gender": "Female",
} "followers": 200,
"zip": "94305"
} 11
11. Nested Query Languages
• DrQL
– SQL-like query language for nested data
– Compatible with Google BigQuery/Dremel
• BigQuery applications should work with Drill
– Designed to support efficient column-based processing
• No record assembly during query processing
• Other languages/programming models can plug in
– Mongo Query Language
• {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}
– Hive
– Pig
12
12. DrQL Example
DocId: 10
Links SELECT DocId AS Id,
Forward: 20 COUNT(Name.Language.Code) WITHIN Name AS Cnt,
Forward: 40 Name.Url + ',' + Name.Language.Code AS Str
Forward: 60 FROM t
Name WHERE REGEXP(Name.Url, '^http') AND DocId < 20;
Language
Code: 'en-us'
Country: 'us' Id: 10
Language Name
Code: 'en' Cnt: 2
Url: 'http://A' Language
Name Str:
Url: 'http://B' 'http://A,en-us'
Name Str:
Language 'http://A,en'
Code: 'en-gb' Name
Country: 'gb' Cnt: 0
13
* Example from the Dremel paper
14. Extensibility
• Nested query languages
– Pluggable model
– DrQL
– Mongo Query Language
– Cascading
• Distributed execution engine
– Extensible model (eg, Dryad)
– Low-latency
– Fault tolerant
• Nested data formats
– Pluggable model
– Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV)
– Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON)
• Scalable data sources
– Pluggable model
– Hadoop (HDFS, Hbase)
– Perhaps MongoDB, Cassandra, etc
15
15. Architecture
• Only the execution engine knows the physical attributes of the cluster
– # nodes, hardware, file locations, …
• Public interfaces enable extensibility
– Developers can build parsers for new query languages
– Developers can provide an execution plan directly
• Each level of the plan has a human readable representation
– Facilitates debugging and unit testing
16
17. Query Components
• Query components:
– SELECT
– FROM
– WHERE
– GROUP BY
– HAVING
– (JOIN)
• Key logical operators:
– Scan
– Filter
– Aggregate
– (Join)
18
18. Scan Operators
• Drill supports multiple data formats by having per-format scan operators
• Queries involving multiple data formats/sources are supported
• Fields and predicates can be pushed down into the scan operator
• Scan operators may have adaptive side-effects (database cracking)
• Produce ColumnIO from RecordIO
• Google PowerDrill stores materialized expressions with the data
Scan with schema Scan without schema
Operator Protocol Buffers JSON-like (MessagePack)
output
Supported ColumnIO (column-based protobuf/Dremel) JSON
data formats RecordIO (row-based protobuf) HBase
CSV
SELECT … ColumnIO(proto URI, data URI) Json(data URI)
FROM … RecordIO(proto URI, data URI) HBase(table name)
19
19. Execution Engine Layers
• Drill execution engine has two layers
– Operator layer is serialization-aware
• Processes individual records
– Execution layer is not serialization-aware
• Processes batches of records (blobs)
• Responsible for communication, dependencies and fault tolerance
20
20. Design Principles
Flexible Easy
• Pluggable query languages • Unzip and run
• Extensible execution engine • Zero configuration
• Pluggable data formats • Reverse DNS not needed
• Column-based and row-based • IP addresses can change
• Schema and schema-less • Clear and concise log messages
• Pluggable data sources
Dependable Fast
• No SPOF • C/C++ core with Java support
• Instant recovery from crashes • Google C++ style guide
• Secure • Min latency and max throughput
(authentication, authorization, and (limited only by hardware)
auditing)
21
21. Hadoop Integration
• Hadoop data sources
– Hadoop FileSystem API (HDFS/MapR-FS)
– HBase
• Hadoop data formats
– Apache Avro
– RCFile
• MapReduce-based tools to create column-based formats
• Table registry in HCatalog
• Run long-running services in YARN
22
23. Get Involved!
• Download these slides
– http://www.mapr.com/company/events/hadoop-dc-11-29-
12
• Apache Drill Project Information
– http://www.mapr.com/drill
– http://incubator.apache.org/drill
– Join the mailing list and help: drill-dev-
subscribe@incubator.apache.org
• Join MapR
– jobs@mapr.com
24
Hinweis der Redaktion
Drill Remove schema requirementIn-situ for real since we’ll support multiple formatsNote: MR needed for big joins so to speak
DrillWill support nestedNo schema required
Protocol buffers are conceptual data modelWill support multiple data modelsWill have to define a way to explain data format (filtering, fields, etc)Schema-less will have perf penaltyHbase will be one format
Likely to support theseCould add HiveQL and more as well. Could even be clever and support HiveQL to MR or Drill based upon queryPig as wellPluggabilityData formatQuery languageSomething 6-9 months alpha qualityCommunity driven, I can’t speak for projectMapRFS gives better chunk size controlNFS support may make small test drivers easierUnified namespace will allow multi-cluster accessMight even have drill component that autoformats dataRead only model
Example query that Drill should supportNeed to talk more here about what Dremel does
Load data into Drill (optional)Could just use as is in “row” formatMultiple query languagesPluggability very important
Note: we have an already partially built execution engine
Initially we’ll support join in the simple cases like Dremel, but our end goal is full join support.
Be prepared for Apache questionsCommitter vs committee vs contributorIf can’t answer question, ask them to answer and contributeReferences to paper and such at end