• Speichern
Introduction to Apache Drill - interactive query and analysis at scale
Upcoming SlideShare
Loading in...5
×
 

Introduction to Apache Drill - interactive query and analysis at scale

on

  • 3,723 Views

 

Statistiken

Views

Gesamtviews
3,723
Views auf SlideShare
3,636
Views einbetten
87

Actions

Gefällt mir
3
Downloads
0
Kommentare
0

2 Einbettungen 87

https://twitter.com 84
http://www.linkedin.com 3

Zugänglichkeit

Kategorien

Details hochladen

Uploaded via as Microsoft PowerPoint

Benutzerrechte

© Alle Rechte vorbehalten

Report content

Als unangemessen gemeldet Als unangemessen melden
Als unangemessen melden

Wählen Sie Ihren Grund, warum Sie diese Präsentation als unangemessen melden.

Löschen
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Ihre Nachricht erscheint hier
    Processing...
Kommentar posten
Kommentar bearbeiten
  • Hive: compile to MR, Aster: external tables in MPP, Oracle/MySQL: export MR results to RDBMSDrill, Impala, CitusDB: real-time
  • Suppose a marketing analyst trying to experiment with ways to do targeting of user segments for next campaign. Needs access to web logs stored in Hadoop, and also needs to access user profiles stored in MongoDB as well as access to transaction data stored in a conventional database.
  • Re ad-hoc:You might not know ahead of time what queries you will want to make. You may need to react to changing circumstances.
  • Two innovations: handle nested-data column style (column-striped representation) and query push-down
  • Source query is parsed and transformed to produce the logical planTypically, the logical plan lives in memory in the form of Java objects, but it also has a textual form. The logical query is then transformed and optimized into the physical plan.The physical plan represents the actual structure of computation as it is done by the system. One of the most important things the optimizer does is the introduction of parallel computation (other: columnar data to improve processing speed)

Introduction to Apache Drill - interactive query and analysis at scale Introduction to Apache Drill - interactive query and analysis at scale Presentation Transcript

  • Introduction to Apache Drill –interactive query and analysis at scale Michael Hausenblas, MapR EMEA 2013-02-22, HUG Munich
  • About Michael• Background in large-scale data integration• Chief Data Engineer EMEA, MapR• Apache Drill contributor
  • Workloads• Batch processing (MapReduce)• Light-weight OLTP (HBase, Cassandra)• Stream processing (Storm, S4)• Search (Solr, Elasticsearch)• Interactive analysis
  • Interactive Query at scale Impala low-latency
  • Use Case• Jane, a marketing analyst• Determine target segments• Data from different sources
  • Today’s Solutions• RDBMS-focused – ETL data from MongoDB/Hadoop – Query with SQL• MapReduce-focused – ETL from RDBMS/MongoDB – Use Hive
  • Requirements• Support for different data sources• Support for different query interfaces• Low-latency/real-time• Ad-hoc queries• Scalable and fast• Reliable
  • Google’s Dremel http://research.google.com/pubs/pub36632.html
  • Apache Drill Overview• Inspired by Google Dremel• Standard SQL2003 support• …. other QL (DSL, etc.) possible• Plug-able data sources• Support for nested data (JSON, etc.)• Schema is optional• Community driven, open, 100’s involved
  • Apache Drill Overview
  • High-level Architecture
  • How does it work? • Drillbits per node, maximize data locality • Co-ordination, query planning, optimization, scheduling, execution are distributedSource Logical PhysicalQuery Parser Plan Optimizer Plan ExecutionSQL 2003, query: [ { topology Scanner API @id: "log",DrQL, op: "sequence", do: [ {MongoQL, op: "scan", source: “logs"}DSL { op: "filter", condition: "x > 3"}, …
  • How does it work?
  • Key Features• Full SQL• Nested data• Optional schema• Extensibility points
  • Full SQL – ANSI SQL2003• SQL-like is often not enough• Integration with existing tools – Tableau, Excel, SAP Crystal Reports – Use standard ODBC/JDBC driver
  • Nested Data• Nested data becoming prevalent – JSON/BSON, XML, ProtoBuf, Avro – Some data sources support it natively (MongoDB, etc.) – Innovation through Dremel• Flattening nested data is error-prone• Apache Drill supports nested data, extension to ANSI SQL2003
  • Optional Schema• Many data sources don’t have rigid schemas – Schema changes rapidly – Different schema per record (e.g. HBase)• Apache Drill supports queries against unknown schema• user can define schema or via discovery
  • Extensibility Points • Query language (parser) - UDFs • Data sources/formats (scanner) • Optimizer • Custom operators (logical plan)Source Logical PhysicalQuery Parser Plan Optimizer Plan Execution
  • Demo{ "id": "0001", "type": "donut", "name": "Cake", "batters": { { "batter”: "sales" : 700.0, [ "typeCount" : 1, { "id": "1001", "type": "Regular" }, "quantity" : 700, { "id": "1002", "type": "Chocolate" }, "ppu" : 1.0… } { "sales" : 109.71,data source: donuts.json "typeCount" : 2, "quantity" : 159, query:[ { "ppu" : 0.69 op:"sequence", } do:[ { { "sales" : 184.25, op: "scan", "typeCount" : 2, ref: "donuts", "quantity" : 335, source: "local-logs", "ppu" : 0.55 selection: {data: "activity"} } }, { result: out.json op: "filter", expr: "donuts.ppu < 2.00" },…logical plan: simple_plan.json https://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo
  • Status• Heavy development by multiple orgs• Logical plan, reference interpreter available• SQL interpreter, storage engine implementations (Accumolo, Cassandra, Hbase, etc.) are WIP• Schedule: – Prototype Q1 – Alpha Q2
  • Why do we do it?
  • Engage!• Follow @ApacheDrill on Twitter• Sign up at mailing lists (user|dev) http://incubator.apache.org/drill/mailing-lists.html• Keep an eye on http://drill-user.org/• Ping me: mhausenblas@maprtech.com