Introduction to Apache Drill –interactive query and analysis at scale     Michael Hausenblas, MapR EMEA        2013-02-22,...
About Michael• Background in large-scale data integration• Chief Data Engineer EMEA, MapR• Apache Drill contributor
Workloads•   Batch processing (MapReduce)•   Light-weight OLTP (HBase, Cassandra)•   Stream processing (Storm, S4)•   Sear...
Interactive Query at scale                       Impala         low-latency
Use Case• Jane, a marketing analyst• Determine target segments• Data from different sources
Today’s Solutions• RDBMS-focused  – ETL data from MongoDB/Hadoop  – Query with SQL• MapReduce-focused  – ETL from RDBMS/Mo...
Requirements•   Support for different data sources•   Support for different query interfaces•   Low-latency/real-time•   A...
Google’s Dremel      http://research.google.com/pubs/pub36632.html
Apache Drill Overview•   Inspired by Google Dremel•   Standard SQL2003 support•   …. other QL (DSL, etc.) possible•   Plug...
Apache Drill Overview
High-level Architecture
How does it work?  • Drillbits per node, maximize data locality  • Co-ordination, query planning, optimization,    schedul...
How does it work?
Key Features•   Full SQL•   Nested data•   Optional schema•   Extensibility points
Full SQL – ANSI SQL2003• SQL-like is often not enough• Integration with existing tools  – Tableau, Excel, SAP Crystal Repo...
Nested Data• Nested data becoming prevalent  – JSON/BSON, XML, ProtoBuf, Avro  – Some data sources support it natively    ...
Optional Schema• Many data sources don’t have rigid schemas  – Schema changes rapidly  – Different schema per record (e.g....
Extensibility Points •   Query language (parser) - UDFs •   Data sources/formats (scanner) •   Optimizer •   Custom operat...
Demo{ "id": "0001", "type": "donut", "name": "Cake", "batters": {                                                         ...
Status• Heavy development by multiple orgs• Logical plan, reference interpreter available• SQL interpreter, storage engine...
Why do we do it?
Engage!• Follow @ApacheDrill on Twitter• Sign up at mailing lists (user|dev)  http://incubator.apache.org/drill/mailing-li...
Nächste SlideShare
Wird geladen in …5
×

Introduction to Apache Drill - interactive query and analysis at scale

4.795 Aufrufe

Veröffentlicht am

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Introduction to Apache Drill - interactive query and analysis at scale

  1. 1. Introduction to Apache Drill –interactive query and analysis at scale Michael Hausenblas, MapR EMEA 2013-02-22, HUG Munich
  2. 2. About Michael• Background in large-scale data integration• Chief Data Engineer EMEA, MapR• Apache Drill contributor
  3. 3. Workloads• Batch processing (MapReduce)• Light-weight OLTP (HBase, Cassandra)• Stream processing (Storm, S4)• Search (Solr, Elasticsearch)• Interactive analysis
  4. 4. Interactive Query at scale Impala low-latency
  5. 5. Use Case• Jane, a marketing analyst• Determine target segments• Data from different sources
  6. 6. Today’s Solutions• RDBMS-focused – ETL data from MongoDB/Hadoop – Query with SQL• MapReduce-focused – ETL from RDBMS/MongoDB – Use Hive
  7. 7. Requirements• Support for different data sources• Support for different query interfaces• Low-latency/real-time• Ad-hoc queries• Scalable and fast• Reliable
  8. 8. Google’s Dremel http://research.google.com/pubs/pub36632.html
  9. 9. Apache Drill Overview• Inspired by Google Dremel• Standard SQL2003 support• …. other QL (DSL, etc.) possible• Plug-able data sources• Support for nested data (JSON, etc.)• Schema is optional• Community driven, open, 100’s involved
  10. 10. Apache Drill Overview
  11. 11. High-level Architecture
  12. 12. How does it work? • Drillbits per node, maximize data locality • Co-ordination, query planning, optimization, scheduling, execution are distributedSource Logical PhysicalQuery Parser Plan Optimizer Plan ExecutionSQL 2003, query: [ { topology Scanner API @id: "log",DrQL, op: "sequence", do: [ {MongoQL, op: "scan", source: “logs"}DSL { op: "filter", condition: "x > 3"}, …
  13. 13. How does it work?
  14. 14. Key Features• Full SQL• Nested data• Optional schema• Extensibility points
  15. 15. Full SQL – ANSI SQL2003• SQL-like is often not enough• Integration with existing tools – Tableau, Excel, SAP Crystal Reports – Use standard ODBC/JDBC driver
  16. 16. Nested Data• Nested data becoming prevalent – JSON/BSON, XML, ProtoBuf, Avro – Some data sources support it natively (MongoDB, etc.) – Innovation through Dremel• Flattening nested data is error-prone• Apache Drill supports nested data, extension to ANSI SQL2003
  17. 17. Optional Schema• Many data sources don’t have rigid schemas – Schema changes rapidly – Different schema per record (e.g. HBase)• Apache Drill supports queries against unknown schema• user can define schema or via discovery
  18. 18. Extensibility Points • Query language (parser) - UDFs • Data sources/formats (scanner) • Optimizer • Custom operators (logical plan)Source Logical PhysicalQuery Parser Plan Optimizer Plan Execution
  19. 19. Demo{ "id": "0001", "type": "donut", "name": "Cake", "batters": { { "batter”: "sales" : 700.0, [ "typeCount" : 1, { "id": "1001", "type": "Regular" }, "quantity" : 700, { "id": "1002", "type": "Chocolate" }, "ppu" : 1.0… } { "sales" : 109.71,data source: donuts.json "typeCount" : 2, "quantity" : 159, query:[ { "ppu" : 0.69 op:"sequence", } do:[ { { "sales" : 184.25, op: "scan", "typeCount" : 2, ref: "donuts", "quantity" : 335, source: "local-logs", "ppu" : 0.55 selection: {data: "activity"} } }, { result: out.json op: "filter", expr: "donuts.ppu < 2.00" },…logical plan: simple_plan.json https://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo
  20. 20. Status• Heavy development by multiple orgs• Logical plan, reference interpreter available• SQL interpreter, storage engine implementations (Accumolo, Cassandra, Hbase, etc.) are WIP• Schedule: – Prototype Q1 – Alpha Q2
  21. 21. Why do we do it?
  22. 22. Engage!• Follow @ApacheDrill on Twitter• Sign up at mailing lists (user|dev) http://incubator.apache.org/drill/mailing-lists.html• Keep an eye on http://drill-user.org/• Ping me: mhausenblas@maprtech.com

×