The Ultimate Guide to Choosing WordPress Pros and Cons
GBDC 2013-01-28
1. Using realtime SQL2003 to query
JSON on Hadoop with Apache Drill
January 28, 2013
Jacques Nadeau
Apache Drill Contributor @ MapR Technologies
2. Me
• Apache Drill and HBase Contributor
• Sponsored by MapR Technologies to lead Apache Drill
contributions
– Enterprise-grade high performance distribution for
Hadoop
– Open source plus standards-based extensions
– Large number Fortune 100 customers, startups too.
– Free distribution for unlimited nodes
– Partnered to provide on Google Compute Engine and
Amazon Elastic MapReduce
3. Transaction
information
Jane works as an
Analyst at an
ecommerce website
How does she figure User
profiles
out good targeting
segments for the next
marketing campaign?
She has some ideas
and lots of data Access
logs
4. Let’s try using existing options
• Use Oracle
– Write flattening MongoDB query for export and generate giant CSV. Work with MapReduce
team to build a MapReduce job that provides export. Contact DBA to import data exports. Use
Oracle SQL to determine answers.
• Use Hive
– Pull up Hive. Start writing queries. Realize that Hive/Mongo interconnector doesn’t support
nested data. Realize that Hive doesn’t have JDBC/ODBC storage handler. Query data from
Oracle and copy to Hadoop. Query flattened Mongo data and copy into Hadoop. Write HiveQL
query. Wait 30 minutes for result. Repeat until desired outcome. Avoid frustration along the
way with the flattened Mongo data, portion of Oracle extraction, and the lack of major
portions of SQL syntax.
• Use Data Virtualization Solution
– Write SQL query against virtualization interface. Realize that you still need to ETL Mongo data
since it isn’t natively supported. Query runs slowly since virtualization solution doesn’t run
locally against Hadoop data and fails to effectively distribute your query.
• Use MapReduce
– Work with Engineering to define a specification for needs. Use Sqoop to setup regular ETL
from Oracle. Define a custom MapReduce to import Mongo data.
– Look at output and realize different analyses should be done, repeat cycle (or learn Java)
5. Why are things so hard?
• Slow
– Virtualization solutions don’t support data locality and pushdown
– MapReduce sacrifices performance to support long running jobs, recoverability, and
ultimate flexibility
• Old
– Most systems assume flat data with well-defined static schemas
• Hard
– Write queries in multiple languages (Does anybody no MongoQL, CQL, HiveQL and
SQL?)
– Analysts often need custom development help
• Error Prone
– ETL leads to data synchronization issues
– Lack of query transparency leads to incorrect assumptions and bad business conclusions
• Expensive
– Commercial solutions are very expensive
– Typically provide poor compatibility with newer NoSQL technologies
6. Open Source Mantra: WWGD?
Distributed Interactive Batch
Datastore
File System analysis processing
GFS BigTable Dremel MapReduce
Hadoop
HDFS HBase
MapReduce
Build Apache Drill to provide a true open source
solution to interactive analysis of Big Data
7. Apache Drill Overview
• Drill overview
– Low latency interactive queries
– Standard ANSI SQL2003 support
– Domain Specific Languages / Your own QL
– Inspired by, compatible with Google BigQuery/Dremel
– Supports Nested/Hierarchical Data Formats
– Supports RDBMS, Hadoop and NoSQL alike
• Open-Source and Flexible
– Apache Incubator
– 100’s involved across US and Europe
– Community consensus on API, functionality
8. Why do we need another tool?
Point queries Data Analyst & Reporting Queries
0-100ms 3 minutes – 20 minutes
Interactive Queries
100ms – 3 minutes Data Mining and Major ETL
20 minutes – 20 hours
MapReduce,
Apache
Per Hive and PIG
Drill
system
interfaces
9. Why not improve Hive or Pig?
• Different Goals
• SQL should be first class concern
• MapReduce severely hampers processing model and performance
– Startup cost is high
– Map:Reduce recoverability and barrier disadvantages
– Job:Job recoverability and barrier disadvantages (chained jobs)
• Need to build from in-memory representation
– Two canonical in-memory formats (row-based and columnar)
– Support much larger memory sizes
– Smaller memory footprint per record
– Avoid serialization/deserialization and object creation costs between nodes and operations
• Performance of interactive queries is critical
– Evaluation and Operator code generation & compilation
• First class recognition of nested types without metadata requirement
– Schema Discovery and standard schema representation
• Clear delineation between important stages
– Support for multiple optimizers and researcher experimentation
10. How does it work?
• Drillbits run on each node to minimize
network transfer
• Queries can be fed to any Drillbit. SELECT * FROM
oracle.transactions,
• Coordination, query planning, mongo.users,
optimization, scheduling, and hdfs.events
LIMIT 1
execution are distributed
12. Apache Drill currently in development
• Heavy active development by multiple
supporting organizations
• Available
– Logical plan syntax and interpreter
– Reference Interpreter
• In progress
– SQL interpreter
– Storage Engine implementations for Accumulo,
Cassandra, HBase, and HDFS file formats
13. Conclusion & Questions
• Put Apache Drill on your roadmap, we’ll make your life
easier
• Join the community
– Code: http://github.com/apache/incubator-drill
– Mailing List: drill-user@incubator.apache.org
– Wiki: https://cwiki.apache.org/confluence/display/DRILL
• Access this presentation: http://bit.ly/Wo6DLd
• Contact Me:
– jacques.drill@gmail.com