Low latency data processing with Impala
Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), JDBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
3. Beyond Batch
3
For some things MapReduce is just too slow
Apache Hive:
MapReduce execution engine
High-latency, low throughput
High runtime overhead
Google realized this early on
Analysts wanted fast, interactive results
Tuesday, July 2, 13
4. Dremel
4
Google paper (2010)
“scalable, interactive ad-hoc query system for
analysis of read-only nested data”
Columnar storage format
Distributed scalable aggregation
“capable of running aggregation queries over
trillion-row tables in seconds”
http://research.google.com/pubs/pub36632.html
Tuesday, July 2, 13
5. Impala: Goals
5
General-purpose SQL query engine for Hadoop
For analytical and transactional workloads
Support queries that take μs to hours
Run directly with Hadoop
Collocated daemons
Same file formats
Same storage managers (NN, metastore)
Tuesday, July 2, 13
6. Impala: Goals
6
High performance
C++
runtime code generation (LLVM)
direct access to data (no MapReduce)
Retain user experience
easy for Hive users to migrate
100% open-source
Tuesday, July 2, 13
7. Impala: Capability
7
HiveQL (subset of SQL92)
select, project, join, union, subqueries,
aggregation, insert, order by (with limit)
DDL
Directly queries data in HDFS & HBase
Text files (compressed)
Sequence files (snappy/gzip)
Avro &Trevni
GA features
Tuesday, July 2, 13
8. Impala: Capability
8
Familiar and unified platform
Uses Hive’s metastore
Submit queries via ODBC | BeeswaxThrift API
Query is distributed to nodes with relevant data
Process-to-process data exchange
Kerberos authentication
No fault tolerance
Tuesday, July 2, 13
9. Impala: Performance
9
Greater disk throughput
~100MB/sec/disk
I/O-bound workloads faster by 3-4x
Queries that require multiple map-reduce phases
in Hive are significantly faster in Impala (up to 45x)
Queries that run against in-memory cached data
see a significant speedup (up to 90x)
Tuesday, July 2, 13
10. Impala:Architecture
10
impalad
runs on every node
handles client requests (ODBC, thrift)
handles query planning & execution
statestored
provides name service
metadata distribution
used for finding data
Tuesday, July 2, 13
15. Current limitations
15
1.0.1 (available since May 2013)
No SerDes
No User Defined Functions (UDF’s)
impalad’s only read statestored metadata at
startup
Tuesday, July 2, 13
16. Futures
16
DDL support (CREATE)
Rudimentary cost-based optimizer (CBO)
metadata distribution through statestored
Doug Cutting’sTrevni
Columnar storage format like Dremel’s
Impala +Trevni = Dremel superset
Tuesday, July 2, 13