SlideShare ist ein Scribd-Unternehmen logo
1 von 63
1© Cloudera, Inc. All rights reserved.
Performance optimizations in Apache
Impala
Mostafa Mokhtar (mmokhtar@cloudera.com)
@mostafamokhtar7
2© Cloudera, Inc. All rights reserved.
• History and motivation
• Impala architecture
• Performance focused overview on
• Front-end
• Query planning overview
• Query optimization
• Metadata and statistics
• Back-end
• Partitioning and sorting for Selective scans
• Code-generation using LLVM
• Streaming Aggregation
• Runtime filters
• Handling cache misses for Joins and Aggs
Outline
3© Cloudera, Inc. All rights reserved.
“Big data” revolution
4© Cloudera, Inc. All rights reserved.
SQL on Apache Hadoop
• SQL
• Run on top of HDFS
• Supported various file formats
• Converted query operators to map-reduce jobs
• Run at scale
• Fault-tolerant
• High startup-costs/materialization overhead
(slow...)
5© Cloudera, Inc. All rights reserved.
Impala: An open source SQL engine for Apache Hadoop
• Fast
• C++ instead of Java
• Run time code generation
• Support interactive BI and analytical workloads
• Supports queries that take from milliseconds to hours
• Scalable
• Run directly on “big” hadoop clusters (100+ nodes)
• Flexible
• Support multiple storage engines (HDFS, S3, ADLS, Apache Kudu, ...)
• Support multiple file formats (Parquet, Text, Sequence, Avro, ..)
• Support both structured and semi-structured data
• Enterprise-grade
• Authorization/authentication/lineage/audit
6© Cloudera, Inc. All rights reserved.
Impala’s history
• First commit in May 2011
• Public beta in October 2012
• Over a million downloads since then
• Most recent released version is 2.10, associated with CDH 5.13
• November 2017
• Apache® Impala™ has graduated from the Apache Incubator to become a Top-
Level Project (TLP), signifying that the project's community and products have
been well-governed under the ASF's meritocratic process and principles.
7© Cloudera, Inc. All rights reserved.
SQL on “big data”. The race is on...
Hive on Tez Hive on Tez + LLAP
8© Cloudera, Inc. All rights reserved.
Impala - Logical view
Query Compiler
Query Executor
Query Coordinator
Metadata
HDFS Kudu
S3/ADL
S
HBase
Query Compiler
Query Executor
Query Coordinator
Metadata
Query Compiler
Query Executor
Query Coordinator
Metadata
Hive Metastore
HDFS NameNode
StateStore Catalog
FE (Java)
BE (C++)
HDFS Kudu
S3/ADL
S
HBase HDFS Kudu
S3/ADL
S
HBase
Metadata
Impalad Impalad Impalad
Metadata/control
Execution
Storage
Sentry
Role: admin, Privs: R/W access on ALL
Role: user, Privs: R access on TABLE db1.foo, ...
Dir: hdfs://db1/foo, File: foo1.csv, Blocks:...., Size: 100B
Dir: S3://bucket/db2/blah, File: blah1.parq, Size: 1GB
DB: db1, Tbl: foo (a int,...) location ‘hdfs://…’
DB: db2, Tbl: blah (b string..) location ‘s3://…’
9© Cloudera, Inc. All rights reserved.
Metadata
Impala in action - DDL
Query Compiler
Query Executor
Query Coordinator
HDFS Kudu
S3/ADL
S
HBase
Query Compiler
Query Executor
Query Coordinator
Metadata
Query Compiler
Query Executor
Query Coordinator
Metadata
Hive Metastore
HDFS NameNode
StateStore Catalog
HDFS Kudu
S3/ADL
S
HBase HDFS Kudu
S3/ADL
S
HBase
Metadata
Impalad Impalad Impalad
SQL App
ODBC/JDBC
CREATE TABLE db.foo (...) location ‘hdfs://…..’;
11© Cloudera, Inc. All rights reserved.
Impala in action - Select query
Query Compiler
Query Executor
Query Coordinator
Metadata
HDFS Kudu
S3/ADL
S
HBase
Query Compiler
Query Executor
Query Coordinator
Metadata
Query Compiler
Query Executor
Query Coordinator
Metadata
Hive Metastore
HDFS NameNode
StateStore Catalog
HDFS Kudu
S3/ADL
S
HBase HDFS Kudu
S3/ADL
S
HBase
Metadata
Impalad Impalad Impalad
SQL App
ODBC/JDBC
select * from db.foo;
0
1
2
34
5
6
7
12© Cloudera, Inc. All rights reserved.
Impala release to release performance trend
● Track record in improving
release to release
performance
● 10x speedup in standard
benchmarks over the last
24 months
● Continued to add
functionality without
introducing regressions
13© Cloudera, Inc. All rights reserved.
2-phase cost-based optimizer
• Phase 1: Generate a single node plan
(transformations, join ordering, static partition
pruning, runtime filters)
• Phase 2: Convert the single node plan into a
distributed query execution plan (add exchange
nodes, decide join strategy)
Query fragments (units of work):
• Parts of query execution tree
• Each fragment is executed in one or more impalads
Query planning
14© Cloudera, Inc. All rights reserved.
Metadata
• Table metadata (HMS) and block level (HDFS) information are cached to speed-up
query time
• Cached data is stored in FlatBuffers to save space and avoid excessive GC
• Metadata loading from HDFS/S3/ADLS uses a thread pool to speedup the
operation when needed
Statistics
• Impala uses an HLL to compute Number of distinct values (NDV)
• HLL is much faster than the combination of COUNT and DISTINCT, and uses a
constant amount of memory and thus is less memory-intensive for columns with
high cardinality.
• HLL size is 1KB per column
• A Novel implementation of sampled stats is coming soon
Metadata & Statistics
15© Cloudera, Inc. All rights reserved.
● Order scan predicates by selectivity and
cost, mitigate correlated predicates
(exponential backoff)
● Detection of common join pattern of
Primary key/Foreign key joins
● Compute selectivity of predicates for scans
as well as joins
● Determine build and probe side for equi
joins
● Select the ideal join type that minimizes
resource utilization
○ Broadcast Join
○ Partition Join
● Identify joins which can benefit from
Runtime filters
● Determine optimal join order
Query optimizations based on statistics
02:HASH JOIN [INNER JOIN, BROADCAST]
| hash predicates: l_partkey = o_orderkey
| fk/pk conjuncts: l_orderkey = o_orderkey
| runtime filters: RF000 <- o_orderkey
| tuple-ids=1,0 row-size=113B cardinality=27,381,196
|
|--05:EXCHANGE [BROADCAST]
| | hosts=20 per-host-mem=0B
| | tuple-ids=0 row-size=8B cardinality=68,452,805
| |
| 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM]
| partitions=366/2406 files=366 size=28.83GB
| predicates: tpch_3000_parquet.orders.o_orderkey < 100
| table stats: 4,500,000,000 rows total
| column stats: all
| tuple-ids=0 row-size=8B cardinality=68,452,805
|
01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM]
partitions=2526/2526 files=2526 size=1.36TB
predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string%
runtime filters: RF000 -> l_orderkey
table stats: 18,000,048,306 rows total
column stats: all
tuple-ids=1 row-size=105B cardinality=1,800,004,831
SELECT l_shipmode,
Sum(l_extendedprice)
FROM orders,
lineitem
WHERE o_orderkey =
l_orderkey
AND l_comment LIKE
'%long string%'
AND l_receiptdate
>= '1994-01-01'
AND l_partkey < 100
AND o_orderdate <
'1993-01-01'
GROUP BY l_shipmode
ORDER BY l_shipmode
16© Cloudera, Inc. All rights reserved.
Query optimizations based on statistics
02:HASH JOIN [INNER JOIN, BROADCAST]
| hash predicates: l_partkey = o_orderkey
| fk/pk conjuncts: l_orderkey = o_orderkey
| runtime filters: RF000 <- o_orderkey
| tuple-ids=1,0 row-size=113B cardinality=27,381,196
|
|--05:EXCHANGE [BROADCAST]
| | hosts=20 per-host-mem=0B
| | tuple-ids=0 row-size=8B cardinality=68,452,805
| |
| 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM]
| partitions=366/2406 files=366 size=28.83GB
| predicates: tpch_3000_parquet.orders.o_orderkey < 100
| table stats: 4,500,000,000 rows total
| column stats: all
| tuple-ids=0 row-size=8B cardinality=68,452,805
|
01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM]
partitions=2526/2526 files=2526 size=1.36TB
predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string%
runtime filters: RF000 -> l_orderkey
table stats: 18,000,048,306 rows total
column stats: all
tuple-ids=1 row-size=105B cardinality=1,800,004,831
SELECT l_shipmode,
Sum(l_extendedprice)
FROM orders,
lineitem
WHERE o_orderkey =
l_orderkey
AND l_comment LIKE
'%long string%'
AND l_receiptdate
>= '1994-01-01'
AND l_partkey < 100
AND o_orderdate <
'1993-01-01'
GROUP BY l_shipmode
ORDER BY l_shipmode
● Order scan predicates by selectivity and
cost, mitigate correlated predicates
(exponential backoff)
● Detection of common join pattern of
Primary key/Foreign key joins
● Compute selectivity of predicates for scans
as well as joins
● Determine build and probe side for equi
joins
● Select the ideal join type that minimizes
resource utilization
○ Broadcast Join
○ Partition Join
● Identify joins which can benefit from
Runtime filters
● Determine optimal join order
17© Cloudera, Inc. All rights reserved.
Query optimizations based on statistics
02:HASH JOIN [INNER JOIN, BROADCAST]
| hash predicates: l_partkey = o_orderkey
| fk/pk conjuncts: l_orderkey = o_orderkey
| runtime filters: RF000 <- o_orderkey
| tuple-ids=1,0 row-size=113B cardinality=27,381,196
|
|--05:EXCHANGE [BROADCAST]
| | hosts=20 per-host-mem=0B
| | tuple-ids=0 row-size=8B cardinality=68,452,805
| |
| 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM]
| partitions=366/2406 files=366 size=28.83GB
| predicates: tpch_3000_parquet.orders.o_orderkey < 100
| table stats: 4,500,000,000 rows total
| column stats: all
| tuple-ids=0 row-size=8B cardinality=68,452,805
|
01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM]
partitions=2526/2526 files=2526 size=1.36TB
predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string%
runtime filters: RF000 -> l_orderkey
table stats: 18,000,048,306 rows total
column stats: all
tuple-ids=1 row-size=105B cardinality=1,800,004,831
● Order scan predicates by selectivity and
cost, mitigate correlated predicates
(exponential backoff)
● Detection of common join pattern of
Primary key/Foreign key joins
● Compute selectivity of predicates for scans
as well as joins
● Determine build and probe side for equi
joins
● Select the ideal join type that minimizes
resource utilization
○ Broadcast Join
○ Partition Join
● Identify joins which can benefit from
Runtime filters
● Determine optimal join order
SELECT l_shipmode,
Sum(l_extendedprice)
FROM orders,
lineitem
WHERE o_orderkey =
l_orderkey
AND l_comment LIKE
'%long string%'
AND l_receiptdate
>= '1994-01-01'
AND l_partkey < 100
AND o_orderdate <
'1993-01-01'
GROUP BY l_shipmode
ORDER BY l_shipmode
18© Cloudera, Inc. All rights reserved.
Query optimizations based on statistics
02:HASH JOIN [INNER JOIN, BROADCAST]
| hash predicates: l_partkey = o_orderkey
| fk/pk conjuncts: l_orderkey = o_orderkey
| runtime filters: RF000 <- o_orderkey
| tuple-ids=1,0 row-size=113B cardinality=27,381,196
|
|--05:EXCHANGE [BROADCAST]
| | hosts=20 per-host-mem=0B
| | tuple-ids=0 row-size=8B cardinality=68,452,805
| |
| 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM]
| partitions=366/2406 files=366 size=28.83GB
| predicates: tpch_3000_parquet.orders.o_orderkey < 100
| table stats: 4,500,000,000 rows total
| column stats: all
| tuple-ids=0 row-size=8B cardinality=68,452,805
|
01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM]
partitions=2526/2526 files=2526 size=1.36TB
predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string%
runtime filters: RF000 -> l_orderkey
table stats: 18,000,048,306 rows total
column stats: all
tuple-ids=1 row-size=105B cardinality=1,800,004,831
● Order scan predicates by selectivity and
cost, mitigate correlated predicates
(exponential backoff)
● Detection of common join pattern of
Primary key/Foreign key joins
● Compute selectivity of predicates for scans
as well as joins
● Determine build and probe side for equi
joins
● Select the ideal join type that minimizes
resource utilization
○ Broadcast Join
○ Partition Join
● Identify joins which can benefit from
Runtime filters
● Determine optimal join order
SELECT l_shipmode,
Sum(l_extendedprice)
FROM orders,
lineitem
WHERE o_orderkey =
l_orderkey
AND l_comment LIKE
'%long string%'
AND l_receiptdate
>= '1994-01-01'
AND l_partkey < 100
AND o_orderdate <
'1993-01-01'
GROUP BY l_shipmode
ORDER BY l_shipmode
19© Cloudera, Inc. All rights reserved.
Query optimizations based on statistics
02:HASH JOIN [INNER JOIN, BROADCAST]
| hash predicates: l_partkey = o_orderkey
| fk/pk conjuncts: l_orderkey = o_orderkey
| runtime filters: RF000 <- o_orderkey
| tuple-ids=1,0 row-size=113B cardinality=27,381,196
|
|--05:EXCHANGE [BROADCAST]
| | hosts=20 per-host-mem=0B
| | tuple-ids=0 row-size=8B cardinality=68,452,805
| |
| 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM]
| partitions=366/2406 files=366 size=28.83GB
| predicates: tpch_3000_parquet.orders.o_orderkey < 100
| table stats: 4,500,000,000 rows total
| column stats: all
| tuple-ids=0 row-size=8B cardinality=68,452,805
|
01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM]
partitions=2526/2526 files=2526 size=1.36TB
predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string%
runtime filters: RF000 -> l_orderkey
table stats: 18,000,048,306 rows total
column stats: all
tuple-ids=1 row-size=105B cardinality=1,800,004,831
● Order scan predicates by selectivity and
cost, mitigate correlated predicates
(exponential backoff)
● Detection of common join pattern of
Primary key/Foreign key joins
● Compute selectivity of predicates for scans
as well as joins
● Determine build and probe side for equi
joins
● Select the ideal join type that minimizes
resource utilization
○ Broadcast Join
○ Partition Join
● Identify joins which can benefit from
Runtime filters
● Determine optimal join order
SELECT l_shipmode,
Sum(l_extendedprice)
FROM orders,
lineitem
WHERE o_orderkey =
l_orderkey
AND l_comment LIKE
'%long string%'
AND l_receiptdate
>= '1994-01-01'
AND l_partkey < 100
AND o_orderdate <
'1993-01-01'
GROUP BY l_shipmode
ORDER BY l_shipmode
20© Cloudera, Inc. All rights reserved.
Query optimizations based on statistics
02:HASH JOIN [INNER JOIN, BROADCAST]
| hash predicates: l_partkey = o_orderkey
| fk/pk conjuncts: l_orderkey = o_orderkey
| runtime filters: RF000 <- o_orderkey
| tuple-ids=1,0 row-size=113B cardinality=27,381,196
|
|--05:EXCHANGE [BROADCAST]
| | hosts=20 per-host-mem=0B
| | tuple-ids=0 row-size=8B cardinality=68,452,805
| |
| 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM]
| partitions=366/2406 files=366 size=28.83GB
| predicates: tpch_3000_parquet.orders.o_orderkey < 100
| table stats: 4,500,000,000 rows total
| column stats: all
| tuple-ids=0 row-size=8B cardinality=68,452,805
|
01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM]
partitions=2526/2526 files=2526 size=1.36TB
predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string%
runtime filters: RF000 -> l_orderkey
table stats: 18,000,048,306 rows total
column stats: all
tuple-ids=1 row-size=105B cardinality=1,800,004,831
● Order scan predicates by selectivity and
cost, mitigate correlated predicates
(exponential backoff)
● Detection of common join pattern of
Primary key/Foreign key joins
● Compute selectivity of predicates for scans
as well as joins
● Determine build and probe side for equi
joins
● Select the ideal join type that minimizes
resource utilization
○ Broadcast Join
○ Partition Join
● Identify joins which can benefit from
Runtime filters
● Determine optimal join order
SELECT l_shipmode,
Sum(l_extendedprice)
FROM orders,
lineitem
WHERE o_orderkey =
l_orderkey
AND l_comment LIKE
'%long string%'
AND l_receiptdate
>= '1994-01-01'
AND l_partkey < 100
AND o_orderdate <
'1993-01-01'
GROUP BY l_shipmode
ORDER BY l_shipmode
21© Cloudera, Inc. All rights reserved.
• Use metadata to avoid table accesses for partition key scans:
• select min(month), max(year) from functional.alltypes;
• month, year are partition keys of the table
• Enabled by query option OPTIMIZE_PARTITION_KEY_SCANS
• Applicable:
• min(), max(), ndv() and aggregate functions with distinct keyword
• partition keys only
01:AGGREGATE [FINALIZE]
| output: min(month),max(year)
|
00:UNION
constant-operands=24
03:AGGREGATE [FINALIZE]
| output: min:merge(month),
max:merge(year)
|
02:EXCHANGE [UNPARTITIONED]
|
01:AGGREGATE
| output: min(month), max(year)
|
00:SCAN HDFS [functional.alltypes]
partitions=24/24 files=24 size=478.45KB
Plan without optimization Plan with optimization
Optimization for metadata only queries
22© Cloudera, Inc. All rights reserved.
Select sum(l_extendedprice* (1 - l_discount)) as revenue
from
Lineitem, part
Where ( p_partkey = l_partkey
and p_brand = 'Brand#32'
and p_container in ('SM CASE', 'SM BOX',
'SM PACK', 'SM PKG')
and l_quantity >= 7 and l_quantity <= 7 +
10
and p_size between 1 and 5
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON')
or( p_partkey = l_partkey
and p_brand = 'Brand#35'
and p_container in ('MED BAG', 'MED BOX',
'MED PKG', 'MED PACK')
and l_quantity >= 15 and l_quantity <= 15
+ 10
and p_size between 1 and 10
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON')
Or( p_partkey = l_partkey
and p_brand = 'Brand#24'
and p_container in ('LG CASE', 'LG BOX',
'LG PACK', 'LG PKG')
and l_quantity >= 26 and l_quantity <= 26
+ 10
and p_size between 1 and 15
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON')
● Extract common conjuncts from
disjunctions.
(a AND b) OR (a AND b) ==> a AND
b
(a AND b AND c) OR (c) ==> c
● >100x speedup for TPC-DS Q13,
Q48 & TPC-H Q19
00:SCAN HDFS [tpch_300_parquet.lineitem, RANDOM]
partitions=1/1 files=259 size=63.71GB
predicates: l_shipmode IN ('AIR', 'AIR REG'), l_shipinstruct = 'DELIVER
IN PERSON'
runtime filters: RF000 -> l_partkey
table stats: 1799989091 rows total
hosts=7 per-host-mem=528.00MB
tuple-ids=0 row-size=80B cardinality=240533660
Scanner : Extract common conjuncts from disjunctions
23© Cloudera, Inc. All rights reserved.
• Not based on Map-Reduce
• Single-threaded, row-based, Volcano-style (iterator-based with batches) query
execution engine
• Multi-threaded scans
• Utilizes HDFS short-circuit local reads & multi-threaded I/O subsystem
• Supports multiple file formats (Parquet, Avro, Text, Sequence, …)
• Supports nested types (only in Parquet)
• Code generation using LLVM
• No transactions
• No indices
Query execution
24© Cloudera, Inc. All rights reserved.
• Impala does not have native indexes (e.g. B+-tree), but it does allow a type of
indexing by partitions
• What it is: physically dividing your data so that queries only need to access a
subset
• Partitioning schema is expressed through DDL
• Pruning applied automatically when queries contain matching predicates
CREATE TABLE Sales (...)
PARTITIONED BY
(INT year, INT month);
SELECT …
FROM Sales
WHERE year >= 2012
AND month IN (1, 2, 3)
SELECT …
FROM Sales JOIN DateDim d
USING date_key
WHERE d.year >= 2012
AND d.month IN (1, 2, 3)
CREATE TABLE Sales (...)
PARTITIONED BY (INT date_key);
or
Partitioning
25© Cloudera, Inc. All rights reserved.
● Sorting data files improves the effectiveness of file statistics (min/max) and
compression (e.g., delta encoding).
● Predicates are evaluated against Min/Max statistics as well as Dictionary encoded
columns, this approximates lazy materialization
● Sorting can be used on columns which have too many values to qualify for
partitioning.
● Create sorted data files by adding the SORT BY clause during table creation.
● The Parquet community is working on extending the format for efficient point
lookups.
CREATE TABLE Sales (...)
PARTITIONED BY (year INT, month INT)
SORT BY (day, hour)
Stored as Parquet;
Optimizations for selective scans: Sorting
26© Cloudera, Inc. All rights reserved.
Metric Partition + Sorting
Speedup
Elapsed time 3x
CPU time (seconds) 5x
HDFS MBytes read 17x
CREATE TABLE Sales (...)
PARTITIONED BY (year INT, month INT)
SORT BY (day, hour)
Stored as Parquet;
Business question: Find top 10 customers in terms
of revenue who made purchases on Christmas eve
in a given time window.
SORT BY helps meet query SLAs without over-
partitioning the table
SELECT sum(ss_ext_sales_price) AS revenue,
count(ss_quantity) AS quantity_count,
customer_name
FROM Sales
WHERE Year = 2017 AND Month=12
AND 12 AND hour BETWEEN 1 AND 4
GROUP BY customer_name
ORDER BY revenue DESC LIMIT 10;
Optimizations for selective scans: Augment Partitioning
27© Cloudera, Inc. All rights reserved.
CREATE TABLE Sales (...)
PARTITIONED BY (year INT, month INT)
SORT BY (customer_id)
Stored as Parquet;
Business question: Find interactions for a specific
customer for given time window
SORT BY helps meet query SLAs without over
partitioning the table
SELECT *
FROM Sales
WHERE Year = 2016 AND customer_id = 4976004;
Metric Partition + Sorting
Speedup
Elapsed time 18x
HDFS MBytes read 22x
Optimizations for selective scans:Complement Partitioning
28© Cloudera, Inc. All rights reserved.
• Impala uses runtime code generation to produce query specific
versions of functions that are critical to performance.
• In particular, code generation is applied to “inner loop” functions
• Code generation (codegen) lets Impala use query-specific
information to do less work
• Remove conditionals
• Propagate constant offsets, pointers, etc.
• Inline virtual functions calls
LLVM Codegen in Impala
29© Cloudera, Inc. All rights reserved.
Operations:
• Hash join
• Aggregation
• Scans: Parquet, Text, Sequence,
Avro
• Expressions in all operators
• Union
• Sort
• Top-N
• Runtime filters
Data Types:
• TINYINT, SMALLINT, INT, BIGINT
• FLOAT, DOUBLE
• BOOLEAN
• STRING, VARCHAR
• DECIMAL
Further optimizations:
• Don’t codegen short running queries
(Relies on cardinality estimates)
• Limit amount of inlining for long and
complex expressions (Group by 1K
columns)
LLVM Codegen in Impala
30© Cloudera, Inc. All rights reserved.
select
l_extendedprice, l_orderkey
from
lineitem
order by l_orderkey
limit 100
l_orderkey l_extendedprice l_shipdate l_shipmode
26039617 13515 12/8/1997 FOB
30525093 16218 12/16/1997 REG AIR
28809990 7208 10/19/1997 AIR
Codegen for Order by & Top-N
31© Cloudera, Inc. All rights reserved.
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
select
l_extendedprice, l_orderkey
from
lineitem
order by l_orderkey
limit 100
l_orderkey l_extendedprice l_shipdate l_shipmode
26039617 13515 12/8/1997 FOB
30525093 16218 12/16/1997 REG AIR
28809990 7208 10/19/1997 AIR
Codegen for Order by & Top-N
32© Cloudera, Inc. All rights reserved.
void* ExprContext::GetValue(Expr* e, TupleRow* row) {
switch (e->type_.type) {
case TYPE_BOOLEAN: {
..
..
}
case TYPE_TINYINT: {
..
..
}
case TYPE_INT: {
..
.
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
Codegen for Order by & Top-N
33© Cloudera, Inc. All rights reserved.
void* ExprContext::GetValue(Expr* e, TupleRow* row) {
switch (e->type_.type) {
case TYPE_BOOLEAN: {
..
..
}
case TYPE_TINYINT: {
..
..
}
case TYPE_INT: {
..
.
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
Codegen for Order by & Top-N
34© Cloudera, Inc. All rights reserved.
int RawValue::Compare(const void* v1, const void* v2,
const ColumnType& type) {
switch (type.type) {
case TYPE_INT:
i1 = *reinterpret_cast<const int32_t*>(v1);
i2 = *reinterpret_cast<const int32_t*>(v2);
return i1 > i2 ? 1 : (i1 < i2 ? -1 : 0);
case TYPE_BIGINT:
b1 = *reinterpret_cast<const int64_t*>(v1);
b2 = *reinterpret_cast<const int64_t*>(v2);
return b1 > b2 ? 1 : (b1 < b2 ? -1 : 0);
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
Codegen for Order by & Top-N
35© Cloudera, Inc. All rights reserved.
int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const {
int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs);
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs);
int result = lhs_value > rhs_value ? 1 :
(lhs_value < rhs_value ? -1 : 0);
if (result != 0) return result;
// Otherwise, try the next Expr
return 0; // fully equivalent key
Codegen codeOriginal code
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
Codegen for Order by & Top-N
36© Cloudera, Inc. All rights reserved.
int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const {
int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1
int result = lhs_value > rhs_value ? 1 :
(lhs_value < rhs_value ? -1 : 0);
if (result != 0) return result;
// Otherwise, try the next Expr
return 0; // fully equivalent key
}
Codegen code
• Perfectly unrolls “for each grouping column”
loop
• No switching on input type(s)
• Removes branching on
ASCENDING/DESCENDING, NULLS FIRST/LAST
Original code
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
Codegen for Order by & Top-N
37© Cloudera, Inc. All rights reserved.
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const {
int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs);
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs);
int result = lhs_value > rhs_value ? 1 :
(lhs_value < rhs_value ? -1 : 0);
if (result != 0) return result;
// Otherwise, try the next Expr
return 0; // fully equivalent key
Codegen code
• No switching on input type(s)
• Perfectly unrolls “for each grouping column”
loop
• Removes branching on
ASCENDING/DESCENDING, NULLS FIRST/LAST
Original code
Codegen for Order by & Top-N
38© Cloudera, Inc. All rights reserved.
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const {
int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs);
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs);
int result = lhs_value > rhs_value ? 1 :
(lhs_value < rhs_value ? -1 : 0);
if (result != 0) return result;
// Otherwise, try the next Expr
return 0; // fully equivalent key
Codegen code
• No switching on input type(s)
• Perfectly unrolls “for each grouping column”
loop
• Removes branching on
ASCENDING/DESCENDING, NULLS FIRST/LAST
Original code
10x more efficient code
Codegen for Order by & Top-N
39© Cloudera, Inc. All rights reserved.
Network
Preagg Preagg Preagg
Merge Merge Merge
select cust_id, sum(dollars)
from sales group by cust_id;
Scan ScanScan
• Aggregations have two phases:
• Pre-aggregation phase
• Merge phase
• The pre-aggregation phase greatly reduces
network traffic if there are many input
rows per grouping value.
• E.g. many sales per customer.
Distributed Aggregation in MPP
40© Cloudera, Inc. All rights reserved.
Network
Preagg Preagg Preagg
Merge Merge Merge
select distinct * from sales;
Scan ScanScan
• Pre-aggregations consume:
• Memory
• CPU cycles
• Pre-aggregations are not always effective
at reducing network traffic
• E.g. select distinct for nearly-distinct rows
• Pre-aggregations can spill to disk under
memory pressure
• Disk I/O is bad - better to send to
merge agg rather than disk
Downside of Pre-aggregations
41© Cloudera, Inc. All rights reserved.
select distinct * from sales;
• Reduction factor is dynamically estimated based
on the actual data processed
• Pre-aggregation expands memory usage only if
reduction factor is good
• Benefits:
• Certain aggregations with low reduction
factor see speedups of up to 40%
• Memory consumption can be reduced by
50% or more
• Streaming pre-aggregations don’t spill to
disk
Network
Stream Stream Stream
Merge Merge Merge
Scan ScanScan
Streaming Pre-aggregations in Impala
42© Cloudera, Inc. All rights reserved.
• When are Runtime filters useful
• To Optimize selective equi-joins against large partitioned or
unpartitioned tables
• General idea
• Avoid unnecessary I/O to read partition data, and avoid unnecessary
network transmission by sending only the subset of rows that match the
join keys across the network
• Some predicates can only be computed at runtime
• Use a Bloom filter, which uses a probability-based algorithm to store and
query values for joins column(s)
Optimizations for selective joins : Runtime filters
43© Cloudera, Inc. All rights reserved.
SELECT c_email_address
,sum(ss_ext_sales_price) sum_agg
FROM store_sales
,customer
,customer_demographics
WHERE ss_customer_sk = c_customer_sk
AND cd_demo_sk = c_current_cdemo_sk
AND cd_gender = ‘M’
AND cd_purchase_estimate = 10000
AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_address
ORDER BY sum_agg DESC
Shuffle
Join #1
43 billion rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
store_sales
43 billion rows
customer
3.8 million
Shuffle Shuffle
store_sales
customer customer_demo
Runtime filters in action
44© Cloudera, Inc. All rights reserved.
SELECT c_email_address
,sum(ss_ext_sales_price) sum_agg
FROM store_sales
,customer
,customer_demographics
WHERE ss_customer_sk = c_customer_sk
AND cd_demo_sk = c_current_cdemo_sk
AND cd_gender = ‘M’
AND cd_purchase_estimate = 10000
AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_address
ORDER BY sum_agg DESC
Shuffle
Join #1
43 billion rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
Join #1 & #2 are expensive
joins since left side of the
joins have 43 billion rows
store_sales
43 billion rows
customer
3.8 million
Shuffle Shuffle
store_sales
customer customer_demo
Runtime filters in action
45© Cloudera, Inc. All rights reserved.
SELECT c_email_address
,sum(ss_ext_sales_price) sum_agg
FROM store_sales
,customer
,customer_demographics
WHERE ss_customer_sk = c_customer_sk
AND cd_demo_sk = c_current_cdemo_sk
AND cd_gender = ‘M’
AND cd_purchase_estimate = 10000
AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_address
ORDER BY sum_agg DESC
Shuffle
Join #1
43 billion rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
Create bloom filter from
Join #2 on cd_demo_sk and
push down to customer
table scan
store_sales
43 billion rows
customer
3.8 million
Shuffle Shuffle
store_sales
customer customer_demo
Runtime filters in action
46© Cloudera, Inc. All rights reserved.
SELECT c_email_address
,sum(ss_ext_sales_price) sum_agg
FROM store_sales
,customer
,customer_demographics
WHERE ss_customer_sk = c_customer_sk
AND cd_demo_sk = c_current_cdemo_sk
AND cd_gender = ‘M’
AND cd_purchase_estimate = 10000
AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_address
ORDER BY sum_agg DESC
Shuffle
Join #1
43 billion rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
Reduced customer rows by
826X
3.8 million to 4,600 rows
store_sales
43 billion rows
customer
4,600 rows
Shuffle Shuffle
store_sales
customer customer_demo
Runtime filters in action
47© Cloudera, Inc. All rights reserved.
SELECT c_email_address
,sum(ss_ext_sales_price) sum_agg
FROM store_sales
,customer
,customer_demographics
WHERE ss_customer_sk = c_customer_sk
AND cd_demo_sk = c_current_cdemo_sk
AND cd_gender = ‘M’
AND cd_purchase_estimate = 10000
AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_address
ORDER BY sum_agg DESC
Shuffle
Join #1
43 billion rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
store_sales
43 billion rows
customer
4,600 rows
Shuffle Shuffle
Create bloom filter from
Join #1 on c_customer_sk
and push down to
store_sales table scan
store_sales
customer customer_demo
Runtime filters in action
48© Cloudera, Inc. All rights reserved.
SELECT c_email_address
,sum(ss_ext_sales_price) sum_agg
FROM store_sales
,customer
,customer_demographics
WHERE ss_customer_sk = c_customer_sk
AND cd_demo_sk = c_current_cdemo_sk
AND cd_gender = ‘M’
AND cd_purchase_estimate = 10000
AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_address
ORDER BY sum_agg DESC
Shuffle
Join #1
49 million rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
store_sales
49 million rows
customer
4,600 rows
Shuffle Shuffle
877x reduction in rows
43 billion -> 49 million rows
store_sales
customer customer_demo
Runtime filters in action
49© Cloudera, Inc. All rights reserved.
• Cache friendly, hash each key to a Bloom filter the size of a cache line or smaller
• Use AVX2 instruction set for Insert and Find operations
• Planner uses a cost model to decide which joins benefit from Runtime Filters
• Bloom filters are sized based on estimated number of distinct values for the build
side (1MB-16MB by default)
• Ineffective runtime filters are disabled dynamically when
• Bloom filters has a higher false positive rate
• Filter selectivity (Reduce CPU overhead)
• Track Min and Max values from build side and push new implied predicates to the
scan of the probe side (Supported for Kudu in CDH5.14, Parquet support coming
soon)
• Runtime filters are codegened, avoids virtual functions calls and branching in inner
loop
Optimizations for selective joins : Runtime filters
50© Cloudera, Inc. All rights reserved.
Joins and aggregations under the hood
51© Cloudera, Inc. All rights reserved.
Joins and aggregations under the hood
Most of the time is spent
in L1 & L2 cache misses!
52© Cloudera, Inc. All rights reserved.
Joins and aggregations under the hood
Old hash table probe pseudo code
int hash_value, probe_value, partition_id;
for (int i =0; i < probe_batch_size; i++) {
// Read probe key from input batch
probe_value =probe_batch_->GetRow(i);
// Compute hash value
hash_value =Hash(probe_value);
// Find the matching partition in the hash table
partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS);
iterator = hash_tbls_[partition_id]->FindProbeRow(
ht_ctx, hash_value,
probe_value);
}
Read probe join key
from input batch
53© Cloudera, Inc. All rights reserved.
Joins and aggregations under the hood
Old hash table probe pseudo code
int hash_value, probe_value, partition_id;
for (int i =0; i < probe_batch_size; i++) {
// Read probe key from input batch
probe_value =probe_batch_->GetRow(i);
// Compute hash value
hash_value =Hash(probe_value);
// Find the matching partition in the hash table
partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS);
iterator = hash_tbls_[partition_id]->FindProbeRow(
ht_ctx, hash_value,
probe_value);
}
Compute hash value
54© Cloudera, Inc. All rights reserved.
Joins and aggregations under the hood
Old hash table probe pseudo code
int hash_value, probe_value, partition_id;
for (int i =0; i < probe_batch_size; i++) {
// Read probe key from input batch
probe_value =probe_batch_->GetRow(i);
// Compute hash value
hash_value =Hash(probe_value);
// Find the matching partition in the hash table
partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS);
iterator = hash_tbls_[partition_id]->FindProbeRow(
ht_ctx, hash_value,
probe_value);
}
Find the partition ID
55© Cloudera, Inc. All rights reserved.
Joins and aggregations under the hood
Old hash table probe pseudo code
int hash_value, probe_value, partition_id;
for (int i =0; i < probe_batch_size; i++) {
// Read probe key from input batch
probe_value =probe_batch_->GetRow(i);
// Compute hash value
hash_value =Hash(probe_value);
// Find the matching partition in the hash table
partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS);
iterator = hash_tbls_[partition_id]->FindProbeRow(
ht_ctx, hash_value,
probe_value);
}
Probe hash table
56© Cloudera, Inc. All rights reserved.
Joins and aggregations under the hood
Old hash table probe pseudo code
int hash_value, probe_value, partition_id;
for (int i =0; i < probe_batch_size; i++) {
// Read probe key from input batch
probe_value =probe_batch_->GetRow(i);
// Compute hash value
hash_value =Hash(probe_value);
// Find the matching partition in the hash table
partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS);
iterator = hash_tbls_[partition_id]->FindProbeRow(
ht_ctx, hash_value,
probe_value);
}
Probe hash table
CACHE-MISS
Data dependency
resulting in poor
pipelining
57© Cloudera, Inc. All rights reserved.
Joins and aggregations under the hood
Old hash table probe pseudo code
int hash_value, probe_value, partition_id;
for (int i =0; i < probe_batch_size; i++) {
// Read probe key from input batch
probe_value =probe_batch_->GetRow(i);
// Compute hash value
hash_value =Hash(probe_value);
// Find the matching partition in the hash table
partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS);
iterator = hash_tbls_[partition_id]->FindProbeRow(
ht_ctx, hash_value,
probe_value);
}
Probe hash table
CACHE-MISS
Data dependency
resulting in poor
pipelining
58© Cloudera, Inc. All rights reserved.
Joins and aggregations in Impala : Current state
Old hash table probe pseudo code Current hash tableJoins probe pseudo code
int hash_value, probe_value, partition_id;
for (int i =0; i < probe_batch_size; i++) {
// Read probe key from input batch
probe_value =probe_batch_->GetRow(i);
// Compute hash value
hash_value =Hash(probe_value);
// Find the matching partition in the hash table
partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS);
iterator = hash_tbls_[partition_id]->FindProbeRow(
ht_ctx, hash_value,
probe_value);
}
int[] hash_value, probe_value, partition_id;
for (int i =0; i < probe_batch_size; i++) {
// Read probe key from input batch
probe_values[i] =probe_batch_->GetRow(i);
// Compute hash value
hash_value[i] =Hash(probe_value);
// Find the matching partition in the hash table
partition_id[i] = hash_value >> (32 - NUM_PARTITIONING_BITS);
// Prefetch the hash value into LLC
hash_tbls_[partition_ids[i]]->Prefetch(hash_values[i]);
}
for (int i =0; i < probe_batch_size; i++) {
// Probe the hash table
iterator = hash_tbls_[partition_id]->FindProbeRow(
ht_ctx, hash_value,
probe_value);
Prefetch the hash
value
59© Cloudera, Inc. All rights reserved.
Joins and aggregations in Impala : Current state
Old hash table probe pseudo code Current hash tableJoins probe pseudo code
int hash_value, probe_value, partition_id;
for (int i =0; i < probe_batch_size; i++) {
// Read probe key from input batch
probe_value =probe_batch_->GetRow(i);
// Compute hash value
hash_value =Hash(probe_value);
// Find the matching partition in the hash table
partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS);
iterator = hash_tbls_[partition_id]->FindProbeRow(
ht_ctx, hash_value,
probe_value);
}
int[] hash_value, probe_value, partition_id;
for (int i =0; i < probe_batch_size; i++) {
// Read probe key from input batch
probe_values[i] =probe_batch_->GetRow(i);
// Compute hash value
hash_value[i] =Hash(probe_value);
// Find the matching partition in the hash table
partition_id[i] = hash_value >> (32 - NUM_PARTITIONING_BITS);
// Prefetch the hash value into LLC
hash_tbls_[partition_ids[i]]->Prefetch(hash_values[i]);
}
for (int i =0; i < probe_batch_size; i++) {
// Probe the hash table
iterator = hash_tbls_[partition_id]->FindProbeRow(
ht_ctx, hash_value,
probe_value);
Breakup the loop into two
parts, to reduce data
dependency
Batch size is 1024 rows, giving
enough time to ensure
prefetched data is in LLC
Prefetch the hash
value
60© Cloudera, Inc. All rights reserved.
Joins and aggregations in Impala : Current state
Old hash table probe pseudo code Current hash tableJoins probe pseudo code
int hash_value, probe_value, partition_id;
for (int i =0; i < probe_batch_size; i++) {
// Read probe key from input batch
probe_value =probe_batch_->GetRow(i);
// Compute hash value
hash_value =Hash(probe_value);
// Find the matching partition in the hash table
partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS);
iterator = hash_tbls_[partition_id]->FindProbeRow(
ht_ctx, hash_value,
probe_value);
}
int[] hash_value, probe_value, partition_id;
for (int i =0; i < probe_batch_size; i++) {
// Read probe key from input batch
probe_values[i] =probe_batch_->GetRow(i);
// Compute hash value
hash_value[i] =Hash(probe_value);
// Find the matching partition in the hash table
partition_id[i] = hash_value >> (32 - NUM_PARTITIONING_BITS);
// Prefetch the hash value into LLC
hash_tbls_[partition_ids[i]]->Prefetch(hash_values[i]);
}
for (int i =0; i < probe_batch_size; i++) {
// Probe the hash table
iterator = hash_tbls_[partition_id]->FindProbeRow(
ht_ctx, hash_value,
probe_value);
Prefetch the hash
value
CACHE-HIT
Breakup the loop into two
parts, to reduce data
dependency
Batch size is 1024 rows, giving
enough time to ensure
prefetched data is in LLC
61© Cloudera, Inc. All rights reserved.
Joins and aggregations in Impala : Current state
Old hash table probe pseudo code Current hash tableJoins probe pseudo code
int hash_value, probe_value, partition_id;
for (int i =0; i < probe_batch_size; i++) {
// Read probe key from input batch
probe_value =probe_batch_->GetRow(i);
// Compute hash value
hash_value =Hash(probe_value);
// Find the matching partition in the hash table
partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS);
iterator = hash_tbls_[partition_id]->FindProbeRow(
ht_ctx, hash_value,
probe_value);
}
int[] hash_value, probe_value, partition_id;
for (int i =0; i < probe_batch_size; i++) {
// Read probe key from input batch
probe_values[i] =probe_batch_->GetRow(i);
// Compute hash value
hash_value[i] =Hash(probe_value);
// Find the matching partition in the hash table
partition_id[i] = hash_value >> (32 - NUM_PARTITIONING_BITS);
// Prefetch the hash value into LLC
hash_tbls_[partition_ids[i]]->Prefetch(hash_values[i]);
}
for (int i =0; i < probe_batch_size; i++) {
// Probe the hash table
iterator = hash_tbls_[partition_id]->FindProbeRow(
ht_ctx, hash_value,
probe_value);
Prefetch the hash
value
CACHE-HIT
62© Cloudera, Inc. All rights reserved.
Lessons learnt
● Don’t underestimate impact of cache-misses
● Prefetching is useful when done before data needs to be read from Memory
○ 30-40% speedup for Join and aggregation operations
● Removing data dependencies opens opportunities for improvement
Caveats
● TLB misses is still an issue
● Prefetching is as good as the hash function used, won’t handle chaining
Joins and aggregations in Impala : Current state
63© Cloudera, Inc. All rights reserved.
Impala Roadmap Focus
• Performance & Scalability
• Parquet scanner performance
• Metadata
• RPC Layer
• Reliability
• Node decommission
• Better resource management
• Cloud
64© Cloudera, Inc. All rights reserved.
Thank you
https://github.com/apache/impala
http://impala.apache.org/
Impala: A Modern, Open-Source SQL Engine
for Hadoop

Weitere ähnliche Inhalte

Was ist angesagt?

What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineInfluxData
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariDataWorks Summit
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala InternalsDavid Groozman
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 

Was ist angesagt? (20)

What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with Ambari
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 

Ähnlich wie Performance Optimizations in Apache Impala

Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120Hyoungjun Kim
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with HadoopJayant Shekhar
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...Cloudera, Inc.
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Cloudera, Inc.
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaJason Shih
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 

Ähnlich wie Performance Optimizations in Apache Impala (20)

Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Kürzlich hochgeladen

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 

Kürzlich hochgeladen (20)

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 

Performance Optimizations in Apache Impala

  • 1. 1© Cloudera, Inc. All rights reserved. Performance optimizations in Apache Impala Mostafa Mokhtar (mmokhtar@cloudera.com) @mostafamokhtar7
  • 2. 2© Cloudera, Inc. All rights reserved. • History and motivation • Impala architecture • Performance focused overview on • Front-end • Query planning overview • Query optimization • Metadata and statistics • Back-end • Partitioning and sorting for Selective scans • Code-generation using LLVM • Streaming Aggregation • Runtime filters • Handling cache misses for Joins and Aggs Outline
  • 3. 3© Cloudera, Inc. All rights reserved. “Big data” revolution
  • 4. 4© Cloudera, Inc. All rights reserved. SQL on Apache Hadoop • SQL • Run on top of HDFS • Supported various file formats • Converted query operators to map-reduce jobs • Run at scale • Fault-tolerant • High startup-costs/materialization overhead (slow...)
  • 5. 5© Cloudera, Inc. All rights reserved. Impala: An open source SQL engine for Apache Hadoop • Fast • C++ instead of Java • Run time code generation • Support interactive BI and analytical workloads • Supports queries that take from milliseconds to hours • Scalable • Run directly on “big” hadoop clusters (100+ nodes) • Flexible • Support multiple storage engines (HDFS, S3, ADLS, Apache Kudu, ...) • Support multiple file formats (Parquet, Text, Sequence, Avro, ..) • Support both structured and semi-structured data • Enterprise-grade • Authorization/authentication/lineage/audit
  • 6. 6© Cloudera, Inc. All rights reserved. Impala’s history • First commit in May 2011 • Public beta in October 2012 • Over a million downloads since then • Most recent released version is 2.10, associated with CDH 5.13 • November 2017 • Apache® Impala™ has graduated from the Apache Incubator to become a Top- Level Project (TLP), signifying that the project's community and products have been well-governed under the ASF's meritocratic process and principles.
  • 7. 7© Cloudera, Inc. All rights reserved. SQL on “big data”. The race is on... Hive on Tez Hive on Tez + LLAP
  • 8. 8© Cloudera, Inc. All rights reserved. Impala - Logical view Query Compiler Query Executor Query Coordinator Metadata HDFS Kudu S3/ADL S HBase Query Compiler Query Executor Query Coordinator Metadata Query Compiler Query Executor Query Coordinator Metadata Hive Metastore HDFS NameNode StateStore Catalog FE (Java) BE (C++) HDFS Kudu S3/ADL S HBase HDFS Kudu S3/ADL S HBase Metadata Impalad Impalad Impalad Metadata/control Execution Storage Sentry Role: admin, Privs: R/W access on ALL Role: user, Privs: R access on TABLE db1.foo, ... Dir: hdfs://db1/foo, File: foo1.csv, Blocks:...., Size: 100B Dir: S3://bucket/db2/blah, File: blah1.parq, Size: 1GB DB: db1, Tbl: foo (a int,...) location ‘hdfs://…’ DB: db2, Tbl: blah (b string..) location ‘s3://…’
  • 9. 9© Cloudera, Inc. All rights reserved. Metadata Impala in action - DDL Query Compiler Query Executor Query Coordinator HDFS Kudu S3/ADL S HBase Query Compiler Query Executor Query Coordinator Metadata Query Compiler Query Executor Query Coordinator Metadata Hive Metastore HDFS NameNode StateStore Catalog HDFS Kudu S3/ADL S HBase HDFS Kudu S3/ADL S HBase Metadata Impalad Impalad Impalad SQL App ODBC/JDBC CREATE TABLE db.foo (...) location ‘hdfs://…..’;
  • 10. 11© Cloudera, Inc. All rights reserved. Impala in action - Select query Query Compiler Query Executor Query Coordinator Metadata HDFS Kudu S3/ADL S HBase Query Compiler Query Executor Query Coordinator Metadata Query Compiler Query Executor Query Coordinator Metadata Hive Metastore HDFS NameNode StateStore Catalog HDFS Kudu S3/ADL S HBase HDFS Kudu S3/ADL S HBase Metadata Impalad Impalad Impalad SQL App ODBC/JDBC select * from db.foo; 0 1 2 34 5 6 7
  • 11. 12© Cloudera, Inc. All rights reserved. Impala release to release performance trend ● Track record in improving release to release performance ● 10x speedup in standard benchmarks over the last 24 months ● Continued to add functionality without introducing regressions
  • 12. 13© Cloudera, Inc. All rights reserved. 2-phase cost-based optimizer • Phase 1: Generate a single node plan (transformations, join ordering, static partition pruning, runtime filters) • Phase 2: Convert the single node plan into a distributed query execution plan (add exchange nodes, decide join strategy) Query fragments (units of work): • Parts of query execution tree • Each fragment is executed in one or more impalads Query planning
  • 13. 14© Cloudera, Inc. All rights reserved. Metadata • Table metadata (HMS) and block level (HDFS) information are cached to speed-up query time • Cached data is stored in FlatBuffers to save space and avoid excessive GC • Metadata loading from HDFS/S3/ADLS uses a thread pool to speedup the operation when needed Statistics • Impala uses an HLL to compute Number of distinct values (NDV) • HLL is much faster than the combination of COUNT and DISTINCT, and uses a constant amount of memory and thus is less memory-intensive for columns with high cardinality. • HLL size is 1KB per column • A Novel implementation of sampled stats is coming soon Metadata & Statistics
  • 14. 15© Cloudera, Inc. All rights reserved. ● Order scan predicates by selectivity and cost, mitigate correlated predicates (exponential backoff) ● Detection of common join pattern of Primary key/Foreign key joins ● Compute selectivity of predicates for scans as well as joins ● Determine build and probe side for equi joins ● Select the ideal join type that minimizes resource utilization ○ Broadcast Join ○ Partition Join ● Identify joins which can benefit from Runtime filters ● Determine optimal join order Query optimizations based on statistics 02:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: l_partkey = o_orderkey | fk/pk conjuncts: l_orderkey = o_orderkey | runtime filters: RF000 <- o_orderkey | tuple-ids=1,0 row-size=113B cardinality=27,381,196 | |--05:EXCHANGE [BROADCAST] | | hosts=20 per-host-mem=0B | | tuple-ids=0 row-size=8B cardinality=68,452,805 | | | 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM] | partitions=366/2406 files=366 size=28.83GB | predicates: tpch_3000_parquet.orders.o_orderkey < 100 | table stats: 4,500,000,000 rows total | column stats: all | tuple-ids=0 row-size=8B cardinality=68,452,805 | 01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM] partitions=2526/2526 files=2526 size=1.36TB predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string% runtime filters: RF000 -> l_orderkey table stats: 18,000,048,306 rows total column stats: all tuple-ids=1 row-size=105B cardinality=1,800,004,831 SELECT l_shipmode, Sum(l_extendedprice) FROM orders, lineitem WHERE o_orderkey = l_orderkey AND l_comment LIKE '%long string%' AND l_receiptdate >= '1994-01-01' AND l_partkey < 100 AND o_orderdate < '1993-01-01' GROUP BY l_shipmode ORDER BY l_shipmode
  • 15. 16© Cloudera, Inc. All rights reserved. Query optimizations based on statistics 02:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: l_partkey = o_orderkey | fk/pk conjuncts: l_orderkey = o_orderkey | runtime filters: RF000 <- o_orderkey | tuple-ids=1,0 row-size=113B cardinality=27,381,196 | |--05:EXCHANGE [BROADCAST] | | hosts=20 per-host-mem=0B | | tuple-ids=0 row-size=8B cardinality=68,452,805 | | | 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM] | partitions=366/2406 files=366 size=28.83GB | predicates: tpch_3000_parquet.orders.o_orderkey < 100 | table stats: 4,500,000,000 rows total | column stats: all | tuple-ids=0 row-size=8B cardinality=68,452,805 | 01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM] partitions=2526/2526 files=2526 size=1.36TB predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string% runtime filters: RF000 -> l_orderkey table stats: 18,000,048,306 rows total column stats: all tuple-ids=1 row-size=105B cardinality=1,800,004,831 SELECT l_shipmode, Sum(l_extendedprice) FROM orders, lineitem WHERE o_orderkey = l_orderkey AND l_comment LIKE '%long string%' AND l_receiptdate >= '1994-01-01' AND l_partkey < 100 AND o_orderdate < '1993-01-01' GROUP BY l_shipmode ORDER BY l_shipmode ● Order scan predicates by selectivity and cost, mitigate correlated predicates (exponential backoff) ● Detection of common join pattern of Primary key/Foreign key joins ● Compute selectivity of predicates for scans as well as joins ● Determine build and probe side for equi joins ● Select the ideal join type that minimizes resource utilization ○ Broadcast Join ○ Partition Join ● Identify joins which can benefit from Runtime filters ● Determine optimal join order
  • 16. 17© Cloudera, Inc. All rights reserved. Query optimizations based on statistics 02:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: l_partkey = o_orderkey | fk/pk conjuncts: l_orderkey = o_orderkey | runtime filters: RF000 <- o_orderkey | tuple-ids=1,0 row-size=113B cardinality=27,381,196 | |--05:EXCHANGE [BROADCAST] | | hosts=20 per-host-mem=0B | | tuple-ids=0 row-size=8B cardinality=68,452,805 | | | 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM] | partitions=366/2406 files=366 size=28.83GB | predicates: tpch_3000_parquet.orders.o_orderkey < 100 | table stats: 4,500,000,000 rows total | column stats: all | tuple-ids=0 row-size=8B cardinality=68,452,805 | 01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM] partitions=2526/2526 files=2526 size=1.36TB predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string% runtime filters: RF000 -> l_orderkey table stats: 18,000,048,306 rows total column stats: all tuple-ids=1 row-size=105B cardinality=1,800,004,831 ● Order scan predicates by selectivity and cost, mitigate correlated predicates (exponential backoff) ● Detection of common join pattern of Primary key/Foreign key joins ● Compute selectivity of predicates for scans as well as joins ● Determine build and probe side for equi joins ● Select the ideal join type that minimizes resource utilization ○ Broadcast Join ○ Partition Join ● Identify joins which can benefit from Runtime filters ● Determine optimal join order SELECT l_shipmode, Sum(l_extendedprice) FROM orders, lineitem WHERE o_orderkey = l_orderkey AND l_comment LIKE '%long string%' AND l_receiptdate >= '1994-01-01' AND l_partkey < 100 AND o_orderdate < '1993-01-01' GROUP BY l_shipmode ORDER BY l_shipmode
  • 17. 18© Cloudera, Inc. All rights reserved. Query optimizations based on statistics 02:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: l_partkey = o_orderkey | fk/pk conjuncts: l_orderkey = o_orderkey | runtime filters: RF000 <- o_orderkey | tuple-ids=1,0 row-size=113B cardinality=27,381,196 | |--05:EXCHANGE [BROADCAST] | | hosts=20 per-host-mem=0B | | tuple-ids=0 row-size=8B cardinality=68,452,805 | | | 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM] | partitions=366/2406 files=366 size=28.83GB | predicates: tpch_3000_parquet.orders.o_orderkey < 100 | table stats: 4,500,000,000 rows total | column stats: all | tuple-ids=0 row-size=8B cardinality=68,452,805 | 01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM] partitions=2526/2526 files=2526 size=1.36TB predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string% runtime filters: RF000 -> l_orderkey table stats: 18,000,048,306 rows total column stats: all tuple-ids=1 row-size=105B cardinality=1,800,004,831 ● Order scan predicates by selectivity and cost, mitigate correlated predicates (exponential backoff) ● Detection of common join pattern of Primary key/Foreign key joins ● Compute selectivity of predicates for scans as well as joins ● Determine build and probe side for equi joins ● Select the ideal join type that minimizes resource utilization ○ Broadcast Join ○ Partition Join ● Identify joins which can benefit from Runtime filters ● Determine optimal join order SELECT l_shipmode, Sum(l_extendedprice) FROM orders, lineitem WHERE o_orderkey = l_orderkey AND l_comment LIKE '%long string%' AND l_receiptdate >= '1994-01-01' AND l_partkey < 100 AND o_orderdate < '1993-01-01' GROUP BY l_shipmode ORDER BY l_shipmode
  • 18. 19© Cloudera, Inc. All rights reserved. Query optimizations based on statistics 02:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: l_partkey = o_orderkey | fk/pk conjuncts: l_orderkey = o_orderkey | runtime filters: RF000 <- o_orderkey | tuple-ids=1,0 row-size=113B cardinality=27,381,196 | |--05:EXCHANGE [BROADCAST] | | hosts=20 per-host-mem=0B | | tuple-ids=0 row-size=8B cardinality=68,452,805 | | | 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM] | partitions=366/2406 files=366 size=28.83GB | predicates: tpch_3000_parquet.orders.o_orderkey < 100 | table stats: 4,500,000,000 rows total | column stats: all | tuple-ids=0 row-size=8B cardinality=68,452,805 | 01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM] partitions=2526/2526 files=2526 size=1.36TB predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string% runtime filters: RF000 -> l_orderkey table stats: 18,000,048,306 rows total column stats: all tuple-ids=1 row-size=105B cardinality=1,800,004,831 ● Order scan predicates by selectivity and cost, mitigate correlated predicates (exponential backoff) ● Detection of common join pattern of Primary key/Foreign key joins ● Compute selectivity of predicates for scans as well as joins ● Determine build and probe side for equi joins ● Select the ideal join type that minimizes resource utilization ○ Broadcast Join ○ Partition Join ● Identify joins which can benefit from Runtime filters ● Determine optimal join order SELECT l_shipmode, Sum(l_extendedprice) FROM orders, lineitem WHERE o_orderkey = l_orderkey AND l_comment LIKE '%long string%' AND l_receiptdate >= '1994-01-01' AND l_partkey < 100 AND o_orderdate < '1993-01-01' GROUP BY l_shipmode ORDER BY l_shipmode
  • 19. 20© Cloudera, Inc. All rights reserved. Query optimizations based on statistics 02:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: l_partkey = o_orderkey | fk/pk conjuncts: l_orderkey = o_orderkey | runtime filters: RF000 <- o_orderkey | tuple-ids=1,0 row-size=113B cardinality=27,381,196 | |--05:EXCHANGE [BROADCAST] | | hosts=20 per-host-mem=0B | | tuple-ids=0 row-size=8B cardinality=68,452,805 | | | 00:SCAN HDFS [tpch_3000_parquet.orders, RANDOM] | partitions=366/2406 files=366 size=28.83GB | predicates: tpch_3000_parquet.orders.o_orderkey < 100 | table stats: 4,500,000,000 rows total | column stats: all | tuple-ids=0 row-size=8B cardinality=68,452,805 | 01:SCAN HDFS [tpch_3000_parquet.lineitem, RANDOM] partitions=2526/2526 files=2526 size=1.36TB predicates: l_partkey < 100, l_receiptdate >= '1994-01-01', l_comment LIKE '%long string% runtime filters: RF000 -> l_orderkey table stats: 18,000,048,306 rows total column stats: all tuple-ids=1 row-size=105B cardinality=1,800,004,831 ● Order scan predicates by selectivity and cost, mitigate correlated predicates (exponential backoff) ● Detection of common join pattern of Primary key/Foreign key joins ● Compute selectivity of predicates for scans as well as joins ● Determine build and probe side for equi joins ● Select the ideal join type that minimizes resource utilization ○ Broadcast Join ○ Partition Join ● Identify joins which can benefit from Runtime filters ● Determine optimal join order SELECT l_shipmode, Sum(l_extendedprice) FROM orders, lineitem WHERE o_orderkey = l_orderkey AND l_comment LIKE '%long string%' AND l_receiptdate >= '1994-01-01' AND l_partkey < 100 AND o_orderdate < '1993-01-01' GROUP BY l_shipmode ORDER BY l_shipmode
  • 20. 21© Cloudera, Inc. All rights reserved. • Use metadata to avoid table accesses for partition key scans: • select min(month), max(year) from functional.alltypes; • month, year are partition keys of the table • Enabled by query option OPTIMIZE_PARTITION_KEY_SCANS • Applicable: • min(), max(), ndv() and aggregate functions with distinct keyword • partition keys only 01:AGGREGATE [FINALIZE] | output: min(month),max(year) | 00:UNION constant-operands=24 03:AGGREGATE [FINALIZE] | output: min:merge(month), max:merge(year) | 02:EXCHANGE [UNPARTITIONED] | 01:AGGREGATE | output: min(month), max(year) | 00:SCAN HDFS [functional.alltypes] partitions=24/24 files=24 size=478.45KB Plan without optimization Plan with optimization Optimization for metadata only queries
  • 21. 22© Cloudera, Inc. All rights reserved. Select sum(l_extendedprice* (1 - l_discount)) as revenue from Lineitem, part Where ( p_partkey = l_partkey and p_brand = 'Brand#32' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 7 and l_quantity <= 7 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON') or( p_partkey = l_partkey and p_brand = 'Brand#35' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity >= 15 and l_quantity <= 15 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON') Or( p_partkey = l_partkey and p_brand = 'Brand#24' and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity >= 26 and l_quantity <= 26 + 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON') ● Extract common conjuncts from disjunctions. (a AND b) OR (a AND b) ==> a AND b (a AND b AND c) OR (c) ==> c ● >100x speedup for TPC-DS Q13, Q48 & TPC-H Q19 00:SCAN HDFS [tpch_300_parquet.lineitem, RANDOM] partitions=1/1 files=259 size=63.71GB predicates: l_shipmode IN ('AIR', 'AIR REG'), l_shipinstruct = 'DELIVER IN PERSON' runtime filters: RF000 -> l_partkey table stats: 1799989091 rows total hosts=7 per-host-mem=528.00MB tuple-ids=0 row-size=80B cardinality=240533660 Scanner : Extract common conjuncts from disjunctions
  • 22. 23© Cloudera, Inc. All rights reserved. • Not based on Map-Reduce • Single-threaded, row-based, Volcano-style (iterator-based with batches) query execution engine • Multi-threaded scans • Utilizes HDFS short-circuit local reads & multi-threaded I/O subsystem • Supports multiple file formats (Parquet, Avro, Text, Sequence, …) • Supports nested types (only in Parquet) • Code generation using LLVM • No transactions • No indices Query execution
  • 23. 24© Cloudera, Inc. All rights reserved. • Impala does not have native indexes (e.g. B+-tree), but it does allow a type of indexing by partitions • What it is: physically dividing your data so that queries only need to access a subset • Partitioning schema is expressed through DDL • Pruning applied automatically when queries contain matching predicates CREATE TABLE Sales (...) PARTITIONED BY (INT year, INT month); SELECT … FROM Sales WHERE year >= 2012 AND month IN (1, 2, 3) SELECT … FROM Sales JOIN DateDim d USING date_key WHERE d.year >= 2012 AND d.month IN (1, 2, 3) CREATE TABLE Sales (...) PARTITIONED BY (INT date_key); or Partitioning
  • 24. 25© Cloudera, Inc. All rights reserved. ● Sorting data files improves the effectiveness of file statistics (min/max) and compression (e.g., delta encoding). ● Predicates are evaluated against Min/Max statistics as well as Dictionary encoded columns, this approximates lazy materialization ● Sorting can be used on columns which have too many values to qualify for partitioning. ● Create sorted data files by adding the SORT BY clause during table creation. ● The Parquet community is working on extending the format for efficient point lookups. CREATE TABLE Sales (...) PARTITIONED BY (year INT, month INT) SORT BY (day, hour) Stored as Parquet; Optimizations for selective scans: Sorting
  • 25. 26© Cloudera, Inc. All rights reserved. Metric Partition + Sorting Speedup Elapsed time 3x CPU time (seconds) 5x HDFS MBytes read 17x CREATE TABLE Sales (...) PARTITIONED BY (year INT, month INT) SORT BY (day, hour) Stored as Parquet; Business question: Find top 10 customers in terms of revenue who made purchases on Christmas eve in a given time window. SORT BY helps meet query SLAs without over- partitioning the table SELECT sum(ss_ext_sales_price) AS revenue, count(ss_quantity) AS quantity_count, customer_name FROM Sales WHERE Year = 2017 AND Month=12 AND 12 AND hour BETWEEN 1 AND 4 GROUP BY customer_name ORDER BY revenue DESC LIMIT 10; Optimizations for selective scans: Augment Partitioning
  • 26. 27© Cloudera, Inc. All rights reserved. CREATE TABLE Sales (...) PARTITIONED BY (year INT, month INT) SORT BY (customer_id) Stored as Parquet; Business question: Find interactions for a specific customer for given time window SORT BY helps meet query SLAs without over partitioning the table SELECT * FROM Sales WHERE Year = 2016 AND customer_id = 4976004; Metric Partition + Sorting Speedup Elapsed time 18x HDFS MBytes read 22x Optimizations for selective scans:Complement Partitioning
  • 27. 28© Cloudera, Inc. All rights reserved. • Impala uses runtime code generation to produce query specific versions of functions that are critical to performance. • In particular, code generation is applied to “inner loop” functions • Code generation (codegen) lets Impala use query-specific information to do less work • Remove conditionals • Propagate constant offsets, pointers, etc. • Inline virtual functions calls LLVM Codegen in Impala
  • 28. 29© Cloudera, Inc. All rights reserved. Operations: • Hash join • Aggregation • Scans: Parquet, Text, Sequence, Avro • Expressions in all operators • Union • Sort • Top-N • Runtime filters Data Types: • TINYINT, SMALLINT, INT, BIGINT • FLOAT, DOUBLE • BOOLEAN • STRING, VARCHAR • DECIMAL Further optimizations: • Don’t codegen short running queries (Relies on cardinality estimates) • Limit amount of inlining for long and complex expressions (Group by 1K columns) LLVM Codegen in Impala
  • 29. 30© Cloudera, Inc. All rights reserved. select l_extendedprice, l_orderkey from lineitem order by l_orderkey limit 100 l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 30525093 16218 12/16/1997 REG AIR 28809990 7208 10/19/1997 AIR Codegen for Order by & Top-N
  • 30. 31© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } select l_extendedprice, l_orderkey from lineitem order by l_orderkey limit 100 l_orderkey l_extendedprice l_shipdate l_shipmode 26039617 13515 12/8/1997 FOB 30525093 16218 12/16/1997 REG AIR 28809990 7208 10/19/1997 AIR Codegen for Order by & Top-N
  • 31. 32© Cloudera, Inc. All rights reserved. void* ExprContext::GetValue(Expr* e, TupleRow* row) { switch (e->type_.type) { case TYPE_BOOLEAN: { .. .. } case TYPE_TINYINT: { .. .. } case TYPE_INT: { .. . int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N
  • 32. 33© Cloudera, Inc. All rights reserved. void* ExprContext::GetValue(Expr* e, TupleRow* row) { switch (e->type_.type) { case TYPE_BOOLEAN: { .. .. } case TYPE_TINYINT: { .. .. } case TYPE_INT: { .. . int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N
  • 33. 34© Cloudera, Inc. All rights reserved. int RawValue::Compare(const void* v1, const void* v2, const ColumnType& type) { switch (type.type) { case TYPE_INT: i1 = *reinterpret_cast<const int32_t*>(v1); i2 = *reinterpret_cast<const int32_t*>(v2); return i1 > i2 ? 1 : (i1 < i2 ? -1 : 0); case TYPE_BIGINT: b1 = *reinterpret_cast<const int64_t*>(v1); b2 = *reinterpret_cast<const int64_t*>(v2); return b1 > b2 ? 1 : (b1 < b2 ? -1 : 0); int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N
  • 34. 35© Cloudera, Inc. All rights reserved. int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen codeOriginal code int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N
  • 35. 36© Cloudera, Inc. All rights reserved. int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0 int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1 int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key } Codegen code • Perfectly unrolls “for each grouping column” loop • No switching on input type(s) • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } Codegen for Order by & Top-N
  • 36. 37© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen code • No switching on input type(s) • Perfectly unrolls “for each grouping column” loop • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code Codegen for Order by & Top-N
  • 37. 38© Cloudera, Inc. All rights reserved. int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key Codegen code • No switching on input type(s) • Perfectly unrolls “for each grouping column” loop • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code 10x more efficient code Codegen for Order by & Top-N
  • 38. 39© Cloudera, Inc. All rights reserved. Network Preagg Preagg Preagg Merge Merge Merge select cust_id, sum(dollars) from sales group by cust_id; Scan ScanScan • Aggregations have two phases: • Pre-aggregation phase • Merge phase • The pre-aggregation phase greatly reduces network traffic if there are many input rows per grouping value. • E.g. many sales per customer. Distributed Aggregation in MPP
  • 39. 40© Cloudera, Inc. All rights reserved. Network Preagg Preagg Preagg Merge Merge Merge select distinct * from sales; Scan ScanScan • Pre-aggregations consume: • Memory • CPU cycles • Pre-aggregations are not always effective at reducing network traffic • E.g. select distinct for nearly-distinct rows • Pre-aggregations can spill to disk under memory pressure • Disk I/O is bad - better to send to merge agg rather than disk Downside of Pre-aggregations
  • 40. 41© Cloudera, Inc. All rights reserved. select distinct * from sales; • Reduction factor is dynamically estimated based on the actual data processed • Pre-aggregation expands memory usage only if reduction factor is good • Benefits: • Certain aggregations with low reduction factor see speedups of up to 40% • Memory consumption can be reduced by 50% or more • Streaming pre-aggregations don’t spill to disk Network Stream Stream Stream Merge Merge Merge Scan ScanScan Streaming Pre-aggregations in Impala
  • 41. 42© Cloudera, Inc. All rights reserved. • When are Runtime filters useful • To Optimize selective equi-joins against large partitioned or unpartitioned tables • General idea • Avoid unnecessary I/O to read partition data, and avoid unnecessary network transmission by sending only the subset of rows that match the join keys across the network • Some predicates can only be computed at runtime • Use a Bloom filter, which uses a probability-based algorithm to store and query values for joins column(s) Optimizations for selective joins : Runtime filters
  • 42. 43© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows store_sales 43 billion rows customer 3.8 million Shuffle Shuffle store_sales customer customer_demo Runtime filters in action
  • 43. 44© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows Join #1 & #2 are expensive joins since left side of the joins have 43 billion rows store_sales 43 billion rows customer 3.8 million Shuffle Shuffle store_sales customer customer_demo Runtime filters in action
  • 44. 45© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows Create bloom filter from Join #2 on cd_demo_sk and push down to customer table scan store_sales 43 billion rows customer 3.8 million Shuffle Shuffle store_sales customer customer_demo Runtime filters in action
  • 45. 46© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows Reduced customer rows by 826X 3.8 million to 4,600 rows store_sales 43 billion rows customer 4,600 rows Shuffle Shuffle store_sales customer customer_demo Runtime filters in action
  • 46. 47© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows store_sales 43 billion rows customer 4,600 rows Shuffle Shuffle Create bloom filter from Join #1 on c_customer_sk and push down to store_sales table scan store_sales customer customer_demo Runtime filters in action
  • 47. 48© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Shuffle Join #1 49 million rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows store_sales 49 million rows customer 4,600 rows Shuffle Shuffle 877x reduction in rows 43 billion -> 49 million rows store_sales customer customer_demo Runtime filters in action
  • 48. 49© Cloudera, Inc. All rights reserved. • Cache friendly, hash each key to a Bloom filter the size of a cache line or smaller • Use AVX2 instruction set for Insert and Find operations • Planner uses a cost model to decide which joins benefit from Runtime Filters • Bloom filters are sized based on estimated number of distinct values for the build side (1MB-16MB by default) • Ineffective runtime filters are disabled dynamically when • Bloom filters has a higher false positive rate • Filter selectivity (Reduce CPU overhead) • Track Min and Max values from build side and push new implied predicates to the scan of the probe side (Supported for Kudu in CDH5.14, Parquet support coming soon) • Runtime filters are codegened, avoids virtual functions calls and branching in inner loop Optimizations for selective joins : Runtime filters
  • 49. 50© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood
  • 50. 51© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood Most of the time is spent in L1 & L2 cache misses!
  • 51. 52© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood Old hash table probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } Read probe join key from input batch
  • 52. 53© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood Old hash table probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } Compute hash value
  • 53. 54© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood Old hash table probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } Find the partition ID
  • 54. 55© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood Old hash table probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } Probe hash table
  • 55. 56© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood Old hash table probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } Probe hash table CACHE-MISS Data dependency resulting in poor pipelining
  • 56. 57© Cloudera, Inc. All rights reserved. Joins and aggregations under the hood Old hash table probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } Probe hash table CACHE-MISS Data dependency resulting in poor pipelining
  • 57. 58© Cloudera, Inc. All rights reserved. Joins and aggregations in Impala : Current state Old hash table probe pseudo code Current hash tableJoins probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } int[] hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_values[i] =probe_batch_->GetRow(i); // Compute hash value hash_value[i] =Hash(probe_value); // Find the matching partition in the hash table partition_id[i] = hash_value >> (32 - NUM_PARTITIONING_BITS); // Prefetch the hash value into LLC hash_tbls_[partition_ids[i]]->Prefetch(hash_values[i]); } for (int i =0; i < probe_batch_size; i++) { // Probe the hash table iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); Prefetch the hash value
  • 58. 59© Cloudera, Inc. All rights reserved. Joins and aggregations in Impala : Current state Old hash table probe pseudo code Current hash tableJoins probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } int[] hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_values[i] =probe_batch_->GetRow(i); // Compute hash value hash_value[i] =Hash(probe_value); // Find the matching partition in the hash table partition_id[i] = hash_value >> (32 - NUM_PARTITIONING_BITS); // Prefetch the hash value into LLC hash_tbls_[partition_ids[i]]->Prefetch(hash_values[i]); } for (int i =0; i < probe_batch_size; i++) { // Probe the hash table iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); Breakup the loop into two parts, to reduce data dependency Batch size is 1024 rows, giving enough time to ensure prefetched data is in LLC Prefetch the hash value
  • 59. 60© Cloudera, Inc. All rights reserved. Joins and aggregations in Impala : Current state Old hash table probe pseudo code Current hash tableJoins probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } int[] hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_values[i] =probe_batch_->GetRow(i); // Compute hash value hash_value[i] =Hash(probe_value); // Find the matching partition in the hash table partition_id[i] = hash_value >> (32 - NUM_PARTITIONING_BITS); // Prefetch the hash value into LLC hash_tbls_[partition_ids[i]]->Prefetch(hash_values[i]); } for (int i =0; i < probe_batch_size; i++) { // Probe the hash table iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); Prefetch the hash value CACHE-HIT Breakup the loop into two parts, to reduce data dependency Batch size is 1024 rows, giving enough time to ensure prefetched data is in LLC
  • 60. 61© Cloudera, Inc. All rights reserved. Joins and aggregations in Impala : Current state Old hash table probe pseudo code Current hash tableJoins probe pseudo code int hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_value =probe_batch_->GetRow(i); // Compute hash value hash_value =Hash(probe_value); // Find the matching partition in the hash table partition_id = hash_value >> (32 - NUM_PARTITIONING_BITS); iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); } int[] hash_value, probe_value, partition_id; for (int i =0; i < probe_batch_size; i++) { // Read probe key from input batch probe_values[i] =probe_batch_->GetRow(i); // Compute hash value hash_value[i] =Hash(probe_value); // Find the matching partition in the hash table partition_id[i] = hash_value >> (32 - NUM_PARTITIONING_BITS); // Prefetch the hash value into LLC hash_tbls_[partition_ids[i]]->Prefetch(hash_values[i]); } for (int i =0; i < probe_batch_size; i++) { // Probe the hash table iterator = hash_tbls_[partition_id]->FindProbeRow( ht_ctx, hash_value, probe_value); Prefetch the hash value CACHE-HIT
  • 61. 62© Cloudera, Inc. All rights reserved. Lessons learnt ● Don’t underestimate impact of cache-misses ● Prefetching is useful when done before data needs to be read from Memory ○ 30-40% speedup for Join and aggregation operations ● Removing data dependencies opens opportunities for improvement Caveats ● TLB misses is still an issue ● Prefetching is as good as the hash function used, won’t handle chaining Joins and aggregations in Impala : Current state
  • 62. 63© Cloudera, Inc. All rights reserved. Impala Roadmap Focus • Performance & Scalability • Parquet scanner performance • Metadata • RPC Layer • Reliability • Node decommission • Better resource management • Cloud
  • 63. 64© Cloudera, Inc. All rights reserved. Thank you https://github.com/apache/impala http://impala.apache.org/ Impala: A Modern, Open-Source SQL Engine for Hadoop