Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
1© Cloudera, Inc. All rights reserved.
Greg Rahn | @gregrahn | Director of Product Management
Apache Impala:
Recent advanc...
2© Cloudera, Inc. All rights reserved.
Agenda
• Improvements in recent Impala versions
• Latest performance comparisons
• ...
3© Cloudera, Inc. All rights reserved.
Meet Impala
4© Cloudera, Inc. All rights reserved.
Apache Impala: Open Source & Open Standard
1 Millions of downloads since GA in May ...
5© Cloudera, Inc. All rights reserved.
The Leader in Interactive SQL Analytics for Hadoop
Multi-User Performance
& Usabili...
6© Cloudera, Inc. All rights reserved.
Key Benefits
An analytic database designed for Hadoop
High-Performance BI and SQL A...
7© Cloudera, Inc. All rights reserved.
Advancements in recent Impala versions
8© Cloudera, Inc. All rights reserved.
• Enables faster and more
complex queries
• 5.4x speedup in standard
benchmarks ove...
9© Cloudera, Inc. All rights reserved.
Major themes across releases
Performance Real-time Update Support
Cloud-Native Capa...
10© Cloudera, Inc. All rights reserved.
Latest Impala performance comparisons
April 2017
vs. analytic database (Greenplum)...
11© Cloudera, Inc. All rights reserved.
Testing approach
• Start with TPC-DS 10TB scale factor
• Use unmodified TPC-DS v2....
12© Cloudera, Inc. All rights reserved.
Software:
• Impala 2.8 from CDH 5.10
• Greenplum Database 4.3.9.1
• Spark 2.1
• Pr...
13© Cloudera, Inc. All rights reserved.
• Impala outperforms on
both single and multi-
user tests at 10TB
• Impala lead ex...
14© Cloudera, Inc. All rights reserved.
• The analytical db cohort
(Impala / GP) leads SQL
on Hadoop
• Impala outperforms
...
15© Cloudera, Inc. All rights reserved.
• At 1TB the analytic db
cohort (Impala / GP)
expands lead with
concurrency
• Pres...
16© Cloudera, Inc. All rights reserved.
Impala leads analytic database performance
• Impala leads in performance against t...
17© Cloudera, Inc. All rights reserved.
Impala performance optimizations and tips
Data types, partitioning, and runtime fi...
18© Cloudera, Inc. All rights reserved.
Data type selection
Data type selection affects performance:
• Computation: numeri...
19© Cloudera, Inc. All rights reserved.
Data type selection
Picking the incorrect data type can result in:
• Increase in o...
20© Cloudera, Inc. All rights reserved.
Data type selection
100% 100% 100% 100%
107%
100%
106%
164%
100%
124%
104%
159%
14...
21© Cloudera, Inc. All rights reserved.
Partitioning design considerations
• Date based partitioning is the most common
• ...
22© Cloudera, Inc. All rights reserved.
CREATE TABLE sales (...)
PARTITIONED BY (INT year, INT month)
SELECT ... FROM sale...
23© Cloudera, Inc. All rights reserved.
Partition wisely
Too few partitions (granularity is too large):
• Data elimination...
24© Cloudera, Inc. All rights reserved.
Business question:
How much was sold in June?
CREATE TABLE store_sales (...)
PARTI...
25© Cloudera, Inc. All rights reserved.
Business question:
How much was sold in June?
CREATE TABLE store_sales (...)
PARTI...
26© Cloudera, Inc. All rights reserved.
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON...
27© Cloudera, Inc. All rights reserved.
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON...
28© Cloudera, Inc. All rights reserved.
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON...
29© Cloudera, Inc. All rights reserved.
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON...
30© Cloudera, Inc. All rights reserved.
SELECT
d_year,
sum(ss_ext_sales_price) sum_agg
FROM date_dim d
JOIN store_sales ON...
31© Cloudera, Inc. All rights reserved.
In summary
• Impala offers
• Flexibility
• Cost-effective scale
• Open architectur...
32© Cloudera, Inc. All rights reserved.
Thank you
Downloads:
https://cloudera.com/downloads
Interested in contributing?
ht...
Nächste SlideShare
Wird geladen in …5
×

New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Analytic Database

2.238 Aufrufe

Veröffentlicht am

Recording Link: http://bit.ly/LSImpala
Author: Greg Rahn, Cloudera Director of Product Management

In this session, we'll review the recent set of benchmark tests the Apache Impala (incubating) performance team completed that compare Apache Impala to a traditional analytic database (Greenplum), as well as to other SQL-on-Hadoop engines (Hive LLAP, Spark SQL, and Presto). We'll go over the methodology and results, and we'll also discuss some of the performance features and best practices that make this performance possible in Impala. Lastly, we'll look at some recent advancements in in Impala over the past few releases.

Veröffentlicht in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Analytic Database

  1. 1. 1© Cloudera, Inc. All rights reserved. Greg Rahn | @gregrahn | Director of Product Management Apache Impala: Recent advancements, performance benchmarks, and pro-tips
  2. 2. 2© Cloudera, Inc. All rights reserved. Agenda • Improvements in recent Impala versions • Latest performance comparisons • Performance recommendations and considerations
  3. 3. 3© Cloudera, Inc. All rights reserved. Meet Impala
  4. 4. 4© Cloudera, Inc. All rights reserved. Apache Impala: Open Source & Open Standard 1 Millions of downloads since GA in May 2013 2 Majority adoption across Cloudera customers 3 Certification across key application partners: 4 De facto standard with multi-vendor support: and others
  5. 5. 5© Cloudera, Inc. All rights reserved. The Leader in Interactive SQL Analytics for Hadoop Multi-User Performance & Usability ✔ • 10x vs. alternatives with latest benchmarks • Cost-based optimization allows for more users and tools to run a broader range of queries Compatibility ✔ • Provides both ANSI SQL and vendor-specific extensions • Compatibility with the leading BI partners Impala delivers the best of both worlds Flexibility ✔ • Supports the common native Hadoop file formats • Parquet provides best-of-breed columnar performance across Hadoop frameworks Native Integration ✔ • Unified with Hadoop resource management, metadata, security, and management
  6. 6. 6© Cloudera, Inc. All rights reserved. Key Benefits An analytic database designed for Hadoop High-Performance BI and SQL Analytics Flexibility for Data and Use Case Variety Cost-effective Scale for Today and Tomorrow Go Beyond SQL with an Open Architecture
  7. 7. 7© Cloudera, Inc. All rights reserved. Advancements in recent Impala versions
  8. 8. 8© Cloudera, Inc. All rights reserved. • Enables faster and more complex queries • 5.4x speedup in standard benchmarks over the last 14 months • Track record of constantly improving performance Impala keeps getting faster 100% 100% 100% 214% 226% 544% 0% 100% 200% 300% 400% 500% 600% TPC-H Nested TPC-H TPC-DS Impala Relative Performance Impala 2.3 Impala 2.5 Impala 2.6 Impala 2.7 Impala 2.8
  9. 9. 9© Cloudera, Inc. All rights reserved. Major themes across releases Performance Real-time Update Support Cloud-Native Capabilities v2.8 •Intra-node parallelism + codegen for TIMESTAMP = 8x faster compute stats •Constant folding predicate expressions = 3x speedup •Optimizations for long IN list predicates = 7.5x speedup •Optimizations for metadata loading for tables w/ large # of blocks = 9x speedup v2.6 •3x faster performance on secure clusters (beneficial for large data extractions) v2.8 •GA of Impala/Kudu integration •“Fast analytics on fast data” •INSERT, UPDATE, DELETE v2.6 •Support reads/writes on S3 •Enable fast, flexible ETL and BI analytics across all data in any environment •Performance Metrics: Analytics and BI on Amazon S3 with Apache Impala
  10. 10. 10© Cloudera, Inc. All rights reserved. Latest Impala performance comparisons April 2017 vs. analytic database (Greenplum) & other SQL-on-Hadoop (Hive LLAP, Spark SQL, Presto)
  11. 11. 11© Cloudera, Inc. All rights reserved. Testing approach • Start with TPC-DS 10TB scale factor • Use unmodified TPC-DS v2.4 query templates • Casting syntax • CONCAT() vs. || • Aliases for in-line views or columns • Determine common set of queries • Eliminate unsupported language and excessively long executions • Single-user test • Multi-user tests • Leverage best practices for all systems
  12. 12. 12© Cloudera, Inc. All rights reserved. Software: • Impala 2.8 from CDH 5.10 • Greenplum Database 4.3.9.1 • Spark 2.1 • Presto 0.160 • Hive 2.1 with LLAP from HDP v2.5 Hardware: • 7 worker nodes each with • CPU: 2 x E5-2698 v4 @ 2.20GHz • Storage: 8 x 2TB HDDs • Memory: 256GB RAM Workload: • Data: TPC-DS 10TB & 1TB stored in • Parquet for Impala & Spark SQL • ORCFile for LLAP & Presto • Columnar storage for Greenplum • Partitioned fact tables Queries: • 77 of the 99 TPC-DS queries using only permitted minor query modifications* • 22 queries not used that required non- minor modifications (valid alternates not used either) Environment details
  13. 13. 13© Cloudera, Inc. All rights reserved. • Impala outperforms on both single and multi- user tests at 10TB • Impala lead expands with concurrency • Other SQL on Hadoop engines failed at 10TB Streams Impala QpH Greenplum QpH Impala Times Better 2 41 20 2.05x 4 75 20 3.75x 8 133 16 8.31x Impala outperforms analytical database Metric Impala Greenplum Impala Times Better Total seconds 11,898 21,093 1.77x Geometric mean 33 92 2.79x Single-user Test – 10TB Multi-user Test – 10TB
  14. 14. 14© Cloudera, Inc. All rights reserved. • The analytical db cohort (Impala / GP) leads SQL on Hadoop • Impala outperforms • Presto by 8.3x • Hive w/ LLAP by 4x • Spark SQL by 2.8x TPC-DS 1TB: Single-user Impala Greenplum Spark SQL Hive with LLAP Presto Total 100% 97% 283% 405% 826% Geomean 100% 250% 367% 450% 1317% 0% 200% 400% 600% 800% 1000% 1200% 1400% RelativePerformance TPC-DS 1TB Single-user (Lower is Better)
  15. 15. 15© Cloudera, Inc. All rights reserved. • At 1TB the analytic db cohort (Impala / GP) expands lead with concurrency • Presto failed to complete • With 16 streams Impala outperforms • Spark by ~22x • Hive by ~20x • Greenplum by ~3x TPC-DS 1TB: Multi-user 4 streams 8 streams 16 streams Impala 495 865 1,315 Greenplum 381 417 462 Spark SQL 76 70 61 Hive with LLAP 58 62 66 0 200 400 600 800 1000 1200 1400 QueriesperHour Multi-user TPC-DS 1TB Queries/Hour (Higher is Better) 20x
  16. 16. 16© Cloudera, Inc. All rights reserved. Impala leads analytic database performance • Impala leads in performance against the traditional analytic database, including over 8x better performance for high concurrency workloads • Even greater difference compared to other SQL-on-Hadoop engines with Impala nearly 22x faster for multi-user workloads • Other SQL-on-Hadoop engines also required a simplified, smaller scale benchmark (with Hive even still requiring modifications and Presto unable to complete multi-user tests) • Impala delivers analytic database performance as well as: • Flexibility • Cost-effective scale • Open architecture
  17. 17. 17© Cloudera, Inc. All rights reserved. Impala performance optimizations and tips Data types, partitioning, and runtime filters
  18. 18. 18© Cloudera, Inc. All rights reserved. Data type selection Data type selection affects performance: • Computation: numerical types allow direct computation, string types require conversion • On-disk storage size: numerical types are more compact • More compact types also require less network traffic General guidelines: • Choose numerical types over character types for numerical data • Use smallest data type that will accommodate the largest possible value
  19. 19. 19© Cloudera, Inc. All rights reserved. Data type selection Picking the incorrect data type can result in: • Increase in on-disk storage by 40% • 80% slower scans • 80% slower aggregations • 150% slower joins • Increase in runtime memory utilization Data set: L_ORDERKEY column from TPC-H 3TB Values domain: 1 - 18,000,000,000 Number of distinct values: 4,500,000,000
  20. 20. 20© Cloudera, Inc. All rights reserved. Data type selection 100% 100% 100% 100% 107% 100% 106% 164% 100% 124% 104% 159% 144% 178% 179% 242% 144% 180% 177% 252% Size Scan Group by Join Numeric Data Type Performance Bigint Double Decimal String Varchar
  21. 21. 21© Cloudera, Inc. All rights reserved. Partitioning design considerations • Date based partitioning is the most common • WHERE event_dt BETWEEN 20000101 AND 20000201 • WHERE event_year = 2000 AND event_month = 1 • Partition keys can be column groups / multi-level. Eg. • By date, by hour • By date, by region
  22. 22. 22© Cloudera, Inc. All rights reserved. CREATE TABLE sales (...) PARTITIONED BY (INT year, INT month) SELECT ... FROM sales WHERE year >= 2012 AND month IN (1, 2, 3) CREATE TABLE sales (...) PARTITIONED BY (INT date_key) SELECT ... FROM sales s JOIN date_dim d USING (date_key) WHERE d.year >= 2012 AND d.month IN (1, 2, 3) Partitioning examples
  23. 23. 23© Cloudera, Inc. All rights reserved. Partition wisely Too few partitions (granularity is too large): • Data elimination is not effective • Increases minimum unit of work Too many partitions (granularity is too small): • Many small data files hurt large queries; scans less efficient; limits parallelism • Large number of files can cause metadata bloat and create bottlenecks on HDFS NameNode, Hive Metastore, Impala catalog service General guidelines: • Regularly compact tables to keep the number of files per partition under control and improve scan and compression efficiency • Keep number of partitions under 20K (not a hard limit, mileage will vary)
  24. 24. 24© Cloudera, Inc. All rights reserved. Business question: How much was sold in June? CREATE TABLE store_sales (...) PARTITIONED BY (INT ss_sold_date_sk); SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate
  25. 25. 25© Cloudera, Inc. All rights reserved. Business question: How much was sold in June? CREATE TABLE store_sales (...) PARTITIONED BY (INT ss_sold_date_sk); SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows AggregateThe query planner does not know what values for d_date_sk will be returned and what fact table partitions need to be scanned (or eliminated). But there’s clearly an opportunity to save some work - why bother sending 28 billion of those rows to the joins? Runtime filters construct the partition pruning predicate at runtime.
  26. 26. 26© Cloudera, Inc. All rights reserved. SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate Step 1: Planner tells Join #1 to produce Bloom filter for qualifying distinct values of d_date_sk Bloom filter: compact, probabilistic representation of a data set Essentially a sophisticated bitmap
  27. 27. 27© Cloudera, Inc. All rights reserved. SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate Step 2: Join reads all rows from build side (right input), and populates Bloom filter containing all distinct values of d_date_sk
  28. 28. 28© Cloudera, Inc. All rights reserved. SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate Step 3: Query coordinator sends filter to store_sales scan before the scan starts.
  29. 29. 29© Cloudera, Inc. All rights reserved. SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters STORE_SALES 28 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate Step 4: Scan eliminates all partitions that don’t have a match in the Bloom filter. Only 150 out of the 1824 partitions are read from store_sales.
  30. 30. 30© Cloudera, Inc. All rights reserved. SELECT d_year, sum(ss_ext_sales_price) sum_agg FROM date_dim d JOIN store_sales ON (d_date_sk = ss_sold_date_sk) WHERE d_moy = 6 GROUP BY d_year; Dynamic partition pruning with runtime filters Step 5: Rows coming out of the scan is reduced from 28 Billion to 1.3 Billion. STORE_SALES 1.3 billion rows DATE_DIM 6,000 rows Broadcast Join #1 1.3 Billion rows Aggregate
  31. 31. 31© Cloudera, Inc. All rights reserved. In summary • Impala offers • Flexibility • Cost-effective scale • Open architecture • Leading performance • Recent improvements/enhancements: • Performance • Real-time and cloud capabilities • Be sure to check out the Impala Cookbook (SlideShare) for more performance protips
  32. 32. 32© Cloudera, Inc. All rights reserved. Thank you Downloads: https://cloudera.com/downloads Interested in contributing? http://impala.io

×