Are you using the fastest query tool for Hadoop? Provide and discuss the latest performance results of the industry standard TPC_H benchmarks executed across an assortment of open source query tools such as Hive (using MR, TEZ, LLAP, SPARK), SparkSQL, Presto, and Drill. Additionally, the performance tests will utilize a variety of data sizes and popular storage formats such as ORC, Parquet and Text and compression codecs.
2. 2
Background on Big Data at Comcast
1K+ users in enterprise data lake
Spectrum of use cases
Multi-tenant PAS, SAS, DAS
24x7 Environment
Speed & Stability are King & Queen!
Petabytes of enterprise data available via Hive tables
3. 3
Focus & Outcomes
Focus on SQL query engines that can be connected to traditional BI &
Reporting Tools
Independent performance results from our lab.
Which engine(s) are the fastest running the TPC – DS dataset?
Are there significant performance differences in how the data is stored and
compressed?
How do these engines perform in a memory limited environment?
Which engine(s) would you offer to your executive staff?
6. 6
Test Data & Queries
Started with Hortonworks hive test benchmark repository
https://github.com/hortonworks/hive-testbench
Generated base 1TB TPC-DS partitioned tables
Used base data to build 14 test databases
Table and partition stats were collected on all schemas
8. 8
Test Methodology
Utilized all 66 TPC-DS Queries defined in Hive Benchmark
Same SQL executed in all engines*
Presto had issues with casting dates
Presto currently does not support “use db;”
Care was taken to tune engine & configurations
We expect everyone can find additional optimizations
Queries are run against one engine at a time
Allowing each engine to utilize all environment resources
Each query is was run consecutively 3 times
9. 9
Performance Measurement
Time is measured in wall clock execution time on server
Queries Invoked via Beeline
Utilize !sh command to write out query timings
Implemented simple SQL client for Presto
Failed queries where assigned a penalty time of 10 minutes
17. 17
Results – Presto
ORC Zlib
6216.20
ORC Snappy
6267.88
ORC None
6480.80
Parquet Gzip
6528.14
Parquet Snappy
6529.59
Parquet None
6538.24
Text Snappy
8292.60
Seq Snappy
8695.51
Text Gzip
8735.74
Text None
8853.33
Seq Gzip
8978.22
Seq None
11150.02
Seq Bzip
22124.66
Text Bzip
23681.03
1
PRESTO CUMULATIVE QUERY TIMES 1TB TPC-DS
(IN SECONDS)
Smaller is better
0.8 % 5 % 33.4 %
281.0 %
18. 18
Results – Spark SQL with Spark Thrift Server (STS)
The Spark Thrift Server proved to be
inconsistent in our test environment
Required monitoring and restart to address long
garbage collection pauses
Achieving repeatable results through the STS
proved to be very problematic
We were forced to scratch the results
This is a relatively new technology that is under
construction – stay tuned
CreativeCommons:102orion
24. 24
Observations
Performance tuning engines is still more art than science
High level of confidence more performance gains out there
Performance gains are still achievable for memory intensive
engines running lower memory environments
LLAP and Presto are solid engines and with zero issues
encountered over the several months of testing
All the engines played well together in our test environment
25. 25
Losers
Map Reduce SQL workloads
BZip compressed TEXT and SEQUENCE files
Spark Thrift Server (temporary)
26. 26
Winners
Solution for the C-Suite - LLAP
Runner up is Presto
Columnar Storage – ORC Zlib
Runner up is Parquet (no clear compression winner)
Map reduce would have had turned in a faster performance number if it has failed all the tests
For the non columar formats Gzip wins the compression
Non columnar formats snappy is the winner
Total Caching size is 420GB
Improvement on the high end of almost 1600 seconds low end of 600 seconds
Improvement of over 10 – 26 minutes over the entire run or 9 -24 seconds on average per query