Weitere ähnliche Inhalte Ähnlich wie Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Querying & Analytical Extensions (20) Kürzlich hochgeladen (20) Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Querying & Analytical Extensions1. © Hortonworks Inc. 2013
Hortonworks
Stinger, Tez
Page 1
Leveraging Hive & Yarn for High-
Performance/Interactive Querying &
Analytical Extensions
3. © Hortonworks Inc. 2013
What is Stinger and Tez initiatives
• Collection of development threads in the Hive
community for
–Improved SQL Interface
–Updated Query Engine
–Optimized File Format
–Always on Services
Page 3
4. © Hortonworks Inc. 2013
Stinger Initiative: 2-Pronged Approach
Page 4
Tez
• New primitives move beyond map-reduce
and beyond batch
• Avoid unnecessary persistence of
temporary data
• Hive, Pig and others generate Tez plans
for high perf
Query Engine Improvements
• Cost-based optimizer
• In-memory joins
State-of-the-art Column Store
• ―Optimized RCFile‖ or ORCFile
• Minimizes disk IO and deserialization
Tez Service
• Always-on service providing query
interactivity
Improve Latency and Throughput
Analytics Functions
• SQL:2003 Compliant
• OVER with PARTITION BY and ORDER
BY
• Wide variety of windowing functions:
• RANK
• LEAD/LAG
• ROW_NUMBER
• FIRST_VALUE
• LAST_VALUE
• Many more
• Aligns well with BI ecosystem
Improved SQL Coverage
• Subqueries within IN / HAVING
• Expanded SQL types including
DATETIME, VARCHAR, etc.
Extend Deep Analytical Ability
Making Hive Best for Interactive Query
6. © Hortonworks Inc. 2013
Where we at
• Key features in Hive 0.11
–ORC File
–Improved Data Types
–Analytic Functions
– ANK, LEAD/LAG, ROW_NUMBER, FIRST_VALUE, LAST_VALUE
and more
– Aggregate OVER functions with PARTITION BY and ORDER BY
–Joins improved in Hive 0.11
– Broadcast join and the SMB join work without user hints
• Tez Alpha Released
Page 6
7. © Hortonworks Inc. 2013
Stinger: Enhance Hive for BI Use Cases
Page 7
Enterprise Reports
Dashboard / Scorecard
Parameterized Reports
Visualization Data Mining
Interactive Batch
More SQL
&
Better Performance
8. © Hortonworks Inc. 2013
Hive Performance: Intelligent Optimizer
• For joins where one side fits in memory:
–In-Memory Hash Join -- Hive reads the small table into a hash table,
makes available to all participating nodes via dist. cache.
–Scans through the big file to produce the output.
• Users often don’t know how to provide Hive hints
–End up with a long pipeline of MapReduce jobs.
–Removed need for many hints
• Star-schema joins
–Dimension Tables loaded to memory/distributed via distributed cache.
–Scatter-Gather without distributed joins (resolved locally).
• Improvements
–Lower the footprint of the fact tables in memory.
–Enable the optimizer to automatically pick map joins.
Page 8
9. © Hortonworks Inc. 2013
Some New Benchmarking Results . . .
Incremental Changes Adding-up
to BIG improvements:
• JIRA: HIVE-3784 – Remove need to
explicitly provide “hint” to optimizer.
• JIRA: HIVE-3952 – MapJoins for
multiple small tables joining large table.
• JIRA: HIVE-2340 – Collapse Order
By/Group By into single task . . .
Page 9
In this case, Six
MR’s reduced to
One
10. © Hortonworks Inc. 2013
ORCFile - Optimized Column Storage
• JIRA-3874: Make a better columnar storage file
–Evolve based on Google Dremel format
• Decompose complex row types into primitive fields
–Better compression and projection
• Only read bytes from HDFS for the required columns.
• Store column level aggregates in the files
–Only need to read the file meta information for common queries
–Stored both for file and each section of a file
–Aggregates: min, max, sum, average, count
–Allows fast access by sorted columns
• Ability to add bloom filters for columns
–Enables quick checks for whether a value is present
–Accelerates searches on alternate keys
Page 10
13. © Hortonworks Inc. 2013
Tez – Moving Hive Beyond MapReduce
• Low level data-processing execution engine
• Use it for the base of MapReduce, Hive, Pig, Cascading
etc.
• Enables efficient pipelining of jobs
• Removes task and job launch times
• Hive and Pig jobs no longer need to move to the end of
the queue between steps in the pipeline
–Performance-oriented jobs aren’t forced into interleaving model
• Does not write intermediate output to HDFS
–Much lighter disk and network usage
–Appropriate for shorter-running jobs—where performance is more
important than being able to re-start a failed job where it left-off
• Built on YARN
Page 13
14. © Hortonworks Inc. 2013
YARN – The Foundation for Tez
Resource
Manager
MapReduce Status
Job Submission
Client
Node
Manager
Node
Manager
Container
Node
Manager
App Mstr
Node Status
Resource Request
Tez is a YARN
application . . .
Instances run
on all nodes
hosting data
targeted for
accelerated
query
processing
15. © Hortonworks Inc. 2013
Pig/Hive-MR versus Pig/Hive-Tez
Page 15
I/O Synchronization
Barrier
I/O Pipelining
Pig/Hive - MR Pig/Hive - Tez
SELECT a.state, COUNT(*)
FROM a JOIN b ON (a.id = b.id)
GROUP BY a.state
16. © Hortonworks Inc. 2013
Result: Massive Performance Uplift
Page 16
Existing Hive
Parse Query 0.5s
Create Plan 0.5s
Launch Map-Reduce 35s
Process Map-Reduce 102s
Total 138s
Interactive Hive
Parse Query 0.5s
Create Plan 0.5s
Launch Map-Reduce 35s
Process Map-Reduce 7s
Total 43s
Interactive Hive & Tez
Parse Query 0.5s
Create Plan 0.5s
Submit to Service 0.1s
Process Map-Reduce 7s
Total 8.1s
Interactive Hive & Tez I/O
Parse Query 0.5s
Create Plan 0.5s
Submit to Service 0.1s
Process Map-Reduce – No Disk I/O 3.5s
Total 4.6s
17. © Hortonworks Inc. 2013
FastQuery: Beyond Batch with YARN
Page 17
Tez Generalizes Map-Reduce
Simplified execution plans process
data more efficiently
Always-On Tez Service
Low latency processing for
all Hadoop data processing
18. © Hortonworks Inc. 2013
Tez Service
• MR Query Startup Expensive
–Job launch & task-launch latencies are fatal for short queries (in order
of 5s to 30s)
• Solution
–Tez Service
– Removes task-launch overhead
– Removes job-launch overhead
–Hive/Pig – Submit query-plan to Tez Service
–Native Hadoop service, not ad-hoc
• An Architecture that can be Extended to the Next Level
of Performance
–Potential for Future Memory-based performance optimizations
based on staging/pre-loading designated tables, indexes, and
aggregates . . .
Page 18
Hinweis der Redaktion Enterprise Reports – Your cell phone bill is an exampleDashboard – KPI trackingParameterized Reports – What are the hot prospects in my region?Visualization – Visual exploration of dataData Mining – Large scale data processing and extraction usually fed to other toolsHow?Improve Latency & ThroughputQuery engine improvementsNew “Optimized RCFile” column storeNext-gen runtime (elim’s M/R latency)Extend Deep Analytical AbilityAnalytics functionsImproved SQL coverageContinued focus on core Hive use cases Add statistics on Compression . . . - For illustration, here’s a quick glance at benchmarking. This is of course, very active in R&D for us. Point being we are seeing 10x and upwards of performance uplift when all is said and done. This will only get better.