Achieving 100k Queries per Hour on Hive on Tez

2016年9月6日
Achieving 100k Queries per
Hour with Hive on Tez

About Yahoo! JAPAN
2
The Largest Portal Site in Japan
65 billon pageviews / month
2.1 billon pageviews / day

YDN Report
What is YDN Report?
• Report for Yahoo Display Ads. Networks
Batch Reporting over Massive Dataset
• 13 months, 800B+ rows of data
• Adding 3.3B+ rows of data per day
Highly Parallel Workload
• 100K reports per hour
3

YDN Report Query
Typical Query
• Query is Relatively Simple
• Answer “How many clicks did I get last week?”
4
0
5000
10000
15000
1 3 5 7 9 11 13 15 17 19 21 23 25 27
SELECT account, yyyymmdd,
sum(total_imps),
sum(total_click),
...
FROM table_x
WHERE yyyymmdd >= xxx
AND yyyymmdd < xxx
AND account = xxx
...
GROUP BY account, yyyymmdd, ...;

Hive Performance Recap
Hive is fast: interactive response
• ORC columnar file format
• Cost based optimizer (CBO)
• Vectorized SQL engine
• Tez execution engine (replacing MapReduce)
Hive 0.10
Batch
Processing 100-150x Query Speedup
Hive 1.2
Human
Interactive
(5 seconds)

Hive on Tez Query Execution
A query execution essentially is put together from
• Client execution [ 0s if done correctly ]
• Optimization [HiveServer2] [~ 0.1s]
• Metadata lookups [Hcatalog, Metastore] [ very fast in hive 0.14 ]
• Application Master creation [4-5s]
• Container Allocation [3-5s]
• Tez task execution on YARN
YARN and HDFS
HiveServer2
Server #1Client
Running testing tool
N connections
N
connections
Metastore Metastore DB
HiveServer2
Server #2
Tez
AM
Tez
Container
Tez
Container
…

Mini Test
Mini Setup Tested
• 50 nodes
• 450B rows dataset
• Achieved 15K queries per hour
So, can we get 100K qph on 700 nodes?
We thought it should be easy, but…
8

The Bottlenecks at Scale
Challenges at Scale
• Hive Metastore Server
• YARN Resource Manager
• Datanode Hotspot
• YARN ATS
9

Hive Metastore Server
10
Use Local Metastore
• Before: HS2 -> Metastore Server -> Metastore DB
• After: HS2 (local metastore) -> Metastore DB

Hive Metastore Server
11
Use Local Metastore
• Throughput: 7.6K -> 22K qph

Pending Apps
YARN ResourceManager Scalability
• Too much pending apps
12

Pending Apps
• Too much pending apps
• Resolve: increase yarn.resourcemanager.amlauncher.thread-count
• Throughput: 22K -> 26K qph
13

Pending Containers
• Too much pending containers
14

Pending Containers
• Too much pending containers
• Resolve: increase tez.am-rm.heartbeat.interval-ms.max
• Throughput: 26K -> 72.5K qph
15

Datanode Hotspot
Last Hour Problem
• Connection timeout and disk access error
• Many queries access recently added data
16

Datanode Hotspot
Last Hour Problem
• Resolve: Increase HDFS replication factor
• Throughput: 72.5K -> 95K qph
17

Other Tunings
Other Tunings We Did
• Container reuse timeout
• YARN capacity scheduler node locality delay
• Tez shuffle keep alive
• TCP fin_wait
Notes on YARN ATS
• Disabling YARN ATS gives higher throughput
• Trade off losing YARN log aggregation
18

End of first half
19
End of first half

Yohei Abe
@Yahoo! JAPAN
Real-life Hive LLAP at
Yahoo! JAPAN
Aug 2016

Agenda
• Hive LLAP at Yahoo! JAPAN
• Tuning
• Performance Result
• Future Work
21

Hive on Tez
Hive on Tez is able to
produce 100K
reports/hour
23

Hive on Tez+LLAP
How Hive on Tez+LLAP
handle 100K reports ?
• how many servers
• Tuning?
24

What is LLAP?
26
LLAP is for sub-second query
procesisng
•Persistent daemons
•Caching data

What is LLAP?
27
Tez
container
Tez
container
Tez AppMaster
Tez
created dynamically
LLAP
daemon
LLAP
daemon
Tez AppMaster
Tez+LLAP
persistent daemon

LLAP test cluster
29
Server node Xeon E5-2660v2
2.2GHz / 2CPU /
128GBMEM /
10GBase-T 2port
Slave node 45 nodes
HiveServer2 node 10 nodes
Hadoop 2.7.1
Hive 2.1.0-snapshot
Tez 0.8.3

Parameters
Some basic parameters needs to be
changed
very slow performance if it’s default
value
30

Threading model
hive.llap.daemon.num.executors
31
hive.llap.io.threadpool.size
thread
executor
thread
thread
I/O
thread
data

Executor thread pool
32
hive.llap.daemon.num.executors
(default 4)
• the number of JVM thread for query
execution
• set this same with the num of vCPU
• 40 in our cpu

Performance: executor thread
33
5.49
23.68 23.42
0
5
10
15
20
25
4(default) 40(our CPU) 72
QPS
executor threads
executor threads - QPS(higher is
be er)

I/O thread pool
hive.llap.io.threadpool.size
(default 10)
• number of IO threads
• Set the number of vCPU
• 40 in our case
34

Performance: I/O thread
35
12.82
23.42
0
5
10
15
20
25
10(default) 40(out CPU)
QPS
I/O threads
I/O threads - QPS
(higher is be er)

Memory
hive.llap.daemon.memory.per.instance.mb  java
-Xmx …
36
hive.llap.io.memory.size
executor I/O
JVM on-heap JVM off-heap

Performance: QPS
38
0
5
10
15
20
25
30
32 64 96 128 160 192 224 256 288 320 352 384
QPS
clients
LLAP
Tez

100K / hour ?
LLAP 45 nodes(test cluster)
max: 24 qps ≈ 87K query/hour
70 nodes for 100K
(if it’s scaled linearly)
39

Advanced Tuning
41
hive.llap.client.consistent.splits
false(default) => Use file locality for
selecting LLAP daemon
true => LLAP daemon is selected evenly(by
hash distribution)

Recap: LLAP
42
A node runs LLAP
and also datanode

43
17.5
23.42
0
5
10
15
20
25
false(default)
use file locality
true
QPS
QPS(higher is be er)
Locality No Locality

Web UI (HIVE-11526)
LLAP daemon
exposes basic metrics
on port 15002(default)
Included in HIVE2.1
Contributed from
Yahoo! JAPAN
46

Web UI (HIVE-14030)
HIVE-11526 is just for each daemon
HIVE-14030 provides aggregation view of a
LLAP cluster (not yet in master)
Contributed from Yahoo! JAPAN
47

Hive Column-level ACL
49
HS2 LLAP
YARN
HDFS
GOAL: Column-level ACL
SQL
ANSWER(?):
HiveServer2 can do it

Direct Access to HDFS
breaks everything
50
HS2 LLAP
YARN
HDFS
Storage Based Authorization
M/R,
Pig,
Spark
Break
SQL
Standard
Based
ACLs !!
But direct accessing(Not from Hive)
to HDFS breaks the security model.
Other solutions
(not only Hive)
are necessary

Future Directions
51
HS2 LLAP
YARN
HDFS
LlapInputForma
t
M/R,
Pig,
Spark
Check
SQL
Based
ACLs
LlapInputFormat checks ACLs to HS2 for other applications.
HIVE-13441
HIVE-12991
see LlapDump.java

Summary
53
• Throughput is greatly
improved by LLAP
• Some tunings are necessary
• LLAP is also effective for
batch processing

Achieving 100k Queries per Hour on Hive on Tez

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Achieving 100k Queries per Hour on Hive on Tez

Ähnlich wie Achieving 100k Queries per Hour on Hive on Tez (20)

Mehr von DataWorks Summit/Hadoop Summit

Mehr von DataWorks Summit/Hadoop Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Achieving 100k Queries per Hour on Hive on Tez

Hinweis der Redaktion