SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Benchmarking Hive at Yahoo Scale
P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n ⎪ J u n e 4 , 2 0 1 4
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
About myself
2
 HCatalog Committer, Hive
contributor
› Metastore, Notifications, HCatalog APIs
› Integration with Oozie, Data Ingestion
 Other odds and ends
› DistCp
 mithun@apache.org
2014 Hadoop Summit, San Jose, California
About this talk
3
 Introduction to “Yahoo Scale”
 The use-case in Yahoo
 The Benchmark
 The Setup
 The Observations (and, possibly, lessons)
 Fisticuffs
2014 Hadoop Summit, San Jose, California
The Y!Grid
4
 16 Hadoop Clusters in YGrid
› 32500 Nodes
› 750K jobs a day
 Hadoop 0.23.10.x, 2.4.x
 Large Datasets
› Daily, hourly, minute-level frequencies
› Terabytes of data, 1000s of files, per dataset instance
 Pig 0.11
 Hive 0.10 / HCatalog 0.5
› => Hive 0.12
2014 Hadoop Summit, San Jose, California
Data Processing Use cases
5 2014 Hadoop Summit, San Jose, California
 Pig for Data Pipelines
› Imperative paradigm
› ~45% Hadoop Jobs on Production Clusters
• M/R + Oozie = 41%
 Hive for Ad hoc queries
› SQL
› Relatively smaller number of jobs
• *Major* Uptick
 Use HCatalog for Inter-op
6 Yahoo Confidential & Proprietary
Hive is Currently the Fastest Growing Product on the Grid
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
9.0%
10.0%
0
5
10
15
20
25
30
Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14
HiveJobs(%ofAllJobs)
AllGridJobs(inMillions)
All Jobs Hive (% of all jobs)
2.4 million
Hive jobs
Business Intelligence Tools
7
 {Tableau, MicroStrategy, Excel, … }
 Challenges:
› Security
• ACLs, Authentication, Encryption over the wire, Full-disk Encryption
› Bandwidth
• Transporting results over ODBC
› Query Latency
• Query execution time
• Cost of query “optimizations”
• “Bad” queries
2014 Hadoop Summit, San Jose, California
The Benchmark
8
 TPC-h
› Industry standard (tpc.org/tpch)
› 22 queries
› dbgen –s 1000 –S 3
• Parallelizable
 Reynold Xin’s excellent work:
› https://github.com/rxin
› Transliterated queries to suit Hive 0.9
2014 Hadoop Summit, San Jose, California
Relational Diagram
9 2014 Hadoop Summit, San Jose, California
PARTKEY
NAME
MFGR
BRAND
TYPE
SIZE
CONTAINER
COMMENT
RETAILPRICE
PARTKEY
SUPPKEY
AVAILQTY
SUPPLYCOST
COMMENT
SUPPKEY
NAME
ADDRESS
NATIONKEY
PHONE
ACCTBAL
COMMENT
ORDERKEY
PARTKEY
SUPPKEY
LINENUMBER
RETURNFLAG
LINESTATUS
SHIPDATE
COMMITDATE
RECEIPTDATE
SHIPINSTRUCT
SHIPMODE
COMMENT
CUSTKEY
ORDERSTATUS
TOTALPRICE
ORDERDATE
ORDER-
PRIORITY
SHIP-
PRIORITY
CLERK
COMMENT
CUSTKEY
NAME
ADDRESS
PHONE
ACCTBAL
MKTSEGMENT
COMMENT
PART (P_)
SF*200,000
PARTSUPP (PS_)
SF*800,000
LINEITEM (L_)
SF*6,000,000
ORDERS (O_)
SF*1,500,000
CUSTOMER (C_)
SF*150,000
SUPPLIER (S_)
SF*10,000
ORDERKEY
NATIONKEY
EXTENDEDPRICE
DISCOUNT
TAX
QUANTITY
NATIONKEY
NAME
REGIONKEY
NATION (N_)
25
COMMENT
REGIONKEY
NAME
COMMENT
REGION (R_)
5
The Setup
10
› 350 Node cluster
• Xeon boxen: 2 Slots with E5530s => 16 CPUs
• 24GB memory
– NUMA enabled
• 6 SATA drives, 2TB, 7200 RPM Seagates
• RHEL 6.4
• JRE 1.7 (-d64)
• Hadoop 0.23.7+/2.3+, Security turned off
• Tez 0.3.x
• 128MB HDFS block-size
› Downscale tests: 100 Node cluster
• hdfs-balancer.sh
2014 Hadoop Summit, San Jose, California
The Prep
11
 Data generation:
› Text data: dbgen on MapReduce
› Transcode to RCFile and ORC: Hive on MR
• insert overwrite table orc_table partition( … ) select * from text_table;
› Partitioning:
• Only for 1TB, 10TB cases
• Perils of dynamic partitioning
› ORC File:
• 64MB stripes, ZLIB Compression
2014 Hadoop Summit, San Jose, California
Observations
13 2014 Hadoop Summit, San Jose, California
0
500
1000
1500
2000
2500
q1_pricing_summary_report.hive
q2_minimum_cost_supplier.hiveq3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hiveq7_volume_shipping.hive
q8_na
onal_market_share.hive
q9_product_type_profit.hiveq10_returned_item.hiveq11_important_stock.hive
q12_shipping.hive
q13_customer_distribu
on.hiveq14_promo
on_effect.hive
q15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volume_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_promo
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
TPC-h 100GB
Hive 0.10 (Text)
Hive 0.10 RCFile
Hive 0.11 ORC
Hive 0.13 ORC MR
Hive 0.13 ORC Tez
100 GB
14
› 18x speedup over Hive 0.10 (Textfile)
• 6-50x
› 11.8x speedup over Hive 0.10 (RCFile)
• 5-30x
› Average query time: 28 seconds
• Down from 530 (Hive 0.10 Text)
› 85% queries completed in under a minute
2014 Hadoop Summit, San Jose, California
15 2014 Hadoop Summit, San Jose, California
0
500
1000
1500
2000
2500
q1_pricing_summary_report.hive
q2_minimum_cost_supplier.hiveq3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hiveq7_volume_shipping.hive
q8_na
onal_market_share.hive
q9_product_type_profit.hiveq10_returned_item.hive
q11_important_stock.hive
q12_shipping.hive
q13_customer_distribu
on.hive
q14_promo
on_effect.hiveq15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volume_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_promo
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
TPC-h 1TB
Hive 0.10 RC File
Hive 0.11 ORC
Hive 0.12 ORC
Hive 0.13 ORC MR
Hive 0.13 ORC Tez
1 TB
16
› 6.2x speedup over Hive 0.10 (RCFile)
• Between 2.5-17x
› Average query time: 172 seconds
• Between 5-947 seconds
• Down from 729 seconds (Hive 0.10 RCFile)
› 61% queries completed in under 2 minutes
› 81% queries completed in under 4 minutes
2014 Hadoop Summit, San Jose, California
17 2014 Hadoop Summit, San Jose, California
0
2000
4000
6000
8000
10000
12000
q1_pricing_summary_report.hiveq2_minim
um_cost_supplier.hive
q3_shipping_priority.hive
q4_order_priorityq5_local_supplier_volume.hiveq6_forecast_revenue_change.hive
q7_volume_shipping.hiveq8_na
onal_market_share.hive
q9_product_type_profit.hive
q10_returned_item.hive
q11_im
portant_stock.hive
q12_shipping.hiveq13_customer_distribu
on.hive
q14_promo
on_effect.hive
q15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hiveq18_large_volume_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_promo
on.hive
q21_suppliers_who_kept_orders_waing.hiveq22_global_sales_opportunity.hive
Time(inseconds)
TPC-h10TB
Hive0.10RCFile
Hive0.11ORC
Hive0.12ORC
Hive0.13ORCMR
Hive0.13ORCTez
10 TB
18
› 6.2x speedup over Hive 0.10 (RCFile)
• Between 1.6-10x
› Average query time: 908 seconds (426 seconds excluding outliers)
• Down from 2129 seconds with Hive 0.10 RCFile
– (1712 seconds excluding outliers)
› 61% queries completed in under 5 minutes
› 71% queries completed in under 10 minutes
› Q6 still completes in 12 seconds!
2014 Hadoop Summit, San Jose, California
Explaining the speed-ups
19
 Hadoop 2.x, et al.
 Tez
› (Arbitrary DAG)-based Execution Engine
› “Playing the gaps” between M&R
• Temporary data and the HDFS
› Feedback loop
› Smart scheduling
› Container re-use
› Pipelined job start-up
 Hive
› Statistics
› “Vector-ized” Execution
 ORC
› PPD
2014 Hadoop Summit, San Jose, California
20 2014 Hadoop Summit, San Jose, California
0
100
200
300
400
500
600
700
800
900
1000
q1_pricing_sum
mary_report.hive
q2_m
inim
um
_cost_supplier.hive
q3_shipping_priority.hiveq4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hive
q7_volume_shipping.hive
q8_na
onal_m
arket_share.hive
q9_product_type_profit.hive
q10_returned_item
.hive
q11_im
portant_stock.hiveq12_shipping.hive
q13_customer_distribu
on.hive
q14_promo
on_effect.hive
q15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volum
e_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_prom
o
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
Vectoriza on
Hive 0.13 Tez ORC
Hive 0.13 Tez ORC Vec
21 2014 Hadoop Summit, San Jose, California
ORC File Layout
 Data is composed of multiple streams per
column
 Index allows for skipping rows (default to
every 10,000 rows), keeping position in
each stream, and min-max for each
column
 Footer contains directory of stream
locations, and the encoding for each
column
 Integer columns are serialized using run-
length encoding
 String columns are serialized using
dictionary for column values, and the
same run length encoding
 Stripe footer is used to find the requested
column’s data streams and adjacent
stream reads are merged File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4
22 2014 Hadoop Summit, San Jose, California
ORC Usage
CREATE TABLE addresses (
name string,
street string,
city string,
state string,
zip int
)
STORED AS orc TBLPROPERTIES ("orc.compress"= "ZLIB");
LOCATION ‘/path/to/addresses’;
ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT orc
SET hive.default.fileformat = orc
SET hive.exec.orc.memory.pool = 0.50 (ORC writer is allowed 50% of JVM heap size by default)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde’
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
Key Default Comments
orc.compress ZLIB high-level compression (one of NONE, ZLIB, Snappy)
orc.compress.size 262,144 (256 KB) number of bytes in each compression chunk
orc.stripe.size 67,108,864 (64 MB) number of bytes in each stripe. Each ORC stripe is processed in one map task (try 32
MB to cut down on disk I/O)
orc.row.index.stride 10,000 number of rows between index entries (must be >= 1,000). A larger stride-size
increases the probability of not being able to skip the stride, for a predicate.
orc.create.index true whether to create row indexes. This is for predicate push-down. If data is frequently
accessed/filtered on a certain column, then sorting on the column and using index-filters
makes column filters work faster
23 2014 Hadoop Summit, San Jose, California
0
100
200
300
400
500
600
700
800
900
1000
q1_pricing_summary_report.hive
q2_minimum_cost_supplier.hiveq3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hiveq7_volume_shipping.hive
q8_na
onal_market_share.hive
q9_product_type_profit.hiveq10_returned_item.hive
q11_important_stock.hive
q12_shipping.hive
q13_customer_distribu
on.hive
q14_promo
on_effect.hiveq15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volume_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_promo
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
Effects of Compression (1TB)
Hive 0.13 Uncompressed ORC
Hive 0.13 ZLIB Compressed
24 2014 Hadoop Summit, San Jose, California
0
500
1000
1500
2000
2500
3000
q1_pricing_summary_report.hive
q2_minimum_cost_supplier.hiveq3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volume.hive
q6_forecast_revenue_change.hiveq7_volume_shipping.hive
q8_na
onal_market_share.hive
q9_product_type_profit.hiveq10_returned_item.hiveq11_important_stock.hive
q12_shipping.hive
q13_customer_distribu
on.hive
q14_promo
on_effect.hiveq15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_small_quan
ty_order_revenue.hive
q18_large_volume_customer.hive
q19_discounted_revenue.hive
q20_poten
al_part_promo
on.hive
q21_suppliers_who_kept_orders_waing.hive
q22_global_sales_opportunity.hive
Time(inseconds)
Effects of Compression (10TB)
Hive 0.13 Uncompressed
Hive 0.13 Compressed
Configuring ORC
25
 set hive.merge.mapredfiles=true
 set hive.merge.mapfiles=true
 set orc.stripe.size=67,108,864
› Half the HDFS block-size
• Prevent cross-block stripe-read
• Tangent: DistCp
 set orc.compress=???
› Depends on size and distribution
› Snappy compression hasn’t been explored
 YMMV
› Experiment
2014 Hadoop Summit, San Jose, California
26 2014 Hadoop Summit, San Jose, California
0
100
200
300
400
500
600
700
800
900
1000
q1_pricing_sum
m
ary_report.hive
q2_m
inim
um
_cost_supplier.hive
q3_shipping_priority.hive
q4_order_priority
q5_local_supplier_volum
e.hive
q6_forecast_revenue_change.hive
q7_volum
e_shipping.hive
q8_na
onal_m
arket_share.hive
q9_product_type_profit.hive
q10_returned_item
.hive
q11_im
portant_stock.hive
q12_shipping.hive
q13_custom
er_distribu
on.hive
q14_prom
o
on_effect.hive
q15_top_supplier.hive
q16_parts_supplier_rela
onship.hive
q17_sm
all_quan
ty_order_revenue.hive
q18_large_volum
e_custom
er.hive
q19_discounted_revenue.hive
q20_poten
al_part_prom
o
on.hive
q21_suppliers_w
ho_kept_orders_w
ai
ng.hive
q22_global_sales_opportunity.hive
Time(inseconds)
100 vs 350 Nodes
Hive 0.13 100 Nodes
Hive 0.13 350 Nodes
Conclusions
Y!Grid sticking with Hive
28
 Familiarity
› Existing ecosystem
 Community
 Scale
 Multitenant
 Coming down the pike
› CBO
› In-memory caching solutions atop HDFS
• RAMfs a la Tachyon?
2014 Hadoop Summit, San Jose, California
We’re not done yet
29
 SQL compliance
 Scaling up the metastore
performance
 Better BI Tool integration
 Faster transport
› HiveServer2 result-sets
2014 Hadoop Summit, San Jose, California
References
30
 The YDN blog post:
› http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-
tez-and-yarn
 Code:
› https://github.com/mythrocks/hivebench (TPC-h scripts, datagen, transcode utils)
› https://github.com/t3rmin4t0r/tpch-gen (Parallel TPC-h gen)
› https://github.com/rxin/TPC-H-Hive (TPC-h scripts for Hive)
› https://issues.apache.org/jira/browse/HIVE-600 (Yuntao’s initial TPC-h JIRA)
2014 Hadoop Summit, San Jose, California
Thank You
@mithunrk
mithun@apache.org
We are hiring!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.
I’m glad you asked.
Sharky comments
33
 Testing with Shark 0.7.x and Shark 0.8
› Compatible with Hive Metastore 0.9
› 100GB datasets : Admirable performance
› 1TB/10TB: Tests did not run completely
• Failures, especially in 10TB cases
• Hangs while shuffling data
• Scaled back to 100 nodes -> More tests ran through, but not completely
› nReducers: Not inferred
 Miscellany
› Security
› Multi-tenancy
› Compatibility
2014 Hadoop Summit, San Jose, California

Weitere ähnliche Inhalte

Was ist angesagt?

Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaTimothy Spann
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 

Was ist angesagt? (20)

Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 

Ähnlich wie Hive and Apache Tez: Benchmarked at Yahoo! Scale

June 2014 HUG : Hive On Tez - Benchmarked at Yahoo Scale
June 2014 HUG : Hive On Tez - Benchmarked at Yahoo ScaleJune 2014 HUG : Hive On Tez - Benchmarked at Yahoo Scale
June 2014 HUG : Hive On Tez - Benchmarked at Yahoo ScaleYahoo Developer Network
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on HadoopDataWorks Summit
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 
Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB MongoDB
 
Leonard Austin (Ravelin) - DevOps in a Machine Learning World
Leonard Austin (Ravelin) - DevOps in a Machine Learning WorldLeonard Austin (Ravelin) - DevOps in a Machine Learning World
Leonard Austin (Ravelin) - DevOps in a Machine Learning WorldOutlyer
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Charles Allen
 
Hadoop Webinar 28July15
Hadoop Webinar 28July15Hadoop Webinar 28July15
Hadoop Webinar 28July15Edureka!
 
Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Edureka!
 
Risk Analysis in the Financial Services Industry
Risk Analysis in the Financial Services IndustryRisk Analysis in the Financial Services Industry
Risk Analysis in the Financial Services IndustryRevolution Analytics
 
Blueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data PlatformBlueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data PlatformMatt Stubbs
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
Hadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsHadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsInside Analysis
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...Big Data Montreal
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
 
Marketing Digital Command Center
Marketing Digital Command CenterMarketing Digital Command Center
Marketing Digital Command CenterDataWorks Summit
 

Ähnlich wie Hive and Apache Tez: Benchmarked at Yahoo! Scale (20)

June 2014 HUG : Hive On Tez - Benchmarked at Yahoo Scale
June 2014 HUG : Hive On Tez - Benchmarked at Yahoo ScaleJune 2014 HUG : Hive On Tez - Benchmarked at Yahoo Scale
June 2014 HUG : Hive On Tez - Benchmarked at Yahoo Scale
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Hive et Hadoop Usage chez Square
Hive et Hadoop Usage chez SquareHive et Hadoop Usage chez Square
Hive et Hadoop Usage chez Square
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB
 
Leonard Austin (Ravelin) - DevOps in a Machine Learning World
Leonard Austin (Ravelin) - DevOps in a Machine Learning WorldLeonard Austin (Ravelin) - DevOps in a Machine Learning World
Leonard Austin (Ravelin) - DevOps in a Machine Learning World
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
 
Hadoop Webinar 28July15
Hadoop Webinar 28July15Hadoop Webinar 28July15
Hadoop Webinar 28July15
 
Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?
 
Risk Analysis in the Financial Services Industry
Risk Analysis in the Financial Services IndustryRisk Analysis in the Financial Services Industry
Risk Analysis in the Financial Services Industry
 
Blueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data PlatformBlueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data Platform
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Hadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsHadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both Worlds
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Marketing Digital Command Center
Marketing Digital Command CenterMarketing Digital Command Center
Marketing Digital Command Center
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 

Kürzlich hochgeladen (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 

Hive and Apache Tez: Benchmarked at Yahoo! Scale

  • 1. Benchmarking Hive at Yahoo Scale P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n ⎪ J u n e 4 , 2 0 1 4 2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
  • 2. About myself 2  HCatalog Committer, Hive contributor › Metastore, Notifications, HCatalog APIs › Integration with Oozie, Data Ingestion  Other odds and ends › DistCp  mithun@apache.org 2014 Hadoop Summit, San Jose, California
  • 3. About this talk 3  Introduction to “Yahoo Scale”  The use-case in Yahoo  The Benchmark  The Setup  The Observations (and, possibly, lessons)  Fisticuffs 2014 Hadoop Summit, San Jose, California
  • 4. The Y!Grid 4  16 Hadoop Clusters in YGrid › 32500 Nodes › 750K jobs a day  Hadoop 0.23.10.x, 2.4.x  Large Datasets › Daily, hourly, minute-level frequencies › Terabytes of data, 1000s of files, per dataset instance  Pig 0.11  Hive 0.10 / HCatalog 0.5 › => Hive 0.12 2014 Hadoop Summit, San Jose, California
  • 5. Data Processing Use cases 5 2014 Hadoop Summit, San Jose, California  Pig for Data Pipelines › Imperative paradigm › ~45% Hadoop Jobs on Production Clusters • M/R + Oozie = 41%  Hive for Ad hoc queries › SQL › Relatively smaller number of jobs • *Major* Uptick  Use HCatalog for Inter-op
  • 6. 6 Yahoo Confidential & Proprietary Hive is Currently the Fastest Growing Product on the Grid 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% 9.0% 10.0% 0 5 10 15 20 25 30 Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14 HiveJobs(%ofAllJobs) AllGridJobs(inMillions) All Jobs Hive (% of all jobs) 2.4 million Hive jobs
  • 7. Business Intelligence Tools 7  {Tableau, MicroStrategy, Excel, … }  Challenges: › Security • ACLs, Authentication, Encryption over the wire, Full-disk Encryption › Bandwidth • Transporting results over ODBC › Query Latency • Query execution time • Cost of query “optimizations” • “Bad” queries 2014 Hadoop Summit, San Jose, California
  • 8. The Benchmark 8  TPC-h › Industry standard (tpc.org/tpch) › 22 queries › dbgen –s 1000 –S 3 • Parallelizable  Reynold Xin’s excellent work: › https://github.com/rxin › Transliterated queries to suit Hive 0.9 2014 Hadoop Summit, San Jose, California
  • 9. Relational Diagram 9 2014 Hadoop Summit, San Jose, California PARTKEY NAME MFGR BRAND TYPE SIZE CONTAINER COMMENT RETAILPRICE PARTKEY SUPPKEY AVAILQTY SUPPLYCOST COMMENT SUPPKEY NAME ADDRESS NATIONKEY PHONE ACCTBAL COMMENT ORDERKEY PARTKEY SUPPKEY LINENUMBER RETURNFLAG LINESTATUS SHIPDATE COMMITDATE RECEIPTDATE SHIPINSTRUCT SHIPMODE COMMENT CUSTKEY ORDERSTATUS TOTALPRICE ORDERDATE ORDER- PRIORITY SHIP- PRIORITY CLERK COMMENT CUSTKEY NAME ADDRESS PHONE ACCTBAL MKTSEGMENT COMMENT PART (P_) SF*200,000 PARTSUPP (PS_) SF*800,000 LINEITEM (L_) SF*6,000,000 ORDERS (O_) SF*1,500,000 CUSTOMER (C_) SF*150,000 SUPPLIER (S_) SF*10,000 ORDERKEY NATIONKEY EXTENDEDPRICE DISCOUNT TAX QUANTITY NATIONKEY NAME REGIONKEY NATION (N_) 25 COMMENT REGIONKEY NAME COMMENT REGION (R_) 5
  • 10. The Setup 10 › 350 Node cluster • Xeon boxen: 2 Slots with E5530s => 16 CPUs • 24GB memory – NUMA enabled • 6 SATA drives, 2TB, 7200 RPM Seagates • RHEL 6.4 • JRE 1.7 (-d64) • Hadoop 0.23.7+/2.3+, Security turned off • Tez 0.3.x • 128MB HDFS block-size › Downscale tests: 100 Node cluster • hdfs-balancer.sh 2014 Hadoop Summit, San Jose, California
  • 11. The Prep 11  Data generation: › Text data: dbgen on MapReduce › Transcode to RCFile and ORC: Hive on MR • insert overwrite table orc_table partition( … ) select * from text_table; › Partitioning: • Only for 1TB, 10TB cases • Perils of dynamic partitioning › ORC File: • 64MB stripes, ZLIB Compression 2014 Hadoop Summit, San Jose, California
  • 13. 13 2014 Hadoop Summit, San Jose, California 0 500 1000 1500 2000 2500 q1_pricing_summary_report.hive q2_minimum_cost_supplier.hiveq3_shipping_priority.hive q4_order_priority q5_local_supplier_volume.hive q6_forecast_revenue_change.hiveq7_volume_shipping.hive q8_na onal_market_share.hive q9_product_type_profit.hiveq10_returned_item.hiveq11_important_stock.hive q12_shipping.hive q13_customer_distribu on.hiveq14_promo on_effect.hive q15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_small_quan ty_order_revenue.hive q18_large_volume_customer.hive q19_discounted_revenue.hive q20_poten al_part_promo on.hive q21_suppliers_who_kept_orders_waing.hive q22_global_sales_opportunity.hive Time(inseconds) TPC-h 100GB Hive 0.10 (Text) Hive 0.10 RCFile Hive 0.11 ORC Hive 0.13 ORC MR Hive 0.13 ORC Tez
  • 14. 100 GB 14 › 18x speedup over Hive 0.10 (Textfile) • 6-50x › 11.8x speedup over Hive 0.10 (RCFile) • 5-30x › Average query time: 28 seconds • Down from 530 (Hive 0.10 Text) › 85% queries completed in under a minute 2014 Hadoop Summit, San Jose, California
  • 15. 15 2014 Hadoop Summit, San Jose, California 0 500 1000 1500 2000 2500 q1_pricing_summary_report.hive q2_minimum_cost_supplier.hiveq3_shipping_priority.hive q4_order_priority q5_local_supplier_volume.hive q6_forecast_revenue_change.hiveq7_volume_shipping.hive q8_na onal_market_share.hive q9_product_type_profit.hiveq10_returned_item.hive q11_important_stock.hive q12_shipping.hive q13_customer_distribu on.hive q14_promo on_effect.hiveq15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_small_quan ty_order_revenue.hive q18_large_volume_customer.hive q19_discounted_revenue.hive q20_poten al_part_promo on.hive q21_suppliers_who_kept_orders_waing.hive q22_global_sales_opportunity.hive Time(inseconds) TPC-h 1TB Hive 0.10 RC File Hive 0.11 ORC Hive 0.12 ORC Hive 0.13 ORC MR Hive 0.13 ORC Tez
  • 16. 1 TB 16 › 6.2x speedup over Hive 0.10 (RCFile) • Between 2.5-17x › Average query time: 172 seconds • Between 5-947 seconds • Down from 729 seconds (Hive 0.10 RCFile) › 61% queries completed in under 2 minutes › 81% queries completed in under 4 minutes 2014 Hadoop Summit, San Jose, California
  • 17. 17 2014 Hadoop Summit, San Jose, California 0 2000 4000 6000 8000 10000 12000 q1_pricing_summary_report.hiveq2_minim um_cost_supplier.hive q3_shipping_priority.hive q4_order_priorityq5_local_supplier_volume.hiveq6_forecast_revenue_change.hive q7_volume_shipping.hiveq8_na onal_market_share.hive q9_product_type_profit.hive q10_returned_item.hive q11_im portant_stock.hive q12_shipping.hiveq13_customer_distribu on.hive q14_promo on_effect.hive q15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_small_quan ty_order_revenue.hiveq18_large_volume_customer.hive q19_discounted_revenue.hive q20_poten al_part_promo on.hive q21_suppliers_who_kept_orders_waing.hiveq22_global_sales_opportunity.hive Time(inseconds) TPC-h10TB Hive0.10RCFile Hive0.11ORC Hive0.12ORC Hive0.13ORCMR Hive0.13ORCTez
  • 18. 10 TB 18 › 6.2x speedup over Hive 0.10 (RCFile) • Between 1.6-10x › Average query time: 908 seconds (426 seconds excluding outliers) • Down from 2129 seconds with Hive 0.10 RCFile – (1712 seconds excluding outliers) › 61% queries completed in under 5 minutes › 71% queries completed in under 10 minutes › Q6 still completes in 12 seconds! 2014 Hadoop Summit, San Jose, California
  • 19. Explaining the speed-ups 19  Hadoop 2.x, et al.  Tez › (Arbitrary DAG)-based Execution Engine › “Playing the gaps” between M&R • Temporary data and the HDFS › Feedback loop › Smart scheduling › Container re-use › Pipelined job start-up  Hive › Statistics › “Vector-ized” Execution  ORC › PPD 2014 Hadoop Summit, San Jose, California
  • 20. 20 2014 Hadoop Summit, San Jose, California 0 100 200 300 400 500 600 700 800 900 1000 q1_pricing_sum mary_report.hive q2_m inim um _cost_supplier.hive q3_shipping_priority.hiveq4_order_priority q5_local_supplier_volume.hive q6_forecast_revenue_change.hive q7_volume_shipping.hive q8_na onal_m arket_share.hive q9_product_type_profit.hive q10_returned_item .hive q11_im portant_stock.hiveq12_shipping.hive q13_customer_distribu on.hive q14_promo on_effect.hive q15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_small_quan ty_order_revenue.hive q18_large_volum e_customer.hive q19_discounted_revenue.hive q20_poten al_part_prom o on.hive q21_suppliers_who_kept_orders_waing.hive q22_global_sales_opportunity.hive Time(inseconds) Vectoriza on Hive 0.13 Tez ORC Hive 0.13 Tez ORC Vec
  • 21. 21 2014 Hadoop Summit, San Jose, California ORC File Layout  Data is composed of multiple streams per column  Index allows for skipping rows (default to every 10,000 rows), keeping position in each stream, and min-max for each column  Footer contains directory of stream locations, and the encoding for each column  Integer columns are serialized using run- length encoding  String columns are serialized using dictionary for column values, and the same run length encoding  Stripe footer is used to find the requested column’s data streams and adjacent stream reads are merged File Footer Postscript Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4
  • 22. 22 2014 Hadoop Summit, San Jose, California ORC Usage CREATE TABLE addresses ( name string, street string, city string, state string, zip int ) STORED AS orc TBLPROPERTIES ("orc.compress"= "ZLIB"); LOCATION ‘/path/to/addresses’; ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT orc SET hive.default.fileformat = orc SET hive.exec.orc.memory.pool = 0.50 (ORC writer is allowed 50% of JVM heap size by default) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde’ INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’ OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'; Key Default Comments orc.compress ZLIB high-level compression (one of NONE, ZLIB, Snappy) orc.compress.size 262,144 (256 KB) number of bytes in each compression chunk orc.stripe.size 67,108,864 (64 MB) number of bytes in each stripe. Each ORC stripe is processed in one map task (try 32 MB to cut down on disk I/O) orc.row.index.stride 10,000 number of rows between index entries (must be >= 1,000). A larger stride-size increases the probability of not being able to skip the stride, for a predicate. orc.create.index true whether to create row indexes. This is for predicate push-down. If data is frequently accessed/filtered on a certain column, then sorting on the column and using index-filters makes column filters work faster
  • 23. 23 2014 Hadoop Summit, San Jose, California 0 100 200 300 400 500 600 700 800 900 1000 q1_pricing_summary_report.hive q2_minimum_cost_supplier.hiveq3_shipping_priority.hive q4_order_priority q5_local_supplier_volume.hive q6_forecast_revenue_change.hiveq7_volume_shipping.hive q8_na onal_market_share.hive q9_product_type_profit.hiveq10_returned_item.hive q11_important_stock.hive q12_shipping.hive q13_customer_distribu on.hive q14_promo on_effect.hiveq15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_small_quan ty_order_revenue.hive q18_large_volume_customer.hive q19_discounted_revenue.hive q20_poten al_part_promo on.hive q21_suppliers_who_kept_orders_waing.hive q22_global_sales_opportunity.hive Time(inseconds) Effects of Compression (1TB) Hive 0.13 Uncompressed ORC Hive 0.13 ZLIB Compressed
  • 24. 24 2014 Hadoop Summit, San Jose, California 0 500 1000 1500 2000 2500 3000 q1_pricing_summary_report.hive q2_minimum_cost_supplier.hiveq3_shipping_priority.hive q4_order_priority q5_local_supplier_volume.hive q6_forecast_revenue_change.hiveq7_volume_shipping.hive q8_na onal_market_share.hive q9_product_type_profit.hiveq10_returned_item.hiveq11_important_stock.hive q12_shipping.hive q13_customer_distribu on.hive q14_promo on_effect.hiveq15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_small_quan ty_order_revenue.hive q18_large_volume_customer.hive q19_discounted_revenue.hive q20_poten al_part_promo on.hive q21_suppliers_who_kept_orders_waing.hive q22_global_sales_opportunity.hive Time(inseconds) Effects of Compression (10TB) Hive 0.13 Uncompressed Hive 0.13 Compressed
  • 25. Configuring ORC 25  set hive.merge.mapredfiles=true  set hive.merge.mapfiles=true  set orc.stripe.size=67,108,864 › Half the HDFS block-size • Prevent cross-block stripe-read • Tangent: DistCp  set orc.compress=??? › Depends on size and distribution › Snappy compression hasn’t been explored  YMMV › Experiment 2014 Hadoop Summit, San Jose, California
  • 26. 26 2014 Hadoop Summit, San Jose, California 0 100 200 300 400 500 600 700 800 900 1000 q1_pricing_sum m ary_report.hive q2_m inim um _cost_supplier.hive q3_shipping_priority.hive q4_order_priority q5_local_supplier_volum e.hive q6_forecast_revenue_change.hive q7_volum e_shipping.hive q8_na onal_m arket_share.hive q9_product_type_profit.hive q10_returned_item .hive q11_im portant_stock.hive q12_shipping.hive q13_custom er_distribu on.hive q14_prom o on_effect.hive q15_top_supplier.hive q16_parts_supplier_rela onship.hive q17_sm all_quan ty_order_revenue.hive q18_large_volum e_custom er.hive q19_discounted_revenue.hive q20_poten al_part_prom o on.hive q21_suppliers_w ho_kept_orders_w ai ng.hive q22_global_sales_opportunity.hive Time(inseconds) 100 vs 350 Nodes Hive 0.13 100 Nodes Hive 0.13 350 Nodes
  • 28. Y!Grid sticking with Hive 28  Familiarity › Existing ecosystem  Community  Scale  Multitenant  Coming down the pike › CBO › In-memory caching solutions atop HDFS • RAMfs a la Tachyon? 2014 Hadoop Summit, San Jose, California
  • 29. We’re not done yet 29  SQL compliance  Scaling up the metastore performance  Better BI Tool integration  Faster transport › HiveServer2 result-sets 2014 Hadoop Summit, San Jose, California
  • 30. References 30  The YDN blog post: › http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive- tez-and-yarn  Code: › https://github.com/mythrocks/hivebench (TPC-h scripts, datagen, transcode utils) › https://github.com/t3rmin4t0r/tpch-gen (Parallel TPC-h gen) › https://github.com/rxin/TPC-H-Hive (TPC-h scripts for Hive) › https://issues.apache.org/jira/browse/HIVE-600 (Yuntao’s initial TPC-h JIRA) 2014 Hadoop Summit, San Jose, California
  • 31. Thank You @mithunrk mithun@apache.org We are hiring! Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com.
  • 32. I’m glad you asked.
  • 33. Sharky comments 33  Testing with Shark 0.7.x and Shark 0.8 › Compatible with Hive Metastore 0.9 › 100GB datasets : Admirable performance › 1TB/10TB: Tests did not run completely • Failures, especially in 10TB cases • Hangs while shuffling data • Scaled back to 100 nodes -> More tests ran through, but not completely › nReducers: Not inferred  Miscellany › Security › Multi-tenancy › Compatibility 2014 Hadoop Summit, San Jose, California

Hinweis der Redaktion

  1. Gopal was supposed to be presenting this with me, to talk about Tez. Point to Gopal/Jitendra’s talk on Hive/Tez for details on things I’ll have to skim over. Also, acknowledge Thomas Graves, who’s talking today about the excellent work he’s doing on driving Spark on Yarn.
  2. There are several sides to query latency: Query execution time : Addressed in the physical query-execution layer. Query optimizations: The first step while optimizing the query plan seems to be to query for all partition instances. Very expensive for “Project Benzene”. Bad queries : Tableau, I’m looking at you.
  3. The Transaction Processing Performance Council (inexplicably abbreviated to TPC) suggests a set of benchmarks for query processing. Many have adopted TPC-DS to showcase performance. We chose TPC-h to complement. (Also, 22 much smaller number to deal with than… 90?) Transliteration: Evita and Kylie Minogue
  4. Lineitem and Orders are extremely large Fact tables. Nation and Region are the smallest dimension tables.
  5. Tangent: Funny story: 1. About hard-drives: Can set up MR intermediate directories and HDFS data-node directories to be on different disks. Traffic from one doesn’t affect the other. But on the other hand, total read bandwidth might be reduced.
  6. Line-item: Partitioned on Ship-date. Orders: Order-date Customers: By market-segment Suppliers: On their region-key.
  7. Q5 and q21 are anomalous. Q21: Hit a trailing reducer across all versions of Hive tested. Perhaps this can be improved with a better plan. Q5: Slow reducer that hit only Hive 13. Could be a bad plan. Could be a difference in data distribution when data was regenerated for Hadoop 2 cluster.
  8. Tez : Scheduling. Playing the gaps, like Beethoven’s Fifth.
  9. Vectorization: On average: 1.2x.
  10. Except for a few outliers, ZLIB compression actually reduced performance for a 1TB dataset. Uncompressed was 1.3x faster than Compressed.
  11. The situation reverses at the 10 TB level. The gains from decompression are actually offset by the disk-read time. The long-tail in 10TB/q21 threw the scale of the graph off, so I’ve excluded it in the results.
  12. Talk about file-coalesce, small-file generation, Namenode pressure and parallelism. You don’t want to read an ORC stripe from a different node. Talk about distcp –pgrub, for ORC files. Mention that SNAPPY’s license is not Apache. Also, Yoda.
  13. At 100 nodes, it performs at 0.9x the 350 node performance.
  14. We’ve seen Hive and Tez scale down for latency, scale up for data-size, and scale out across larger clusters.
  15. Familiarity : We have an existing ecosystem with Hive, HCatalog, Pig and Oozie that delivers revenue to Yahoo today. It’s hard to rock the boat. Community: The Apache Hive community is large, active and thriving. They’ve been solving issues with query latency for ages now. The switch to using the Tez execution engine was a solution within the Apache Hive project. This wasn’t a fork of Hive. This is Hive, proper. Scale: We’ve seen Hive and Tez perform at scale. Heck, we’ve seen Pig perform on Tez. Multitenant: Yahoo’s use-case is unique, and not just because of data-scale. There’s hundreds of active users and genuine multitenancy and security concerns. Design: We think the Hive community has tackled the right problems first, rather than throw RAM at the problem.
  16. Bucky Lasek at the X-Games in 2001. Notice where he’s looking… Not at the camera, but setting up his next trick.
  17. Security: Kerberos support was patched in, after the benchmarks were run. Multi-tenancy: Data needs to be explicitly pinned into memory as RDDs. In a multi-tenant system, how would pinning work? Eviction policy for data. Compatibility: Needs to work with Metastore versions 12 and 13. Shark’s gone to 0.11 just recently. Integration with the rest of the stack: Oozie and Pig. Overall, we wanted a solution that works with high-dynamic range. i.e. works well with small datasets (100s of GBs), as well as scale to multi-terabyte datasets. We have a familiar system that seems to fit that bill. It doesn’t quite rock the boat. It’s not perfect yet. There are bugs that we’re working on. And we still haven’t solved the problem of data-volume/BI. By the way, I really like the idea of BlinkDB. I saw the JIRA.