SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
TPC H
TPC-H ANALTYCS’ SCENARIOS
AND PERFORMANCES ON
HADOOP DATA CLOUDS
RIM MOUSSA
rim.moussa@esti.rnu.tn
LATICE –UNIV. OF TUNIS
TUNISIA

24th, April 2012

4th International. Conference on Networked Digital Technologies
NDT’2012, Dubai, UAE.
LaTICE

OUTLINE

Cloud
Analytics
A l ti

1. Business Intelligence
2. Motivation
data managment issues, NoSQL, clouds
g
,
,
OLAP in the cloud

3. Implementation of OLAP in the cloud
TPC-H Benchmark
Analytics Scenarios
y
Performance Measurements

4. Related Work
5. Conclusion
6.
6 Future W k
Work
24th, Apr. 2012

NDT’2012. Dubai. UAE

2
LATICE
Cloud
Analytics
A l ti

BUSINESS INTELLIGENCE

BI
Motivation
TPC‐H & COLAP
Performance
Conclusion
Future work

Business intelligence aims to support better business
g
pp
decision-making.
Common functions of business intelligence technologies are
On-Line Analytical Processing,
data mining, process mining
mining
mining,
Business performance management
Text i i
T t mining and predictive analytics, …
d
di ti
l ti

Market share
Gartner Research Reports BI Market Revenue Hit $12.2 Billion in
2011
24th, Apr. 2012

NDT’2012. Dubai. UAE

3
LATICE

MOTIVATION

Cloud
Analytics
A l ti

BI
Motivation
|—Issues
|— NoSQL
|
|—Cloud
|—COLAP

Decision Support Systems
Incessant D t & complex workload
• I
t Data
l
kl d
• Complex DB schema
Scalability Issues
• Ideally, Linear Speed up & Linear Scale up
• DBMS do not scale linearly
y
• OLAP Technologies do not scale

•
•
•
•

24th, Apr. 2012

Hardware
I/O Bottleneck
I/O-bound data storage systems
Gilder law: Thrice bandwidth every 3 years
Moore L
M
Law: T
Twice computing and storage capacities every 18 months.
d
h
Obsolete by 2017
Vertical scaling cost >> Horizontal scaling cost

NDT’2012. Dubai. UAE

4
LATICE

NOSQL

Cloud
Analytics
A l ti

BI
Motivation
|—Issues
|— NoSQL
|
|—Cloud
|—COLAP

Big Ch ll
Bi Challenges related t velocity
l t d to l it
How fast huge volumes of data can be processed?

NoSQL l ti
N SQL solutions
Adopted by Google, Facebook, Amazon, …
Dynamic horizontal scale up
scale-up
Nodes are added without bringing the cluster down
Shared nothing
Shared-nothing architecture
Independent computing& storage nodes interconnected via a high speed
network
Distributed programming framework: MapReduce (Google)

24th, Apr. 2012

NDT’2012. Dubai. UAE

5
LATICE
Cloud
Analytics
A l ti

CLOUD COMPUTING

BI
Motivation
|—Issues
|— NoSQL
|
|—Cloud
|—COLAP

Cloud computing is a style of computing where scalable and elastic ITenabled capabilities are provided "as a service" t external customers
bl d
biliti
id d
i to t
l t
using Internet technologies.
Broad network access
Resource pooling (virtualization)
Self-provisioning
p
g
Rapid elasticity

Scalable
Analytics

Measured service

Market share
Forrester Research expects the global cloud computing market to reach $241
billion in 2020. In particular, SaaS market growing to $92.8 billion by 2016.
Gartner group expects the cloud computing market will reach $US150.1 billion,
with a compound annual rate of 26.5%, in 2013.
26 5% 2013
24th, Apr. 2012

NDT’2012. Dubai. UAE

6
LATICE

BI
Motivation
|—Issues
|— NoSQL
|
|—Cloud
|—COLAP

OLAP IN THE CLOUD

Cloud
Analytics
A l ti

OLAP constraints
Big data analytics’ obstacles
Current systems & technologies do not scale

Key benefits of Cloud Computing

Cloud
computing

Big data
analytics

NoSQL

Performance
Much faster data analysis,
Dynamic and up-to-date hardware infrastructure,
y
p
,
More Economical
Organizations no longer need to expend capital upfront for
hardware and software purchases
Services are provided on a pay-per-use basis,
p
p yp
,
24th, Apr. 2012

NDT’2012. Dubai. UAE

7
LATICE
Cloud
Analytics
A l ti

TPC-H
DECISION-SUPPORT SYSTEM BENCHMARK
S

BI
Motivation
TPC‐H 
COLAP
Performance
Conclusion

DATA

• Complex DB schema
• Scale factor 1, 10, …, 100,000 correspond respectively
to 1GB, 10 GB …, 100 TB
1GB
GB,
• 8 data files {lineitem, customer…, region}.tbl
y
• broad industry-wide relevance
WORKLOAD

• 22 real world business questions
l
ld b
• High degree of complexity
• Star queries (complex joins)
• Grouping
• Nested queries
24th, Apr. 2012

NDT’2012. Dubai. UAE

8
LATICE
Cloud
Analytics
A l ti

24th, Apr. 2012

TPC-H BENCHMARK

NDT’2012. Dubai. UAE

BI
Motivation
TPC‐H 
|— E/R schema
COLAP
Performance

9
LATICE
Cloud
Analytics
A l ti

HADOOP/PIG LATIN

TPC‐H 
COLAP
|— hadoop/pig
|— translation
Performance
Conclusion

APACHE HADOOP

• Framework for running applications on large clusters of
commodity hardware.
dit h d
• Implements computational framework MapReduce
(
p
y
)
• HDFS: (hadoop distributed file system) stores data on the
compute nodes
• Replication & job resoumissions for failures’ handling
APACHE PIG LATIN

• high-level language for expressing data analysis
programs (filter, projection, join, group, sort, union, …)
24th, Apr. 2012

NDT’2012. Dubai. UAE

10
PIG LATIN BENCHMARK

LATICE
Cloud
Analytics
A l ti

5 TRANSLATION HINTS

TPC‐H 
COLAP
|— hadoop/pig
|— translation
|
|—nominal 
scenario

1. Load Data for Immediate
Processing

•

Better memory management
y
g

•

Conjunction/disjunction of
predicates applied once

2. Minimum Relation Scan

3.
3 Unary operations prior to
binary operations

•

Unary operations (projection,
U
ti
( j ti
restriction, ) reduce data volume

24th, Apr. 2012

NDT’2012. Dubai. UAE

11
PIG LATIN BENCHMARK

LATICE
Cloud
Analytics
A l ti

5 TRANSLATION HINTS -CTND

TPC‐H 
COLAP
|— hadoop/pig
|— translation
|
|—nominal 
scenario

4. Intra-operation parallelism
p
p
•

partitioned join

•

Algorithm
• h h join,
hash j i
• merge join,
Star-queries: j i ordering
S
i
joins d i

5.
5 Join Algorithm

•

24th, Apr. 2012

NDT’2012. Dubai. UAE

12
LATICE
Cloud
Analytics
A l ti

NOMINAL ANALYTICAL SCENARIO

TPC‐H 
COLAP
|— hadoop/pig
|— translation
|
|—nominal 
scenario

Pig Latin Script
(Business Question)
Response

24th, Apr. 2012

NDT’2012. Dubai. UAE

13
LATICE
Cloud
Analytics
A l ti

NOMINAL ANALYTICAL SCENARIO

TPC‐H 
COLAP
|— hadoop/pig
|— translation
|
|—nominal 
scenario

High Cost
• Measured Service, pay as you consume cloud ressources
(bandwitdh, CPU, RAM)

Performance Issues
• The same query (with same or different parameters) is
executed several times with no optimization

Discontinuity of S i
Di
ti it f Service
• Network failure/congestion

24th, Apr. 2012

NDT’2012. Dubai. UAE

14
LATICE
Cloud
Analytics
A l ti

TPC‐H 
COLAP
|— hadoop/pig
|— translation
|
|—nominal 
scenario

COMPLEX!!
How to reduce service cost?
How to improve performances?
How to prevent discontinuity of service?

• Materialized
views?
• Aggregated data
replication
• OLAP or not?

Workload Study
W kl d S d
for Tuning

24th, Apr. 2012

Cost
Management
• Exploit
organization
i i
hardware resources
?
• Cloud Right size?

NDT’2012. Dubai. UAE

15
TPC-H WORKLOAD NUMERICAL STUDY
TYPE A

LATICE
Cloud
Analytics
A l ti

COLAP
|— …
|—nominal 
scenario
|
|—towards
better scenario

*Order Priority Checking*

order date dim × order priority dim × count orders measure
OLAP!
+export to olap server
+MV
24th, Apr. 2012

always 135

NDT’2012. Dubai. UAE

16
TPC-H WORKLOAD NUMERICAL STUDY
TYPE B

LATICE
Cloud
Analytics
A l ti

COLAP
|— …
|—nominal 
scenario
|
|—towards
better scenario

*Large Volume Orders*

order dim × sum line qty measure
SF × 1,500,000
almost 3.8 ppm of orders have ∑ line qty > 300, for SF = 1
Not OLAP!
+MV
24th, Apr. 2012

NDT’2012. Dubai. UAE

17
LATICE
Cloud
Analytics
A l ti

TPC-H WORKLOAD NUMERICAL STUDY
TYPE C

COLAP
|— …
|—nominal 
scenario
|
|—towards
better scenario

*Minimum Cost Supplier*

supplier dim × part dim × min supply cost measure
OLAP!
SF 2 × 2,000,000,000
MV storage cost
best supplier in each region for each part!
24th, Apr. 2012

NDT’2012. Dubai. UAE

18
TPC-H WORKLOAD NUMERICAL
STUDY

LATICE
Cloud
Analytics
A l ti

Type
A

Features
•
•

Medium dimensionality
Result is TPC-H Scale Factor
independent

COLAP
|— …
|—nominal 
scenario
|
|—towards
better scenario

TPC-H Business Questions
(OLAP Cube)
Q1, Q3, Q4, Q5, Q6, Q7,
Q8, Q12, Q13, Q14, Q16,
Q19, Q22
13 business questions

B

C

•
•
•
•

High dimensionality
few results, lots of empty
cells
High dimensionality
g
y
Result % of Scale Factor

Q15, Q18
2 business questions
q

Q2, Q9, Q10, Q11, Q17,
,
,
,
,
,
Q20, Q21
7 business questions

24th, Apr. 2012

NDT’2012. Dubai. UAE

19
CLOUD
COST MANAGEMENT

LATICE
Cloud
Analytics
A l ti

COLAP
|— …
|—nominal 
scenario
|
|—towards
better scenario

Measured Service
pay as you go
CPU + Memory + Bandwidth
“When users understand the relationship between cost and
consumption,
consumption everybody wins” –Ron Miller
wins

Emerging need to understand, manage and
proactively control costs across the cloud
ti l
t l
t
th l d
Resource Utilization Monitoring
Right size w.r.t. both performances & cost (client and provider)
Green cloud through energy saving
24th, Apr. 2012

NDT’2012. Dubai. UAE

20
LATICE

BETTER SCENARIO

Cloud
Analytics
A l ti

COLAP
|— …
|—better scenario
Performance
Related work
Conclusion 

Pig Latin Script
(Generalized Business
Question)

Interaction
Pre-aggregated Data

OLAP Client

24th, Apr. 2012

Import Data
into an on-site
OLAP server

NDT’2012. Dubai. UAE

21
LATICE
Cloud
Analytics
A l ti

PERFORMANCE MEASUREMENTS

24th, Apr. 2012

• TPC H
TPC-H
Benchmark
• SF=1
• 1 1GB source
1.1GB
files
• 4.5GB single
big file
g
• SF=10
• 11GB source
files
• 45GB single
big file

NDT’2012. Dubai. UAE

Pig/
/HDFS
S

T H
TPC-H

•

G5K
K

French GRID
platform: a large
p
g
scale nation wide
infrastructure for
Grid research.
• Bordeaux Site
• Borderel: 24GB
RAM, 4 AMD
CPUs, 2.27 GHz,
and 4cores/CPU.
/
• Borderline: :
32GB RAM, 4
Intel Xeon CPUs,
e eo C Us,
2.6 GHz, and 2
cores/CPU.
• Ethernet10Gbps

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

• Apache
Hadoop 0.20
• N=3, 5 or 8
• one Hadoop
Master
• (2, 4 or 7)
Workers
• Apache Pig
0.8.1

22
LATICE
Cloud
A l ti
Analytics

Original
TPC-H
TPC H 1GB

PERFORMANCE MEASUREMENTS
Original
TPC-H
TPC H 11GB

Original 1.1
vs 11GB

Big File
g

Big File
4.5GB
4 5GB

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

Big File
45GB

Except business questions which do not perform join operations: No improvement
when cluster size doubles
24th, Apr. 2012

NDT’2012. Dubai. UAE

23
LATICE
Cloud
A l ti
Analytics

Original
TPC-H
TPC H 1GB

PERFORMANCE MEASUREMENTS
Original
TPC-H
TPC H 11GB

Original 1.1
vs 11GB

Big File
g

Big File
4.5GB
4 5GB

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

Big File
45GB

Improvement when cluster size doubles
More data so it’s nice to have more storage & computing nodes

24th, Apr. 2012

NDT’2012. Dubai. UAE

24
LATICE

PERFORMANCE MEASUREMENTS

Cloud
A l ti
Analytics

Original
TPC-H
TPC H 1GB

Original
TPC-H
TPC H 11GB

Original 1.1
vs 11GB

Big File
g

Big File
4.5GB
4 5GB

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

Big File
45GB

Elapsed times for SF=10 (11GB) are
At maximum 5 times elapsed times obtained for SF=1 (1.1GB)
In average twice elapsed times obtained for SF=1 (1.1GB)

24th, Apr. 2012

NDT’2012. Dubai. UAE

25
LATICE
Cloud
A l ti
Analytics

Original
TPC-H
TPC H 1GB

PERFORMANCE MEASUREMENTS
Original
TPC-H
TPC H 11GB

Original 1.1
vs 11GB

Big File
g

Big File
4.5GB
4 5GB

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

Big File
45GB

Joining partitionned files is complex!
Combine all files into one file
SF 1
SF=1
4.5GB
SF=10
45GB
Evaluation of Pig/MR without joins
Denormalization
saves j i cost
join t
increases required storage space (≈ ×4 for TPC-H)

24th, Apr. 2012

NDT’2012. Dubai. UAE

26
LATICE
Cloud
A l ti
Analytics

Original
TPC-H
TPC H 1GB

PERFORMANCE MEASUREMENTS
Original
TPC-H
TPC H 11GB

Original 1.1
vs 11GB

Big File
g

Big File
4.5GB
4 5GB

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

Big File
45GB

Compared to (SF=1, 1.1GB), improvements range from 10% to 80%,
p
(
,
), p
g
,
performance degradation with more than 4 workers (N=5): this is due to MR
framework (before reduce phase, data is grouped and sorted which has a cost
when involving more storage and computing nodes)
24th, Apr. 2012

NDT’2012. Dubai. UAE

27
LATICE
Cloud
A l ti
Analytics

Original
TPC-H
TPC H 1GB

PERFORMANCE MEASUREMENTS
Original
TPC-H
TPC H 11GB

Original 1.1
vs 11GB

Big File
g

Big File
4.5GB
4 5GB

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

Big File
45GB

Compared to (SF=10, 11GB), after affording more nodes we obtain similar results
p
(
,
),
g
Compared to (SF=1, 4.5GB), elapsed times are less than 10× for same cluster size
Performance degradation for queries which do not perform joins
(SF 10,11GB),
Q1 executes over 7GB lineitem file (SF=10,11GB), now it executes over 45GB file
24th, Apr. 2012

NDT’2012. Dubai. UAE

28
LATICE
Cloud
A l ti
Analytics

Big File
g

PERFORMANCE MEASUREMENTS
Big File
4.5GB
4 5GB

Big File
45GB

OLAP

OLAP 4.5GB

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

OLAP 45GB

Aggregated data
TPC-H business questions type A (SF independent & small resultset)
TPC-H business questions type B (very very small resultset)
15 business questions from 22

Tradeoff between space & computation
TPC-H business questions type C
Add derived fields
Q2: check (true) minimum supplycost by supplier for a part in PartSupp
Q17: average_line_quantity field for each part
Q20: sum_lines_quantities_per_year for each supplier
Q21: number of waiting orders for each supplier
7 business questions from 22
24th, Apr. 2012

NDT’2012. Dubai. UAE

29
LATICE
Cloud
A l ti
Analytics

Big File
g

PERFORMANCE MEASUREMENTS
Big File
4.5GB
4 5GB

Big File
45GB

OLAP

OLAP 4.5GB

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

OLAP 45GB

Average degradation is 5%
done once + time to retreive data
24th, Apr. 2012

NDT’2012. Dubai. UAE

30
LATICE
Cloud
A l ti
Analytics

Big File
g

PERFORMANCE MEASUREMENTS
Big File
4.5GB
4 5GB

Big File
45GB

OLAP

OLAP 4.5GB

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

OLAP 45GB

Average degradation is 60%
done once + time to retreive data
24th, Apr. 2012

NDT’2012. Dubai. UAE

31
LATICE
Cloud
Analytics
A l ti

RELATED WORK

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

Implementation of relational operations using MR framework
Kim et al. –MRBench, 2008
Nominal analytics scenario
TPC-H benchmarking for SF=1,3

Translation from SQL to Pig Latin
Iu et al. –Hadoop to SQL, 2010
Lee et al. –Ysmart, 2011
,
Other pig latin use cases
Shatzle et al, RDF data, 2011
Loebman etal. Astrophysical data, 2009

24th, Apr. 2012

NDT’2012. Dubai. UAE

32
LATICE
Cloud
Analytics
A l ti

CONCLUSION

TPC‐H
COLAP
Performance
Related work
Conclusion
Future work

TPC-H in-depth
TPC H in depth numerical study
OLAP in the cloud
Scenarios
Implementation
p
Pig / Hadoop Distributed File System
Performances
Various cluster sizes
Various data volumes
Various schemas (with and without joins)
(
j )
24th, Apr. 2012

NDT’2012. Dubai. UAE

33
FUTURE WORK

LATICE
Cloud
Analytics
A l ti

AQP

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

Approximate Query Processing in clouds
pp
y
g
Most Distributed File Systems implement replication
for high availability
MDS erasure codes outperform replication from two
perspectives (i) storage cost and (ii) minimal
operation cost of redundant data management
New Hadoop release
N H d
l
Facebook

Generalized framework for approximate data
analytics in the cloud coping with nodes’ failure
24th, Apr. 2012

NDT’2012. Dubai. UAE

34
LATICE
Cloud
Analytics
A l ti

FUTURE WORK
PIG LATIN++

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

Most of TPC-H business questions scripts are composed of
jobs, which execute sequentially,
some branches of the DAG are unnecessarily
y
blocked!
namely Q1, Q3, Q4, Q9, Q10, Q11, Q12, Q13,
Q1 Q3 Q4 Q9 Q10 Q11 Q12 Q13
Q14, Q16, Q17 and Q18
Pig Latin Enhancements
Investigate intra-operation Parallelism for better
performances of Pig Scripts
f
f
S
Investigate better job definitions strategies, in order
to increase inter-job parallelism
24th, Apr. 2012

NDT’2012. Dubai. UAE

35
FUTURE WORK

LATICE
Cloud
Analytics
A l ti

PIG LATIN++〉〉 Q16 EXPLE

Q16’s DAG

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

Q16’s pig
script

J_0005

J_0004

J_0003

J_0002

J_0001
J 0001

24th, Apr. 2012

NDT’2012. Dubai. UAE

36
TPC-H ANALTYCS SCENARIOS AND PERFORMANCES
CLOUDS

ON

HADOOP DATA

THANK YOU FOR YOUR ATTENTION
Q&A

24th, Apr. 2012
p
4th International. Conference on Networked Digital Technologies
NDT’12.Dubai. UAE

Weitere ähnliche Inhalte

Was ist angesagt?

Generating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure DefinitionsGenerating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure Definitions
Christophe Debruyne
 
towards_analytics_query_engine
towards_analytics_query_enginetowards_analytics_query_engine
towards_analytics_query_engine
Nantia Makrynioti
 

Was ist angesagt? (20)

parallel OLAP
parallel OLAPparallel OLAP
parallel OLAP
 
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence Generator
 
Generating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure DefinitionsGenerating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure Definitions
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
ArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & RoadmapArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & Roadmap
 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data Success
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero
 
IoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsIoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management Things
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
 
towards_analytics_query_engine
towards_analytics_query_enginetowards_analytics_query_engine
towards_analytics_query_engine
 
STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...
STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...
STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
 
Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020
 
Multidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGISMultidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGIS
 
Working with Scientific Data in MATLAB
Working with Scientific Data in MATLABWorking with Scientific Data in MATLAB
Working with Scientific Data in MATLAB
 

Ähnlich wie TPC-H analytics' scenarios and performances on Hadoop data clouds

Essential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big DataEssential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big Data
Society of Petroleum Engineers
 
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
confluent
 
Connecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud Platform
ConnectaDigital
 
how_graphs_eat_the_world
how_graphs_eat_the_worldhow_graphs_eat_the_world
how_graphs_eat_the_world
Ora Weinstein
 
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via Hadoop
Skillspeed
 
Hadoop Boosts Profits in Media and Telecom Industry
Hadoop Boosts Profits in Media and Telecom IndustryHadoop Boosts Profits in Media and Telecom Industry
Hadoop Boosts Profits in Media and Telecom Industry
DataWorks Summit
 

Ähnlich wie TPC-H analytics' scenarios and performances on Hadoop data clouds (20)

What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?
 
Essential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big DataEssential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big Data
 
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
 
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
 
Apps World Europe: Data Management panel.
Apps World Europe: Data Management panel.Apps World Europe: Data Management panel.
Apps World Europe: Data Management panel.
 
Telvent Big Data Approach and Case Studies
Telvent Big Data Approach and Case StudiesTelvent Big Data Approach and Case Studies
Telvent Big Data Approach and Case Studies
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial Intelligence
 
MLConf Atlanta - Sept 2017
MLConf Atlanta - Sept 2017MLConf Atlanta - Sept 2017
MLConf Atlanta - Sept 2017
 
Towards Quality-Aware Development of Big Data Applications with DICE
Towards Quality-Aware Development of Big Data Applications with DICETowards Quality-Aware Development of Big Data Applications with DICE
Towards Quality-Aware Development of Big Data Applications with DICE
 
Connecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud Platform
 
how_graphs_eat_the_world
how_graphs_eat_the_worldhow_graphs_eat_the_world
how_graphs_eat_the_world
 
Talend 6.1 - What's New in Talend?
Talend 6.1 - What's New in Talend?Talend 6.1 - What's New in Talend?
Talend 6.1 - What's New in Talend?
 
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via Hadoop
 
Datenstrategie der Zukunft - Technologietrends, die Sie kennen müssen
Datenstrategie der Zukunft - Technologietrends, die Sie kennen müssenDatenstrategie der Zukunft - Technologietrends, die Sie kennen müssen
Datenstrategie der Zukunft - Technologietrends, die Sie kennen müssen
 
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 
Amr Ghanem resume
Amr Ghanem resumeAmr Ghanem resume
Amr Ghanem resume
 
Hadoop Boosts Profits in Media and Telecom Industry
Hadoop Boosts Profits in Media and Telecom IndustryHadoop Boosts Profits in Media and Telecom Industry
Hadoop Boosts Profits in Media and Telecom Industry
 
Smart Document Processing-IQ+Alfresco-ver-22aug
Smart Document Processing-IQ+Alfresco-ver-22augSmart Document Processing-IQ+Alfresco-ver-22aug
Smart Document Processing-IQ+Alfresco-ver-22aug
 
Key Considerations While Rolling Out Denodo Platform
Key Considerations While Rolling Out Denodo PlatformKey Considerations While Rolling Out Denodo Platform
Key Considerations While Rolling Out Denodo Platform
 

Mehr von Rim Moussa (6)

polystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfpolystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdf
 
Big Data Projects
Big Data ProjectsBig Data Projects
Big Data Projects
 
EMR AWS Demo
EMR AWS DemoEMR AWS Demo
EMR AWS Demo
 
BICOD-2017
BICOD-2017BICOD-2017
BICOD-2017
 
Automation of MultiDimensional DB Design (poster)
Automation of MultiDimensional DB Design (poster)Automation of MultiDimensional DB Design (poster)
Automation of MultiDimensional DB Design (poster)
 
highly available distributed databases (poster)
highly available distributed databases (poster)highly available distributed databases (poster)
highly available distributed databases (poster)
 

Kürzlich hochgeladen

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 

Kürzlich hochgeladen (20)

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 

TPC-H analytics' scenarios and performances on Hadoop data clouds

  • 1. TPC H TPC-H ANALTYCS’ SCENARIOS AND PERFORMANCES ON HADOOP DATA CLOUDS RIM MOUSSA rim.moussa@esti.rnu.tn LATICE –UNIV. OF TUNIS TUNISIA 24th, April 2012 4th International. Conference on Networked Digital Technologies NDT’2012, Dubai, UAE.
  • 2. LaTICE OUTLINE Cloud Analytics A l ti 1. Business Intelligence 2. Motivation data managment issues, NoSQL, clouds g , , OLAP in the cloud 3. Implementation of OLAP in the cloud TPC-H Benchmark Analytics Scenarios y Performance Measurements 4. Related Work 5. Conclusion 6. 6 Future W k Work 24th, Apr. 2012 NDT’2012. Dubai. UAE 2
  • 3. LATICE Cloud Analytics A l ti BUSINESS INTELLIGENCE BI Motivation TPC‐H & COLAP Performance Conclusion Future work Business intelligence aims to support better business g pp decision-making. Common functions of business intelligence technologies are On-Line Analytical Processing, data mining, process mining mining mining, Business performance management Text i i T t mining and predictive analytics, … d di ti l ti Market share Gartner Research Reports BI Market Revenue Hit $12.2 Billion in 2011 24th, Apr. 2012 NDT’2012. Dubai. UAE 3
  • 4. LATICE MOTIVATION Cloud Analytics A l ti BI Motivation |—Issues |— NoSQL | |—Cloud |—COLAP Decision Support Systems Incessant D t & complex workload • I t Data l kl d • Complex DB schema Scalability Issues • Ideally, Linear Speed up & Linear Scale up • DBMS do not scale linearly y • OLAP Technologies do not scale • • • • 24th, Apr. 2012 Hardware I/O Bottleneck I/O-bound data storage systems Gilder law: Thrice bandwidth every 3 years Moore L M Law: T Twice computing and storage capacities every 18 months. d h Obsolete by 2017 Vertical scaling cost >> Horizontal scaling cost NDT’2012. Dubai. UAE 4
  • 5. LATICE NOSQL Cloud Analytics A l ti BI Motivation |—Issues |— NoSQL | |—Cloud |—COLAP Big Ch ll Bi Challenges related t velocity l t d to l it How fast huge volumes of data can be processed? NoSQL l ti N SQL solutions Adopted by Google, Facebook, Amazon, … Dynamic horizontal scale up scale-up Nodes are added without bringing the cluster down Shared nothing Shared-nothing architecture Independent computing& storage nodes interconnected via a high speed network Distributed programming framework: MapReduce (Google) 24th, Apr. 2012 NDT’2012. Dubai. UAE 5
  • 6. LATICE Cloud Analytics A l ti CLOUD COMPUTING BI Motivation |—Issues |— NoSQL | |—Cloud |—COLAP Cloud computing is a style of computing where scalable and elastic ITenabled capabilities are provided "as a service" t external customers bl d biliti id d i to t l t using Internet technologies. Broad network access Resource pooling (virtualization) Self-provisioning p g Rapid elasticity Scalable Analytics Measured service Market share Forrester Research expects the global cloud computing market to reach $241 billion in 2020. In particular, SaaS market growing to $92.8 billion by 2016. Gartner group expects the cloud computing market will reach $US150.1 billion, with a compound annual rate of 26.5%, in 2013. 26 5% 2013 24th, Apr. 2012 NDT’2012. Dubai. UAE 6
  • 7. LATICE BI Motivation |—Issues |— NoSQL | |—Cloud |—COLAP OLAP IN THE CLOUD Cloud Analytics A l ti OLAP constraints Big data analytics’ obstacles Current systems & technologies do not scale Key benefits of Cloud Computing Cloud computing Big data analytics NoSQL Performance Much faster data analysis, Dynamic and up-to-date hardware infrastructure, y p , More Economical Organizations no longer need to expend capital upfront for hardware and software purchases Services are provided on a pay-per-use basis, p p yp , 24th, Apr. 2012 NDT’2012. Dubai. UAE 7
  • 8. LATICE Cloud Analytics A l ti TPC-H DECISION-SUPPORT SYSTEM BENCHMARK S BI Motivation TPC‐H  COLAP Performance Conclusion DATA • Complex DB schema • Scale factor 1, 10, …, 100,000 correspond respectively to 1GB, 10 GB …, 100 TB 1GB GB, • 8 data files {lineitem, customer…, region}.tbl y • broad industry-wide relevance WORKLOAD • 22 real world business questions l ld b • High degree of complexity • Star queries (complex joins) • Grouping • Nested queries 24th, Apr. 2012 NDT’2012. Dubai. UAE 8
  • 9. LATICE Cloud Analytics A l ti 24th, Apr. 2012 TPC-H BENCHMARK NDT’2012. Dubai. UAE BI Motivation TPC‐H  |— E/R schema COLAP Performance 9
  • 10. LATICE Cloud Analytics A l ti HADOOP/PIG LATIN TPC‐H  COLAP |— hadoop/pig |— translation Performance Conclusion APACHE HADOOP • Framework for running applications on large clusters of commodity hardware. dit h d • Implements computational framework MapReduce ( p y ) • HDFS: (hadoop distributed file system) stores data on the compute nodes • Replication & job resoumissions for failures’ handling APACHE PIG LATIN • high-level language for expressing data analysis programs (filter, projection, join, group, sort, union, …) 24th, Apr. 2012 NDT’2012. Dubai. UAE 10
  • 11. PIG LATIN BENCHMARK LATICE Cloud Analytics A l ti 5 TRANSLATION HINTS TPC‐H  COLAP |— hadoop/pig |— translation | |—nominal  scenario 1. Load Data for Immediate Processing • Better memory management y g • Conjunction/disjunction of predicates applied once 2. Minimum Relation Scan 3. 3 Unary operations prior to binary operations • Unary operations (projection, U ti ( j ti restriction, ) reduce data volume 24th, Apr. 2012 NDT’2012. Dubai. UAE 11
  • 12. PIG LATIN BENCHMARK LATICE Cloud Analytics A l ti 5 TRANSLATION HINTS -CTND TPC‐H  COLAP |— hadoop/pig |— translation | |—nominal  scenario 4. Intra-operation parallelism p p • partitioned join • Algorithm • h h join, hash j i • merge join, Star-queries: j i ordering S i joins d i 5. 5 Join Algorithm • 24th, Apr. 2012 NDT’2012. Dubai. UAE 12
  • 13. LATICE Cloud Analytics A l ti NOMINAL ANALYTICAL SCENARIO TPC‐H  COLAP |— hadoop/pig |— translation | |—nominal  scenario Pig Latin Script (Business Question) Response 24th, Apr. 2012 NDT’2012. Dubai. UAE 13
  • 14. LATICE Cloud Analytics A l ti NOMINAL ANALYTICAL SCENARIO TPC‐H  COLAP |— hadoop/pig |— translation | |—nominal  scenario High Cost • Measured Service, pay as you consume cloud ressources (bandwitdh, CPU, RAM) Performance Issues • The same query (with same or different parameters) is executed several times with no optimization Discontinuity of S i Di ti it f Service • Network failure/congestion 24th, Apr. 2012 NDT’2012. Dubai. UAE 14
  • 15. LATICE Cloud Analytics A l ti TPC‐H  COLAP |— hadoop/pig |— translation | |—nominal  scenario COMPLEX!! How to reduce service cost? How to improve performances? How to prevent discontinuity of service? • Materialized views? • Aggregated data replication • OLAP or not? Workload Study W kl d S d for Tuning 24th, Apr. 2012 Cost Management • Exploit organization i i hardware resources ? • Cloud Right size? NDT’2012. Dubai. UAE 15
  • 16. TPC-H WORKLOAD NUMERICAL STUDY TYPE A LATICE Cloud Analytics A l ti COLAP |— … |—nominal  scenario | |—towards better scenario *Order Priority Checking* order date dim × order priority dim × count orders measure OLAP! +export to olap server +MV 24th, Apr. 2012 always 135 NDT’2012. Dubai. UAE 16
  • 17. TPC-H WORKLOAD NUMERICAL STUDY TYPE B LATICE Cloud Analytics A l ti COLAP |— … |—nominal  scenario | |—towards better scenario *Large Volume Orders* order dim × sum line qty measure SF × 1,500,000 almost 3.8 ppm of orders have ∑ line qty > 300, for SF = 1 Not OLAP! +MV 24th, Apr. 2012 NDT’2012. Dubai. UAE 17
  • 18. LATICE Cloud Analytics A l ti TPC-H WORKLOAD NUMERICAL STUDY TYPE C COLAP |— … |—nominal  scenario | |—towards better scenario *Minimum Cost Supplier* supplier dim × part dim × min supply cost measure OLAP! SF 2 × 2,000,000,000 MV storage cost best supplier in each region for each part! 24th, Apr. 2012 NDT’2012. Dubai. UAE 18
  • 19. TPC-H WORKLOAD NUMERICAL STUDY LATICE Cloud Analytics A l ti Type A Features • • Medium dimensionality Result is TPC-H Scale Factor independent COLAP |— … |—nominal  scenario | |—towards better scenario TPC-H Business Questions (OLAP Cube) Q1, Q3, Q4, Q5, Q6, Q7, Q8, Q12, Q13, Q14, Q16, Q19, Q22 13 business questions B C • • • • High dimensionality few results, lots of empty cells High dimensionality g y Result % of Scale Factor Q15, Q18 2 business questions q Q2, Q9, Q10, Q11, Q17, , , , , , Q20, Q21 7 business questions 24th, Apr. 2012 NDT’2012. Dubai. UAE 19
  • 20. CLOUD COST MANAGEMENT LATICE Cloud Analytics A l ti COLAP |— … |—nominal  scenario | |—towards better scenario Measured Service pay as you go CPU + Memory + Bandwidth “When users understand the relationship between cost and consumption, consumption everybody wins” –Ron Miller wins Emerging need to understand, manage and proactively control costs across the cloud ti l t l t th l d Resource Utilization Monitoring Right size w.r.t. both performances & cost (client and provider) Green cloud through energy saving 24th, Apr. 2012 NDT’2012. Dubai. UAE 20
  • 21. LATICE BETTER SCENARIO Cloud Analytics A l ti COLAP |— … |—better scenario Performance Related work Conclusion  Pig Latin Script (Generalized Business Question) Interaction Pre-aggregated Data OLAP Client 24th, Apr. 2012 Import Data into an on-site OLAP server NDT’2012. Dubai. UAE 21
  • 22. LATICE Cloud Analytics A l ti PERFORMANCE MEASUREMENTS 24th, Apr. 2012 • TPC H TPC-H Benchmark • SF=1 • 1 1GB source 1.1GB files • 4.5GB single big file g • SF=10 • 11GB source files • 45GB single big file NDT’2012. Dubai. UAE Pig/ /HDFS S T H TPC-H • G5K K French GRID platform: a large p g scale nation wide infrastructure for Grid research. • Bordeaux Site • Borderel: 24GB RAM, 4 AMD CPUs, 2.27 GHz, and 4cores/CPU. / • Borderline: : 32GB RAM, 4 Intel Xeon CPUs, e eo C Us, 2.6 GHz, and 2 cores/CPU. • Ethernet10Gbps TPC‐H COLAP Performance Related work Conclusion  Future work • Apache Hadoop 0.20 • N=3, 5 or 8 • one Hadoop Master • (2, 4 or 7) Workers • Apache Pig 0.8.1 22
  • 23. LATICE Cloud A l ti Analytics Original TPC-H TPC H 1GB PERFORMANCE MEASUREMENTS Original TPC-H TPC H 11GB Original 1.1 vs 11GB Big File g Big File 4.5GB 4 5GB TPC‐H COLAP Performance Related work Conclusion  Future work Big File 45GB Except business questions which do not perform join operations: No improvement when cluster size doubles 24th, Apr. 2012 NDT’2012. Dubai. UAE 23
  • 24. LATICE Cloud A l ti Analytics Original TPC-H TPC H 1GB PERFORMANCE MEASUREMENTS Original TPC-H TPC H 11GB Original 1.1 vs 11GB Big File g Big File 4.5GB 4 5GB TPC‐H COLAP Performance Related work Conclusion  Future work Big File 45GB Improvement when cluster size doubles More data so it’s nice to have more storage & computing nodes 24th, Apr. 2012 NDT’2012. Dubai. UAE 24
  • 25. LATICE PERFORMANCE MEASUREMENTS Cloud A l ti Analytics Original TPC-H TPC H 1GB Original TPC-H TPC H 11GB Original 1.1 vs 11GB Big File g Big File 4.5GB 4 5GB TPC‐H COLAP Performance Related work Conclusion  Future work Big File 45GB Elapsed times for SF=10 (11GB) are At maximum 5 times elapsed times obtained for SF=1 (1.1GB) In average twice elapsed times obtained for SF=1 (1.1GB) 24th, Apr. 2012 NDT’2012. Dubai. UAE 25
  • 26. LATICE Cloud A l ti Analytics Original TPC-H TPC H 1GB PERFORMANCE MEASUREMENTS Original TPC-H TPC H 11GB Original 1.1 vs 11GB Big File g Big File 4.5GB 4 5GB TPC‐H COLAP Performance Related work Conclusion  Future work Big File 45GB Joining partitionned files is complex! Combine all files into one file SF 1 SF=1 4.5GB SF=10 45GB Evaluation of Pig/MR without joins Denormalization saves j i cost join t increases required storage space (≈ ×4 for TPC-H) 24th, Apr. 2012 NDT’2012. Dubai. UAE 26
  • 27. LATICE Cloud A l ti Analytics Original TPC-H TPC H 1GB PERFORMANCE MEASUREMENTS Original TPC-H TPC H 11GB Original 1.1 vs 11GB Big File g Big File 4.5GB 4 5GB TPC‐H COLAP Performance Related work Conclusion  Future work Big File 45GB Compared to (SF=1, 1.1GB), improvements range from 10% to 80%, p ( , ), p g , performance degradation with more than 4 workers (N=5): this is due to MR framework (before reduce phase, data is grouped and sorted which has a cost when involving more storage and computing nodes) 24th, Apr. 2012 NDT’2012. Dubai. UAE 27
  • 28. LATICE Cloud A l ti Analytics Original TPC-H TPC H 1GB PERFORMANCE MEASUREMENTS Original TPC-H TPC H 11GB Original 1.1 vs 11GB Big File g Big File 4.5GB 4 5GB TPC‐H COLAP Performance Related work Conclusion  Future work Big File 45GB Compared to (SF=10, 11GB), after affording more nodes we obtain similar results p ( , ), g Compared to (SF=1, 4.5GB), elapsed times are less than 10× for same cluster size Performance degradation for queries which do not perform joins (SF 10,11GB), Q1 executes over 7GB lineitem file (SF=10,11GB), now it executes over 45GB file 24th, Apr. 2012 NDT’2012. Dubai. UAE 28
  • 29. LATICE Cloud A l ti Analytics Big File g PERFORMANCE MEASUREMENTS Big File 4.5GB 4 5GB Big File 45GB OLAP OLAP 4.5GB TPC‐H COLAP Performance Related work Conclusion  Future work OLAP 45GB Aggregated data TPC-H business questions type A (SF independent & small resultset) TPC-H business questions type B (very very small resultset) 15 business questions from 22 Tradeoff between space & computation TPC-H business questions type C Add derived fields Q2: check (true) minimum supplycost by supplier for a part in PartSupp Q17: average_line_quantity field for each part Q20: sum_lines_quantities_per_year for each supplier Q21: number of waiting orders for each supplier 7 business questions from 22 24th, Apr. 2012 NDT’2012. Dubai. UAE 29
  • 30. LATICE Cloud A l ti Analytics Big File g PERFORMANCE MEASUREMENTS Big File 4.5GB 4 5GB Big File 45GB OLAP OLAP 4.5GB TPC‐H COLAP Performance Related work Conclusion  Future work OLAP 45GB Average degradation is 5% done once + time to retreive data 24th, Apr. 2012 NDT’2012. Dubai. UAE 30
  • 31. LATICE Cloud A l ti Analytics Big File g PERFORMANCE MEASUREMENTS Big File 4.5GB 4 5GB Big File 45GB OLAP OLAP 4.5GB TPC‐H COLAP Performance Related work Conclusion  Future work OLAP 45GB Average degradation is 60% done once + time to retreive data 24th, Apr. 2012 NDT’2012. Dubai. UAE 31
  • 32. LATICE Cloud Analytics A l ti RELATED WORK TPC‐H COLAP Performance Related work Conclusion  Future work Implementation of relational operations using MR framework Kim et al. –MRBench, 2008 Nominal analytics scenario TPC-H benchmarking for SF=1,3 Translation from SQL to Pig Latin Iu et al. –Hadoop to SQL, 2010 Lee et al. –Ysmart, 2011 , Other pig latin use cases Shatzle et al, RDF data, 2011 Loebman etal. Astrophysical data, 2009 24th, Apr. 2012 NDT’2012. Dubai. UAE 32
  • 33. LATICE Cloud Analytics A l ti CONCLUSION TPC‐H COLAP Performance Related work Conclusion Future work TPC-H in-depth TPC H in depth numerical study OLAP in the cloud Scenarios Implementation p Pig / Hadoop Distributed File System Performances Various cluster sizes Various data volumes Various schemas (with and without joins) ( j ) 24th, Apr. 2012 NDT’2012. Dubai. UAE 33
  • 34. FUTURE WORK LATICE Cloud Analytics A l ti AQP TPC‐H COLAP Performance Related work Conclusion  Future work Approximate Query Processing in clouds pp y g Most Distributed File Systems implement replication for high availability MDS erasure codes outperform replication from two perspectives (i) storage cost and (ii) minimal operation cost of redundant data management New Hadoop release N H d l Facebook Generalized framework for approximate data analytics in the cloud coping with nodes’ failure 24th, Apr. 2012 NDT’2012. Dubai. UAE 34
  • 35. LATICE Cloud Analytics A l ti FUTURE WORK PIG LATIN++ TPC‐H COLAP Performance Related work Conclusion  Future work Most of TPC-H business questions scripts are composed of jobs, which execute sequentially, some branches of the DAG are unnecessarily y blocked! namely Q1, Q3, Q4, Q9, Q10, Q11, Q12, Q13, Q1 Q3 Q4 Q9 Q10 Q11 Q12 Q13 Q14, Q16, Q17 and Q18 Pig Latin Enhancements Investigate intra-operation Parallelism for better performances of Pig Scripts f f S Investigate better job definitions strategies, in order to increase inter-job parallelism 24th, Apr. 2012 NDT’2012. Dubai. UAE 35
  • 36. FUTURE WORK LATICE Cloud Analytics A l ti PIG LATIN++〉〉 Q16 EXPLE Q16’s DAG TPC‐H COLAP Performance Related work Conclusion  Future work Q16’s pig script J_0005 J_0004 J_0003 J_0002 J_0001 J 0001 24th, Apr. 2012 NDT’2012. Dubai. UAE 36
  • 37. TPC-H ANALTYCS SCENARIOS AND PERFORMANCES CLOUDS ON HADOOP DATA THANK YOU FOR YOUR ATTENTION Q&A 24th, Apr. 2012 p 4th International. Conference on Networked Digital Technologies NDT’12.Dubai. UAE