SlideShare a Scribd company logo
1 of 27
Dynamically Optimizing Queries over 
Large Scale Data Platforms 
[Work done at IBM Almaden Research Center] 
Konstantinos Karanasos♯, Andrey Balmin§, Marcel Kutsch♣, 
Fatma Özcan*, Vuk Ercegovac◊, Chunyang Xia♦, Jesse Jackson♦ 
♯Microsoft *IBM Research §Platfora ♣Apple ◊Google ♦IBM 
Inria Saclay 
November 26, 2014
Impala 
Dryad HAWQ 
2 
The Big Data Landscape 
Big Data 
Platforms 
nested 
relational 
HiveQL 
DryadLINQ 
Pig 
Spark 
SQL 
Jaql 
Stratosphere 
unstructured 
semi-structured 
structured 
data streams 
Languages 
Hadoop 
Hive/Stinger 
Jaql 
Spark 
Stratosphere 
Hadapt 
Polybase 
Drill 
Need for efficient Big Data management 
Challenging due to size and heterogeneity of data, 
variety of applications 
Query optimization is crucial
Query Optimization in Large Scale Data Platforms 
3 
• Existing challenges 
• Exponential error propagation in joins 
• Correlations between predicates 
• “New” challenges 
• Prominent use of UDFs 
• Complex data types (arrays, maps, structs) 
• Poor statistics (do we own the data?) 
• Bad plans over Big data may be disastrous 
• Exploit cluster’s resources (parallel execution) 
Traditional static techniques are not sufficient 
We introduce dynamic techniques that are: 
• at least as good as and 
• up to 2x (4x) better than 
the best hand-written left-deep Jaql (Hive) plans
4 
SELECT <projection list> FROM ( 
SELECT <projection list> 
FROM "PART", "SUPPLIER", "LINEITEM", 
"PARTSUPP", "ORDERS", "NATION" 
5-way join 
WHERE <join conditions> 
AND "PART"."p_name" LIKE '%green%' 
AND "ORDERS"."o_orderdate" BETWEEN '1995-01-01' AND 
'1995-07-01' 
correlated 
predicates 
AND "ORDERS"."o_orderstatus"='P' 
AND udf("PARTSUPP"."ps_partkey") < 0.001 
external UDFs 
AND <udf list> 
) "PROFIT" 
GROUP BY "PROFIT"."NATION", "PROFIT"."order_YEAR" 
ORDER BY "PROFIT"."NATION" ASC, "PROFIT"."order_YEAR" DESC; 
Example: TPCH Q9’
5 
“SQL” Processing in Large Scale Platforms 
• Relational operators -> MapReduce jobs 
• Two join algorithms: 
• Repartition join (RJ) – 1MR job (default) 
• Memory join (MJ) – map-only job 
• Optimizations based on rewrite rules and hints 
• RJ -> MJ 
• Chain MJs (multiple joins in one map job) 
• Left-deep plans 
• This is the picture for Jaql (and Hive)
6 
Limitations 
• No selectivity estimation for predicates/UDFs 
• Conservative application of memory joins 
• No cost-based join enumeration 
• Rely on order of relations in FROM clause 
• Left-deep plans 
• Often close to optimal for centralized settings 
• Not sufficient for distributed query processing
7 
TPCH Q9’: Execution Plans 
udf(o,l) 
RJ ps 
p 
l 
RJ 
RJ l 
udf(o,l) 
udf(p) 
udf(o) 
udf(ps) 
Best left-deep hand-written 
Jaql plan 
RJ 
o 
RJ 
Best relational 
optimizer plan 
MJ 
udf(ps) s n 
udf(o) 
udf(p) 
RJ 
s 
RJ 
RJ 
p 
n 
MJ 
o 
ps
8 
Dynamic Optimization 
• Key idea: alter execution plan at runtime 
• Studied in the relational setting 
• Both centralized and distributed 
• Basic concern: when to break the pipeline? 
• No emphasis on UDFs and correlated predicates 
• Increasingly being used in large scale platforms 
(e.g., Scope, Shark, Hive) 
Goal: dynamic optimization techniques for large 
scale data platforms (implemented in Jaql)
9 
IBM BigInsights Jaql 
Dataflows for conceptual JSON data 
Key differentiators 
• Functions: 
reusability + abstraction 
• Physical Transparency: 
precise control when needed 
• Data model: 
semi-structured based on JSON 
Flexible scripting language 
Scalable map-reduce runtime 
Fault Tolerant DFS 
Jaql 
Map 
Jaql 
Reduce 
Jaql 
Map 
Jaql 
Reduce 
Jaql 
Map
10 
Jaql Script: Example 
read transform group by write 
Query Data 
read(hdfs("reviews")) 
-> transform { pid: $.placeid, rev: sentAn($.review) } 
-> group by p = ($.pid) as r into { pid: p, revs: r.rev } 
-> write(hdfs("group-reviews")) 
[ 
{ pid: 12, revs: [ 3*, 4*, … ] }, 
{ pid: 19, revs: [ 2*, 1*, … ] } 
] 
Group user reviews by place
11 
Jaql to MapReduce 
mapReduce( 
input: { type: hdfs, location: "reviews" }, 
output: { type: hdfs, location: "group-reviews" }, 
map: fn($mapIn) ( 
$mapIn -> transform { pid: $.placeid, rev: sentAn($.review) } 
-> transform [ $.placeid, $.rev ] ), 
reduce: fn($p, $r) ( [ pid: $p, revs: $r ] ) ) 
• Functions as parameters 
• Rewritten script is valid 
Jaql! 
read(hdfs("reviews")) 
-> transform { pid: $.placeid, rev: sentAn($.review) } 
-> group by p = ($.pid) as r into { pid: p, revs: r.rev } 
-> write(hdfs("group-reviews")) 
Rewrite Engine
12 
Outline 
• Introduction 
• System Architecture 
• Pilot Runs 
• Adaptation of Execution Plans 
• Experiments 
• Conclusion
13 
DynO Architecture 
Query 
best plan 
Query 
result 
Jaql 
plan 
Optimizer 
(join enumeration) 
Jaql compiler 
Jaql runtime 
MapReduce 
join query 
blocks 
Statistics DFS 
execute part 
of the plan 
pilot runs 
remaining 
plan 
1 
2 
3 
4 
8 
5 
6 
7
14 
Pilot Runs 
• PilR algorithm: 
• Push-down selections/UDFs 
• Get leaf expressions (scans + local predicates) 
• Transform them to map-only jobs 
• Execute them over random splits of each relation 
• Until k tuples are output 
• Collect statistics during execution 
• Parallel execution of pilot runs (~4.5x speedup) 
• Approx. 3% overhead to the execution 
• Performance speedup of up to 2x (4x) for Jaql (Hive)
15 
udf(o,l) 
RJ ps 
p 
l 
RJ 
RJ l 
udf(o,l) 
udf(p) 
udf(o) 
udf(ps) 
Best left-deep hand-written 
Jaql plan 
RJ 
o 
RJ 
Best relational 
optimizer plan 
MJ 
udf(ps) s n 
udf(o) 
udf(p) 
RJ 
s 
RJ 
RJ 
p 
n 
MJ 
o 
ps 
udf(o,l) 
MJ 
p 
MJ 
o MJ 
l 
ps 
MJ 
MJ 
s n 
udf(ps) 
udf(o) 
udf(p) 
Up to 2x speedup 
(4x when applied 
to Hive) 
DynO plan 
TPCH Q9’: Impact of Pilot Runs
16 
Pilot Runs: Details 
• Collected statistics: 
• #tuples, min/max, #distinct values 
• add more if the optimizer can support them 
• Statistics reusability 
• Optimization for selective (and expensive) predicates 
• Shortcomings: 
• Non-local predicates 
• Non primary/foreign key joins 
• Join correlations 
Runtime adaptation of execution plans
17 
Adaptation of Execution Plans 
• Cost-based optimizer 
• Based on Columbia (top-down) optimizer 
• Focuses on join enumeration 
• Accurate statistics from pilot runs and/or previous executions 
• Bushy plans (intra-query parallelization) 
• Online statistics collection 
• Re-optimization points (natural in MR) 
• Execution strategies: choosing leaf jobs 
• Degree of parallelization, cost/uncertainty of jobs
18 
TPCH Q8’: Impact of Execution Plan Adaptation 
MJ r 
MJ n2 
RJ c 
RJ o 
p s Best left-deep hand-written 
Jaql plan 
RJ l 
RJ 
n1 
MJ 
o 
RJ 
MJ 
RJ n2 
s 
RJ 
l 
RJ 
p 
udf(o,c) 
r 
MJ 
MJ 
c 
n1 
udf(o,c) 
Best relational 
optimizer plan
TPCH Q8’: Impact of Execution Plan Adaptation 
MJ 
RJ n2 
19 
udf(o,c) 
o 
RJ n2 
s 
MJ 
RJ 
l 
RJ 
p 
t1 RJ 
r 
MJ 
MJ 
c 
n1 
MJ n2 
s 
MJ 
RJ 
l 
RJ 
t1 
p 
t2 
RJ s 
p t2 
t3 
MJ 
MJ 
n2 
s 
t3 
Speedup up to 2x without 
any initial statistics 
(despite the added overhead)
20 
Outline 
• Introduction 
• System Architecture 
• Pilot Runs 
• Adaptation of Execution Plans 
• Experiments 
• Conclusion
21 
Experimental Setup 
• 15-node cluster, 10 GbE 
• Each machine: 
• 12-cores, 96 GB RAM (2GB to each MR slot), 12*2TB disks 
• 10 map/8 reduce slots 
• Hadoop 1.1.1 
• ZooKeeper for coordination (in statistics collection) 
• TPCH data, SF = {100, 300, 1000} 
• TPCH queries (with additional UDFs)
22 
Execution times comparison 
• At least as good as the best left-deep hand-written plans 
• Benefits from bushy plans (Q2) 
• Benefits from pilot runs due to many UDFs (Q9’) 
• Benefits from re-optimization due to UDF on join result (Q8’) 
• Biggest benefit is brought by the pilot runs
23 
Benefits of our Approach on Hive 
• Similar 
performance 
trends with Jaql 
• Bigger speedup 
(up to 4x) due to 
implementation of 
broadcast joins 
(Hive 0.12 exploits 
DistributedCache)
24 
Overhead of Dynamic Optimization 
• Pilot runs overhead 
2.5-6.5% 
• Stats collection 
overhead 0.1-2.8% 
• Overall overhead 
7-10%
25 
Conclusion 
• Pilot runs to account for UDFs 
• Dynamic adaptation of execution plans 
• Traditional optimizer for join ordering (bushy plans) 
• Online statistics collection (no need for initial statistics) 
• Execution strategies 
• At least as good plans as the left-deep hand-written ones 
• Up to 2x faster (4x for Hive) 
• Applicability to other systems (e.g., Hive)
26 
Perspectives 
• Broader range of applications (e.g., ML) 
• Other runtimes (e.g., Tez) 
• Adaptive operators 
• Extend optimizer to support grouping, ordering
Thank you!

More Related Content

What's hot

Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...Databricks
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...Databricks
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Studies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesStudies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesHPCC Systems
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaJosef Niedermeier
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduceHassan A-j
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CARobert Metzger
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit
 

What's hot (20)

Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
Apache Giraph
Apache GiraphApache Giraph
Apache Giraph
 
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Studies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesStudies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning Perspectives
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and Vertica
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 

Viewers also liked

Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionYunyao Li
 
Exploring Linked Data content through network analysis
Exploring Linked Data content through network analysisExploring Linked Data content through network analysis
Exploring Linked Data content through network analysisChristophe Guéret
 
Linked Data: What’s the Story?
Linked Data: What’s the Story?Linked Data: What’s the Story?
Linked Data: What’s the Story?WiLS
 
A Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
A Comparison of NER Tools w.r.t. a Domain-Specific VocabularyA Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
A Comparison of NER Tools w.r.t. a Domain-Specific VocabularyTimm Heuss
 
On the Reproducibility of the TAGME entity linking system
On the Reproducibility of the TAGME entity linking systemOn the Reproducibility of the TAGME entity linking system
On the Reproducibility of the TAGME entity linking systemFaegheh Hasibi
 
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...Guy De Pauw
 
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...Olivier Grisel
 
Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER) Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER) Stephen Shellman
 
QER : query entity recognition
QER : query entity recognitionQER : query entity recognition
QER : query entity recognitionDhwaj Raj
 
The named entity recognition (ner)2
The named entity recognition (ner)2The named entity recognition (ner)2
The named entity recognition (ner)2Arabic_NLP_ImamU2013
 
Named Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 PresentationNamed Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 PresentationRichard Littauer
 
RDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataRDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataDave Lewis
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataEUCLID project
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked DataEUCLID project
 
Enhancing Entity Linking by Combining NER Models
Enhancing Entity Linking by Combining NER ModelsEnhancing Entity Linking by Combining NER Models
Enhancing Entity Linking by Combining NER ModelsJulien PLU
 

Viewers also liked (20)

Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity Detection
 
Exploring Linked Data content through network analysis
Exploring Linked Data content through network analysisExploring Linked Data content through network analysis
Exploring Linked Data content through network analysis
 
Entity Search Engine
Entity Search Engine Entity Search Engine
Entity Search Engine
 
Linked Data: What’s the Story?
Linked Data: What’s the Story?Linked Data: What’s the Story?
Linked Data: What’s the Story?
 
A Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
A Comparison of NER Tools w.r.t. a Domain-Specific VocabularyA Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
A Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
 
On the Reproducibility of the TAGME entity linking system
On the Reproducibility of the TAGME entity linking systemOn the Reproducibility of the TAGME entity linking system
On the Reproducibility of the TAGME entity linking system
 
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages suc...
 
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
 
Multlingual Linked Data Patterns
Multlingual Linked Data PatternsMultlingual Linked Data Patterns
Multlingual Linked Data Patterns
 
Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER) Understanding Named-Entity Recognition (NER)
Understanding Named-Entity Recognition (NER)
 
QER : query entity recognition
QER : query entity recognitionQER : query entity recognition
QER : query entity recognition
 
The named entity recognition (ner)2
The named entity recognition (ner)2The named entity recognition (ner)2
The named entity recognition (ner)2
 
Text mining
Text miningText mining
Text mining
 
Named Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 PresentationNamed Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 Presentation
 
RDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataRDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization data
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Discoverers of Surface Analysis
Discoverers of Surface AnalysisDiscoverers of Surface Analysis
Discoverers of Surface Analysis
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
LR Parsing
LR ParsingLR Parsing
LR Parsing
 
Enhancing Entity Linking by Combining NER Models
Enhancing Entity Linking by Combining NER ModelsEnhancing Entity Linking by Combining NER Models
Enhancing Entity Linking by Combining NER Models
 

Similar to Dynamically Optimizing Queries over Large Scale Data Platforms

Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at ScaleSascha Dittmann
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Ontico
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...TigerGraph
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoopColin Su
 

Similar to Dynamically Optimizing Queries over Large Scale Data Platforms (20)

Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 
Disco workshop
Disco workshopDisco workshop
Disco workshop
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoop
 

More from INRIA-OAK

Change Management in the Traditional and Semantic Web
Change Management in the Traditional and Semantic WebChange Management in the Traditional and Semantic Web
Change Management in the Traditional and Semantic WebINRIA-OAK
 
A Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social MediaA Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social MediaINRIA-OAK
 
Speeding up information extraction programs: a holistic optimizer and a learn...
Speeding up information extraction programs: a holistic optimizer and a learn...Speeding up information extraction programs: a holistic optimizer and a learn...
Speeding up information extraction programs: a holistic optimizer and a learn...INRIA-OAK
 
Querying incomplete data
Querying incomplete dataQuerying incomplete data
Querying incomplete dataINRIA-OAK
 
ANGIE in wonderland
ANGIE in wonderlandANGIE in wonderland
ANGIE in wonderlandINRIA-OAK
 
On building more human query answering systems
On building more human query answering systemsOn building more human query answering systems
On building more human query answering systemsINRIA-OAK
 
Web Data Management in RDF Age
Web Data Management in RDF AgeWeb Data Management in RDF Age
Web Data Management in RDF AgeINRIA-OAK
 
Oak meeting 18/09/2014
Oak meeting 18/09/2014Oak meeting 18/09/2014
Oak meeting 18/09/2014INRIA-OAK
 
Rdf saturator
Rdf saturatorRdf saturator
Rdf saturatorINRIA-OAK
 
Rdf generator
Rdf generatorRdf generator
Rdf generatorINRIA-OAK
 
Rdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimationRdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimationINRIA-OAK
 
rdf query reformulation
rdf query reformulationrdf query reformulation
rdf query reformulationINRIA-OAK
 
postgres loader
postgres loaderpostgres loader
postgres loaderINRIA-OAK
 
Conjunctive queries
Conjunctive queriesConjunctive queries
Conjunctive queriesINRIA-OAK
 

More from INRIA-OAK (20)

Change Management in the Traditional and Semantic Web
Change Management in the Traditional and Semantic WebChange Management in the Traditional and Semantic Web
Change Management in the Traditional and Semantic Web
 
A Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social MediaA Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social Media
 
Speeding up information extraction programs: a holistic optimizer and a learn...
Speeding up information extraction programs: a holistic optimizer and a learn...Speeding up information extraction programs: a holistic optimizer and a learn...
Speeding up information extraction programs: a holistic optimizer and a learn...
 
Querying incomplete data
Querying incomplete dataQuerying incomplete data
Querying incomplete data
 
ANGIE in wonderland
ANGIE in wonderlandANGIE in wonderland
ANGIE in wonderland
 
On building more human query answering systems
On building more human query answering systemsOn building more human query answering systems
On building more human query answering systems
 
Web Data Management in RDF Age
Web Data Management in RDF AgeWeb Data Management in RDF Age
Web Data Management in RDF Age
 
Oak meeting 18/09/2014
Oak meeting 18/09/2014Oak meeting 18/09/2014
Oak meeting 18/09/2014
 
Nautilus
NautilusNautilus
Nautilus
 
Warg
WargWarg
Warg
 
Vip2p
Vip2pVip2p
Vip2p
 
S4
S4S4
S4
 
Rdf saturator
Rdf saturatorRdf saturator
Rdf saturator
 
Rdf generator
Rdf generatorRdf generator
Rdf generator
 
Rdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimationRdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimation
 
rdf query reformulation
rdf query reformulationrdf query reformulation
rdf query reformulation
 
postgres loader
postgres loaderpostgres loader
postgres loader
 
Plreuse
PlreusePlreuse
Plreuse
 
Paxquery
PaxqueryPaxquery
Paxquery
 
Conjunctive queries
Conjunctive queriesConjunctive queries
Conjunctive queries
 

Recently uploaded

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oManavSingh202607
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONrouseeyyy
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 

Recently uploaded (20)

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 

Dynamically Optimizing Queries over Large Scale Data Platforms

  • 1. Dynamically Optimizing Queries over Large Scale Data Platforms [Work done at IBM Almaden Research Center] Konstantinos Karanasos♯, Andrey Balmin§, Marcel Kutsch♣, Fatma Özcan*, Vuk Ercegovac◊, Chunyang Xia♦, Jesse Jackson♦ ♯Microsoft *IBM Research §Platfora ♣Apple ◊Google ♦IBM Inria Saclay November 26, 2014
  • 2. Impala Dryad HAWQ 2 The Big Data Landscape Big Data Platforms nested relational HiveQL DryadLINQ Pig Spark SQL Jaql Stratosphere unstructured semi-structured structured data streams Languages Hadoop Hive/Stinger Jaql Spark Stratosphere Hadapt Polybase Drill Need for efficient Big Data management Challenging due to size and heterogeneity of data, variety of applications Query optimization is crucial
  • 3. Query Optimization in Large Scale Data Platforms 3 • Existing challenges • Exponential error propagation in joins • Correlations between predicates • “New” challenges • Prominent use of UDFs • Complex data types (arrays, maps, structs) • Poor statistics (do we own the data?) • Bad plans over Big data may be disastrous • Exploit cluster’s resources (parallel execution) Traditional static techniques are not sufficient We introduce dynamic techniques that are: • at least as good as and • up to 2x (4x) better than the best hand-written left-deep Jaql (Hive) plans
  • 4. 4 SELECT <projection list> FROM ( SELECT <projection list> FROM "PART", "SUPPLIER", "LINEITEM", "PARTSUPP", "ORDERS", "NATION" 5-way join WHERE <join conditions> AND "PART"."p_name" LIKE '%green%' AND "ORDERS"."o_orderdate" BETWEEN '1995-01-01' AND '1995-07-01' correlated predicates AND "ORDERS"."o_orderstatus"='P' AND udf("PARTSUPP"."ps_partkey") < 0.001 external UDFs AND <udf list> ) "PROFIT" GROUP BY "PROFIT"."NATION", "PROFIT"."order_YEAR" ORDER BY "PROFIT"."NATION" ASC, "PROFIT"."order_YEAR" DESC; Example: TPCH Q9’
  • 5. 5 “SQL” Processing in Large Scale Platforms • Relational operators -> MapReduce jobs • Two join algorithms: • Repartition join (RJ) – 1MR job (default) • Memory join (MJ) – map-only job • Optimizations based on rewrite rules and hints • RJ -> MJ • Chain MJs (multiple joins in one map job) • Left-deep plans • This is the picture for Jaql (and Hive)
  • 6. 6 Limitations • No selectivity estimation for predicates/UDFs • Conservative application of memory joins • No cost-based join enumeration • Rely on order of relations in FROM clause • Left-deep plans • Often close to optimal for centralized settings • Not sufficient for distributed query processing
  • 7. 7 TPCH Q9’: Execution Plans udf(o,l) RJ ps p l RJ RJ l udf(o,l) udf(p) udf(o) udf(ps) Best left-deep hand-written Jaql plan RJ o RJ Best relational optimizer plan MJ udf(ps) s n udf(o) udf(p) RJ s RJ RJ p n MJ o ps
  • 8. 8 Dynamic Optimization • Key idea: alter execution plan at runtime • Studied in the relational setting • Both centralized and distributed • Basic concern: when to break the pipeline? • No emphasis on UDFs and correlated predicates • Increasingly being used in large scale platforms (e.g., Scope, Shark, Hive) Goal: dynamic optimization techniques for large scale data platforms (implemented in Jaql)
  • 9. 9 IBM BigInsights Jaql Dataflows for conceptual JSON data Key differentiators • Functions: reusability + abstraction • Physical Transparency: precise control when needed • Data model: semi-structured based on JSON Flexible scripting language Scalable map-reduce runtime Fault Tolerant DFS Jaql Map Jaql Reduce Jaql Map Jaql Reduce Jaql Map
  • 10. 10 Jaql Script: Example read transform group by write Query Data read(hdfs("reviews")) -> transform { pid: $.placeid, rev: sentAn($.review) } -> group by p = ($.pid) as r into { pid: p, revs: r.rev } -> write(hdfs("group-reviews")) [ { pid: 12, revs: [ 3*, 4*, … ] }, { pid: 19, revs: [ 2*, 1*, … ] } ] Group user reviews by place
  • 11. 11 Jaql to MapReduce mapReduce( input: { type: hdfs, location: "reviews" }, output: { type: hdfs, location: "group-reviews" }, map: fn($mapIn) ( $mapIn -> transform { pid: $.placeid, rev: sentAn($.review) } -> transform [ $.placeid, $.rev ] ), reduce: fn($p, $r) ( [ pid: $p, revs: $r ] ) ) • Functions as parameters • Rewritten script is valid Jaql! read(hdfs("reviews")) -> transform { pid: $.placeid, rev: sentAn($.review) } -> group by p = ($.pid) as r into { pid: p, revs: r.rev } -> write(hdfs("group-reviews")) Rewrite Engine
  • 12. 12 Outline • Introduction • System Architecture • Pilot Runs • Adaptation of Execution Plans • Experiments • Conclusion
  • 13. 13 DynO Architecture Query best plan Query result Jaql plan Optimizer (join enumeration) Jaql compiler Jaql runtime MapReduce join query blocks Statistics DFS execute part of the plan pilot runs remaining plan 1 2 3 4 8 5 6 7
  • 14. 14 Pilot Runs • PilR algorithm: • Push-down selections/UDFs • Get leaf expressions (scans + local predicates) • Transform them to map-only jobs • Execute them over random splits of each relation • Until k tuples are output • Collect statistics during execution • Parallel execution of pilot runs (~4.5x speedup) • Approx. 3% overhead to the execution • Performance speedup of up to 2x (4x) for Jaql (Hive)
  • 15. 15 udf(o,l) RJ ps p l RJ RJ l udf(o,l) udf(p) udf(o) udf(ps) Best left-deep hand-written Jaql plan RJ o RJ Best relational optimizer plan MJ udf(ps) s n udf(o) udf(p) RJ s RJ RJ p n MJ o ps udf(o,l) MJ p MJ o MJ l ps MJ MJ s n udf(ps) udf(o) udf(p) Up to 2x speedup (4x when applied to Hive) DynO plan TPCH Q9’: Impact of Pilot Runs
  • 16. 16 Pilot Runs: Details • Collected statistics: • #tuples, min/max, #distinct values • add more if the optimizer can support them • Statistics reusability • Optimization for selective (and expensive) predicates • Shortcomings: • Non-local predicates • Non primary/foreign key joins • Join correlations Runtime adaptation of execution plans
  • 17. 17 Adaptation of Execution Plans • Cost-based optimizer • Based on Columbia (top-down) optimizer • Focuses on join enumeration • Accurate statistics from pilot runs and/or previous executions • Bushy plans (intra-query parallelization) • Online statistics collection • Re-optimization points (natural in MR) • Execution strategies: choosing leaf jobs • Degree of parallelization, cost/uncertainty of jobs
  • 18. 18 TPCH Q8’: Impact of Execution Plan Adaptation MJ r MJ n2 RJ c RJ o p s Best left-deep hand-written Jaql plan RJ l RJ n1 MJ o RJ MJ RJ n2 s RJ l RJ p udf(o,c) r MJ MJ c n1 udf(o,c) Best relational optimizer plan
  • 19. TPCH Q8’: Impact of Execution Plan Adaptation MJ RJ n2 19 udf(o,c) o RJ n2 s MJ RJ l RJ p t1 RJ r MJ MJ c n1 MJ n2 s MJ RJ l RJ t1 p t2 RJ s p t2 t3 MJ MJ n2 s t3 Speedup up to 2x without any initial statistics (despite the added overhead)
  • 20. 20 Outline • Introduction • System Architecture • Pilot Runs • Adaptation of Execution Plans • Experiments • Conclusion
  • 21. 21 Experimental Setup • 15-node cluster, 10 GbE • Each machine: • 12-cores, 96 GB RAM (2GB to each MR slot), 12*2TB disks • 10 map/8 reduce slots • Hadoop 1.1.1 • ZooKeeper for coordination (in statistics collection) • TPCH data, SF = {100, 300, 1000} • TPCH queries (with additional UDFs)
  • 22. 22 Execution times comparison • At least as good as the best left-deep hand-written plans • Benefits from bushy plans (Q2) • Benefits from pilot runs due to many UDFs (Q9’) • Benefits from re-optimization due to UDF on join result (Q8’) • Biggest benefit is brought by the pilot runs
  • 23. 23 Benefits of our Approach on Hive • Similar performance trends with Jaql • Bigger speedup (up to 4x) due to implementation of broadcast joins (Hive 0.12 exploits DistributedCache)
  • 24. 24 Overhead of Dynamic Optimization • Pilot runs overhead 2.5-6.5% • Stats collection overhead 0.1-2.8% • Overall overhead 7-10%
  • 25. 25 Conclusion • Pilot runs to account for UDFs • Dynamic adaptation of execution plans • Traditional optimizer for join ordering (bushy plans) • Online statistics collection (no need for initial statistics) • Execution strategies • At least as good plans as the left-deep hand-written ones • Up to 2x faster (4x for Hive) • Applicability to other systems (e.g., Hive)
  • 26. 26 Perspectives • Broader range of applications (e.g., ML) • Other runtimes (e.g., Tez) • Adaptive operators • Extend optimizer to support grouping, ordering