SlideShare ist ein Scribd-Unternehmen logo
1 von 41
QuickTime™ and a
decompressor
are needed to see this picture.
HaLoop: Efficient Iterative Data
Processing On Large Scale Clusters
Yingyi Bu, UC Irvine
Bill Howe, UW
Magda Balazinska, UW
Michael Ernst, UW
http://clue.cs.washington.edu/
Award IIS 0844572
Cluster Exploratory (CluE)
QuickTime™ and a
decompressor
are needed to see this picture.
http://escience.washington.edu/
VLDB 2010, Singapore
Horizon
01/30/15Bill Howe, UW 2QuickTime™ and a
decompressor
are needed to see this picture.
Thesis in one slide
 Observation: MapReduce has proven successful as a
common runtime for non-recursive declarative languages
 HIVE (SQL)
 Pig (RA with nested types)
 Observation: Many people roll their own loops
 Graphs, clustering, mining, recursive queries
 iteration managed by external script
 Thesis: With minimal extensions, we can provide an efficient
common runtime for recursive languages
 Map, Reduce, Fixpoint
01/30/15Bill Howe, UW 3QuickTime™ and a
decompressor
are needed to see this picture.
Related Work: Twister [Ekanayake HPDC 2010]
 Redesigned evaluation engine using pub/sub
 Termination condition evaluated by main()
13. while(!complete){
14. monitor = driver.runMapReduceBCast(cData);
15. monitor.monitorTillCompletion();
16. DoubleVectorData newCData = ((KMeansCombiner) driver
.getCurrentCombiner()).getResults();
17. totalError = getError(cData, newCData);
18. cData = newCData;
19. if (totalError < THRESHOLD) {
20. complete = true;
21. break;
22. }
23. }
O(k)
01/30/15Bill Howe, UW 4QuickTime™ and a
decompressor
are needed to see this picture.
In Detail: PageRank (Twister)
while (!complete) {
// start the pagerank map reduce process
monitor = driver.runMapReduceBCast(new
BytesValue(tmpCompressedDvd.getBytes()));
monitor.monitorTillCompletion();
// get the result of process
newCompressedDvd = ((PageRankCombiner)
driver.getCurrentCombiner()).getResults();
// decompress the compressed pagerank values
newDvd = decompress(newCompressedDvd);
tmpDvd = decompress(tmpCompressedDvd);
totalError = getError(tmpDvd, newDvd);
// get the difference between new and old pagerank values
if (totalError < tolerance) {
complete = true;
}
tmpCompressedDvd = newCompressedDvd;
}
O(N) in the size
of the graph
run MR
term.
cond.
01/30/15Bill Howe, UW 5QuickTime™ and a
decompressor
are needed to see this picture.
Related Work: Spark [Zaharia HotCloud 2010]
 Reduction output collected at driver program

“…does not currently support a grouped reduce
operation as in MapReduce”
val spark = new SparkContext(<Mesos master>)
var count = spark.accumulator(0)
for (i <- spark.parallelize(1 to 10000, 10)) {
val x = Math.random * 2 - 1
val y = Math.random * 2 - 1
if (x*x + y*y < 1) count += 1
}
println("Pi is roughly " + 4 * count.value / 10000.0)
all output sent
to driver.
01/30/15Bill Howe, UW 6QuickTime™ and a
decompressor
are needed to see this picture.
Related Work: Pregel [Malewicz PODC 2009]
 Graphs only
 clustering: k-means, canopy, DBScan
 Assumes each vertex has access to outgoing edges
 So an edge representation …
 …requires offline preprocessing
 perhaps using MapReduce
Edge(from, to)
01/30/15Bill Howe, UW 7QuickTime™ and a
decompressor
are needed to see this picture.
Related Work: Piccolo [Power OSDI 2010]
 Partitioned table data model, with user-
defined partitioning
 Programming model:
 message-passing with global synchronization
barriers
 User can give locality hints
 Worth exploring a direct comparison
GroupTables(curr, next, graph)
01/30/15Bill Howe, UW 8QuickTime™ and a
decompressor
are needed to see this picture.
Related Work: BOOM [c.f. Alvaro EuroSys 10]
 Distributed computing based on Overlog
(Datalog + temporal logic + more)
 Recursion supported naturally
 app: API-compliant implementation of MR
 Worth exploring a direct comparison
01/30/15Bill Howe, UW 9QuickTime™ and a
decompressor
are needed to see this picture.
Details
 Architecture
 Programming Model
 Caching (and Indexing)
 Scheduling
01/30/15Bill Howe, UW 10QuickTime™ and a
decompressor
are needed to see this picture.
Example 1: PageRank
url rank
www.a.com 1.0
www.b.com 1.0
www.c.com 1.0
www.d.com 1.0
www.e.com 1.0
url_src url_dest
www.a.com www.b.com
www.a.com www.c.com
www.c.com www.a.com
www.e.com www.c.com
www.d.com www.b.com
www.c.com www.e.com
www.e.com www.c.om
www.a.com www.d.com
Rank Table R0
Linkage Table L
url rank
www.a.com 2.13
www.b.com 3.89
www.c.com 2.60
www.d.com 2.60
www.e.com 2.13
Rank Table R3
Ri L
Ri.rank = Ri.rank/γurlCOUNT(url_dest)
Ri.url = L.url_src
π(url_dest, γurl_destSUM(rank))
Ri+1
01/30/15Bill Howe, UW 11QuickTime™ and a
decompressor
are needed to see this picture.
A MapReduce Implementation
M
M
M
M
M
r
r
Ri
L-split1
L-split0
M
M
r
r
i=i+1 Converged?
Join & compute rank
Aggregate fixpoint evaluation
Client
done
r
r
01/30/15Bill Howe, UW 12QuickTime™ and a
decompressor
are needed to see this picture.
What’s the problem?
1. L is loaded on each iteration
2. L is shuffled on each iteration
3. Fixpoint evaluated as a separate MapReduce job per iteration
m
m
m
Ri
L-split1
L-split0
M
M
r
r
1.
2.
3.
L is loop invariant, but
plus
r
r M
M
r
r
01/30/15Bill Howe, UW 13QuickTime™ and a
decompressor
are needed to see this picture.
Example 2: Transitive Closure
Friend Find all transitive friends of Eric
{Eric, Elisa}
{Eric, Tom
Eric, Harry}
{}
R1
R0 {Eric, Eric}
R2
R3
(semi-naïve evaluation)
01/30/15Bill Howe, UW 14QuickTime™ and a
decompressor
are needed to see this picture.
Example 2 in MapReduce
M
M
M
M
M
r
r
Si
Friend1
Friend0
i=i+1
Anything new?
Join
Dupe-elim
Client
done
r
r
(compute next generation of friends)
(remove the ones
we’ve already seen)
01/30/15Bill Howe, UW 15QuickTime™ and a
decompressor
are needed to see this picture.
What’s the problem?
1. Friend is loaded on each iteration
2. Friend is shuffled on each iteration
Friend is loop invariant, but
M
M
M
M
M
r
r
Si
Friend1
Friend0
Join
Dupe-elim
r
r
(compute next generation of friends)
(remove the ones
we’ve already seen)
1.
2.
01/30/15Bill Howe, UW 16QuickTime™ and a
decompressor
are needed to see this picture.
Example 3: k-means
M
M
M
P0
i=i+1
ki - ki+1 < threshold?
Client
done
r
r
P1
P2
= k centroids at iteration iki
ki
ki
ki
ki+1
01/30/15Bill Howe, UW 17QuickTime™ and a
decompressor
are needed to see this picture.
What’s the problem?
M
M
M
P0
i=i+1
ki - ki+1 < threshold?
Client
done
r
r
P1
P2
= k centroids at iteration ikiki
ki
ki
ki+1
1. P is loaded on each iteration
P is loop invariant, but
1.
01/30/15Bill Howe, UW 18QuickTime™ and a
decompressor
are needed to see this picture.
Approach: Inter-iteration caching
Mapper input cache (MI)
Mapper output cache (MO)
Reducer input cache (RI)
Reducer output cache (RO)
M
M
M
r
r
…
Loop body
01/30/15Bill Howe, UW 19QuickTime™ and a
decompressor
are needed to see this picture.
RI: Reducer Input Cache
 Provides:
 Access to loop invariant data without
map/shuffle
 Used By:
 Reducer function
 Assumes:
1. Mapper output for a given table constant
across iterations
2. Static partitioning (implies: no new nodes)
 PageRank
 Avoid shuffling the network at every step
 Transitive Closure
 Avoid shuffling the graph at every step
 K-means
 No help
…
01/30/15Bill Howe, UW 20QuickTime™ and a
decompressor
are needed to see this picture.
Reducer Input Cache Benefit
Transitive Closure
Billion Triples Dataset (120GB)
90 small instances on EC2
Overall run time
01/30/15Bill Howe, UW 21QuickTime™ and a
decompressor
are needed to see this picture.
Reducer Input Cache Benefit
Transitive Closure
Billion Triples Dataset (120GB)
90 small instances on EC2
Join step only
Livejournal, 12GB
01/30/15Bill Howe, UW 22QuickTime™ and a
decompressor
are needed to see this picture.
Reducer Input Cache Benefit
Transitive Closure
Billion Triples Dataset (120GB)
90 small instances on EC2
Reduce and Shuffle of Join Step
Livejournal, 12GB
01/30/15Bill Howe, UW 23QuickTime™ and a
decompressor
are needed to see this picture.
Join & compute rank
M
M
M
M
M
r
r
Ri
L-split1
L-split0
M
M
r
r
Aggregate fixpoint evaluation
r
r
Total
01/30/15Bill Howe, UW 24QuickTime™ and a
decompressor
are needed to see this picture.
RO: Reducer Output Cache
 Provides:
 Distributed access to output of previous
iterations
 Used By:
 Fixpoint evaluation
 Assumes:
1. Partitioning constant across iterations
2. Reducer output key functionally
determines Reducer input key
 PageRank
 Allows distributed fixpoint evaluation
 Obviates extra MapReduce job
 Transitive Closure
 No help
 K-means
 No help
…
01/30/15Bill Howe, UW 25QuickTime™ and a
decompressor
are needed to see this picture.
Reducer Output Cache BenefitFixpointevaluation(s)
Iteration # Iteration #
Livejournal dataset
50 EC2 small instances
Freebase dataset
90 EC2 small instances
01/30/15Bill Howe, UW 26QuickTime™ and a
decompressor
are needed to see this picture.
MI: Mapper Input Cache
 Provides:
 Access to non-local mapper input on later
iterations
 Used:
 During scheduling of map tasks
 Assumes:
1. Mapper input does not change
 PageRank
 Subsumed by use of Reducer Input Cache
 Transitive Closure
 Subsumed by use of Reducer Input Cache
 K-means
 Avoids non-local data reads on iterations > 0
…
01/30/15Bill Howe, UW 27QuickTime™ and a
decompressor
are needed to see this picture.
Mapper Input Cache Benefit
5% non-local data reads;
~5% improvement
01/30/15Bill Howe, UW 28QuickTime™ and a
decompressor
are needed to see this picture.
Conclusions (last slide)
 Relatively simple changes to MapReduce/Hadoop can
support arbitrary recursive programs
 TaskTracker (Cache management)
 Scheduler (Cache awareness)
 Programming model (multi-step loop bodies, cache control)
 Optimizations
 Caching loop invariant data realizes largest gain
 Good to eliminate extra MapReduce step for termination checks
 Mapper input cache benefit inconclusive; need a busier cluster
 Future Work
 Analyze expressiveness of Map Reduce Fixpoint
 Consider a model of Map (Reduce+
) Fixpoint
01/30/15Bill Howe, UW 29QuickTime™ and a
decompressor
are needed to see this picture.
Data-Intensive
Scalable Science
http://clue.cs.washington.edu
http://escience.washington.edu
Award IIS 0844572
Cluster Exploratory (CluE)
01/30/15Bill Howe, UW 30QuickTime™ and a
decompressor
are needed to see this picture.
Motivation in One Slide
 MapReduce can’t express recursion/iteration
 Lots of interesting programs need loops
 graph algorithms
 clustering
 machine learning
 recursive queries (CTEs, datalog, WITH clause)
 Dominant solution: Use a driver program outside
of mapreduce
 Hypothesis: making MapReduce loop-aware
affords optimization
 …and lays a foundation for scalable implementations of
recursive languages
01/30/15Bill Howe, UW 31QuickTime™ and a
decompressor
are needed to see this picture.
Experiments
 Amazon EC2
 20, 50, 90 default small instances
 Datasets
 Billions of Triples (120GB) [1.5B nodes 1.6B edges]
 Freebase (12GB) [7M ndoes 154M edges]
 Livejournal social network (18GB) [4.8M nodes, 67M edges]
 Queries
 Transitive Closure
 PageRank
 k-means
[VLDB 2010]
01/30/15Bill Howe, UW 32QuickTime™ and a
decompressor
are needed to see this picture.
HaLoop Architecture
01/30/15Bill Howe, UW 33QuickTime™ and a
decompressor
are needed to see this picture.
Scheduling Algorithm
Input: Node node
Global variable: HashMap<Node, List<Parition>> last, HashMaph<Node, List<Partition>> current
1: if (iteration ==0) {
2: Partition part = StandardMapReduceSchedule(node);
3: current.add(node, part);
4: }else{
5: if (node.hasFullLoad()) {
6: Node substitution = findNearbyNode(node);
7: last.get(substitution).addAll(last.remove(node));
8: return;
9: }
10: if (last.get(node).size()>0) {
11: Partition part = last.get(node).get(0);
12: schedule(part, node);
13: current.get(node).add(part);
14: list.remove(part);
15: }
16: }
The same as MapReduce
Find a substitution
Iteration-local Schedule
01/30/15Bill Howe, UW 34QuickTime™ and a
decompressor
are needed to see this picture.
Programming Interface
Job job = new Job();
job.AddMap(Map Rank, 1);
job.AddReduce(Reduce Rank, 1);
job.AddMap(Map Aggregate, 2);
job.AddReduce(Reduce Aggregate, 2);
job.AddInvariantTable(#1);
job.SetInput(IterationInput);
job.SetFixedPointThreshold(0.1);
job.SetDistanceMeasure(ResultDistance);
job.SetMaxNumOfIterations(10);
job.SetReducerInputCache(true);
job.SetReducerOutputCache(true);
job.Submit();
define loop body
Turn on caches
Declare an input as invariant
Specify loop body input,
parameterized by iteration #
Termination condition
01/30/15Bill Howe, UW 35QuickTime™ and a
decompressor
are needed to see this picture.
Cache Infrastructure Details
 Programmer control
 Architecture for cache management
 Scheduling for inter-iteration locality
 Indexing the values in the cache
01/30/15Bill Howe, UW 36QuickTime™ and a
decompressor
are needed to see this picture.
Other Extensions and Experiments
 Distributed databases and Pig/Hadoop for Astronomy [IASDS 09]
 Efficient “Friends of Friends” in Dryad [SSDBM 2010]
 SkewReduce: Automated skew handling [SOCC 2010]
 Image Stacking and Mosaicing with Hadoop [Hadoop Summit 2010]
 HaLoop: Efficient iterative processing with Hadoop [VLDB2010]
01/30/15Bill Howe, UW 37QuickTime™ and a
decompressor
are needed to see this picture.
MapReduce Broadly Applicable
 Biology
 [Schatz 08, 09]
 Astronomy
 [IASDS 09, SSDBM 10, SOCC 10, PASP 10]
 Oceanography
 [UltraVis 09]
 Visualization
 [UltraVis 09, EuroVis 10]
01/30/15Bill Howe, UW 38QuickTime™ and a
decompressor
are needed to see this picture.
Key idea
 When the loop output is large…
 transitive closure
 connected components
 PageRank (with a convergence test as the
termination condition)
 …need a distributed fixpoint operator
 typically implemented as yet another
MapReduce job -- on every iteration
01/30/15Bill Howe, UW 39QuickTime™ and a
decompressor
are needed to see this picture.
Background
 Why is MapReduce popular?
 Because it’s fast?
 Because it scales to 1000s of commodity
nodes?
 Because it’s fault tolerant?
 Witness
 MapReduce on GPUs
 MapReduce on MPI
 MapReduce in main memory
 MapReduce on <10 nodes
01/30/15Bill Howe, UW 40QuickTime™ and a
decompressor
are needed to see this picture.
So why is MapReduce popular?
 The programming model
 Two serial functions, parallelism for free
 Easy and expressive
 Compare this with MPI
 70+ operations
 But it can’t express recursion
 graph algorithms
 clustering
 machine learning
 recursive queries (CTEs, datalog, WITH clause)
01/30/15Bill Howe, UW 41QuickTime™ and a
decompressor
are needed to see this picture.
Fixpoint
 A fixpoint of a function f is a value x such that
f(x) = x
 The fixpoint queries FIX can be expressed with
the relational algebra plus a fixpoint operator
 Map - Reduce - Fixpoint
 hypothesis: sufficient model for all recursive queries

Weitere ähnliche Inhalte

Was ist angesagt?

GPU Accelerated Domain Decomposition
GPU Accelerated Domain DecompositionGPU Accelerated Domain Decomposition
GPU Accelerated Domain DecompositionRichard Southern
 
131107 foss4 g_osaka_grass7_presentation
131107 foss4 g_osaka_grass7_presentation131107 foss4 g_osaka_grass7_presentation
131107 foss4 g_osaka_grass7_presentationTakayuki Nuimura
 
An Efficient Pipelined VLSI Architecture for Lifting-Based 2D-Discrete Wavele...
An Efficient Pipelined VLSI Architecture for Lifting-Based 2D-Discrete Wavele...An Efficient Pipelined VLSI Architecture for Lifting-Based 2D-Discrete Wavele...
An Efficient Pipelined VLSI Architecture for Lifting-Based 2D-Discrete Wavele...Rahul Jain
 
Parametric surface visualization in Directx 11 and C++
Parametric surface visualization in Directx 11 and C++Parametric surface visualization in Directx 11 and C++
Parametric surface visualization in Directx 11 and C++Alejandro Cosin Ayerbe
 
2 transformation computer graphics
2 transformation computer graphics2 transformation computer graphics
2 transformation computer graphicscairo university
 
6.3.2 CLIMADA model demo
6.3.2 CLIMADA model demo6.3.2 CLIMADA model demo
6.3.2 CLIMADA model demoNAP Events
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
 
Fast and Scalable NUMA-based Thread Parallel Breadth-first Search
Fast and Scalable NUMA-based Thread Parallel Breadth-first SearchFast and Scalable NUMA-based Thread Parallel Breadth-first Search
Fast and Scalable NUMA-based Thread Parallel Breadth-first SearchYuichiro Yasui
 
Chap5 - ADSP 21K Manual
Chap5 - ADSP 21K ManualChap5 - ADSP 21K Manual
Chap5 - ADSP 21K ManualSethCopeland
 

Was ist angesagt? (11)

GPU Accelerated Domain Decomposition
GPU Accelerated Domain DecompositionGPU Accelerated Domain Decomposition
GPU Accelerated Domain Decomposition
 
131107 foss4 g_osaka_grass7_presentation
131107 foss4 g_osaka_grass7_presentation131107 foss4 g_osaka_grass7_presentation
131107 foss4 g_osaka_grass7_presentation
 
An Efficient Pipelined VLSI Architecture for Lifting-Based 2D-Discrete Wavele...
An Efficient Pipelined VLSI Architecture for Lifting-Based 2D-Discrete Wavele...An Efficient Pipelined VLSI Architecture for Lifting-Based 2D-Discrete Wavele...
An Efficient Pipelined VLSI Architecture for Lifting-Based 2D-Discrete Wavele...
 
Parametric surface visualization in Directx 11 and C++
Parametric surface visualization in Directx 11 and C++Parametric surface visualization in Directx 11 and C++
Parametric surface visualization in Directx 11 and C++
 
2 transformation computer graphics
2 transformation computer graphics2 transformation computer graphics
2 transformation computer graphics
 
6.3.2 CLIMADA model demo
6.3.2 CLIMADA model demo6.3.2 CLIMADA model demo
6.3.2 CLIMADA model demo
 
matlab_simulink_for_control082p.pdf
matlab_simulink_for_control082p.pdfmatlab_simulink_for_control082p.pdf
matlab_simulink_for_control082p.pdf
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
 
Fast and Scalable NUMA-based Thread Parallel Breadth-first Search
Fast and Scalable NUMA-based Thread Parallel Breadth-first SearchFast and Scalable NUMA-based Thread Parallel Breadth-first Search
Fast and Scalable NUMA-based Thread Parallel Breadth-first Search
 
Survey onhpcs languages
Survey onhpcs languagesSurvey onhpcs languages
Survey onhpcs languages
 
Chap5 - ADSP 21K Manual
Chap5 - ADSP 21K ManualChap5 - ADSP 21K Manual
Chap5 - ADSP 21K Manual
 

Ähnlich wie HaLoop: Efficient Iterative Processing on Large-Scale Clusters

Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryDataWorks Summit
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesC4Media
 
CS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingCS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingMark Kilgard
 
Portofolio Control Version SN
Portofolio Control Version SNPortofolio Control Version SN
Portofolio Control Version SNSamuel Narcisse
 
Build Your Own 3D Scanner: 3D Scanning with Structured Lighting
Build Your Own 3D Scanner: 3D Scanning with Structured LightingBuild Your Own 3D Scanner: 3D Scanning with Structured Lighting
Build Your Own 3D Scanner: 3D Scanning with Structured LightingDouglas Lanman
 
Reactive programming at scale
Reactive programming at scale Reactive programming at scale
Reactive programming at scale John McClean
 
FAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLAB
FAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLABFAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLAB
FAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLABJournal For Research
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersJen Aman
 
Project report on design & implementation of high speed carry select adder
Project report on design & implementation of high speed carry select adderProject report on design & implementation of high speed carry select adder
Project report on design & implementation of high speed carry select adderssingh7603
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
A Multicore Parallelization of Continuous Skyline Queries on Data Streams
A Multicore Parallelization of Continuous Skyline Queries on Data StreamsA Multicore Parallelization of Continuous Skyline Queries on Data Streams
A Multicore Parallelization of Continuous Skyline Queries on Data StreamsTiziano De Matteis
 
Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)Matthias Trapp
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Christian Peel
 
Comp5704-Final Presentation
Comp5704-Final PresentationComp5704-Final Presentation
Comp5704-Final PresentationAli Davoudian
 

Ähnlich wie HaLoop: Efficient Iterative Processing on Large-Scale Clusters (20)

Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theory
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+Tables
 
CS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingCS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and Culling
 
Portofolio Control Version SN
Portofolio Control Version SNPortofolio Control Version SN
Portofolio Control Version SN
 
Build Your Own 3D Scanner: 3D Scanning with Structured Lighting
Build Your Own 3D Scanner: 3D Scanning with Structured LightingBuild Your Own 3D Scanner: 3D Scanning with Structured Lighting
Build Your Own 3D Scanner: 3D Scanning with Structured Lighting
 
Reactive programming at scale
Reactive programming at scale Reactive programming at scale
Reactive programming at scale
 
FAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLAB
FAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLABFAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLAB
FAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLAB
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Log polar coordinates
Log polar coordinatesLog polar coordinates
Log polar coordinates
 
Project report on design & implementation of high speed carry select adder
Project report on design & implementation of high speed carry select adderProject report on design & implementation of high speed carry select adder
Project report on design & implementation of high speed carry select adder
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Srushti_M.E_PPT.ppt
Srushti_M.E_PPT.pptSrushti_M.E_PPT.ppt
Srushti_M.E_PPT.ppt
 
A Multicore Parallelization of Continuous Skyline Queries on Data Streams
A Multicore Parallelization of Continuous Skyline Queries on Data StreamsA Multicore Parallelization of Continuous Skyline Queries on Data Streams
A Multicore Parallelization of Continuous Skyline Queries on Data Streams
 
Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)Rendering of Complex 3D Treemaps (GRAPP 2013)
Rendering of Complex 3D Treemaps (GRAPP 2013)
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
Comp5704-Final Presentation
Comp5704-Final PresentationComp5704-Final Presentation
Comp5704-Final Presentation
 

Mehr von University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 

Mehr von University of Washington (20)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 

Kürzlich hochgeladen

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

HaLoop: Efficient Iterative Processing on Large-Scale Clusters

  • 1. QuickTime™ and a decompressor are needed to see this picture. HaLoop: Efficient Iterative Data Processing On Large Scale Clusters Yingyi Bu, UC Irvine Bill Howe, UW Magda Balazinska, UW Michael Ernst, UW http://clue.cs.washington.edu/ Award IIS 0844572 Cluster Exploratory (CluE) QuickTime™ and a decompressor are needed to see this picture. http://escience.washington.edu/ VLDB 2010, Singapore Horizon
  • 2. 01/30/15Bill Howe, UW 2QuickTime™ and a decompressor are needed to see this picture. Thesis in one slide  Observation: MapReduce has proven successful as a common runtime for non-recursive declarative languages  HIVE (SQL)  Pig (RA with nested types)  Observation: Many people roll their own loops  Graphs, clustering, mining, recursive queries  iteration managed by external script  Thesis: With minimal extensions, we can provide an efficient common runtime for recursive languages  Map, Reduce, Fixpoint
  • 3. 01/30/15Bill Howe, UW 3QuickTime™ and a decompressor are needed to see this picture. Related Work: Twister [Ekanayake HPDC 2010]  Redesigned evaluation engine using pub/sub  Termination condition evaluated by main() 13. while(!complete){ 14. monitor = driver.runMapReduceBCast(cData); 15. monitor.monitorTillCompletion(); 16. DoubleVectorData newCData = ((KMeansCombiner) driver .getCurrentCombiner()).getResults(); 17. totalError = getError(cData, newCData); 18. cData = newCData; 19. if (totalError < THRESHOLD) { 20. complete = true; 21. break; 22. } 23. } O(k)
  • 4. 01/30/15Bill Howe, UW 4QuickTime™ and a decompressor are needed to see this picture. In Detail: PageRank (Twister) while (!complete) { // start the pagerank map reduce process monitor = driver.runMapReduceBCast(new BytesValue(tmpCompressedDvd.getBytes())); monitor.monitorTillCompletion(); // get the result of process newCompressedDvd = ((PageRankCombiner) driver.getCurrentCombiner()).getResults(); // decompress the compressed pagerank values newDvd = decompress(newCompressedDvd); tmpDvd = decompress(tmpCompressedDvd); totalError = getError(tmpDvd, newDvd); // get the difference between new and old pagerank values if (totalError < tolerance) { complete = true; } tmpCompressedDvd = newCompressedDvd; } O(N) in the size of the graph run MR term. cond.
  • 5. 01/30/15Bill Howe, UW 5QuickTime™ and a decompressor are needed to see this picture. Related Work: Spark [Zaharia HotCloud 2010]  Reduction output collected at driver program  “…does not currently support a grouped reduce operation as in MapReduce” val spark = new SparkContext(<Mesos master>) var count = spark.accumulator(0) for (i <- spark.parallelize(1 to 10000, 10)) { val x = Math.random * 2 - 1 val y = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 } println("Pi is roughly " + 4 * count.value / 10000.0) all output sent to driver.
  • 6. 01/30/15Bill Howe, UW 6QuickTime™ and a decompressor are needed to see this picture. Related Work: Pregel [Malewicz PODC 2009]  Graphs only  clustering: k-means, canopy, DBScan  Assumes each vertex has access to outgoing edges  So an edge representation …  …requires offline preprocessing  perhaps using MapReduce Edge(from, to)
  • 7. 01/30/15Bill Howe, UW 7QuickTime™ and a decompressor are needed to see this picture. Related Work: Piccolo [Power OSDI 2010]  Partitioned table data model, with user- defined partitioning  Programming model:  message-passing with global synchronization barriers  User can give locality hints  Worth exploring a direct comparison GroupTables(curr, next, graph)
  • 8. 01/30/15Bill Howe, UW 8QuickTime™ and a decompressor are needed to see this picture. Related Work: BOOM [c.f. Alvaro EuroSys 10]  Distributed computing based on Overlog (Datalog + temporal logic + more)  Recursion supported naturally  app: API-compliant implementation of MR  Worth exploring a direct comparison
  • 9. 01/30/15Bill Howe, UW 9QuickTime™ and a decompressor are needed to see this picture. Details  Architecture  Programming Model  Caching (and Indexing)  Scheduling
  • 10. 01/30/15Bill Howe, UW 10QuickTime™ and a decompressor are needed to see this picture. Example 1: PageRank url rank www.a.com 1.0 www.b.com 1.0 www.c.com 1.0 www.d.com 1.0 www.e.com 1.0 url_src url_dest www.a.com www.b.com www.a.com www.c.com www.c.com www.a.com www.e.com www.c.com www.d.com www.b.com www.c.com www.e.com www.e.com www.c.om www.a.com www.d.com Rank Table R0 Linkage Table L url rank www.a.com 2.13 www.b.com 3.89 www.c.com 2.60 www.d.com 2.60 www.e.com 2.13 Rank Table R3 Ri L Ri.rank = Ri.rank/γurlCOUNT(url_dest) Ri.url = L.url_src π(url_dest, γurl_destSUM(rank)) Ri+1
  • 11. 01/30/15Bill Howe, UW 11QuickTime™ and a decompressor are needed to see this picture. A MapReduce Implementation M M M M M r r Ri L-split1 L-split0 M M r r i=i+1 Converged? Join & compute rank Aggregate fixpoint evaluation Client done r r
  • 12. 01/30/15Bill Howe, UW 12QuickTime™ and a decompressor are needed to see this picture. What’s the problem? 1. L is loaded on each iteration 2. L is shuffled on each iteration 3. Fixpoint evaluated as a separate MapReduce job per iteration m m m Ri L-split1 L-split0 M M r r 1. 2. 3. L is loop invariant, but plus r r M M r r
  • 13. 01/30/15Bill Howe, UW 13QuickTime™ and a decompressor are needed to see this picture. Example 2: Transitive Closure Friend Find all transitive friends of Eric {Eric, Elisa} {Eric, Tom Eric, Harry} {} R1 R0 {Eric, Eric} R2 R3 (semi-naïve evaluation)
  • 14. 01/30/15Bill Howe, UW 14QuickTime™ and a decompressor are needed to see this picture. Example 2 in MapReduce M M M M M r r Si Friend1 Friend0 i=i+1 Anything new? Join Dupe-elim Client done r r (compute next generation of friends) (remove the ones we’ve already seen)
  • 15. 01/30/15Bill Howe, UW 15QuickTime™ and a decompressor are needed to see this picture. What’s the problem? 1. Friend is loaded on each iteration 2. Friend is shuffled on each iteration Friend is loop invariant, but M M M M M r r Si Friend1 Friend0 Join Dupe-elim r r (compute next generation of friends) (remove the ones we’ve already seen) 1. 2.
  • 16. 01/30/15Bill Howe, UW 16QuickTime™ and a decompressor are needed to see this picture. Example 3: k-means M M M P0 i=i+1 ki - ki+1 < threshold? Client done r r P1 P2 = k centroids at iteration iki ki ki ki ki+1
  • 17. 01/30/15Bill Howe, UW 17QuickTime™ and a decompressor are needed to see this picture. What’s the problem? M M M P0 i=i+1 ki - ki+1 < threshold? Client done r r P1 P2 = k centroids at iteration ikiki ki ki ki+1 1. P is loaded on each iteration P is loop invariant, but 1.
  • 18. 01/30/15Bill Howe, UW 18QuickTime™ and a decompressor are needed to see this picture. Approach: Inter-iteration caching Mapper input cache (MI) Mapper output cache (MO) Reducer input cache (RI) Reducer output cache (RO) M M M r r … Loop body
  • 19. 01/30/15Bill Howe, UW 19QuickTime™ and a decompressor are needed to see this picture. RI: Reducer Input Cache  Provides:  Access to loop invariant data without map/shuffle  Used By:  Reducer function  Assumes: 1. Mapper output for a given table constant across iterations 2. Static partitioning (implies: no new nodes)  PageRank  Avoid shuffling the network at every step  Transitive Closure  Avoid shuffling the graph at every step  K-means  No help …
  • 20. 01/30/15Bill Howe, UW 20QuickTime™ and a decompressor are needed to see this picture. Reducer Input Cache Benefit Transitive Closure Billion Triples Dataset (120GB) 90 small instances on EC2 Overall run time
  • 21. 01/30/15Bill Howe, UW 21QuickTime™ and a decompressor are needed to see this picture. Reducer Input Cache Benefit Transitive Closure Billion Triples Dataset (120GB) 90 small instances on EC2 Join step only Livejournal, 12GB
  • 22. 01/30/15Bill Howe, UW 22QuickTime™ and a decompressor are needed to see this picture. Reducer Input Cache Benefit Transitive Closure Billion Triples Dataset (120GB) 90 small instances on EC2 Reduce and Shuffle of Join Step Livejournal, 12GB
  • 23. 01/30/15Bill Howe, UW 23QuickTime™ and a decompressor are needed to see this picture. Join & compute rank M M M M M r r Ri L-split1 L-split0 M M r r Aggregate fixpoint evaluation r r Total
  • 24. 01/30/15Bill Howe, UW 24QuickTime™ and a decompressor are needed to see this picture. RO: Reducer Output Cache  Provides:  Distributed access to output of previous iterations  Used By:  Fixpoint evaluation  Assumes: 1. Partitioning constant across iterations 2. Reducer output key functionally determines Reducer input key  PageRank  Allows distributed fixpoint evaluation  Obviates extra MapReduce job  Transitive Closure  No help  K-means  No help …
  • 25. 01/30/15Bill Howe, UW 25QuickTime™ and a decompressor are needed to see this picture. Reducer Output Cache BenefitFixpointevaluation(s) Iteration # Iteration # Livejournal dataset 50 EC2 small instances Freebase dataset 90 EC2 small instances
  • 26. 01/30/15Bill Howe, UW 26QuickTime™ and a decompressor are needed to see this picture. MI: Mapper Input Cache  Provides:  Access to non-local mapper input on later iterations  Used:  During scheduling of map tasks  Assumes: 1. Mapper input does not change  PageRank  Subsumed by use of Reducer Input Cache  Transitive Closure  Subsumed by use of Reducer Input Cache  K-means  Avoids non-local data reads on iterations > 0 …
  • 27. 01/30/15Bill Howe, UW 27QuickTime™ and a decompressor are needed to see this picture. Mapper Input Cache Benefit 5% non-local data reads; ~5% improvement
  • 28. 01/30/15Bill Howe, UW 28QuickTime™ and a decompressor are needed to see this picture. Conclusions (last slide)  Relatively simple changes to MapReduce/Hadoop can support arbitrary recursive programs  TaskTracker (Cache management)  Scheduler (Cache awareness)  Programming model (multi-step loop bodies, cache control)  Optimizations  Caching loop invariant data realizes largest gain  Good to eliminate extra MapReduce step for termination checks  Mapper input cache benefit inconclusive; need a busier cluster  Future Work  Analyze expressiveness of Map Reduce Fixpoint  Consider a model of Map (Reduce+ ) Fixpoint
  • 29. 01/30/15Bill Howe, UW 29QuickTime™ and a decompressor are needed to see this picture. Data-Intensive Scalable Science http://clue.cs.washington.edu http://escience.washington.edu Award IIS 0844572 Cluster Exploratory (CluE)
  • 30. 01/30/15Bill Howe, UW 30QuickTime™ and a decompressor are needed to see this picture. Motivation in One Slide  MapReduce can’t express recursion/iteration  Lots of interesting programs need loops  graph algorithms  clustering  machine learning  recursive queries (CTEs, datalog, WITH clause)  Dominant solution: Use a driver program outside of mapreduce  Hypothesis: making MapReduce loop-aware affords optimization  …and lays a foundation for scalable implementations of recursive languages
  • 31. 01/30/15Bill Howe, UW 31QuickTime™ and a decompressor are needed to see this picture. Experiments  Amazon EC2  20, 50, 90 default small instances  Datasets  Billions of Triples (120GB) [1.5B nodes 1.6B edges]  Freebase (12GB) [7M ndoes 154M edges]  Livejournal social network (18GB) [4.8M nodes, 67M edges]  Queries  Transitive Closure  PageRank  k-means [VLDB 2010]
  • 32. 01/30/15Bill Howe, UW 32QuickTime™ and a decompressor are needed to see this picture. HaLoop Architecture
  • 33. 01/30/15Bill Howe, UW 33QuickTime™ and a decompressor are needed to see this picture. Scheduling Algorithm Input: Node node Global variable: HashMap<Node, List<Parition>> last, HashMaph<Node, List<Partition>> current 1: if (iteration ==0) { 2: Partition part = StandardMapReduceSchedule(node); 3: current.add(node, part); 4: }else{ 5: if (node.hasFullLoad()) { 6: Node substitution = findNearbyNode(node); 7: last.get(substitution).addAll(last.remove(node)); 8: return; 9: } 10: if (last.get(node).size()>0) { 11: Partition part = last.get(node).get(0); 12: schedule(part, node); 13: current.get(node).add(part); 14: list.remove(part); 15: } 16: } The same as MapReduce Find a substitution Iteration-local Schedule
  • 34. 01/30/15Bill Howe, UW 34QuickTime™ and a decompressor are needed to see this picture. Programming Interface Job job = new Job(); job.AddMap(Map Rank, 1); job.AddReduce(Reduce Rank, 1); job.AddMap(Map Aggregate, 2); job.AddReduce(Reduce Aggregate, 2); job.AddInvariantTable(#1); job.SetInput(IterationInput); job.SetFixedPointThreshold(0.1); job.SetDistanceMeasure(ResultDistance); job.SetMaxNumOfIterations(10); job.SetReducerInputCache(true); job.SetReducerOutputCache(true); job.Submit(); define loop body Turn on caches Declare an input as invariant Specify loop body input, parameterized by iteration # Termination condition
  • 35. 01/30/15Bill Howe, UW 35QuickTime™ and a decompressor are needed to see this picture. Cache Infrastructure Details  Programmer control  Architecture for cache management  Scheduling for inter-iteration locality  Indexing the values in the cache
  • 36. 01/30/15Bill Howe, UW 36QuickTime™ and a decompressor are needed to see this picture. Other Extensions and Experiments  Distributed databases and Pig/Hadoop for Astronomy [IASDS 09]  Efficient “Friends of Friends” in Dryad [SSDBM 2010]  SkewReduce: Automated skew handling [SOCC 2010]  Image Stacking and Mosaicing with Hadoop [Hadoop Summit 2010]  HaLoop: Efficient iterative processing with Hadoop [VLDB2010]
  • 37. 01/30/15Bill Howe, UW 37QuickTime™ and a decompressor are needed to see this picture. MapReduce Broadly Applicable  Biology  [Schatz 08, 09]  Astronomy  [IASDS 09, SSDBM 10, SOCC 10, PASP 10]  Oceanography  [UltraVis 09]  Visualization  [UltraVis 09, EuroVis 10]
  • 38. 01/30/15Bill Howe, UW 38QuickTime™ and a decompressor are needed to see this picture. Key idea  When the loop output is large…  transitive closure  connected components  PageRank (with a convergence test as the termination condition)  …need a distributed fixpoint operator  typically implemented as yet another MapReduce job -- on every iteration
  • 39. 01/30/15Bill Howe, UW 39QuickTime™ and a decompressor are needed to see this picture. Background  Why is MapReduce popular?  Because it’s fast?  Because it scales to 1000s of commodity nodes?  Because it’s fault tolerant?  Witness  MapReduce on GPUs  MapReduce on MPI  MapReduce in main memory  MapReduce on <10 nodes
  • 40. 01/30/15Bill Howe, UW 40QuickTime™ and a decompressor are needed to see this picture. So why is MapReduce popular?  The programming model  Two serial functions, parallelism for free  Easy and expressive  Compare this with MPI  70+ operations  But it can’t express recursion  graph algorithms  clustering  machine learning  recursive queries (CTEs, datalog, WITH clause)
  • 41. 01/30/15Bill Howe, UW 41QuickTime™ and a decompressor are needed to see this picture. Fixpoint  A fixpoint of a function f is a value x such that f(x) = x  The fixpoint queries FIX can be expressed with the relational algebra plus a fixpoint operator  Map - Reduce - Fixpoint  hypothesis: sufficient model for all recursive queries

Hinweis der Redaktion

  1. Programming Model
  2. Programming Model
  3. User writes two serial functions, and MapReduce create a parallel program with fairly clear semantics and good scalability.