1. Mapreduce Algorithms
O'Reilly Strata Conference,
London UK, October 1st 2012
Amund Tveit
amund@atbrox.com - twitter.com/atveit
http://atbrox.com/about/ - twitter.com/atbrox
2. Background
● Been blogging about Mapreduce Algorithms in Academic
Papers since since Oct 2009 (1st Hadoop World)
1. http://atbrox.com/2009/10/01/mapreduce-and-hadoop-academic-
papers/
2. http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-
academic-papers-updated/
3. http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-
academic-papers-may-2010-update/
4. http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in-
academic-papers-4th-update-may-2011/
5. http://atbrox.com/2011/11/09/mapreduce-hadoop-algorithms-in-
academic-papers-5th-update-%E2%80%93-nov-2011/
● Atbrox works on IR-related Hadoop and cloud projects
● My prior experience: Google (software infrastructure and
mobile news), PhD in Computer Science
3. TOC
1. Brief introduction to Mapreduce Algorithms
2. Overview of a few Recent Mapreduce Algorithms in Papers
3. In-Depth look at a Mapreduce Algorithm
4. Recommendations for Designing Mapreduce Algorithms
5. Appendix - 6th (partial) list of Mapreduce and Hadoop
Algorithms in Acemic papers
5. 1.1 So What is Mapreduce?
Mapreduce is a concept,method and software for typically batch-based
large-scale parallelization. It is inspired by functional programming's
map() and reduce() functions
Nice features of mapreduce systems include:
● reliably processing job even though machines die (vs MPI,BSP)
● parallelization, e.g. thousands of machines for terasort and
petasort
Mapreduce was invented by the Google fellows:
Jeff Dean Sanjay Ghemawat
6. 1.2 Mapper function
Processes one key and value pair at the time, e.g.
● word count
○ map(key: uri, value: text):
■ for word in tokenize(value)
■ emit(word, 1) # found 1 occurence of word
● inverted index
○ map(key: uri, value: text):
■ for word in tokenize(value)
■ emit(word, key) # word and uri pair
7. 1.3 Reducer function
Reducers processes one key and all values that belong to it (as
received and aggregated from the map function), e.g.
● word count
○ reduce(key: word type, value: list of 1s):
■ emit(key, sum(value))
● inverted index
○ reduce(key: word type, value: list of URIs):
■ # perhaps transformation of value, e.g. encoding
■ emit(key, value) // e.g. to a distr. hash table
11. 1.6 Pattern 3 - Data Increase
● Decompression
● Annotation, e.g. traditional indexing pipeline
12. 2. Examples of recently published use
and development of Mapreduce
Algorithms
13. 2.1 Machine Learning - ILP
● Problem: Automatically find (induce) rules from examples
and knowledge base
● Paper:
○ Data and Task Parallelism in ILP using
Mapreduce (IBM Research India et.al)
This follows Pan Pattern 1 - Data Reduction - output is a set of
rules from a (typically larger) set of examples and knowledge
base
15. 2.2 Finance - Trading
Problem: Optimize Algorithmic Trading
Paper:
○ Optimizing Parameters of Algorithm Trading Strategies
using Mapreduce (EMC-Greenplum Research China et.
al)
This follows Pan Pattern 1 - Data Reduction - output is the set
of best parameter sets for algorithmic trading. Note that during
map phase there is increase in data, i.e. creation of
permutations of possible parameters
16. 2.3 Software Engineering
Problem: Automatically generate unit test code to increase test
coverage and offload developers
Paper:
○ A Parallel Genetic Algorithm Based on Hadoop
Mapreduce for the Automatic Generation of JUnit Test
Suites (University of Salerno, Italy)
This (probably) follows Pan Pattern 1, 2 and 3, i.e. - assumably
- fixed amount of chromosomes (i.e. transformation), collection
unit tests are being evolved and the combined lengths of unit
tests evolved might increase or decrease compared to the
original input.
17. 2.3 Software Engineering - II
Figure from "EvoTest: Test Case Generation using
Genetic Programming and Software Analysis"
19. 3.1 The Challenge
● Task:
○ Build a low-latency key-value store for disk or SSD
● Features:
○ Low startup time
■ i.e. no/little pre-loading of (large) caches to memory
○ Prefix-search
■ i.e. support searching for both all prefixes of a key as
well as the entire key
○ Low-latency
■ i.e. reduce number of disk/SSD seeks, e.g. by
increase probability of disk cache hits
○ Static/Immutable data - write once, read many
20. 3.2 A few Possible Ways
1. Binary Search or
Interpolation Search
within a file of sorted keys
and then look up value
~ lg(N) or lg(lg(N))
2. Prefix-algorithms mapped
to file, e.g.
1. Trie,
2. Ternary search tree
3. Patricia Tree
~ O(k)
21. 3.3 Overall Approach
1. Scale - divide key,value data into shards
2. Build patricia tree per shard and store all key, values for
later
3. Prepare trees to have placeholder (short) value for each key
4. Flatten each patricia tree to a disk-friendly and byte-aligned
format fit for random access
5. Recalculate file addresses in each patricia tree to be able to
store the actual values
6. Create final patricia tree with values on disk
22. 3.4 Split data with mapper
1. Scale - divide key,value data into shards
map(key, value):
# e.g. simple - hash(first char), or use a classifier
# personalization etc.
shard_key = shard_function(key, value)
out_value = (key,value)
emit shard_key, out_value
23. 3.5 Init and run reduce()
2. Build one patricia tree per (reduce) shard
reduce_init(): # called once per reducer before it starts
self.patricia = Patricia()
self.tempkeyvaluestore = TempKeyValueStore
reducer(shard_key, list of key_value pairs):
for (key, value) in list of key_value pairs:
self.tempkeyvaluestore[key] = value
24. 3.6 Reducer cont.
3. Prepare trees to have placeholder values (=key) for each key
reduce_final(): # called once per reducer after all
reduce()
for key, value in self.tempkeyvaluestore:
self.patricia.add(key, key) # key == value for now
25. 3.7 Flatten patricia tree for disk
4. Flatten each patricia tree to a disk-friendly and byte-
aligned format fit for random access
reduce_final(): # continued from 3.
# num 0s below constrains addressable size of shard file
self.firstblockaddress = "00000000000000"
# create mapping from dict of dicts to a linear file
self.flatten_patricia(self.patricia, parent=self.firstblockaddress)
#
self.recalculate_patricia_tree_for_actual_values()
self.output_patricia_tree_with_actual_values()
28. Mapreduce Patterns
Map() and Reduce() methods typically follow patterns, a
recommended way of representing such patterns are:
extracting and generalize code skeleton fingerprints based on:
1. loops: e.g. "do-while", "while", "for", "repeat-until" => "loop"
2. conditions: e.g. "if", "exception" and "switch" => "condition"
3. emits: e.g. outputs from map() => reduce() or IO => "emit"
4. emit data types: e.g. string, number, list (if known)
map(key, value): reduce(key, values):
loop # over tokenized value emit # key = word,
emit # key=word, val=1 or uri # value = sum(values) or
# list of URIs
29. General Mapreduce Advice
Performance
1. IO/moving data is expensive - use compression and aggr.
2. Use combiners, i.e. "reducer afterburners" for mappers
3. Look out for skewness in key distribution, e.g. zipfs law
4. Use the right programming language for the task
5. Balance work between mappers and reducers - http:
//atbrox.com/2010/02/08/parallel-machine-learning-for-
hadoopmapreduce-a-python-example/
Cost, Readability & Maintainability
6. Mapreduce = right tool? (seq./parallel/iterative/realtime)
7. E.g. Crunch, Pig, Hive instead of full Mapreduce code?
8. Split job into sequence of mapreduce jobs, e.g. with
cascading, mrjob etc.
30. The End
● Mapreduce Paper Trends (from 2009 => 2012), roughly:
○ Increased use of mapreduce jobflows, i.e. more than one
mapreduce in a sequence and also in various types of
iterations
■ e.g. the Algorithmic Trading earlier
○ Increased amount of papers published related to
semantic web (e.g. RDF) and AI reasoning/inference
○ Decreased (relative) amount of IR and Ads papers
31. APPENDIX
List of Mapreduce and Hadoop
Algorithms in Academic Papers - 6th
version (partial subset of
forthcoming blogpost)
32. AI: Reasoning & Semantic Web
1. Reasoning with Fuzzy-cL+Ontologies Using Mapreduce
2. WebPIE: A Web-scale parallel inference engine using
Mapreduce
3. Towards Scalable Reasoning over Annotated RDF Data
Using Mapreduce
4. Reasoning with Large Scale Ontologies in Fuzzy pD* Using
Mapreduce
5. Scalable RDF Compression with Mapreduce
6. Towards Parallel Nonmonotonic Reasoning with Billions of
Facts
33. Biology & Medicine
1. A Mapreduce-based Algorithm for Motif Search
2. A MapReduce Approach for Ridge Regression in
Neuroimaging Genetic Studies
3. Fractal Mapreduce decomposition of sequence alignment
4. Cloud-enabling Sequence Alignment with Hadoop
Mapreduce: A Performance Analysis
AI Misc.
A MapReduce based Ant Colony Optimization approach to
combinatorial optimization problems
34. Machine Learning
1. An efficient Mapreduce Algorithm for Parallelizing Large-
Scale Graph Clustering
2. Accelerating Bayesian Network Parameter Learning Using
Hadoop and Mapreduce
3. The Performance Improvements of SPRINT Algorithm
Based on the Hadoop Platform
Graphs & Graph Theory
4. Large-Scale Graph Biconnectivity in MapReduce
5. Parallel Tree Reduction on MapReduce
35. Datacubes & Joins
1. Data Cube Materialization and Mining Over Mapreduce
2. Fuzzy joins using Mapreduce
3. Efficient Distributed Parallel Top-Down Computation of
ROLAP Data Cube Using Mapreduce
4. V-smart-join: A scalable MapReduce Framework for all-pair
similarity joins of multisets and vectors
5. Data Cube Materialization and Mining over MapReduce
Finance & Business
6. Optimizing Parameters of Algorithm Trading Strategies
using Mapreduce
7. Using Mapreduce to scale events correlation discovery for
business processes mining
8. Computational Finance with Map-Reduce in Scala
36. Mathematics & Statistics
1. GigaTensor: scaling tensor analysis up by 100 times -
algorithms and discoveries
2. Fast Parallel Algorithms for Blocked Dense Matrix
Multiplication on Shared Memory Architectures
3. Mr. LDA: A Flexible Large Scale Topic Modelling Package
using Variational Inference in MapReduce
4. Matrix chain multiplication via multi-way algorithms in
MapReduce