Boost PC performance: How more available memory can improve productivity
Â
An Introduction to MapReduce
1. An Introduction to MapReduce
Presented by Frane Bandov
at the Operating Complex IT-Systems seminar
Berlin, 1/26/2010
2. Outline
â˘â Introduction
â˘â Google MapReduce
ââ Idea
ââ Overview
ââ Fault Tolerance
ââ GFS: Google File System
ââ Job Example
â˘â Alternative Implementations
â˘â Reception and Criticism
â˘â Trends and Future Development
â˘â Conclusion
2/16/10 An Introduction to MapReduce 2
3. Outline
â˘â Introduction
â˘â Google MapReduce
ââ Idea
ââ Overview
ââ Fault Tolerance
ââ GFS: Google File System
ââ Job Example
â˘â Alternative Implementations
â˘â Reception and Criticism
â˘â Trends and Future Development
â˘â Conclusion
2/16/10 An Introduction to MapReduce 3
4. Introduction â Problem
Sometimes we have to deal with huge amounts
of data
TBytes
250
200
150
100
50
0
You Facebook Yahoo! Groups German Climate
Computing Centre
2/16/10 An Introduction to MapReduce 4
5. Introduction â Problem
The data needs to be processed, but how?
Canât process all of this data on one machine
ď Distribute the processing to many machines
2/16/10 An Introduction to MapReduce 5
6. Introduction â Approach
Distributed computing is the solution
âLetâs write our own distributed computing
software as a solution to our problemâ
Checklist
ďźâdesign protocols ď â evelopment takes a long time
D
ďźâdesign data structures
ďźâwrite the code ď âExpensive: Cost-benefit ratio?
ďźâassure failure tolerance
Build complex software for simple computations?
2/16/10 An Introduction to MapReduce 6
7. Outline
â˘â Introduction
â˘â Google MapReduce
ââ Idea
ââ Overview
ââ Fault Tolerance
ââ GFS: Google File System
ââ Job Example
â˘â Alternative Implementations
â˘â Reception and Criticism
â˘â Trends and Future Development
â˘â Conclusion
2/16/10 An Introduction to MapReduce 7
8. Google MapReduce â Idea
A framework for distributed computing
Donât care about protocols, failure tolerance, etc.
Just write your simple computation
2/16/10 An Introduction to MapReduce 8
9. Google MapReduce â Idea
MapReduce Paradigm
Map: Reduce:
Apply function to all Combine all elements
elements of a list of a list
square x = x * x; reduce (+)[1, 2, 3, 4, 5];
map square [1, 2, 3, 4, 5];
ď [1, 4, 9, 16, 25] ď 15
2/16/10 An Introduction to MapReduce 9
10. Google MapReduce â Idea
Basic functioning
Input Map Reduce Output
2/16/10 An Introduction to MapReduce 10
12. MapReduce â Fault Tolerance
â˘â Workers are periodically pinged by master
â˘â No answer over certain time ď worker failed
Mapper fails:
ââ Reset map job as idle
ââ Even if job was completed ď intermediate files are
inaccessible
ââ Notify reducers where to get the new intermediate file
Reducer fails:
ââ Reset its job as idle
2/16/10 An Introduction to MapReduce 12
13. MapReduce â Fault Tolerance
Master fails:
ââ Periodically sets checkpoints
ââ In case of failure MapReduce-Operation is aborted
ââ Operation can be restarted from last checkpoint
2/16/10 An Introduction to MapReduce 13
14. Google MapReduce â GFS
Google File System
â˘â In-house distributed file system at Google
â˘â Stores all input an output files
â˘â Stores filesâŚ
ââdivided into 64 MB blocks
ââon at least 3 different machines
â˘â Machines running GFS also
run MapReduce
2/16/10 An Introduction to MapReduce 14
19. Outline
â˘â Introduction
â˘â Google MapReduce
ââ Idea
ââ Overview
ââ Fault Tolerance
ââ GFS: Google File System
ââ Job Example
â˘â Alternative Implementations
â˘â Reception and Criticism
â˘â Trends and Future Development
â˘â Conclusion
2/16/10 An Introduction to MapReduce 19
20. Alternative Implementations
Apache Hadoop
â˘â Open-Source-Implementation in Java
â˘â Jobs can be written in C++, Java, Python, etc.
â˘â Used by Yahoo!, Facebook, Amazon and others
â˘â Most commonly used implementation
â˘â HDFS as open-source-implementation of GFS
â˘â Can also use Amazon S3, HTTP(S) or FTP
â˘â Extensions: Hive, Pig, HBase
2/16/10 An Introduction to MapReduce 20
21. Alternative Implementations
Mars
MapReduce-Implementation for nVidia GPU
using the CUDA framework
MapReduce-Cell
Implementation for the Cell multi-core
processor
Qizmt
MySpaceâs implementation of MapReduce in C#
2/16/10 An Introduction to MapReduce 21
22. Alternative Implementations
There are many other open- and closed-
source implementations of MapReduce!
2/16/10 An Introduction to MapReduce 22
23. Outline
â˘â Introduction
â˘â Google MapReduce
ââ Idea
ââ Overview
ââ Fault Tolerance
ââ GFS: Google File System
ââ Job Example
â˘â Alternative Implementations
â˘â Reception and Criticism
â˘â Trends and Future Development
â˘â Conclusion
2/16/10 An Introduction to MapReduce 23
24. Reception and Criticism
â˘â Yahoo!: Hadoop on a 10,000 server cluster
â˘â Facebook analyses the daily log (25TB) on
a 1,000 server cluster
â˘â Amazon Elastic MapReduce: Hadoop
clusters for rent on EC2 and S3
â˘â IBM and Google: Support university
courses in distributed programming
â˘â UC Berkley announced to teach freashmen
programming MapReduce
2/16/10 An Introduction to MapReduce 24
26. Reception and Criticism
â˘â Criticism mainly by RDBMS experts
DeWitt and Stonebraker
â˘â MapReduce
ââis a step backwards in database access
ââis a poor implementation
ââis not novel
ââis missing features that are routinely provided
by modern DBMSs
ââis incompatible with the DBMS tools
2/16/10 An Introduction to MapReduce 26
27. Reception and Criticism
Response to criticism
MapReduce is no RDBMS
It suits well for processing and structuring huge
amounts of unstructured data
MapReduce's big inovation is that it enables
distributing data processing across a network of
cheap and possibly unreliable computers
2/16/10 An Introduction to MapReduce 27
28. Outline
â˘â Introduction
â˘â Google MapReduce
ââ Idea
ââ Overview
ââ Fault Tolerance
ââ GFS: Google File System
ââ Job Example
â˘â Alternative Implementations
â˘â Reception and Criticism
â˘â Trends and Future Development
â˘â Conclusion
2/16/10 An Introduction to MapReduce 28
29. Trends and Future Development
Trend of utilizing MapReduce/Hadoop as
parallel database
â˘â Hive: Query language for Hadoop
â˘â HBase: Column-oriented distributed database
(modeled after Googleâs BigTable)
â˘â Map-Reduce-Merge: Adding merge to the
paradigm allows implementing features of
relational algebra
2/16/10 An Introduction to MapReduce 29
30. Trends and Future Development
Trend to use the MapReduce-paradigm to
better utilize multi-core CPUs
â˘â Qt Concurrent
ââ Simplified C++ version of MapReduce for distributing
tasks between multiple processor cores
â˘â Mars
â˘â MapReduce-Cell
2/16/10 An Introduction to MapReduce 30
31. Outline
â˘â Introduction
â˘â Google MapReduce
ââ Idea
ââ Overview
ââ Fault Tolerance
ââ GFS: Google File System
ââ Job Example
â˘â Alternative Implementations
â˘â Reception and Criticism
â˘â Trends and Future Development
â˘â Conclusion
2/16/10 An Introduction to MapReduce 31
32. Conclusion
MapReduce
provides an easy solution for the processing of
large amounts of data
brings a paradigm shift in programming
changed the world,
i.e. made data processing more efficient and
cheaper, is the foundation of many other
approaches and solutions
2/16/10 An Introduction to MapReduce 32