[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
1. Introduction to
Zak Stone <zak@eecs.harvard.edu>
PhD candidate, Harvard School of Engineering and Applied Sciences
Advisor: Todd Zickler (Computer Vision)
3. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
4. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
5. Why should you care? - Lots of Data
LOTS OF DATA
EVERYWHERE
9. Why!! ! ! ! ! ! for big data?
• Most credible open-source toolset for large-scale, general-purpose computing
• Backed by ,
• Used by , , many others
• Increasing support from web services
• Hadoop closely imitates infrastructure developed by
• Hadoop processes petabytes daily, right now
11. DISCLAIMER
• Don’t use Hadoop if your data and computation fit on one machine
• Getting easier to use, but still complicated
http://www.wired.com/gadgetlab/2008/07/patent-crazines/
12. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
13. What exactly is ! ! ! ! ! ! ! ?
• Actually a growing collection of subprojects
14. What exactly is ! ! ! ! ! ! ! ?
• Actually a growing collection of subprojects; focus on two right now
15. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
16. An overview of Hadoop Map-Reduce
Traditional
Hadoop
Computing
(one computer)
(many computers)
17. An overview of Hadoop Map-Reduce
(Actually more like this)
(many computers, little communication,
stragglers and failures)
19. Map-Reduce: Map phase
Only specify operations on key-value pairs!
INPUT PAIR OUTPUT PAIRS
(key, value) (key, value)
(key, value)
(key, value)
(zero or more output pairs)
(each “elephant” works on an input pair;
doesn’t know other elephants exist )
28. Map-Reduce: The main advantage
With Hadoop, this very same code could run on
the entire Web! (In theory, at least)
def mapper(key,value):
for word in value.split():
yield word,1
def reducer(key,values):
yield key,sum(values)
29. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
30. HDFS: Hadoop Distributed File System
... (chunks of data
on computers)
Data ... (each chunk
replicated more
than once for
reliability)
...
...
31. HDFS: Hadoop Distributed File System
(key1, value1)
(key2, value2)
...
... (key1, value1)
(key2, value2)
...
...
Computation is local to the data
Key-value pairs processed independently in parallel
33. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
34. Hadoop Map-Reduce and HDFS: Advantages
• Distribute data and computation
• Computation local to data avoids network overload
• Tasks are independent
• Easy to handle partial failures - entire nodes can fail and restart
• Avoid crawling horrors of failure-tolerant synchronous distributed systems
• Speculative execution to work around stragglers
• Linear scaling in the ideal case
• Designed for cheap, commodity hardware
• Simple programming model
• The “end-user” programmer only writes map-reduce tasks
35. Hadoop Map-Reduce and HDFS: Disadvantages
• Still rough - software under active development
• e.g. HDFS only recently added support for append operations
• Programming model is very restrictive
• Lack of central data can be frustrating
• “Joins” of multiple datasets are tricky and slow
• No indices! Often, entire dataset gets copied in the process
• Cluster management is hard (debugging, distributing software, collecting logs...)
• Still single master, which requires care and may limit scaling
• Managing job flow isn’t trivial when intermediate data should be kept
• Optimal configuration of nodes not obvious (# mappers, # reducers, mem. limits)
36. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
37. Getting started: Installation options
• Cloudera virtual machine
• Your own virtual machine (install Ubuntu in VirtualBox, which is free)
• Elastic MapReduce on EC2
• StarCluster with Hadoop on EC2
• Cloudera’s distribution of Hadoop on EC2
• Install Cloudera’s distribution of Hadoop on your own machine
• Available for RPM and Debian deployments
• Or download Hadoop directly from http://hadoop.apache.org/
38. Getting started: Language choices
• Hadoop is written in Java
• However, Hadoop Streaming allows mappers and reducers in any language!
• Binary data is a little tricky with Hadoop Streaming
• Could use base64 encoding, but TypedBytes are much better
• For Python, try Dumbo: http://wiki.github.com/klbostee/dumbo
• The Python word-count example and others come with Dumbo
• Dumbo makes binary data with TypedBytes easy
• Also consider Hadoopy: https://github.com/bwhite/hadoopy
39. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
40. Useful resources and tips
• The Hadoop homepage: http://hadoop.apache.org/
• Cloudera: http://cloudera.com/
• Dumbo: http://wiki.github.com/klbostee/dumbo
• Hadoopy: https://github.com/bwhite/hadoopy
• Amazon Elastic Compute Cloud Getting Started Guide:
• http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/
• Always test locally on a tiny dataset before running on a cluster!