2. Big Data Explosion
• 90% of today's data was created in the last 2 years
• Moore's law: Data volume doubles every 18
months
• YouTube: 13 million hours and 700 billion views in
2010
• Facebook: 20TB/day (compressed)
• CERN/LHC: 40TB/day (15PB/year)
• Many more examples
4. Challenges!
• How to assign units of work to the workers?
• What if there are more units of work than workers?
• What if the workers need to share intermediate
incomplete data?
• How do we aggregate such intermediate data?
• How do we know when all workers have completed
their assignments?
• What if some workers failed?
5. History
• 2000: Apache Lucene: batch index updates and
sort/merge with on disk index
• 2002: Apache Nutch: distributed, scalable open
source web crawler
• 2004: Google publishes GFS and MapReduce
papers
• 2006: Apache Hadoop: open source Java
implementation of GFS and MapReduce to solve
Nutch’ problem; later becomes standalone project
6. What is Map Reduce?
• A programming model to distribute a task on
multiple nodes
• Used to develop solutions that will process large
amounts of data in a parallelized fashion in clusters
of computing nodes
• Original MapReduce paper by Google
• Features of MapReduce:
• Fault-tolerance
• Status and monitoring tools
• A clean abstraction for programmers
10. HDFS Basics
• HDFS is a filesystem written in Java
• Sits on top of a native filesystem
• Provides redundant storage for massive
amounts of data
• Use Commodity devices
11. HDFS Data
• Data is split into blocks and stored on
multiple nodes in the cluster
• Each block is usually 64 MB or 128 MB
• Each block is replicated multiple times
• Replicas stored on different data nodes
13. Master Node
• NameNode
• only 1 per cluster
• metadata server and database
• SecondaryNameNode helps with some housekeeping
• JobTracker
• only 1 per cluster
• job scheduler
14. Slave Nodes
• DataNodes
• 1-4000 per cluster
• block data storage
• TaskTrackers
• 1-4000 per cluster
• task execution
15. NameNode
• A single NameNode stores all
metadata, replication of blocks and
read/write access to files
• Filenames, locations on DataNodes of each
block, owner, group, etc.
• All information maintained in RAM for fast
lookup
16. Secondary NameNode
• Does memory-intensive administrative
functions for the NameNode
• Should run on a separate machine
17. Data Node
• DataNodes store file contents
• Different blocks of the same file will be
stored on different DataNodes
• Same block is stored on three (or more)
DataNodes for redundancy
18. Word Count Example
• Input
• Text files
• Output
• Single file containing (Word <TAB> Count)
• Map Phase
• Generates (Word, Count) pairs
• [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]
• Reduce Phase
• For each word, calculates aggregate
• [{a,7}, {b,5}, {c,6}]
19. Typical Cluster
• 3-4000 commodity servers
• Each server
• 2x quad-core
• 16-24 GB ram
• 4-12 TB disk space
• 20-30 servers per rack
20. When Should I use it?
Good choice for jobs that can be broken into parallelized jobs:
• Indexing/Analysis of log files
• Sorting of large data sets
• Image Processing/Machine Learning
Bad choice for serial or low latency jobs:
• For real-time processing
• For processing intensive task with little data
• Replacing MySQL