4. MapReduce Paradigm
• Splits input files into blocks (typically of 64MB
each)
• Operates on key/value pairs
• Mappers filter & transform input data
• Reducers aggregate mappers output
• Efficient way to process the cluster:
– Move code to data
– Run code on all machines
5. • Map
Hash Function
(K1,v1) List(k2,v2)
• Reduce
Aggregate Function List(k3,v3)
(k2,list(v2))
6. Advanced MapReduce
• Hadoop Streaming
– Lets you stream Mapper and reducer written in
other languages such as python, ruby, etc.,
• Chaining MapReduce jobs
• Joining data
• Bloom filters
7. Hadoop
• Open Source Implementation of MapReduce by
Apache Software Foundation.
• Created by Doug Cutting.
• Derived from Google's MapReduce and Google
File System (GFS) papers.
• Apache Hadoop is a software framework that
supports data-intensive distributed applications
under a free license
• It enables applications to work with thousands of
computational independent computers and
petabytes of data.
8. Hadoop Architecture
• Hadoop MapReduce
– Single master node, many worker nodes
– Client submits a job to master node
– Master splits each job into tasks
(MapReduce), and assigns tasks to worker nodes
• Hadoop Distributed File System (HDFS)
– Single name node, many data nodes
– Files stored as large, fixed-size (e.g. 64MB) blocks
– HDFS typically holds map input and reduce output
10. Job Scheduling in Hadoop
• One map task for each block of the input file
– Applies user-defined map function to each record in
the block
– Record = <key, value>
• User-defined number of reduce tasks
– Each reduce task is assigned a set of record groups
– For each group, apply user-defined reduce function to
the record values in that group
• Reduce tasks read from every map task
– Each read returns the record groups for that reduce
task
11. Dataflow in Hadoop
• Map tasks write their output to local disk
– Output available after map task has completed
• Reduce tasks write their output to HDFS
– Once job is finished, next job’s map tasks can be
scheduled, and will read input from HDFS
• Therefore, fault tolerance is simple: simply re-
run tasks on failure
– No consumers see partial operator output
16. HDFS
• Data is distributed and replicated over
multiple machines.
• Files are not stored in contiguously on servers
broken up into blocks.
• Designed for large files (large means GB or TB)
• Block Oriented
• Linux Style commands (eg. ls, cp, mkdir, mv)
18. Hadoop Applicability by Workflow[MTAGS11]
Score Meaning:
• Score Zero implies Easily adaptable to the workflow
• Score 0.5 implies Moderately adaptable to the
workflow
• Score 1 indicates one of the potential workflow areas
where Hadoop needs improvement
19. Relative Merits and Demerits of
Hadoop Over DBMS
Pros Cons
• Fault tolerance • No high level language like
• Self Healing rebalances files SQL in DBMS
across cluster • No schema and no index
• Highly Scalable • Low efficiency
• Highly Flexible as it does not • Very young (since 2004)
have any dependency on compared to over 40years
data model and schema of DBMS
Hadoop Relational
Scale out (add more Scaling is difficult
machines)
Key/Value pairs Tables
Say how to process the data Say what you want (SQL)
Offline/ batch Online/ realtime
20. Conclusions and Future Work
• MapReduce is easy to program
• Hadoop=HDFS+MapReduce
• Distributed, Parallel processing
• Designed for fault tolerance and high scalability
• MapReduce is unlikely to substitute DBMS in
data warehousing instead we expect them to
complement each other and help in data analysis
of scientific data patterns
• Finally, Efficiency and especially I/O costs needs
to be addressed for successful implications
21. References
[LLCCM12] Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn
Chung, and Bongki Moon, “Parallel data processing with MapReduce:
a survey,” SIGMOD, January 2012, pp. 11-20.
[MTAGS11] Elif Dede, Madhusudhan Govindaraju, Daniel Gunter, and
Lavanya Ramakrishnan, “ Riding the Elephant: Managing Ensembles
with Hadoop,” Proceedings of the 2011 ACM international workshop
on Many task computing on grids and supercomputers, ACM, New
York, NY, USA, pp. 49-58.
[DG08]Jeffrey Dean and Sanjay Ghemawat, “MapReduce: simplified
data processing on large clusters,” January 2008, pp. 107-113. ACM.
[CAHER10]Tyson Condie, Neil Conway, Peter Alvaro, Joseph M.
Hellerstein, Khaled Elmeleegy, and Russell Sears, “MapReduce online,”
Proceedings of the 7th USENIX conference on Networked systems
design and implementation (NSDI'10), USENIX
Association, Berkeley, CA, USA, 2010, pp. 21-37.
If Distributed Computing is so hard, Do we need it?
Run code on machines unlike conventional systems where we move data to code, do processing and then store them back.
- Out of the scope of papers
The master (Job-Tracker) is ress. Each worker runs a Task- Tracker process that manages the execution of the tasks currently assigned to that node. Each TaskTracker has a fixed number of slots for executing tasks. Each map task is assigned a portion of the input file called a split. By default, a split contains a single HDFS block, so the total number of file blocks determines the number of map tasks.
Reducers begin processing data as soon as it is produced by mappers, they can generate and refine an approximation of their final answer during the course of executionMapReduce jobs can run continuously, accepting new data as it arrives and analyzing it immediately. This allows MapReduce to be used for applications such as event monitoring and stream processing.Data Node: Store actual file blocks on disk. Does not store entire files!Report block info to Namenode.Receive instructions from namenode.Secondary Namenode: Snapshot of namenode.Not a flipover server of namenode.Help minimize downtime/data loss ifNameNode failsJobTracker: Partition tasks across the cluster. Track MapReduce tasks. Re start failed tasks on different nodes.TaskTracker does the task processing and logs each and every event.
The input to a job is an input specification that is in key-value pairs. Each job consists of two stages: first, a user defin map function is applied to each input record to produce a list of intermediate key-value pairs. Second, a user-defined reduce function is called once for each distinct key in the map output and passed the list of intermediate values associated with that key. Reduce - The shuffle phase (Each reduce task is assigned a partition of the keyrange produced by the map step, so the reduce task must fetch the content of this partition from every map task’s output). The sort phase groups records with the same key. Apply the user-defined reduce function
The buffer content is written to the local file system as an index file and a data file . Index file for indexing and The data file contains only the records, which are sorted by the key within each partition segment. A reduce task fetches data from each map task by issuing HTTP requests to a configurablenumber of TaskTrackers at once (5 by default). The Job- Tracker relays the location of every TaskTracker that hosts map output to every TaskTracker that is executing a reduce task. a reduce task cannot fetch the output of a map task until the map has finished executing and committed its final output to disk.
The map phase reads the task’s split/HDFS blocks from HDFS, parses it into records (key/value pairs), and applies the map function to each record.After the map function has been applied to each input record, the commit phase registers the final output with the TaskTracker, which then informs theJobTracker that the task has finished executing.
a reduce task fetches data from ach map task by issuing HTTP requests to a configurable number of TaskTrackers at once (5 by default). The Job-Tracker relays the location of every TaskTracker that hosts map output to every TaskTracker that is executing a reduce task. Note that a reduce task cannot fetch the output of a map task until the map has finished executing and committed its final output to disk
In this design, the output of both map and reduce tasks is written to disk before it can be consumed. This is particularly expensive for reduce tasks, because their output is written to HDFS. Output materialization simplifies fault tolerance, because it reduces the amount of state that must be restored to consistency after a node failure. If any task (either map or reduce) fails, the JobTracker simply schedules a new task to perform the same work as the failed task.
While it was possible to implement all patterns in the framework but the level of difficulty varied.This evaluation helps in identifying if an applications workflow will be suitable to run in MapReduce Framework or not.
Fault tolerant when node fails due to high data replication. Scalable just by adding nodes we can process as much data as we want.Low efficiency:- with fault tolerance and scalability as its primary goals, MapReduce operations are not always optimized for I/O efficiency. Also Map and Reduce are blocking operations
-Easy since it hides implementation details of parallelization, fault tolerance, local optimization and load balanace. Horizontal scale out helps in processing as much as data we want by simply adding as many nodes as you want.