4. • Hadoop is :
Open source software framework.
Scalable, fault-tolerant system, Simple and Accessible and
supports distributed applications.
• Hadoop is build on (Provides) two main parts:
A shared storage: HDFS and Framework: MR
•Hadoop project includes: HDFS, MR, YARN.
•Other Hadoop-related projects at Apache.
Hadoop was created by Doug Cutting.
12/05/14 4
5. • Software framework.
• The name derives from the application of map() and reduce() functions.
• It splits the I/P data-set into independent chunks, which are processed
by the map() and the reduce().
• Typically, compute nodes & storage nodes are the same (MRF + DFS).
• HDFS :
o A shared storage.
o Stores data on the compute nodes (After Processing).
o HDFS has a master (Namenodes)/slave (DataNodes)
architecture.
5
6. Data Node
Name Node
Data Node Data Node
DFS
DFS DFS DFS
Master
Slave Slave Slave
Task
Tracke
r
Task
Tracke
r
Task
Tracke
r
MRF MRF MRF
Job
Tracker
MRF
10. DataNodesDataNodes
NameNode
Put I/P on
HDFS
Run Algorithm
http://localhost:50070
http://localhost:50030
Running Algorithm on Cluster
Map & ReduceRunnING……..
Reliability: Replication of Data
12. 12
[1] http://ieeexplore.ieee.org
[2] http://hadoop.apache.org/
[3] https://wiki.cloudera.com
[4] http://hadoop.apache.org/docs/hdfs/current/hdfs_design.html
[5] Books:
•Data-Intensive Text Processing with MapReduce, Jimmy Lin and
Chris Dyer University of Maryland, College Park.
•Hadoop: The Definitive Guide, Tom White.
12/05/14
1 PB (Petabytes)= 1000000000000000B = 10005 B = 1015 B = 1 million gigabytes = 1 thousand terabytes.
Hadoop runs in different modes:
Local (standalone) mode:
The standalone mode is the default mode for Hadoop.
Pseudo-distributed mode:
The pseudo-distributed mode is running Hadoop in a “cluster of one” with all daemons running on a single machine.
Fully distributed mode:
An actual Hadoop cluster runs in the fully distributed mode.
Scalable: Enables applications to work with thousands of computational independent computers and petabytes of data.
Fault-Tolerant: Both MapReduce and the HDFS are designed to handles the node failure automatically.
Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.
MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.
The model is inspired by the map and reduce functions commonly used in functional programming
MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware).
Here, Hadoop Architecture Drawn on basis of Reading Some IEEE Papers and Some Books.
Try to Explain the terms: Combiner and Partitioner, for More details read book and see Last Slide.