2. Agenda
• Problems with traditional large-scale systems
• Requirements for new approaches
• What is Hadoop..?
• Why Hadoop?
• Overview of Hadoop
• HDFS
• Map Reduce
• Applications
• Conclusion
3.
4. Problems with traditional large-scale systems
Data is being increased day-by-day
Issues with the network failure
Server failure
Loss of data
Cost is more.
Distributed computing need manual processing
5. Requirements for new approaches
Data should be stored in a distributed manner
and parallel processing.
High performance and less cost.
Should be scalable
Should be simple to access and process
Fault tolerance
9. Overview of Hadoop
It handles 3 types of data
Structured
Semi – structured
Unstructured
Analyses and process large amounts of data (Peta byte)
10. Compare with traditional DB’s
RDBMS
• Stores GB’s of data
• Supports batch process
and interactive process
• Allows Updation
• Schemas must me defined
• Only structured data
HADOOP
• Stores PB’s of data
• Only batch process
• Does not allow Updation, it
follows WORM
• Schemas not required
• Supports 3 types of data
11.
12.
13. Components
Hadoop can be divided into 2 parts
1. HDFS – Hadoop Distributed File System
2. MapReduce Programming model
14. Hadoop Distributed File System
It is a distributed file system
Runs on commodity hardware
Provides high throughput access to application data
suitable for applications that have large data sets.
It is designed to store a very large amount of data (Tera or peta
bytes).
15.
16. Core Architectural Goal of HDFS
A HDFS instance may consist of thousands of server machines.
Detection of faults and quickly recovering from them in an
automated manner
17. MapReduce Programming Model
MapReduce works on divide and conquer rule on the data.
Schedules execution across a set of machines
Manages inter-process communication
The Reducer processes all output from all mappers and arrives
at final output
18. MapReduce Programming Model
– MAP
• Map() function that processes a key/value pair to
generate a set of intermediate key/value pairs
– REDUCE
• reduce() function that merges all intermediate values
associated with the same intermediate key.