This document provides an overview of Hadoop, a tool for processing large datasets across clusters of computers. It discusses why big data has become so large, including exponential growth in data from the internet and machines. It describes how Hadoop uses HDFS for reliable storage across nodes and MapReduce for parallel processing. The document traces the history of Hadoop from its origins in Google's file system GFS and MapReduce framework. It provides brief explanations of how HDFS and MapReduce work at a high level.
2. Why and What Hadoop ?
A tool to process big data
3. What is BIG Data ?
Facebook, Google+ etc.,
Whatever we do getting stored in form of data or inform of logs
Machines too generate lots of data
Cameras, Mobiles, softwares like STAAD Pro, automated machines in
industries etc.,
We are having a online discussion now , certainly
your reading of this presentation is recorded in
data.
4. What is BIG Data ? ..continued
Exponential growth of data challenges to Google, Yahoo,
Microsoft, Amazon
Need to go through TBs and PBs of data ?
Which websites and books were popular ?
What kind of Ads appeal to them ?
Existing tools became inadequate to process such large
data sets.
5. Why is the data so BIG ?
Till Couple of decade back Floppy disks
From then on CD/DVD Drives
Half a decade back Hard drives (500 GB)
Now Hard Drives(I TB) are available in abundance
6. Why is the data so BIG ?
So WHAT ?
Even the technology to read has taken a leap.
7. Why is the data so BIG ?
Data
Time to
Year Device Volume Transfer
process
speed
Optical Drive
1990 1370 MB 4.4 MB/s 5 minutes
1 TB SATA
2012 1 TB 100 MB/s 2.5 Hrs
Drives
8. How to handle such BIG ?
BIG elephant
Numerous small chicken ?
9. How to handle such BIG ?
Concept of Torrents
Reduce time to read by reading it from multiple sources
simultaneously.
Imagine if we had 100 drives, each holding one hundredth of
the data. Working in parallel, we could read the data in less
than two minutes.
10. How to handle such BIG ? -- Issues
How to handle a system up and downs ?
How to combine the data from all the systems ?
11. Problem1 : System’s Ups and Downs
Commodity hard ware for data storage and analysis
Chances of failure are very high
So, have a redundant copy of the same data across some machines
In case of eventuality of one machine, you have the other
Google came up with a file system GFS (Google File System) which
implemented all these details.
12. GFS
Divides data into chunks and stores in the file System
Can store data in ranges of PBs also
13. Problem 2 : How to combine the data ?
Analyze data across different machines , But how do we merge them to
get a meaningful outcome ?
Yes, all (some) of the data has to travel across network. Then only
merging of the data can occur.
Doing this is notoriously challenging
Again Google Map—Reduce
14. Map Reduce
Provides a programming model abstracts the problem
of disk reads and writes transforming in to a computation
of keys and values.
Two phases
Map
Reduce
15. So what is Hadoop ?
An operating system ?
Provides
1. A reliable shared storage system
1. Analysis system
16. History of Hadoop
Google was the first to launch GFS and MapReduce
They published a paper in 2004 announcing the world
a brand new technology
This technology was well proven in Google by 2004
itself
MapReduce paper by Google
17. History of Hadoop
Doug Cutting saw an opportunity and led the charge
to develop an open source version of this
MapReduce system called Hadoop .
Soon after, Yahoo and others rallied around to
support this effort.
Now Hadoop is core part in :
Facebook, Yahoo, LinkedIn, Twitter …
19. HDFS -- A Brief
Design Streaming very large files on commodity cluster.
1. Very Large Files
MBs to PBs
2. Streaming
Write once read many approach
After huge data being placed We tend to use the data not modify it
Time to read the whole data is more important
3. Commodity Cluster
No High end Servers
Yes, high chance of failure (But HDFS is tolerant enoguh)
Replication is done
20. MapReduce -- A Brief
Large scale data processing in parallel.
MapReduce provides:
Automatic parallelization and distribution
Fault-tolerance
I/O scheduling
Status and monitoring
Two phases in MapReduce
Map
Reduce
21. MapReduce -- A Brief
Map phase
map (in_key, in_value) -> list(out_key, intermediate_value)
Processes input key/value pair
Produces set of intermediate pairs
Reduce Phase
reduce (out_key, list(intermediate_value)) -> list(out_value)
Combines all intermediate values for a particular key
Produces a set of merged output values (usually just one)