This is our next tech talk in the series where we dive deep into the Apache Hadoop framework. Hadoop, undoubtedly is the current industry leader in Big data implementation. This tech talk covers core Hadoop and how it works. This is Part 1 which explains HDFS. The next tech talk will be Part 2 explaining MapReduce.
Identifying Appropriate Test Statistics Involving Population Mean
Big Data Architecture and Hadoop Distributed File System
1. Debarchan Sarkar
Sunil Kumar Chakrapani
The call would start soon, please be on mute.
Thanks for your time and patience.
2. Recap - What is Big DATA?
Problems Introduced
Traditional Architecture
Cluster Architecture
Where it all started?
How does It work, A 50000 feet overview
How does it work 1 & 2
Hadoop Distributed Architecture
HDFS Architecture
3. Internet of things
Audio /
Video
Log
Files
Text/Image
Social
Sentiment
Data Market
Feeds
eGov Feeds
Weather
Wikis / Blogs
Click
Stream
Sensors / RFID /
Devices
Spatial & GPS
Coordinates
WEB 2.0Mobile
Advertisin
g
CollaborationeCommerce
Digital
Marketing
Search Marketing
Web Logs
Recommendation
s
ERP / CRM
Sales
Pipeline
Payables
Payroll
Inventory
Contacts
Deal
Tracking
Terabytes
(10E12)
Gigabytes
(10E9)
Exabytes
(10E18)
Petabytes
(10E15)
Velocity - Variety - variability
Volume
1980
190,000$
2010
0.07$
1990
9,000$
2000
15$
Storage/GB
ERP / CRM WEB
2.0
Internet of
things
4. 1990 2010
Stores 1370 MB of data
Read
@ 4.4MB/S transfer rate
1 TB is a norm
Read
@ 100MB/S transfer rate
Takes 5 minutes Takes 2.5 hours
9. Google File System
Map Reduce
HDFS: HADOOP Distributed File
System
MapReduce
10.
11. // Map Reduce function in
JavaScript
var map = function (key,
value, context) {
var words =
value.split(/[^a-zA-Z]/);
for (var i = 0; i <
words.length; i++) {
if
(words[i] !== "")
{context.write(words[i].to
LowerCase(), 1);}
}};
var reduce = function
(key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum +=
parseInt(values.next());
}
context.write(key, sum);
};
15. NameNode Secondary NameNode
• Reads fsimage and edits file
• Transaction in edits are merged With
fsimage and edits is emptied
• A client application creates a new file
in HDFS
• Name node logs that transaction in
the edits file
Checkpoint
• Secondary Namenode periodically
creates checkpoints of the namespace
• It downloads fsimage and edit from the
active NameNode
• Merges fsimage and edits locally
• Uploads the new image back to the
active NameNode
• fs.checkpoint.period
• fs.checkpoint.size
16. During start up the NameNode loads the file system state from the fsimage and the
edits log file.
Waits for DataNodes to report their blocks.
During this time NameNode stays in Safemode.
Safemode for the NameNode is essentially a read-only mode for the HDFS cluster, where it
does not allow any modifications to file system or blocks.
Normally the NameNode leaves Safemode automatically after the DataNodes have reported
that most file system blocks are available.
17. 1 2 3
1. HDFS
client caches
the file data
into a
temporary
local file
Step 2
Step 3
Step 4
Step 5
Name Node
Data Node