This document introduces Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop uses HDFS for scalable storage and MapReduce for distributed processing. Key components are introduced, including how HDFS stores data in replicated blocks and how MapReduce executes jobs by splitting data, mapping tasks, shuffling, and reducing results. A word count example demonstrates the MapReduce process.
2. Agenda
•
•
•
•
•
Big data - big issues
Hadoop to the rescue
Storage - HDFS
Processing - MapReduce
Hadoop ecosystem
3. Big Data - Big Issues
● Volume, Velocity, Variability
● Lots of data - logs, sensors, social, pictures,
video, etc.
● May not fit a single machine
● Access to data is slow
● Hardware may fail
● Network errors happen
4. Hadoop to the rescue
•
•
•
•
•
•
Distributed “operating system”
Scalable - many servers of commodity hardware
with lots of cores and disks
Reliable - detect failures, redundant storage
Fault-tolerant - auto-retry, self-healing
Simple - use many servers as one really big
computer
Suitable for batch processing (throughput over
5. Storage - HDFS
•
•
•
•
Hadoop Distributed File System
Replicated (3 default) fixed size blocks
(64MB default)
runs on large clusters of commodity
machines
Optimized for write once - read many
throughput of large files
7. Useful HDFS commands
•
•
•
•
•
•
•
•
hdfs dfs -get <file name> - copy a file from hdfs to local
hdfs dfs -put <file name> [destination]- copy a file from local
to hdfs in the specified destination
hdfs dfs -cat <file name> - prints a file to stdout
hdfs dfs -ls <dir name> - show all files under the specified
directory
hdfs dfs -mv <file name> <changed name> - rename a file
hdfs dfs -rm <file name> - remove a file
hdfs dfs -rmr <directory name> - remove a directory
hdfs dfs -mkdir <dir name> - creates a directory
8. Processing - MapReduce
•
•
•
•
A distributed data processing model and execution
environment that runs on large clusters of commodity
machines
Responsible for running a job in parallel on many
servers
Handles re-trying a task that fails, validating complete
results
Computation moved to the data
10. MapReduce Sample - Word Count
input
splitting
Ini Mini Miny
Ini Mini Miny
Mo Mo Miny
Ini Mo Mini
Mo Mo Miny
Ini Mo Mini
11. MapReduce Sample - Word Count
input
splitting
Ini Mini Miny
Ini Mini Miny
Mo Mo Miny
Ini Mo Mini
Mo Mo Miny
Ini Mo Mini
mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1
Ini, 1
Mo, 1
Mini, 1
12. MapReduce Sample - Word Count
input
splitting
Ini Mini Miny
Ini Mini Miny
Mo Mo Miny
Ini Mo Mini
Mo Mo Miny
Ini Mo Mini
mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1
Ini, 1
Mo, 1
Mini, 1
shuffling
Ini, 1
Ini, 1
Mini, 1
Mini, 1
Miny, 1
Miny, 1
Mo, 1
Mo, 1
Mo, 1
13. MapReduce Sample - Word Count
input
splitting
Ini Mini Miny
Ini Mini Miny
Mo Mo Miny
Ini Mo Mini
Mo Mo Miny
Ini Mo Mini
mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1
Ini, 1
Mo, 1
Mini, 1
shuffling
reducing
Ini, 1
Ini, 1
Ini, [1,1]
Mini, 1
Mini, 1
Mini, [1,1]
Miny, 1
Miny, 1
Miny, [1,1]
Mo, 1
Mo, 1
Mo, 1
Mo, [1,1,1]
14. MapReduce Sample - Word Count
input
splitting
Ini Mini Miny
Ini Mini Miny
Mo Mo Miny
Ini Mo Mini
Mo Mo Miny
Ini Mo Mini
mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1
Ini, 1
Mo, 1
Mini, 1
shuffling
reducing
Ini, 1
Ini, 1
Ini, [1,1]
Mini, 1
Mini, 1
Mini, [1,1]
Miny, 1
Miny, 1
Miny, [1,1]
Mo, 1
Mo, 1
Mo, 1
Mo, [1,1,1]
final result
Ini, 2
Mini, 2
Miny,2
Mo, 3
22. Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
23. Word Count Reducer
public static class Reduce extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
24. Hadoop Ecosystem
•
•
•
•
•
•
•
•
Hive - SQL like language over big data using MR
HBase - distributed, column-oriented database
ZooKeeper - coordination service
Avro - cross language serialization
Pig - language for exploring big data
Impala - SQL like directly over HDFS
Sqoop - tool for moving data from DBs to HDFS
Mahout - machine learning and data mining library
25. Some resources
•
•
•
•
•
•
Motivation about hadoop and where it’s
going video and whitepaper
HDFS Architecture Guide
How MapReduce Works With Hadoop
HDFS shell commands
VM
MapReduce tutorial