Introduction to Hadoop: An Overview of Storage, Processing, and the Hadoop Ecosystem

Introduction to Hadoop
Ron Sher

Agenda

•
•
•
•
•

Big data - big issues
Hadoop to the rescue
Storage - HDFS
Processing - MapReduce
Hadoop ecosystem

Big Data - Big Issues
● Volume, Velocity, Variability
● Lots of data - logs, sensors, social, pictures,
video, etc.
● May not fit a single machine
● Access to data is slow
● Hardware may fail
● Network errors happen

Hadoop to the rescue

•
•
•
•
•
•

Distributed “operating system”
Scalable - many servers of commodity hardware
with lots of cores and disks
Reliable - detect failures, redundant storage
Fault-tolerant - auto-retry, self-healing
Simple - use many servers as one really big
computer
Suitable for batch processing (throughput over

Storage - HDFS

•
•

•
•

Hadoop Distributed File System
Replicated (3 default) fixed size blocks
(64MB default)
runs on large clusters of commodity
machines
Optimized for write once - read many
throughput of large files

HDFS Architecture
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/images/hdfsarchitecture.png

Useful HDFS commands
•
•
•
•
•
•
•
•

hdfs dfs -get <file name> - copy a file from hdfs to local
hdfs dfs -put <file name> [destination]- copy a file from local
to hdfs in the specified destination
hdfs dfs -cat <file name> - prints a file to stdout
hdfs dfs -ls <dir name> - show all files under the specified
directory
hdfs dfs -mv <file name> <changed name> - rename a file
hdfs dfs -rm <file name> - remove a file
hdfs dfs -rmr <directory name> - remove a directory
hdfs dfs -mkdir <dir name> - creates a directory

Processing - MapReduce

•
•

•
•

A distributed data processing model and execution
environment that runs on large clusters of commodity
machines
Responsible for running a job in parallel on many
servers
Handles re-trying a task that fails, validating complete
results
Computation moved to the data

MapReduce Sample - Word Count
input

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1

input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1

shuffling
Ini, 1
Ini, 1
Mini, 1
Mini, 1
Miny, 1
Miny, 1
Mo, 1
Mo, 1
Mo, 1

input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1

shuffling

reducing

Ini, 1
Ini, 1

Ini, [1,1]

Mini, 1
Mini, 1

Mini, [1,1]

Miny, 1
Miny, 1

Miny, [1,1]

Mo, 1
Mo, 1
Mo, 1

Mo, [1,1,1]

input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1

shuffling

reducing

Ini, 1
Ini, 1

Ini, [1,1]

Mini, 1
Mini, 1

Mini, [1,1]

Miny, 1
Miny, 1

Miny, [1,1]

Mo, 1
Mo, 1
Mo, 1

Mo, [1,1,1]

final result

Ini, 2
Mini, 2
Miny,2
Mo, 3

http://answers.oreilly.com/uploads/monthly_10_2009/post-118-125676084924_thumb.png

How a MapReduce Job Runs in Hadoop

Monitoring MR jobs (machine:50030)

Useful Commands

•
•

mapred job -kill <job id> - kill a running job
mapred job -status <job id> - show status
of a job

Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}

Word Count Reducer
public static class Reduce extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

Hadoop Ecosystem

•
•
•
•
•
•
•
•

Hive - SQL like language over big data using MR
HBase - distributed, column-oriented database
ZooKeeper - coordination service
Avro - cross language serialization
Pig - language for exploring big data
Impala - SQL like directly over HDFS
Sqoop - tool for moving data from DBs to HDFS
Mahout - machine learning and data mining library

Some resources

•
•
•
•
•
•

Motivation about hadoop and where it’s
going video and whitepaper
HDFS Architecture Guide
How MapReduce Works With Hadoop
HDFS shell commands
VM
MapReduce tutorial

Introduction to Hadoop: An Overview of Storage, Processing, and the Hadoop Ecosystem

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Introduction to Hadoop: An Overview of Storage, Processing, and the Hadoop Ecosystem

Ähnlich wie Introduction to Hadoop: An Overview of Storage, Processing, and the Hadoop Ecosystem (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduction to Hadoop: An Overview of Storage, Processing, and the Hadoop Ecosystem