Hadoop Streaming Tutorial With Python

Tutorial: Streaming Jobs (& Non-Java Hadoop)

/*

Joe Stein, Chief Architect
http://www.medialets.com
Twitter: @allthingshadoop

*/

Sample Code
https://github.com/joestein/amaunet

1

Overview
• Intro
• Sample Dataset
• Options
• Deep Dive

http://allthingshadoop.com/2010/12/16/si
mple-hadoop-streaming-tutorial-using-
joins-and-keys-with-python/

2

Medialets
• Largest deployment of rich media ads for mobile devices
• Installed on hundreds of millions of devices
• 3-4 TB of new data every day
• Thousands of services in production
• Hundreds of thousands of events received every second
• Response times are measured in microseconds
• Languages
– 35% JVM (20% Scala & 10% Java)
– 30% Ruby
– 20% C/C++
– 13% Python
– 2% Bash

4

MapReduce 101

Why and How It Works

6

Sample Dataset

The requirement: you need to find out grouped by type of
customer how many of each type are in each country
with the name of the country listed in the countries.dat in
the final result (and not the 2 digit country name).

To-do this you need to:

1) Join the data sets
2) Key on country
3) Count type of customer per country
4) Output the results

9

So many ways to MapReduce

• Java
• Hive
• Pig
• Datameer
• Cascading
–Cascalog
–Scalding
• Streaming with a framework
–Wukong
–Dumbo
–MrJobs
• Streaming without a framework
–You can even do it with bash scripts, but don’t

11

Why and When
There are two types of jobs in Hadoop
1) data transformation 2) queries
• Java
– Faster? Maybe not, because you might not know how to
optimize it as well as the Pig and Hive committers do, its
Java … so … Does not work outside of Hadoop without
other Apache projects to let it do so.
• Hive & Pig
– Definitely a possibility but maybe better after you have
created your data set. Does not work outside of Hadoop.
• Datameer
– WICKED cool front end, seriously!!!
• Streaming
– With a framework – one more thing to learn
– Without a framework – MapReduce with and without
Hadoop, huh? really? Yeah!!!
12

How does streaming work
stdin & stdout

• Hadoop actually opens a process and writes and reads
• Is this efficient? Yeah it is when you look at it
• You can read/write to your process without Hadoop – score!!!
• Why would you do this?
– You should not put things into Hadoop that don’t belong
there. Prototyping and go live without the overhead!
– You can have your MapReduce program run outside of
Hadoop until it is ready and NEEDS to be running there
– Really great dev lifecycles
– Did I mention about the great dev lifecycles?
– You can write a script in 5 minutes, seriously and then
interrogate TERABYTES of data without a fuss

13

Blah blah blah
Where's the beef?

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
try: #sometimes bad data can cause errors use this how you like to deal with lint and bad data

personName = "-1" #default sorted as first
personType = "-1" #default sorted as first
countryName = "-1" #default sorted as first
country2digit = "-1" #default sorted as first

# remove leading and trailing whitespace
line = line.strip()

splits = line.split("|")

if len(splits) == 2: #country data
countryName = splits[0]
country2digit = splits[1]
else: #people data
personName = splits[0]
personType = splits[1]
country2digit = splits[2]

print '%s^%s^%s^%s' % (country2digit,personType,personName,countryName)
except: #errors are going to make your job fail which you may or may not want
pass

14

Here is the output of that

CA^-1^-1^Canada
CA^not so good^Yo Yo Ma^-1
CA^valued^Jon Sneed^-1
CA^valued^Jon York^-1
CA^valued^Sam Sneed^-1
IT^-1^-1Îtaly
JA^not so bad^Jim Davis^-1
UK^-1^-1Ûnited Kingdom
UK^not so goodÂrnold Wesise^-1
UK^valuedÂlex Ball^-1
US^-1^-1Ûnited States
US^not badÂlice Bob^-1
US^not bad^Henry Bob^-1

15

Padding is your friend
All sorts are not created equal

Josephs-MacBook-Pro:~ josephstein$ cat test
1,,2
1,1,2
Josephs-MacBook-Pro:~ josephstein$ cat test |sort
1,,2
1,1,2

[root@megatron joestein]# cat test
1,,2
1,1,2
[root@megatron joestein]# cat test|sort
1,1,2
1,,2

16

And the reducer
#!/usr/bin/env python

import sys

# maps words to their counts
foundKey = ""
foundValue = ""
isFirst = 1
currentCount = 0
currentCountry2digit = "-1"
currentCountryName = "-1"
isCountryMappingLine = False

# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

try:
# parse the input we got from mapper.py
country2digit,personType,personName,countryName = line.split('^')

#the first line should be a mapping line, otherwise we need to set the currentCountryName to not known
if personName == "-1": #this is a new country which may or may not have people in it
currentCountryName = countryName
currentCountry2digit = country2digit
isCountryMappingLine = True
else:
isCountryMappingLine = False # this is a person we want to count

if not isCountryMappingLine: #we only want to count people but use the country line to get the right name

#first check to see if the 2digit country info matches up, might be unkown country
if currentCountry2digit != country2digit:
currentCountry2digit = country2digit
currentCountryName = '%s - Unkown Country' % currentCountry2digit

currentKey = '%st%s' % (currentCountryName,personType)

if foundKey != currentKey: #new combo of keys to count
if isFirst == 0:
print '%st%s' % (foundKey,currentCount)
currentCount = 0 #reset the count
else:
isFirst = 0

foundKey = currentKey #make the found key what we see so when we loop again can see if we increment or print out

currentCount += 1 # we increment anything not in the map list
except:
pass

try:
print '%st%s' % (foundKey,currentCount)
except: 17
pass

How to run it

• cat customers.dat
countries.dat|./smplMapper.py|sort|./smplReducer.py
• su hadoop -c "hadoop jar /usr/lib/hadoop-
0.20/contrib/streaming/hadoop-0.20.1+169.89-streaming.jar -
D mapred.map.tasks=75 -D mapred.reduce.tasks=42 -file
./smplMapper.py -mapper ./smplMapper.py -file
./smplReducer.py -reducer ./smplReducer.py -input $1 –output
$2 -inputformat SequenceFileAsTextInputFormat -partitioner
org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -
jobconf stream.map.output.field.separator=^ -jobconf
stream.num.map.output.key.fields=4 -jobconf
map.output.key.field.separator=^ -jobconf
num.key.fields.for.partition=1"

18

Breaking down the Hadoop job

• -partitioner
org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
– This is how you handle keying on values
• -jobconf stream.map.output.field.separator=^
– Tell hadoop how it knows how to parse your output so it can
key on it
• -jobconf stream.num.map.output.key.fields=4
– How many fields total
• -jobconf map.output.key.field.separator=^
– You can key on your map fields seperatly
• -jobconf num.key.fields.for.partition=1
– This is how many of those fiels are your “key” the rest are
sort

19

Some tips

• chmod a+x your py files, they need to execute on the nodes as they are
LITERALLY a process that is run
• NEVER hold too much in memory, it is better to use the last variable method
than holding say a hashmap
• It is ok to have multiple jobs DON’T put too much into each of these it is
better to make pass over the data. Transform then query and calculate.
Creating data sets for your data lets others also interrogate the data
• To join smaller data sets use –file and open it in the script
• http://hadoop.apache.org/common/docs/r0.20.1/streaming.html
• For Ruby streaming check out the podcast
http://allthingshadoop.com/2010/05/20/ruby-streaming-wukong-hadoop-flip-
kromer-infochimps/

• Sample Code for this talk https://github.com/joestein/amaunet

20

We are hiring!
/*

Joe Stein, Chief Architect
http://www.medialets.com
Twitter: @allthingshadoop

*/

Medialets
The rich media ad
platform for mobile.
connect@medialets.com
www.medialets.com/showcas
e

21

Hadoop Streaming Tutorial With Python

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop Streaming Tutorial With Python

Similar to Hadoop Streaming Tutorial With Python (20)

More from Joe Stein

More from Joe Stein (17)

Recently uploaded

Recently uploaded (20)

Hadoop Streaming Tutorial With Python