SlideShare a Scribd company logo
1 of 20
Tutorial: Streaming Jobs (& Non-Java Hadoop)


               /*

                    Joe Stein, Chief Architect
                    http://www.medialets.com
                    Twitter: @allthingshadoop

               */




               Sample Code
   https://github.com/joestein/amaunet

                           1
Overview
• Intro
• Sample Dataset
• Options
• Deep Dive

http://allthingshadoop.com/2010/12/16/si
mple-hadoop-streaming-tutorial-using-
joins-and-keys-with-python/

                    2
Medialets




    3
Medialets
•   Largest deployment of rich media ads for mobile devices
•   Installed on hundreds of millions of devices
•   3-4 TB of new data every day
•   Thousands of services in production
•   Hundreds of thousands of events received every second
•   Response times are measured in microseconds
•   Languages
     – 35% JVM (20% Scala & 10% Java)
     – 30% Ruby
     – 20% C/C++
     – 13% Python
     – 2% Bash


                                 4
MapReduce 101

Why and How It Works




         6
Sample Dataset

Data set 1: countries.dat

name|key

United States|US
Canada|CA
United Kingdom|UK
Italy|IT




                            7
Sample Dataset

Data set 2: customers.dat

name|type|country
Alice Bob|not bad|US
Sam Sneed|valued|CA
Jon Sneed|valued|CA
Arnold Wesise|not so good|UK
Henry Bob|not bad|US
Yo Yo Ma|not so good|CA
Jon York|valued|CA
Alex Ball|valued|UK
Jim Davis|not so bad|JA


                            8
Sample Dataset

The requirement: you need to find out grouped by type of
customer how many of each type are in each country
with the name of the country listed in the countries.dat in
the final result (and not the 2 digit country name).

To-do this you need to:

1) Join the data sets
2) Key on country
3) Count type of customer per country
4) Output the results



                             9
Sample Dataset
United States|US           Alice Bob|not bad|US
Canada|CA                  Sam Sneed|valued|CA
United Kingdom|UK          Jon Sneed|valued|CA
Italy|IT                   Arnold Wesise|not so good|UK
                           Henry Bob|not bad|US
                           Yo Yo Ma|not so good|CA
                           Jon York|valued|CA
                           Alex Ball|valued|UK
                           Jim Davis|not so bad|JA



   Canada not so good 1
   Canada valued 3
   JA - Unkown Country not so bad 1
   United Kingdom not so good 1
   United Kingdom valued 1
   United States not bad 2


                      10
So many ways to MapReduce

• Java
• Hive
• Pig
• Datameer
• Cascading
   –Cascalog
   –Scalding
• Streaming with a framework
   –Wukong
   –Dumbo
   –MrJobs
• Streaming without a framework
   –You can even do it with bash scripts, but don’t

                            11
Why and When
              There are two types of jobs in Hadoop
                1) data transformation 2) queries
• Java
   – Faster? Maybe not, because you might not know how to
     optimize it as well as the Pig and Hive committers do, its
     Java … so … Does not work outside of Hadoop without
     other Apache projects to let it do so.
• Hive & Pig
   – Definitely a possibility but maybe better after you have
     created your data set. Does not work outside of Hadoop.
• Datameer
   – WICKED cool front end, seriously!!!
• Streaming
   – With a framework – one more thing to learn
   – Without a framework – MapReduce with and without
     Hadoop, huh? really? Yeah!!!
                                12
How does streaming work
                           stdin & stdout

•   Hadoop actually opens a process and writes and reads
•   Is this efficient? Yeah it is when you look at it
•   You can read/write to your process without Hadoop – score!!!
•   Why would you do this?
     – You should not put things into Hadoop that don’t belong
       there. Prototyping and go live without the overhead!
     – You can have your MapReduce program run outside of
       Hadoop until it is ready and NEEDS to be running there
     – Really great dev lifecycles
     – Did I mention about the great dev lifecycles?
     – You can write a script in 5 minutes, seriously and then
       interrogate TERABYTES of data without a fuss


                                 13
Blah blah blah
                                                      Where's the beef?

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
   try: #sometimes bad data can cause errors use this how you like to deal with lint and bad data

     personName = "-1" #default sorted as first
     personType = "-1" #default sorted as first
     countryName = "-1" #default sorted as first
     country2digit = "-1" #default sorted as first

     # remove leading and trailing whitespace
     line = line.strip()

     splits = line.split("|")

     if len(splits) == 2: #country data
         countryName = splits[0]
         country2digit = splits[1]
     else: #people data
         personName = splits[0]
         personType = splits[1]
         country2digit = splits[2]

    print '%s^%s^%s^%s' % (country2digit,personType,personName,countryName)
  except: #errors are going to make your job fail which you may or may not want
    pass




                                                                        14
Here is the output of that


CA^-1^-1^Canada
CA^not so good^Yo Yo Ma^-1
CA^valued^Jon Sneed^-1
CA^valued^Jon York^-1
CA^valued^Sam Sneed^-1
IT^-1^-1^Italy
JA^not so bad^Jim Davis^-1
UK^-1^-1^United Kingdom
UK^not so good^Arnold Wesise^-1
UK^valued^Alex Ball^-1
US^-1^-1^United States
US^not bad^Alice Bob^-1
US^not bad^Henry Bob^-1


                             15
Padding is your friend
                  All sorts are not created equal

Josephs-MacBook-Pro:~ josephstein$ cat test
1,,2
1,1,2
Josephs-MacBook-Pro:~ josephstein$ cat test |sort
1,,2
1,1,2

[root@megatron joestein]# cat test
1,,2
1,1,2
[root@megatron joestein]# cat test|sort
1,1,2
1,,2

                                16
And the reducer
#!/usr/bin/env python

import sys

# maps words to their counts
foundKey = ""
foundValue = ""
isFirst = 1
currentCount = 0
currentCountry2digit = "-1"
currentCountryName = "-1"
isCountryMappingLine = False

# input comes from STDIN
for line in sys.stdin:
   # remove leading and trailing whitespace
   line = line.strip()

  try:
     # parse the input we got from mapper.py
     country2digit,personType,personName,countryName = line.split('^')

    #the first line should be a mapping line, otherwise we need to set the currentCountryName to not known
    if personName == "-1": #this is a new country which may or may not have people in it
        currentCountryName = countryName
        currentCountry2digit = country2digit
        isCountryMappingLine = True
    else:
        isCountryMappingLine = False # this is a person we want to count

    if not isCountryMappingLine: #we only want to count people but use the country line to get the right name

       #first check to see if the 2digit country info matches up, might be unkown country
       if currentCountry2digit != country2digit:
           currentCountry2digit = country2digit
           currentCountryName = '%s - Unkown Country' % currentCountry2digit

       currentKey = '%st%s' % (currentCountryName,personType)

       if foundKey != currentKey: #new combo of keys to count
           if isFirst == 0:
               print '%st%s' % (foundKey,currentCount)
               currentCount = 0 #reset the count
           else:
               isFirst = 0

          foundKey = currentKey #make the found key what we see so when we loop again can see if we increment or print out

      currentCount += 1 # we increment anything not in the map list
  except:
    pass

try:
   print '%st%s' % (foundKey,currentCount)
except:                                                                                             17
   pass
How to run it


• cat customers.dat
  countries.dat|./smplMapper.py|sort|./smplReducer.py
• su hadoop -c "hadoop jar /usr/lib/hadoop-
  0.20/contrib/streaming/hadoop-0.20.1+169.89-streaming.jar -
  D mapred.map.tasks=75 -D mapred.reduce.tasks=42 -file
  ./smplMapper.py -mapper ./smplMapper.py -file
  ./smplReducer.py -reducer ./smplReducer.py -input $1 –output
  $2 -inputformat SequenceFileAsTextInputFormat -partitioner
  org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -
  jobconf stream.map.output.field.separator=^ -jobconf
  stream.num.map.output.key.fields=4 -jobconf
  map.output.key.field.separator=^ -jobconf
  num.key.fields.for.partition=1"

                              18
Breaking down the Hadoop job


• -partitioner
  org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
   – This is how you handle keying on values
• -jobconf stream.map.output.field.separator=^
   – Tell hadoop how it knows how to parse your output so it can
     key on it
• -jobconf stream.num.map.output.key.fields=4
   – How many fields total
• -jobconf map.output.key.field.separator=^
   – You can key on your map fields seperatly
• -jobconf num.key.fields.for.partition=1
   – This is how many of those fiels are your “key” the rest are
     sort

                               19
Some tips


• chmod a+x your py files, they need to execute on the nodes as they are
  LITERALLY a process that is run
• NEVER hold too much in memory, it is better to use the last variable method
  than holding say a hashmap
• It is ok to have multiple jobs DON’T put too much into each of these it is
  better to make pass over the data. Transform then query and calculate.
  Creating data sets for your data lets others also interrogate the data
• To join smaller data sets use –file and open it in the script
• http://hadoop.apache.org/common/docs/r0.20.1/streaming.html
• For Ruby streaming check out the podcast
  http://allthingshadoop.com/2010/05/20/ruby-streaming-wukong-hadoop-flip-
  kromer-infochimps/

• Sample Code for this talk https://github.com/joestein/amaunet




                                     20
We are hiring!
 /*

      Joe Stein, Chief Architect
      http://www.medialets.com
      Twitter: @allthingshadoop

 */


 Medialets
 The rich media ad
 platform for mobile.
                      connect@medialets.com
                      www.medialets.com/showcas
                      e




              21

More Related Content

What's hot

Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingToni Cebrián
 
Merge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using RMerge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using RYogesh Khandelwal
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureDr. Christian Betz
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Holden Karau
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsChien Chung Shen
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupSri Ambati
 
2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekingeProf. Wim Van Criekinge
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
 
PyRate for fun and research
PyRate for fun and researchPyRate for fun and research
PyRate for fun and researchBrianna McHorse
 

What's hot (20)

Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Merge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using RMerge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using R
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
R meetup talk
R meetup talkR meetup talk
R meetup talk
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
PyRate for fun and research
PyRate for fun and researchPyRate for fun and research
PyRate for fun and research
 

Viewers also liked

Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Tutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reduceTutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reducemudassar mulla
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopVigen Sahakyan
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogJoe Stein
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving P...
K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving P...K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving P...
K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving P...Yusuke Matsubara
 
Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Uri Laserson
 
Is Hadoop a necessity for Data Science
Is Hadoop a necessity for Data ScienceIs Hadoop a necessity for Data Science
Is Hadoop a necessity for Data ScienceEdureka!
 
jstein.cassandra.nyc.2011
jstein.cassandra.nyc.2011jstein.cassandra.nyc.2011
jstein.cassandra.nyc.2011Joe Stein
 
Storing Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsStoring Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsJoe Stein
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaJoe Stein
 
Apache Cassandra 2.0
Apache Cassandra 2.0Apache Cassandra 2.0
Apache Cassandra 2.0Joe Stein
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Joe Stein
 
Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on MesosJoe Stein
 

Viewers also liked (20)

Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Tutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reduceTutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reduce
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving P...
K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving P...K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving P...
K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving P...
 
Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)
 
Is Hadoop a necessity for Data Science
Is Hadoop a necessity for Data ScienceIs Hadoop a necessity for Data Science
Is Hadoop a necessity for Data Science
 
jstein.cassandra.nyc.2011
jstein.cassandra.nyc.2011jstein.cassandra.nyc.2011
jstein.cassandra.nyc.2011
 
Storing Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsStoring Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite Columns
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
 
Apache Cassandra 2.0
Apache Cassandra 2.0Apache Cassandra 2.0
Apache Cassandra 2.0
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
 
Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on Mesos
 

Similar to Hadoop Streaming Tutorial With Python

MySQL Performance Monitoring
MySQL Performance MonitoringMySQL Performance Monitoring
MySQL Performance Monitoringspil-engineering
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"Portland R User Group
 
What we Learned Implementing Puppet at Backstop
What we Learned Implementing Puppet at BackstopWhat we Learned Implementing Puppet at Backstop
What we Learned Implementing Puppet at BackstopPuppet
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
Lab 3 Set Working Directory, Scatterplots and Introduction to.docx
Lab 3 Set Working Directory, Scatterplots and Introduction to.docxLab 3 Set Working Directory, Scatterplots and Introduction to.docx
Lab 3 Set Working Directory, Scatterplots and Introduction to.docxDIPESH30
 
Retaining globally distributed high availability
Retaining globally distributed high availabilityRetaining globally distributed high availability
Retaining globally distributed high availabilityspil-engineering
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
The Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseThe Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseFred Moyer
 
Fixing Growing Pains With Puppet Data Patterns
Fixing Growing Pains With Puppet Data PatternsFixing Growing Pains With Puppet Data Patterns
Fixing Growing Pains With Puppet Data PatternsMartin Jackson
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePedro Figueiredo
 
2 Years of Real World FP at REA
2 Years of Real World FP at REA2 Years of Real World FP at REA
2 Years of Real World FP at REAkenbot
 
How to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RHow to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RPaul Bradshaw
 

Similar to Hadoop Streaming Tutorial With Python (20)

MySQL Performance Monitoring
MySQL Performance MonitoringMySQL Performance Monitoring
MySQL Performance Monitoring
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
What we Learned Implementing Puppet at Backstop
What we Learned Implementing Puppet at BackstopWhat we Learned Implementing Puppet at Backstop
What we Learned Implementing Puppet at Backstop
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
IOE MODULE 6.pptx
IOE MODULE 6.pptxIOE MODULE 6.pptx
IOE MODULE 6.pptx
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Lab 3 Set Working Directory, Scatterplots and Introduction to.docx
Lab 3 Set Working Directory, Scatterplots and Introduction to.docxLab 3 Set Working Directory, Scatterplots and Introduction to.docx
Lab 3 Set Working Directory, Scatterplots and Introduction to.docx
 
Lua pitfalls
Lua pitfallsLua pitfalls
Lua pitfalls
 
Retaining globally distributed high availability
Retaining globally distributed high availabilityRetaining globally distributed high availability
Retaining globally distributed high availability
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
The Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseThe Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL Database
 
Fixing Growing Pains With Puppet Data Patterns
Fixing Growing Pains With Puppet Data PatternsFixing Growing Pains With Puppet Data Patterns
Fixing Growing Pains With Puppet Data Patterns
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReduce
 
2 Years of Real World FP at REA
2 Years of Real World FP at REA2 Years of Real World FP at REA
2 Years of Real World FP at REA
 
How to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RHow to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in R
 
The breakup
The breakupThe breakup
The breakup
 

More from Joe Stein

SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1Joe Stein
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosJoe Stein
 
Introduction To Apache Mesos
Introduction To Apache MesosIntroduction To Apache Mesos
Introduction To Apache MesosJoe Stein
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache MesosJoe Stein
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosJoe Stein
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloJoe Stein
 
Building and Deploying Application to Apache Mesos
Building and Deploying Application to Apache MesosBuilding and Deploying Application to Apache Mesos
Building and Deploying Application to Apache MesosJoe Stein
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosApache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosJoe Stein
 
Developing with the Go client for Apache Kafka
Developing with the Go client for Apache KafkaDeveloping with the Go client for Apache Kafka
Developing with the Go client for Apache KafkaJoe Stein
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache KafkaJoe Stein
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache KafkaJoe Stein
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache MesosJoe Stein
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaJoe Stein
 
Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 

More from Joe Stein (17)

SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache Mesos
 
Introduction To Apache Mesos
Introduction To Apache MesosIntroduction To Apache Mesos
Introduction To Apache Mesos
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache Mesos
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache Mesos
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
 
Building and Deploying Application to Apache Mesos
Building and Deploying Application to Apache MesosBuilding and Deploying Application to Apache Mesos
Building and Deploying Application to Apache Mesos
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosApache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on Mesos
 
Developing with the Go client for Apache Kafka
Developing with the Go client for Apache KafkaDeveloping with the Go client for Apache Kafka
Developing with the Go client for Apache Kafka
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache Mesos
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache Kafka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 

Hadoop Streaming Tutorial With Python

  • 1. Tutorial: Streaming Jobs (& Non-Java Hadoop) /* Joe Stein, Chief Architect http://www.medialets.com Twitter: @allthingshadoop */ Sample Code https://github.com/joestein/amaunet 1
  • 2. Overview • Intro • Sample Dataset • Options • Deep Dive http://allthingshadoop.com/2010/12/16/si mple-hadoop-streaming-tutorial-using- joins-and-keys-with-python/ 2
  • 4. Medialets • Largest deployment of rich media ads for mobile devices • Installed on hundreds of millions of devices • 3-4 TB of new data every day • Thousands of services in production • Hundreds of thousands of events received every second • Response times are measured in microseconds • Languages – 35% JVM (20% Scala & 10% Java) – 30% Ruby – 20% C/C++ – 13% Python – 2% Bash 4
  • 5. MapReduce 101 Why and How It Works 6
  • 6. Sample Dataset Data set 1: countries.dat name|key United States|US Canada|CA United Kingdom|UK Italy|IT 7
  • 7. Sample Dataset Data set 2: customers.dat name|type|country Alice Bob|not bad|US Sam Sneed|valued|CA Jon Sneed|valued|CA Arnold Wesise|not so good|UK Henry Bob|not bad|US Yo Yo Ma|not so good|CA Jon York|valued|CA Alex Ball|valued|UK Jim Davis|not so bad|JA 8
  • 8. Sample Dataset The requirement: you need to find out grouped by type of customer how many of each type are in each country with the name of the country listed in the countries.dat in the final result (and not the 2 digit country name). To-do this you need to: 1) Join the data sets 2) Key on country 3) Count type of customer per country 4) Output the results 9
  • 9. Sample Dataset United States|US Alice Bob|not bad|US Canada|CA Sam Sneed|valued|CA United Kingdom|UK Jon Sneed|valued|CA Italy|IT Arnold Wesise|not so good|UK Henry Bob|not bad|US Yo Yo Ma|not so good|CA Jon York|valued|CA Alex Ball|valued|UK Jim Davis|not so bad|JA Canada not so good 1 Canada valued 3 JA - Unkown Country not so bad 1 United Kingdom not so good 1 United Kingdom valued 1 United States not bad 2 10
  • 10. So many ways to MapReduce • Java • Hive • Pig • Datameer • Cascading –Cascalog –Scalding • Streaming with a framework –Wukong –Dumbo –MrJobs • Streaming without a framework –You can even do it with bash scripts, but don’t 11
  • 11. Why and When There are two types of jobs in Hadoop 1) data transformation 2) queries • Java – Faster? Maybe not, because you might not know how to optimize it as well as the Pig and Hive committers do, its Java … so … Does not work outside of Hadoop without other Apache projects to let it do so. • Hive & Pig – Definitely a possibility but maybe better after you have created your data set. Does not work outside of Hadoop. • Datameer – WICKED cool front end, seriously!!! • Streaming – With a framework – one more thing to learn – Without a framework – MapReduce with and without Hadoop, huh? really? Yeah!!! 12
  • 12. How does streaming work stdin & stdout • Hadoop actually opens a process and writes and reads • Is this efficient? Yeah it is when you look at it • You can read/write to your process without Hadoop – score!!! • Why would you do this? – You should not put things into Hadoop that don’t belong there. Prototyping and go live without the overhead! – You can have your MapReduce program run outside of Hadoop until it is ready and NEEDS to be running there – Really great dev lifecycles – Did I mention about the great dev lifecycles? – You can write a script in 5 minutes, seriously and then interrogate TERABYTES of data without a fuss 13
  • 13. Blah blah blah Where's the beef? #!/usr/bin/env python import sys # input comes from STDIN (standard input) for line in sys.stdin: try: #sometimes bad data can cause errors use this how you like to deal with lint and bad data personName = "-1" #default sorted as first personType = "-1" #default sorted as first countryName = "-1" #default sorted as first country2digit = "-1" #default sorted as first # remove leading and trailing whitespace line = line.strip() splits = line.split("|") if len(splits) == 2: #country data countryName = splits[0] country2digit = splits[1] else: #people data personName = splits[0] personType = splits[1] country2digit = splits[2] print '%s^%s^%s^%s' % (country2digit,personType,personName,countryName) except: #errors are going to make your job fail which you may or may not want pass 14
  • 14. Here is the output of that CA^-1^-1^Canada CA^not so good^Yo Yo Ma^-1 CA^valued^Jon Sneed^-1 CA^valued^Jon York^-1 CA^valued^Sam Sneed^-1 IT^-1^-1^Italy JA^not so bad^Jim Davis^-1 UK^-1^-1^United Kingdom UK^not so good^Arnold Wesise^-1 UK^valued^Alex Ball^-1 US^-1^-1^United States US^not bad^Alice Bob^-1 US^not bad^Henry Bob^-1 15
  • 15. Padding is your friend All sorts are not created equal Josephs-MacBook-Pro:~ josephstein$ cat test 1,,2 1,1,2 Josephs-MacBook-Pro:~ josephstein$ cat test |sort 1,,2 1,1,2 [root@megatron joestein]# cat test 1,,2 1,1,2 [root@megatron joestein]# cat test|sort 1,1,2 1,,2 16
  • 16. And the reducer #!/usr/bin/env python import sys # maps words to their counts foundKey = "" foundValue = "" isFirst = 1 currentCount = 0 currentCountry2digit = "-1" currentCountryName = "-1" isCountryMappingLine = False # input comes from STDIN for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() try: # parse the input we got from mapper.py country2digit,personType,personName,countryName = line.split('^') #the first line should be a mapping line, otherwise we need to set the currentCountryName to not known if personName == "-1": #this is a new country which may or may not have people in it currentCountryName = countryName currentCountry2digit = country2digit isCountryMappingLine = True else: isCountryMappingLine = False # this is a person we want to count if not isCountryMappingLine: #we only want to count people but use the country line to get the right name #first check to see if the 2digit country info matches up, might be unkown country if currentCountry2digit != country2digit: currentCountry2digit = country2digit currentCountryName = '%s - Unkown Country' % currentCountry2digit currentKey = '%st%s' % (currentCountryName,personType) if foundKey != currentKey: #new combo of keys to count if isFirst == 0: print '%st%s' % (foundKey,currentCount) currentCount = 0 #reset the count else: isFirst = 0 foundKey = currentKey #make the found key what we see so when we loop again can see if we increment or print out currentCount += 1 # we increment anything not in the map list except: pass try: print '%st%s' % (foundKey,currentCount) except: 17 pass
  • 17. How to run it • cat customers.dat countries.dat|./smplMapper.py|sort|./smplReducer.py • su hadoop -c "hadoop jar /usr/lib/hadoop- 0.20/contrib/streaming/hadoop-0.20.1+169.89-streaming.jar - D mapred.map.tasks=75 -D mapred.reduce.tasks=42 -file ./smplMapper.py -mapper ./smplMapper.py -file ./smplReducer.py -reducer ./smplReducer.py -input $1 –output $2 -inputformat SequenceFileAsTextInputFormat -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner - jobconf stream.map.output.field.separator=^ -jobconf stream.num.map.output.key.fields=4 -jobconf map.output.key.field.separator=^ -jobconf num.key.fields.for.partition=1" 18
  • 18. Breaking down the Hadoop job • -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner – This is how you handle keying on values • -jobconf stream.map.output.field.separator=^ – Tell hadoop how it knows how to parse your output so it can key on it • -jobconf stream.num.map.output.key.fields=4 – How many fields total • -jobconf map.output.key.field.separator=^ – You can key on your map fields seperatly • -jobconf num.key.fields.for.partition=1 – This is how many of those fiels are your “key” the rest are sort 19
  • 19. Some tips • chmod a+x your py files, they need to execute on the nodes as they are LITERALLY a process that is run • NEVER hold too much in memory, it is better to use the last variable method than holding say a hashmap • It is ok to have multiple jobs DON’T put too much into each of these it is better to make pass over the data. Transform then query and calculate. Creating data sets for your data lets others also interrogate the data • To join smaller data sets use –file and open it in the script • http://hadoop.apache.org/common/docs/r0.20.1/streaming.html • For Ruby streaming check out the podcast http://allthingshadoop.com/2010/05/20/ruby-streaming-wukong-hadoop-flip- kromer-infochimps/ • Sample Code for this talk https://github.com/joestein/amaunet 20
  • 20. We are hiring! /* Joe Stein, Chief Architect http://www.medialets.com Twitter: @allthingshadoop */ Medialets The rich media ad platform for mobile. connect@medialets.com www.medialets.com/showcas e 21