SlideShare ist ein Scribd-Unternehmen logo
1 von 54
Meet-up: Tackling “Big
Data” with Hadoop and
Sam Kamin, VP Data Engineering
NYC Data Science Academy
NYC Data Science Academy
● We’re a company that does training and
consulting in the Data Science area.
● I’m Sam Kamin. I just joined NYCDSA as
VP of Data Engineering (a new area for us).
I was formerly a professor at the U. of Illinois
(CS) and a Software Engineer at Google.
What this meet-up is about
● Wikipedia: “Data Science is the extraction of
knowledge from large volumes of data.”
● My goal tonight: Show you how you can
handle large volumes of data with simple
Python programming, using the Hadoop
streaming interface.
Outline of talk
● Brief overview of Hadoop
● Introduction to parallelism via MapReduce
● Examples of applying MapReduce
● Implementing MapReduce in Python
You can do some programming at the end if you want!
Big Data: What’s the problem?
Too much data!
o Web contains about 5 billion web pages. According
to Wikipedia, its total size in bytes is about 4
zettabytes - that’s 1021, or four thousand billion
o Google’s datacenters store about 15 exabytes (15 x
1018 bytes).
Big Data: What’s the solution?
● Parallel computing: Use multiple,
cooperating computers.
● Parallelism = dividing up a problem so that
multiple computers can all work on it:
o Break the data into pieces
o Send the pieces to different computers for
o Send the results back and process the combination
to get the final result.
Cloud computing
● Amazon, Google, Microsoft, and many other
companies operate huge clusters: Racks of
(basically) off-the-shelf computers with
(basically) standard network connections.
● The computers in these clusters run Linux -
use them like any other computer...
Cloud computing
● But: getting them to work together is really
o Management: machine/disk failure; efficient data
placement; debugging, monitoring, logging, auditing.
o Algorithms: decomposing your problem so it can be
solved in parallel can be hard.
That’s what Hadoop is here to help with.
● A collection of services in a cluster:
o Distributed, reliable file system (HDFS)
o Scheduler to run jobs in correct order, monitor,
restart on failure, etc.
o MapReduce to help you decompose your problem
for parallel execution
o A variety of other components (mostly based on
MapReduce), e.g. databases, application-focused
How to use Hadoop
● Hadoop is open source (free!)
● It is hosted on Apache:
● Download it and run it standalone (for
● Buy a cluster or rent time on one, e.g. AWS,
GCE, Azure. (All offer some free time for
new users.)
● The main, and original, parallel-processing
system of Hadoop.
● Developed by Google to simplify parallel
processing. Hadoop started as an open-
source implementation of Google’s idea.
● With Hadoop’s streaming interface, it’s really
easy to use MapReduce in Python.
MapReduce - The Big Idea
● Calculations on large data sets often have
this form: Start by aggregating the data
(possibly in a different order from the
“natural order”), then perform a summarizing
calculation on the aggregated groups.
● The idea of MapReduce: If your calculation
is explicitly structured like this, it can be
automatically parallelized.
Computing with MapReduce
A MapReduce computation has three stages:
Map: A function called map is applied to each record in
your input. It produces zero or more records as output,
each with a key and value. Keys may be repeated.
Shuffle: The output from step 1 is sorted and combined: All
records with the same key are combined into one.
Reduce: A function called reduce is applied to each record
(key + values) from step 2 to produce the final output.
As the programmer, you only write map and reduce.
Computing with MapReduce
A, 7
C, 5
B, 23
B, 12
A, 18
A, [18, 7]
B, [23, 12]
C, [5]
Outputmap reduceshuffle
Note: map is record-oriented, meaning the output of the
map stage is strictly a combination of the outputs from
each record. That allows us to calculate in parallel...
Parallelism via MapReduce
Input A, [18, 7]
B, [23, 12]
C, [5]
map reduce
Because map and reduce are record-oriented, MR can
divide inputs into arbitrary chunks:
MapReduce example: Stock prices
● Input: list of daily opening and closing prices for
thousands of stocks over thousands of days.
● Desired output: The biggest-ever one-day
percentage price increase for each stock.
● Solution using MR:
o map: (stock, open, close) =>
(stock, (close - open) / open) (if pos)
o reduce: (stock, [%c0, %c1, …]) =>
(stock, max [%c0, %c1, …]).
MapReduce example - map
Goog, 230, 240
Apple, 100, 98
MS, 300, 250
MS, 250, 260
MS, 270, 280
Goog, 220, 215
Goog, 300, 350
IBM, 80, 90
IBM, 90, 85
Goog, 4.3%
MS, 4%
MS, 3.7%
Goog, 16.6%
IBM, 12.5%
You supply map: Output stock with % increase, or
nothing if decrease.
MapReduce example - shuffle/sort
Goog, 4.3%
MS, 4%
MS, 3.7%
Goog, 16.6%
IBM, 12.5%
/sort Goog, [4.3%, 16.6%]
IBM, [12.5%]
MS, [3.7%, 4%]
Goog, 4.3%
MS, 4%
MS, 3.7%
Goog, 16.6%
IBM, 12.5%
MapReduce supplies shuffle/sort: Combine all
records for each stock.
MapReduce example - reduce
reduceGoog, [4.3%, 16.6%]
IBM, [12.5%]
MS, [3.7%, 4%]
Goog, 16.6%
IBM, 12.5%
MS, 4%
You supply reduce: Output max of percentages for
each input record.
Wait, why did that help?
I could have just written a loop to read every
line and put the percentages in a table!
● Suppose you have a terabyte of data, and
1000 computers in your cluster.
● MapReduce can automatically split the data
into 1000 1GB chunks. You write two simple
functions and get a 1000x speed-up!
Modelling problems using MR
● We’re going to look at a variety of problems
and see how we can fit them into the MR
● The question for each problem is: What are
the types of map and reduce, and what do
they do?
Example: Word count
Input: Lines of text.
Desired output: # of occurrences of each
word (i.e. each sequence of non-space chars)
E.g. Input: Roses are red, violets are blue
Output: are, 2
blue, 1
red, 1 etc.
Example: Word count
● map: “w1 w2 … wk” → w1, 1
w2, 1
wk, 1
● reduce: (w, [1, 1, …]) → (w, n)
n 1’s
Example: Word count frequency
Input: Output of word count
Desired output: For any number of
occurrences c, the number of different words
that occur c times.
E.g. Input: Roses are red, violets are blue
Output: 1, 4
2, 1
Example: Word count frequency
● map: w, c → c, 1
● reduce: (c, [1, 1, …]) → (c, n)
n 1’s
Example: Page Rank
● Famous algorithm used by Google to rank
pages. (Comes down to matrix-vector
multiplication, as we’ll see…)
● Based on two ideas:
o Importance of a page depends upon how many
pages link to it.
o However, if a page has lots of links going out, the
value of each link is reduced.
Example: Page Rank
With those two ideas, calculate rank of page:
Note: Because the web has cycles - page p can
have a link to page q, which has a link to p -
this formula requires an iterative solution.
pagerank(p) =
Example: Page Rank
Consider pages and their links as a graph
(page A has links to B, C, and D, etc.):
pr(A) = pr(B)/2 + pr(D)/2
pr(B) = pr(A)/3 + pr(D)/2
pr(C) = pr(A)/3 + pr(B)/2
pr(D) = pr(A)/3 + pr(C)
Example: Page Rank
● Represent the graph as a weighted
adjacency matrix:
0 1/2 0 1/2
1/3 0 0 1/2
1/3 1/2 0 0
1/3 0 1 0
M =
links to
links from
Example: Page Rank
● Now, if we put the page rank of each page in
a vector v, then multiplying M by v calculates
the pagerank formula for all nodes:
0 1/2 0 1/2
1/3 0 0 1/2
1/3 1/2 0 0
1/3 0 1 0
pr(B)/2 + pr(D)/2
pr(A)/3 + pr(D)/2
pr(A)/3 + pr(B)/2
pr(A)/3 + pr(C)
X =
Example: Page Rank
● So, to calculate page ranks, start with an
initial guess of all page ranks and multiply.
● After one multiplication:
0 1/2 0 1/2
1/3 0 0 1/2
1/3 1/2 0 0
1/3 0 1 0
X =
Example: Page Rank
● After two multiplications:
0 1/2 0 1/2
1/3 0 0 1/2
1/3 1/2 0 0
1/3 0 1 0
X =
Example: Page Rank
● Thus, page rank = matrix-vector product.
● Can we express matrix-vector multiplication
as a MapReduce?
o Assume v is copied (magically) to each node.
o M, being much bigger, needs to be partitioned, i.e. M
is the main input file.
o How shall we represent M and define map and
Example: Page Rank
● A solution:
o Represent M using one record for each link:
(p, q, out-degree(p)) for every link p→q.
o map: (p, q, d) ↦ (q, v[p]/d)
reduce: p, [c1, c2, …] ↦ p, c1+c2+...
MapReduce: Summary
● Nowadays, MapReduce powers the internet:
o Google, Amazon, Facebook, use it extensively for
everything from page ranking to error log analysis.
o NIH use it to analyze gene sequences.
o NASA uses it to analyze data from probes.
o etc., etc.
● Next question: How can we implement a
Writing map and reduce in Python
● Easy using the streaming interface:
o map and reduce : stdin → stdout. Each should
iterate over stdin and output result for each line.
o Inputs and outputs are text files. In map and reduce
output, tab character separates key from value.
o Shuffle just sorts the files on the key.
 Instead of a line with a key and list of values, we
get consecutive lines with the same key.
Example: stock prices
● Recall the output of the shuffle stage:
● The only difference is this becomes:
Goog, [4.3%, 16.6%]
IBM, [12.5%]
MS, [3.7%, 4%]
Goog 4.3%
Goog 16.6%
IBM 12.5%
MS 3.7%
MS 4%
Example: stock prices
● On the next two slides, we show the map
and reduce functions in Python.
● Both of them are just stand-alone programs
that read stdin and write stdout.
● In fact, we can test our pipeline without using
cat input-file | ./ | sort |
Example: stock prices -
#!/usr/bin/env python
import sys
import string
for line in sys.stdin:
record = line.split(",")
opening = int(record[1])
closing = int(record[2])
if (closing > opening):
change = float(closing - opening) / opening
print '%st%s' % (record[0], change)
Example: stock prices -
stock = None
max_increase = 0
for line in sys.stdin:
next_stock, increase = line.split('t')
increase = float(increase)
if next_stock == stock: # another line for the same stock
if increase > max_increase:
max_increase = increase
else: # new stock; output result for previous stock
if stock: # only false on the very first line of input
print( "%st%f" % (stock, max_increase) )
stock = next_stock
max_increase = increase
# print the last
print( "%st%d" % (stock, max_increase) )
Invoking Hadoop
● Now we just have to run Hadoop. (Here we
are running locally. To run in a cluster, you
need to move the data into HDFS first.)
If you want to run code on our servers, I’ll
give instructions at the end of the talk.
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar 
-input input.txt -output output 
-mapper -reducer
Brief history of Hadoop
● 2004: Two engineers from Google published
a paper on MapReduce
o Doug Cutting was working on an open-source web
crawler; saw that MapReduce solved his biggest
problem: coordinating lots of computers; decided to
implement an open-source version of MR.
o Yahoo hired Cutting and continued and expanded
the Hadoop project.
Brief history of Hadoop (cont.)
● Today: Hadoop includes its own scheduler,
lock mechanism, many database systems,
MapReduce, a non-MapReduce parallelism
system called Spark, and more.
● Demand for “data engineers” who can
manage huge datasets using Hadoop keeps
● We discussed the easiest way (that I know)
to use Hadoop to process large datasets.
● Hadoop provides MapReduce, which can
exploit massive parallelism by automatically
breaking up inputs and processing the
pieces separately, as long as the user
supplies map and reduce functions.
Summary (cont.)
● Your problem as a programmer is to figure
out how to write map and reduce functions
that will solve your problem. This is
sometimes really easy.
● Using Python streaming, map and reduce
are just Python scripts that read from stdin
and write to stdout - no need to learn special
Hadoop APIs or anything!
So is that all there is to MapReduce?
● If only! For more complex cases and for
higher efficiency:
o Use Java for higher efficiency
o Store data in the cluster, for capacity, reliability, and
o Tune your application for higher efficiency, e.g.
placing computations near data
o Use some of many Hadoop components that can
make programs easier to write and more efficient
Next steps
● If you want to learn more, there are many books and
online tutorials.
o Hadoop: The Definitive Guide, by Tom White, is the
definitive guide. (You’ll need to know Java.)
● We’ll be giving a five-Saturday lecture/lab class
expanding on this meet-up starting this Saturday, and a
twelve-evening class starting August 3.
● We’ll be giving a six-week, full-time bootcamp on
Hadoop+Python starting in late August.
Running examples
● For those of you who want to run examples:
o Login to server per given instructions
o Directory streaming-examples has code for stock
prices, wordcount, and word frequencies.
o In each directory, enter: source
o Output in output/part-00000 should match file
o If you want to edit and re-run, you need to delete
output directories: rm -r output (and rm -r output0 in
Running examples (cont.)
● Please let us know if you want to continue
working on this tomorrow; we’ll leave the
accounts live until Friday if you request it.
● Some suggestions:
o Word count variants
 Ignore case
 Ignore punctuation
 Find number of words of each length
 Create sorted list of words of each length
Running examples (cont.)
● Some suggestions:
o Stock prices
 Produce both max and min increases
o Matrix-vector multiplication - you’ll be starting from
scratch on this one.
 Implement the method we described.
 Suppose the input is in the form p, q1, q2, …, qn,
i.e. a page and all of its outgoing links.
● Obvious source of inefficiency in wordcount:
Suppose a word occurs twice on one line;
we should output one line of ‘w, 2’ instead of
two lines of ‘w, 1’.
● In fact, this applies to the entire file: Instead
of ‘w, 1’ for each occurrence of a word,
output ‘w, n’ if w occurs n times.
● Or, to put this differently: We should apply
reduce to each file before the shuffle stage.
● Can do this by specifying a combiner
function (which in this case is just reduce).
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar 
-input input.txt 
-output output 
-reducer -combiner

Weitere ähnliche Inhalte

Was ist angesagt?

Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹Wayne Chen
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ FyberDaniel Hen
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature HashingWush Wu
Support Vector Machines (SVM)
Support Vector Machines (SVM)Support Vector Machines (SVM)
Support Vector Machines (SVM)FAO
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data StructureSakthi Dasans
8. R Graphics with R
8. R Graphics with R8. R Graphics with R
8. R Graphics with RFAO
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMarjan Sterjev
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahoutsscdotopen
Gbm.more GBM in H2O
Gbm.more GBM in H2OGbm.more GBM in H2O
Gbm.more GBM in H2OSri Ambati
Vectors data frames
Vectors data framesVectors data frames
Vectors data framesFAO
Fundamentals of Image Processing & Computer Vision with MATLAB
Fundamentals of Image Processing & Computer Vision with MATLABFundamentals of Image Processing & Computer Vision with MATLAB
Fundamentals of Image Processing & Computer Vision with MATLABAli Ghanbarzadeh
How to use SVM for data classification
How to use SVM for data classificationHow to use SVM for data classification
How to use SVM for data classificationYiwei Chen
Writing Fast MATLAB Code
Writing Fast MATLAB CodeWriting Fast MATLAB Code
Writing Fast MATLAB CodeJia-Bin Huang
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
Build your own Convolutional Neural Network CNN
Build your own Convolutional Neural Network CNNBuild your own Convolutional Neural Network CNN
Build your own Convolutional Neural Network CNNHichem Felouat
Linear models
Linear modelsLinear models
Linear modelsFAO

Was ist angesagt? (20)

Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
Support Vector Machines (SVM)
Support Vector Machines (SVM)Support Vector Machines (SVM)
Support Vector Machines (SVM)
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data Structure
8. R Graphics with R
8. R Graphics with R8. R Graphics with R
8. R Graphics with R
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
Gbm.more GBM in H2O
Gbm.more GBM in H2OGbm.more GBM in H2O
Gbm.more GBM in H2O
Vectors data frames
Vectors data framesVectors data frames
Vectors data frames
Fundamentals of Image Processing & Computer Vision with MATLAB
Fundamentals of Image Processing & Computer Vision with MATLABFundamentals of Image Processing & Computer Vision with MATLAB
Fundamentals of Image Processing & Computer Vision with MATLAB
Image processing
Image processingImage processing
Image processing
How to use SVM for data classification
How to use SVM for data classificationHow to use SVM for data classification
How to use SVM for data classification
Writing Fast MATLAB Code
Writing Fast MATLAB CodeWriting Fast MATLAB Code
Writing Fast MATLAB Code
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
R studio
R studio R studio
R studio
Build your own Convolutional Neural Network CNN
Build your own Convolutional Neural Network CNNBuild your own Convolutional Neural Network CNN
Build your own Convolutional Neural Network CNN
Linear models
Linear modelsLinear models
Linear models

Andere mochten auch

A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)Vivian S. Zhang
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningVivian S. Zhang
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataVivian S. Zhang
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentationVivian S. Zhang
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rVivian S. Zhang
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang

Andere mochten auch (11)

A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions

Ähnlich wie Streaming Python on Hadoop

My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentationNoha Elprince
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Sparkdatamantra
Benchmarking Tool for Graph Algorithms
Benchmarking Tool for Graph AlgorithmsBenchmarking Tool for Graph Algorithms
Benchmarking Tool for Graph AlgorithmsYash Khandelwal
Benchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsBenchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsYash Khandelwal
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCAapo Kyrölä
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e práticaPET Computação
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014Codemotion
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce Sina Ebrahimi
Download It
Download ItDownload It
Download Itbutest
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfTSANKARARAO

Ähnlich wie Streaming Python on Hadoop (20)

MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
Benchmarking Tool for Graph Algorithms
Benchmarking Tool for Graph AlgorithmsBenchmarking Tool for Graph Algorithms
Benchmarking Tool for Graph Algorithms
Benchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsBenchmarking tool for graph algorithms
Benchmarking tool for graph algorithms
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CPP Homework Help
CPP Homework HelpCPP Homework Help
CPP Homework Help
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e prática
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
Download It
Download ItDownload It
Download It
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx

Mehr von Vivian S. Zhang

Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger RenVivian S. Zhang
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide bookVivian S. Zhang
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Vivian S. Zhang
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataVivian S. Zhang
Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Vivian S. Zhang
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Vivian S. Zhang
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycVivian S. Zhang
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycVivian S. Zhang
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Vivian S. Zhang
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Vivian S. Zhang
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Vivian S. Zhang
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...Vivian S. Zhang
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...Vivian S. Zhang

Mehr von Vivian S. Zhang (14)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)Natural Language Processing(SupStat Inc)
Natural Language Processing(SupStat Inc)
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...

Kürzlich hochgeladen

Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George

Kürzlich hochgeladen (20)

Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database

Streaming Python on Hadoop

  • 1. 1
  • 2. Meet-up: Tackling “Big Data” with Hadoop and Python Sam Kamin, VP Data Engineering NYC Data Science Academy 2
  • 3. NYC Data Science Academy ● We’re a company that does training and consulting in the Data Science area. ● I’m Sam Kamin. I just joined NYCDSA as VP of Data Engineering (a new area for us). I was formerly a professor at the U. of Illinois (CS) and a Software Engineer at Google. 3
  • 4. What this meet-up is about ● Wikipedia: “Data Science is the extraction of knowledge from large volumes of data.” ● My goal tonight: Show you how you can handle large volumes of data with simple Python programming, using the Hadoop streaming interface. 4
  • 5. Outline of talk ● Brief overview of Hadoop ● Introduction to parallelism via MapReduce ● Examples of applying MapReduce ● Implementing MapReduce in Python You can do some programming at the end if you want! 5
  • 6. Big Data: What’s the problem? Too much data! o Web contains about 5 billion web pages. According to Wikipedia, its total size in bytes is about 4 zettabytes - that’s 1021, or four thousand billion gigabytes. o Google’s datacenters store about 15 exabytes (15 x 1018 bytes). 6
  • 7. Big Data: What’s the solution? ● Parallel computing: Use multiple, cooperating computers. 7
  • 8. Parallelism ● Parallelism = dividing up a problem so that multiple computers can all work on it: o Break the data into pieces o Send the pieces to different computers for processing. o Send the results back and process the combination to get the final result. 8
  • 9. Cloud computing ● Amazon, Google, Microsoft, and many other companies operate huge clusters: Racks of (basically) off-the-shelf computers with (basically) standard network connections. ● The computers in these clusters run Linux - use them like any other computer... 9
  • 10. Cloud computing ● But: getting them to work together is really hard: o Management: machine/disk failure; efficient data placement; debugging, monitoring, logging, auditing. o Algorithms: decomposing your problem so it can be solved in parallel can be hard. That’s what Hadoop is here to help with. 10
  • 11. ● A collection of services in a cluster: o Distributed, reliable file system (HDFS) o Scheduler to run jobs in correct order, monitor, restart on failure, etc. o MapReduce to help you decompose your problem for parallel execution o A variety of other components (mostly based on MapReduce), e.g. databases, application-focused libraries 11
  • 12. How to use Hadoop ● Hadoop is open source (free!) ● It is hosted on Apache: ● Download it and run it standalone (for debugging) ● Buy a cluster or rent time on one, e.g. AWS, GCE, Azure. (All offer some free time for new users.) 12
  • 13. MapReduce ● The main, and original, parallel-processing system of Hadoop. ● Developed by Google to simplify parallel processing. Hadoop started as an open- source implementation of Google’s idea. ● With Hadoop’s streaming interface, it’s really easy to use MapReduce in Python. 13
  • 14. MapReduce - The Big Idea ● Calculations on large data sets often have this form: Start by aggregating the data (possibly in a different order from the “natural order”), then perform a summarizing calculation on the aggregated groups. ● The idea of MapReduce: If your calculation is explicitly structured like this, it can be automatically parallelized. 14
  • 15. Computing with MapReduce A MapReduce computation has three stages: Map: A function called map is applied to each record in your input. It produces zero or more records as output, each with a key and value. Keys may be repeated. Shuffle: The output from step 1 is sorted and combined: All records with the same key are combined into one. Reduce: A function called reduce is applied to each record (key + values) from step 2 to produce the final output. As the programmer, you only write map and reduce. 15
  • 16. Computing with MapReduce 16 Input A, 7 C, 5 B, 23 B, 12 A, 18 A, [18, 7] B, [23, 12] C, [5] Outputmap reduceshuffle Note: map is record-oriented, meaning the output of the map stage is strictly a combination of the outputs from each record. That allows us to calculate in parallel...
  • 17. Parallelism via MapReduce 17 Input A, [18, 7] B, [23, 12] C, [5] map reduce Because map and reduce are record-oriented, MR can divide inputs into arbitrary chunks: map map map reduce reduce reduce Output Output Output Outputdistribute data distribute data combine/ shuffle
  • 18. MapReduce example: Stock prices ● Input: list of daily opening and closing prices for thousands of stocks over thousands of days. ● Desired output: The biggest-ever one-day percentage price increase for each stock. ● Solution using MR: o map: (stock, open, close) => (stock, (close - open) / open) (if pos) o reduce: (stock, [%c0, %c1, …]) => (stock, max [%c0, %c1, …]). 18
  • 19. MapReduce example - map Goog, 230, 240 Apple, 100, 98 MS, 300, 250 MS, 250, 260 MS, 270, 280 Goog, 220, 215 Goog, 300, 350 IBM, 80, 90 IBM, 90, 85 Goog, 4.3% MS, 4% MS, 3.7% Goog, 16.6% IBM, 12.5% map You supply map: Output stock with % increase, or nothing if decrease. 19
  • 20. MapReduce example - shuffle/sort Goog, 4.3% MS, 4% MS, 3.7% Goog, 16.6% IBM, 12.5% shuffle /sort Goog, [4.3%, 16.6%] IBM, [12.5%] MS, [3.7%, 4%] Goog, 4.3% MS, 4% MS, 3.7% Goog, 16.6% IBM, 12.5% MapReduce supplies shuffle/sort: Combine all records for each stock. 20
  • 21. MapReduce example - reduce reduceGoog, [4.3%, 16.6%] IBM, [12.5%] MS, [3.7%, 4%] Goog, 16.6% IBM, 12.5% MS, 4% You supply reduce: Output max of percentages for each input record. 21
  • 22. Wait, why did that help? I could have just written a loop to read every line and put the percentages in a table! ● Suppose you have a terabyte of data, and 1000 computers in your cluster. ● MapReduce can automatically split the data into 1000 1GB chunks. You write two simple functions and get a 1000x speed-up! 22
  • 23. Modelling problems using MR ● We’re going to look at a variety of problems and see how we can fit them into the MR structure. ● The question for each problem is: What are the types of map and reduce, and what do they do? 23
  • 24. Example: Word count Input: Lines of text. Desired output: # of occurrences of each word (i.e. each sequence of non-space chars) E.g. Input: Roses are red, violets are blue Output: are, 2 blue, 1 red, 1 etc. 24
  • 25. Example: Word count Solution: ● map: “w1 w2 … wk” → w1, 1 w2, 1 ... wk, 1 ● reduce: (w, [1, 1, …]) → (w, n) n 1’s 25
  • 26. Example: Word count frequency Input: Output of word count Desired output: For any number of occurrences c, the number of different words that occur c times. E.g. Input: Roses are red, violets are blue Output: 1, 4 2, 1 26
  • 27. Example: Word count frequency Solution: ● map: w, c → c, 1 ● reduce: (c, [1, 1, …]) → (c, n) n 1’s 27
  • 28. Example: Page Rank ● Famous algorithm used by Google to rank pages. (Comes down to matrix-vector multiplication, as we’ll see…) ● Based on two ideas: o Importance of a page depends upon how many pages link to it. o However, if a page has lots of links going out, the value of each link is reduced. 28
  • 29. Example: Page Rank With those two ideas, calculate rank of page: Note: Because the web has cycles - page p can have a link to page q, which has a link to p - this formula requires an iterative solution. pagerank(p) = Σq→p 29 pagerank(q) out-degree(q)
  • 30. Example: Page Rank Consider pages and their links as a graph (page A has links to B, C, and D, etc.): 30 pr(A) = pr(B)/2 + pr(D)/2 pr(B) = pr(A)/3 + pr(D)/2 pr(C) = pr(A)/3 + pr(B)/2 pr(D) = pr(A)/3 + pr(C)
  • 31. Example: Page Rank ● Represent the graph as a weighted adjacency matrix: 31 0 1/2 0 1/2 1/3 0 0 1/2 1/3 1/2 0 0 1/3 0 1 0 M = links to links from A B C D B DA C
  • 32. Example: Page Rank ● Now, if we put the page rank of each page in a vector v, then multiplying M by v calculates the pagerank formula for all nodes: 32 0 1/2 0 1/2 1/3 0 0 1/2 1/3 1/2 0 0 1/3 0 1 0 pr(A) pr(B) pr(C) pr(D) pr(B)/2 + pr(D)/2 pr(A)/3 + pr(D)/2 pr(A)/3 + pr(B)/2 pr(A)/3 + pr(C) X =
  • 33. Example: Page Rank ● So, to calculate page ranks, start with an initial guess of all page ranks and multiply. ● After one multiplication: 33 0 1/2 0 1/2 1/3 0 0 1/2 1/3 1/2 0 0 1/3 0 1 0 1/4 1/4 1/4 1/4 1/4 5/24 5/24 1/3 X =
  • 34. Example: Page Rank ● After two multiplications: 34 0 1/2 0 1/2 1/3 0 0 1/2 1/3 1/2 0 0 1/3 0 1 0 .27 .24 .188 .29 X = 1/4 5/24 5/24 1/3
  • 35. Example: Page Rank ● Thus, page rank = matrix-vector product. ● Can we express matrix-vector multiplication as a MapReduce? o Assume v is copied (magically) to each node. o M, being much bigger, needs to be partitioned, i.e. M is the main input file. o How shall we represent M and define map and reduce? 35
  • 36. Example: Page Rank ● A solution: o Represent M using one record for each link: (p, q, out-degree(p)) for every link p→q. o map: (p, q, d) ↦ (q, v[p]/d) reduce: p, [c1, c2, …] ↦ p, c1+c2+... 36
  • 37. MapReduce: Summary ● Nowadays, MapReduce powers the internet: o Google, Amazon, Facebook, use it extensively for everything from page ranking to error log analysis. o NIH use it to analyze gene sequences. o NASA uses it to analyze data from probes. o etc., etc. ● Next question: How can we implement a MapReduce? 37
  • 38. Writing map and reduce in Python ● Easy using the streaming interface: o map and reduce : stdin → stdout. Each should iterate over stdin and output result for each line. o Inputs and outputs are text files. In map and reduce output, tab character separates key from value. o Shuffle just sorts the files on the key.  Instead of a line with a key and list of values, we get consecutive lines with the same key. 38
  • 39. Example: stock prices ● Recall the output of the shuffle stage: ● The only difference is this becomes: Goog, [4.3%, 16.6%] IBM, [12.5%] MS, [3.7%, 4%] Goog 4.3% Goog 16.6% IBM 12.5% MS 3.7% MS 4% 39
  • 40. Example: stock prices ● On the next two slides, we show the map and reduce functions in Python. ● Both of them are just stand-alone programs that read stdin and write stdout. ● In fact, we can test our pipeline without using MapReduce: cat input-file | ./ | sort | ./ 40
  • 41. Example: stock prices - #!/usr/bin/env python import sys import string for line in sys.stdin: record = line.split(",") opening = int(record[1]) closing = int(record[2]) if (closing > opening): change = float(closing - opening) / opening print '%st%s' % (record[0], change) 41
  • 42. Example: stock prices - stock = None max_increase = 0 for line in sys.stdin: next_stock, increase = line.split('t') increase = float(increase) if next_stock == stock: # another line for the same stock if increase > max_increase: max_increase = increase else: # new stock; output result for previous stock if stock: # only false on the very first line of input print( "%st%f" % (stock, max_increase) ) stock = next_stock max_increase = increase # print the last print( "%st%d" % (stock, max_increase) ) 42
  • 43. Invoking Hadoop ● Now we just have to run Hadoop. (Here we are running locally. To run in a cluster, you need to move the data into HDFS first.) If you want to run code on our servers, I’ll give instructions at the end of the talk. 43 hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar -input input.txt -output output -mapper -reducer
  • 44. Brief history of Hadoop ● 2004: Two engineers from Google published a paper on MapReduce o Doug Cutting was working on an open-source web crawler; saw that MapReduce solved his biggest problem: coordinating lots of computers; decided to implement an open-source version of MR. o Yahoo hired Cutting and continued and expanded the Hadoop project. 44
  • 45. Brief history of Hadoop (cont.) ● Today: Hadoop includes its own scheduler, lock mechanism, many database systems, MapReduce, a non-MapReduce parallelism system called Spark, and more. ● Demand for “data engineers” who can manage huge datasets using Hadoop keeps increasing. 45
  • 46. Summary ● We discussed the easiest way (that I know) to use Hadoop to process large datasets. ● Hadoop provides MapReduce, which can exploit massive parallelism by automatically breaking up inputs and processing the pieces separately, as long as the user supplies map and reduce functions. 46
  • 47. Summary (cont.) ● Your problem as a programmer is to figure out how to write map and reduce functions that will solve your problem. This is sometimes really easy. ● Using Python streaming, map and reduce are just Python scripts that read from stdin and write to stdout - no need to learn special Hadoop APIs or anything! 47
  • 48. So is that all there is to MapReduce? ● If only! For more complex cases and for higher efficiency: o Use Java for higher efficiency o Store data in the cluster, for capacity, reliability, and efficiency o Tune your application for higher efficiency, e.g. placing computations near data o Use some of many Hadoop components that can make programs easier to write and more efficient 48
  • 49. Next steps ● If you want to learn more, there are many books and online tutorials. o Hadoop: The Definitive Guide, by Tom White, is the definitive guide. (You’ll need to know Java.) ● We’ll be giving a five-Saturday lecture/lab class expanding on this meet-up starting this Saturday, and a twelve-evening class starting August 3. ● We’ll be giving a six-week, full-time bootcamp on Hadoop+Python starting in late August. 49
  • 50. Running examples ● For those of you who want to run examples: o Login to server per given instructions o Directory streaming-examples has code for stock prices, wordcount, and word frequencies. o In each directory, enter: source o Output in output/part-00000 should match file expected-output. o If you want to edit and re-run, you need to delete output directories: rm -r output (and rm -r output0 in count-freq). 50
  • 51. Running examples (cont.) ● Please let us know if you want to continue working on this tomorrow; we’ll leave the accounts live until Friday if you request it. ● Some suggestions: o Word count variants  Ignore case  Ignore punctuation  Find number of words of each length  Create sorted list of words of each length 51
  • 52. Running examples (cont.) ● Some suggestions: o Stock prices  Produce both max and min increases o Matrix-vector multiplication - you’ll be starting from scratch on this one.  Implement the method we described.  Suppose the input is in the form p, q1, q2, …, qn, i.e. a page and all of its outgoing links. 52
  • 53. Combiners ● Obvious source of inefficiency in wordcount: Suppose a word occurs twice on one line; we should output one line of ‘w, 2’ instead of two lines of ‘w, 1’. ● In fact, this applies to the entire file: Instead of ‘w, 1’ for each occurrence of a word, output ‘w, n’ if w occurs n times. 53
  • 54. Combiners ● Or, to put this differently: We should apply reduce to each file before the shuffle stage. ● Can do this by specifying a combiner function (which in this case is just reduce). 54 hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar -input input.txt -output output -mapper -reducer -combiner