Data Science is concerned with the analysis of large amounts of data. When the volume of data is really large, it requires the use of cooperating, distributed machines. The most popular method of doing this is Hadoop, a collection of programs to perform computations on connected machines in a cluster. Hadoop began life as an open-source implementation of MapReduce, an idea first developed and implemented by Google for its own clusters. Though Hadoop's MapReduce is Java-based, and quite complex, this talk focuses on the "streaming" facility, which allows Python programmers to use MapReduce in a clean and simple way. We will present the core ideas of MapReduce and show you how to implement a MapReduce computation using Python streaming. The presentation will also include an overview of the various components of the Hadoop "ecosystem."
NYC Data Science Academy is excited to welcome Sam Kamin who will be presenting an Introduction to Hadoop for Python Programmers a well as a discussion of MapReduce with Streaming Python.
Sam Kamin was a professor in the University of Illinois Computer Science Department. His research was in programming languages, high-performance computing, and educational technology. He taught a wide variety of courses, and served as the Director of Undergraduate Programs. He retired as Emeritus Associate Professor, and worked at Google until taking his current position as VP of Data Engineering in NYC Data Science Academy.
--------------------------------------
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
2. Meet-up: Tackling “Big
Data” with Hadoop and
Python
Sam Kamin, VP Data Engineering
NYC Data Science Academy
sam.kamin@nycdatascience.com
2
3. NYC Data Science Academy
● We’re a company that does training and
consulting in the Data Science area.
● I’m Sam Kamin. I just joined NYCDSA as
VP of Data Engineering (a new area for us).
I was formerly a professor at the U. of Illinois
(CS) and a Software Engineer at Google.
3
4. What this meet-up is about
● Wikipedia: “Data Science is the extraction of
knowledge from large volumes of data.”
● My goal tonight: Show you how you can
handle large volumes of data with simple
Python programming, using the Hadoop
streaming interface.
4
5. Outline of talk
● Brief overview of Hadoop
● Introduction to parallelism via MapReduce
● Examples of applying MapReduce
● Implementing MapReduce in Python
You can do some programming at the end if you want!
5
6. Big Data: What’s the problem?
Too much data!
o Web contains about 5 billion web pages. According
to Wikipedia, its total size in bytes is about 4
zettabytes - that’s 1021, or four thousand billion
gigabytes.
o Google’s datacenters store about 15 exabytes (15 x
1018 bytes).
6
7. Big Data: What’s the solution?
● Parallel computing: Use multiple,
cooperating computers.
7
8. Parallelism
● Parallelism = dividing up a problem so that
multiple computers can all work on it:
o Break the data into pieces
o Send the pieces to different computers for
processing.
o Send the results back and process the combination
to get the final result.
8
9. Cloud computing
● Amazon, Google, Microsoft, and many other
companies operate huge clusters: Racks of
(basically) off-the-shelf computers with
(basically) standard network connections.
● The computers in these clusters run Linux -
use them like any other computer...
9
10. Cloud computing
● But: getting them to work together is really
hard:
o Management: machine/disk failure; efficient data
placement; debugging, monitoring, logging, auditing.
o Algorithms: decomposing your problem so it can be
solved in parallel can be hard.
That’s what Hadoop is here to help with.
10
11. ● A collection of services in a cluster:
o Distributed, reliable file system (HDFS)
o Scheduler to run jobs in correct order, monitor,
restart on failure, etc.
o MapReduce to help you decompose your problem
for parallel execution
o A variety of other components (mostly based on
MapReduce), e.g. databases, application-focused
libraries
11
12. How to use Hadoop
● Hadoop is open source (free!)
● It is hosted on Apache: hadoop.apache.org
● Download it and run it standalone (for
debugging)
● Buy a cluster or rent time on one, e.g. AWS,
GCE, Azure. (All offer some free time for
new users.)
12
13. MapReduce
● The main, and original, parallel-processing
system of Hadoop.
● Developed by Google to simplify parallel
processing. Hadoop started as an open-
source implementation of Google’s idea.
● With Hadoop’s streaming interface, it’s really
easy to use MapReduce in Python.
13
14. MapReduce - The Big Idea
● Calculations on large data sets often have
this form: Start by aggregating the data
(possibly in a different order from the
“natural order”), then perform a summarizing
calculation on the aggregated groups.
● The idea of MapReduce: If your calculation
is explicitly structured like this, it can be
automatically parallelized.
14
15. Computing with MapReduce
A MapReduce computation has three stages:
Map: A function called map is applied to each record in
your input. It produces zero or more records as output,
each with a key and value. Keys may be repeated.
Shuffle: The output from step 1 is sorted and combined: All
records with the same key are combined into one.
Reduce: A function called reduce is applied to each record
(key + values) from step 2 to produce the final output.
As the programmer, you only write map and reduce.
15
16. Computing with MapReduce
16
Input
A, 7
C, 5
B, 23
B, 12
A, 18
A, [18, 7]
B, [23, 12]
C, [5]
Outputmap reduceshuffle
Note: map is record-oriented, meaning the output of the
map stage is strictly a combination of the outputs from
each record. That allows us to calculate in parallel...
17. Parallelism via MapReduce
17
Input A, [18, 7]
B, [23, 12]
C, [5]
map reduce
Because map and reduce are record-oriented, MR can
divide inputs into arbitrary chunks:
map
map
map
reduce
reduce
reduce
Output
Output
Output
Outputdistribute
data
distribute
data
combine/
shuffle
18. MapReduce example: Stock prices
● Input: list of daily opening and closing prices for
thousands of stocks over thousands of days.
● Desired output: The biggest-ever one-day
percentage price increase for each stock.
● Solution using MR:
o map: (stock, open, close) =>
(stock, (close - open) / open) (if pos)
o reduce: (stock, [%c0, %c1, …]) =>
(stock, max [%c0, %c1, …]).
18
20. MapReduce example - shuffle/sort
Goog, 4.3%
MS, 4%
MS, 3.7%
Goog, 16.6%
IBM, 12.5%
shuffle
/sort Goog, [4.3%, 16.6%]
IBM, [12.5%]
MS, [3.7%, 4%]
Goog, 4.3%
MS, 4%
MS, 3.7%
Goog, 16.6%
IBM, 12.5%
MapReduce supplies shuffle/sort: Combine all
records for each stock.
20
21. MapReduce example - reduce
reduceGoog, [4.3%, 16.6%]
IBM, [12.5%]
MS, [3.7%, 4%]
Goog, 16.6%
IBM, 12.5%
MS, 4%
You supply reduce: Output max of percentages for
each input record.
21
22. Wait, why did that help?
I could have just written a loop to read every
line and put the percentages in a table!
● Suppose you have a terabyte of data, and
1000 computers in your cluster.
● MapReduce can automatically split the data
into 1000 1GB chunks. You write two simple
functions and get a 1000x speed-up!
22
23. Modelling problems using MR
● We’re going to look at a variety of problems
and see how we can fit them into the MR
structure.
● The question for each problem is: What are
the types of map and reduce, and what do
they do?
23
24. Example: Word count
Input: Lines of text.
Desired output: # of occurrences of each
word (i.e. each sequence of non-space chars)
E.g. Input: Roses are red, violets are blue
Output: are, 2
blue, 1
red, 1 etc.
24
26. Example: Word count frequency
Input: Output of word count
Desired output: For any number of
occurrences c, the number of different words
that occur c times.
E.g. Input: Roses are red, violets are blue
Output: 1, 4
2, 1
26
27. Example: Word count frequency
Solution:
● map: w, c → c, 1
● reduce: (c, [1, 1, …]) → (c, n)
n 1’s
27
28. Example: Page Rank
● Famous algorithm used by Google to rank
pages. (Comes down to matrix-vector
multiplication, as we’ll see…)
● Based on two ideas:
o Importance of a page depends upon how many
pages link to it.
o However, if a page has lots of links going out, the
value of each link is reduced.
28
29. Example: Page Rank
With those two ideas, calculate rank of page:
Note: Because the web has cycles - page p can
have a link to page q, which has a link to p -
this formula requires an iterative solution.
pagerank(p) =
Σq→p
29
pagerank(q)
out-degree(q)
30. Example: Page Rank
Consider pages and their links as a graph
(page A has links to B, C, and D, etc.):
30
pr(A) = pr(B)/2 + pr(D)/2
pr(B) = pr(A)/3 + pr(D)/2
pr(C) = pr(A)/3 + pr(B)/2
pr(D) = pr(A)/3 + pr(C)
31. Example: Page Rank
● Represent the graph as a weighted
adjacency matrix:
31
0 1/2 0 1/2
1/3 0 0 1/2
1/3 1/2 0 0
1/3 0 1 0
M =
links to
links from
A
B
C
D
B DA C
32. Example: Page Rank
● Now, if we put the page rank of each page in
a vector v, then multiplying M by v calculates
the pagerank formula for all nodes:
32
0 1/2 0 1/2
1/3 0 0 1/2
1/3 1/2 0 0
1/3 0 1 0
pr(A)
pr(B)
pr(C)
pr(D)
pr(B)/2 + pr(D)/2
pr(A)/3 + pr(D)/2
pr(A)/3 + pr(B)/2
pr(A)/3 + pr(C)
X =
33. Example: Page Rank
● So, to calculate page ranks, start with an
initial guess of all page ranks and multiply.
● After one multiplication:
33
0 1/2 0 1/2
1/3 0 0 1/2
1/3 1/2 0 0
1/3 0 1 0
1/4
1/4
1/4
1/4
1/4
5/24
5/24
1/3
X =
35. Example: Page Rank
● Thus, page rank = matrix-vector product.
● Can we express matrix-vector multiplication
as a MapReduce?
o Assume v is copied (magically) to each node.
o M, being much bigger, needs to be partitioned, i.e. M
is the main input file.
o How shall we represent M and define map and
reduce?
35
36. Example: Page Rank
● A solution:
o Represent M using one record for each link:
(p, q, out-degree(p)) for every link p→q.
o map: (p, q, d) ↦ (q, v[p]/d)
reduce: p, [c1, c2, …] ↦ p, c1+c2+...
36
37. MapReduce: Summary
● Nowadays, MapReduce powers the internet:
o Google, Amazon, Facebook, use it extensively for
everything from page ranking to error log analysis.
o NIH use it to analyze gene sequences.
o NASA uses it to analyze data from probes.
o etc., etc.
● Next question: How can we implement a
MapReduce?
37
38. Writing map and reduce in Python
● Easy using the streaming interface:
o map and reduce : stdin → stdout. Each should
iterate over stdin and output result for each line.
o Inputs and outputs are text files. In map and reduce
output, tab character separates key from value.
o Shuffle just sorts the files on the key.
Instead of a line with a key and list of values, we
get consecutive lines with the same key.
38
39. Example: stock prices
● Recall the output of the shuffle stage:
● The only difference is this becomes:
Goog, [4.3%, 16.6%]
IBM, [12.5%]
MS, [3.7%, 4%]
Goog 4.3%
Goog 16.6%
IBM 12.5%
MS 3.7%
MS 4%
39
40. Example: stock prices
● On the next two slides, we show the map
and reduce functions in Python.
● Both of them are just stand-alone programs
that read stdin and write stdout.
● In fact, we can test our pipeline without using
MapReduce:
cat input-file | ./map.py | sort |
./reduce.py
40
41. Example: stock prices - map.py
#!/usr/bin/env python
import sys
import string
for line in sys.stdin:
record = line.split(",")
opening = int(record[1])
closing = int(record[2])
if (closing > opening):
change = float(closing - opening) / opening
print '%st%s' % (record[0], change)
41
42. Example: stock prices - reduce.py
stock = None
max_increase = 0
for line in sys.stdin:
next_stock, increase = line.split('t')
increase = float(increase)
if next_stock == stock: # another line for the same stock
if increase > max_increase:
max_increase = increase
else: # new stock; output result for previous stock
if stock: # only false on the very first line of input
print( "%st%f" % (stock, max_increase) )
stock = next_stock
max_increase = increase
# print the last
print( "%st%d" % (stock, max_increase) )
42
43. Invoking Hadoop
● Now we just have to run Hadoop. (Here we
are running locally. To run in a cluster, you
need to move the data into HDFS first.)
If you want to run code on our servers, I’ll
give instructions at the end of the talk.
43
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar
-input input.txt -output output
-mapper map.py -reducer reduce.py
44. Brief history of Hadoop
● 2004: Two engineers from Google published
a paper on MapReduce
o Doug Cutting was working on an open-source web
crawler; saw that MapReduce solved his biggest
problem: coordinating lots of computers; decided to
implement an open-source version of MR.
o Yahoo hired Cutting and continued and expanded
the Hadoop project.
44
45. Brief history of Hadoop (cont.)
● Today: Hadoop includes its own scheduler,
lock mechanism, many database systems,
MapReduce, a non-MapReduce parallelism
system called Spark, and more.
● Demand for “data engineers” who can
manage huge datasets using Hadoop keeps
increasing.
45
46. Summary
● We discussed the easiest way (that I know)
to use Hadoop to process large datasets.
● Hadoop provides MapReduce, which can
exploit massive parallelism by automatically
breaking up inputs and processing the
pieces separately, as long as the user
supplies map and reduce functions.
46
47. Summary (cont.)
● Your problem as a programmer is to figure
out how to write map and reduce functions
that will solve your problem. This is
sometimes really easy.
● Using Python streaming, map and reduce
are just Python scripts that read from stdin
and write to stdout - no need to learn special
Hadoop APIs or anything!
47
48. So is that all there is to MapReduce?
● If only! For more complex cases and for
higher efficiency:
o Use Java for higher efficiency
o Store data in the cluster, for capacity, reliability, and
efficiency
o Tune your application for higher efficiency, e.g.
placing computations near data
o Use some of many Hadoop components that can
make programs easier to write and more efficient
48
49. Next steps
● If you want to learn more, there are many books and
online tutorials.
o Hadoop: The Definitive Guide, by Tom White, is the
definitive guide. (You’ll need to know Java.)
● We’ll be giving a five-Saturday lecture/lab class
expanding on this meet-up starting this Saturday, and a
twelve-evening class starting August 3.
● We’ll be giving a six-week, full-time bootcamp on
Hadoop+Python starting in late August.
49
50. Running examples
● For those of you who want to run examples:
o Login to server per given instructions
o Directory streaming-examples has code for stock
prices, wordcount, and word frequencies.
o In each directory, enter: source run-hadoop.sh
o Output in output/part-00000 should match file
expected-output.
o If you want to edit and re-run, you need to delete
output directories: rm -r output (and rm -r output0 in
count-freq).
50
51. Running examples (cont.)
● Please let us know if you want to continue
working on this tomorrow; we’ll leave the
accounts live until Friday if you request it.
● Some suggestions:
o Word count variants
Ignore case
Ignore punctuation
Find number of words of each length
Create sorted list of words of each length
51
52. Running examples (cont.)
● Some suggestions:
o Stock prices
Produce both max and min increases
o Matrix-vector multiplication - you’ll be starting from
scratch on this one.
Implement the method we described.
Suppose the input is in the form p, q1, q2, …, qn,
i.e. a page and all of its outgoing links.
52
53. Combiners
● Obvious source of inefficiency in wordcount:
Suppose a word occurs twice on one line;
we should output one line of ‘w, 2’ instead of
two lines of ‘w, 1’.
● In fact, this applies to the entire file: Instead
of ‘w, 1’ for each occurrence of a word,
output ‘w, n’ if w occurs n times.
53
54. Combiners
● Or, to put this differently: We should apply
reduce to each file before the shuffle stage.
● Can do this by specifying a combiner
function (which in this case is just reduce).
54
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar
-input input.txt
-output output
-mapper map.py
-reducer reduce.py -combiner reduce.py