This document outlines a workshop on using Hadoop and R for big data analytics. It provides an introduction to Hadoop, describing its file system and MapReduce framework. It also introduces R, noting it can be used with Hadoop's MapReduce approach. The document details various techniques that can be used with Hadoop and R, including counting, graphics, modeling, scoring, sampling and simulating large datasets. Specific modeling techniques covered are linear regression, logistic regression, and trees/random forests.
2. Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Big Data Analytics
R & Hadoop
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Carlos J. Gil Bellosta
Details of
mapreduce
cgb@datanalytics.com
Scoring,
sampling &
simulating
November 2013
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
3. Big Data
Analytics
Table of Contents
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
1 Intro to Hadoop & R
All about Hadoop
Hadoop FS
Hadoop & mapreduce
All about R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
4. Big Data
Analytics
File system: manages all about
files
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Examples: diskettes, hard disks, RAIDs,... magnetic tapes!
• Combination of hardware and software to hide boring
activities from users:
•
•
•
•
Find space to write the files
Read/write files
Manage fragmentation
Etc.
• How many devices per FS?
• 1-to-1: diskettes, CD-ROMs, HDDs,...
• n-to-1: partitioned HDDs,...
• 1-to-n: RAIDs, Hadoop
5. Big Data
Analytics
Hadoop goodies (as a FS)
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Chuncks (large) files among machines
• Replicates chunks (default, 3)
• Balances data
• Robust to hardware failures
• It is rack aware
Obviously, it requires some system to keep track of:
• Which servers/racks are up/down
• Where each chunk is located
• ...
6. Big Data
Analytics
How to work with data in Hadoop?
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
• Provides a shell (ls, cp, etc.)
• You can put/get data from your local FS to Hadoop FS
• This is:
• You can dump your data to your local machine
• You can run your programs in your local machine
• You can put results back into Hadoop
• But what if the file is too large?
Scoring,
sampling &
simulating
Solution
Data
modelling
Rather than bringing the data to the code, why not moving the
code to the data?
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
One of the ways to move code to data is known as mapreduce.
7. Big Data
Analytics
Mapreduce
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Two step process:
• Map: run your code on chunks all over
• Reduce: reshape the output into the desired format
• Hadoop manages issues:
• System failures
• Threads that do not return
• And all (?) that made life of OpenMP, MPI, etc. users
miserable
• Slotted approach: mapreduce provides slots where you put
the mappers/reducers code
• The code is for you to provide!
8. Big Data
Analytics
What is R?
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• R is a
• software package?
• programming language?
• environment?
for data analysis and graphics.
• R users are (should be?) used to the mapreduce approach:
ddply(dfx, .(group, sex), summarize,
mean = mean(age),
sd
= sd(age))
9. Big Data
Analytics
Table of Contents
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
1 Intro to Hadoop & R
2 Counting (& Graphics)
Graphics & big data
Let’s count... hexagons
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
10. Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
Visualizing a million
11. Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
Fluctuation plot
12. Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
Table plot
13. Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Non-trivial counting exercise (no, we are not counting
words today!)
• Good visualization features for big datasets
• Fits in mapreduce framework:
• Map: Assigns points to hexagons
• Reduce: aggregates counts on hexagons
• The output is small and can be plotted locally
14. Big Data
Analytics
Table of Contents
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
1 Intro to Hadoop & R
2 Counting (& Graphics)
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
6 Final remarks
15. Big Data
Analytics
Carlos J. Gil
Bellosta
What you see: input/output, map,
reduce
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• input:
• Type: text, csv, R object,...
• Options: separator,...
• output: similar to input
• map & reduce:
• Functions with (k,v) argument (k, key; v, value)
• They return a k,v list
• Thus, mapreduces can be chained together (the output of
the first one is the input for the second)
16. Big Data
Analytics
Carlos J. Gil
Bellosta
What you don’t see
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
$HADOOP jar $HADOOP_STREAMING -D stream.map.input=typedbytes
-D stream.map.output=typedbytes
-D stream.reduce.input=typedbytes
-D stream.reduce.output=typedbytes
-D mapred.reduce.tasks=0
-input /tmp/RtmpUUrNMj/file68c0185e60c
-output /tmp/RtmpUUrNMj/file68c04c25d5f0
-mapper "Rscript rmr-streaming-map68c018acf680 "
-file /tmp/RtmpUUrNMj/rmr-local-env68c0101c8e8a
-file /tmp/RtmpUUrNMj/rmr-global-env68c03abb4080
-file /tmp/RtmpUUrNMj/rmr-streaming-map68c018acf680
-inputformat org.apache.hadoop.streaming.AutoInputFormat
-outputformat org.apache.hadoop.mapred.SequenceFileOutputForm
17. Big Data
Analytics
Table of Contents
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
1 Intro to Hadoop & R
2 Counting (& Graphics)
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
6 Final remarks
18. Big Data
Analytics
Scoring
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Externals consultants build a model (using R and small
data)
• Models in R should have a predict method
• You can then score your huge database (in batch)
• No need to rewrite the model into your systems!
19. Big Data
Analytics
The case for sampling
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Sampling works!
• Sampled datasets can be used to build small data models
• You can use R (& mapreduce) to sample data, but you
better not
20. Big Data
Analytics
Carlos J. Gil
Bellosta
Running simulations on Hadoop
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Some (many?) people say it is not the right tool
• You need input data, but simulations often not
• You want to control the number of mappers (which run
your simulations)
• Still mapreduce is nice for simulations...
• ... so let and old dog try its dirty trick!
21. Big Data
Analytics
Table of Contents
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
1 Intro to Hadoop & R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
Linear Regression
Logistic Regression
Trees & Random Forests
6 Final remarks
22. Big Data
Analytics
Linear regression can be
parallelized
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Simple linear regression: y ∼ α + βx
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
ˆ
β=
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
=
n
¯
¯
i=1 (xi − x )(yi − y )
=
n
(xi − x )2
¯
i=1
n
n
n
1
i=1 xi yi − n
i=1 xi
j=1 yj
n
2 ) − 1 ( n x )2
i=1 (xi
i=1 i
n
Operations are case by case!
23. Big Data
Analytics
Carlos J. Gil
Bellosta
Multiple linear regression
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Based on X X and X y :
ˆ
β = (X X )−1 X y
• If X = [X1 |...|Xn ] (by blocks), then X X =
i
Xi Xi .
24. Big Data
Analytics
Carlos J. Gil
Bellosta
Can logistic regression be
parallelized? Yes and no.
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Fitting logistic regression models is iterative and iterations
are not parallelizable.
• However, each iteration can be parallelized (these are not
unlike fitting linear models as before)
• We will explore two big data alternatives:
• Parallelize iterations using mapreduce (see
http://goo.gl/ftx36r)
• Split your data meaningfully and do standard logistic
regression in the nodes
25. Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
How many bytes make knowledge?
(aka the fractal nature of big data)
26. Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
Splitted logistic regression
27. Big Data
Analytics
Carlos J. Gil
Bellosta
Viable alternatives to logistic
models
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Trees
• High interpretability
• But unstable and tend to miss out details
• Random forests
• Black boxes
• Superb performance
• These are collections of trees that can be built in parallel
• Both can be parallelized indifferent ways:
• Similar to partitioned logistic models above
• Within training
28. Big Data
Analytics
Table of Contents
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
1 Intro to Hadoop & R
2 Counting (& Graphics)
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
6 Final remarks
29. Big Data
Analytics
Carlos J. Gil
Bellosta
Forget most of what you learned
today, seriously
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• People strive to extend small data models to big data (as
we did today)...
• ... but is it the way to go?
• Achtung microlocal structure
• Small data people knows microlocal structure as outliers
• Global models (linear, logistic,...) cannot (easily?) exploit
microlocal structure
• But the promises of big data lie precisely there
• (Otherwise, just sample and you will be fine)
• Areas to watch for insights on big data modelling:
• SNA (networks analysis)
• Text analysis
30. Big Data
Analytics
Carlos J. Gil
Bellosta
Thank you very much and...
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
... questions?