SlideShare ist ein Scribd-Unternehmen logo
1 von 87
Downloaden Sie, um offline zu lesen
“I want to die on Mars but not on impact” 
— Elon Musk, interview with Chris Anderson 
"There are no facts, only interpretations." - Friedrich Nietzsche 
“The shrewd guess, the fertile hypothesis, the courageous leap to a 
tentative conclusion – these are the most valuable coin of the thinker at 
work” -- Jerome Seymour Bruner 
The Hitchhiker's Guide to 
Machine Learning with 
Python  Apache Spark 
@ksankar // doubleclix.wordpress.com 
http://www.bigdatatechcon.com/ 
classes.html#TheHitchhikersGuidetoMachineLearningwithPythonandApacheSparkPartI 
October 29, 2014
Agenda 
o Spark  Data Science DevOps 
• Spark, Python  Machine Learning 
• Goals/non-goals 
• Intro to Spark 
• Stack, Mechanisms – RDD 
• Datasets : SOTU, Titanic, Frequent 
Flier 
• Statistical Toolbox 
• Summary, Correlations 
o “Mood Of the Union” 
• State of the Union w/ Washington, 
Lincoln, FDR, JFK, Clinton, Bush  
Obama 
• Map reduce, parse text 
o Clustering 
• K-means for Gallactic Hoppers! 
o Break [3:15-3:45) 
o Predicting Survivors with Classification 
• Decision Trees 
• NaiveBayes (Titanic data set) 
o Linear Regression 
o Recommendation Engine 
• Collab Filtering w/movie lens 
o Discussions/Slack 
Oct 
29 
2-­‐3:15 
(75min), 
3:45-­‐5:00 
(75 
min) 
= 
150 
min 
[20] 
2:00 
– 
2:20 
[30] 
2:20 
– 
2:50 
[25] 
2:50 
– 
3:15 
[30] 
3:45 
– 
4:15 
[10] 
4:15 
– 
4:25 
[20] 
4:25 
– 
4:45 
[15] 
4:45 
– 
5:00
Goals  non-goals 
Goals 
¤ Understand how to program 
Machine Learning with Spark  
Python 
¤ Focus on programming  ML 
application 
¤ Give you a focused time to work 
thru examples 
§ Work with me. I will wait 
if you want to catch-up 
¤ Less theory, more usage - let us 
see if this works 
¤ As straightforward as possible 
§ The programs can be 
optimized 
Non-goals 
¡ Go deep into the algorithms 
• We don’t have sufficient 
time. The topic can be 
easily a 5 day tutorial ! 
¡ Dive into spark internals 
• That is for another day 
¡ The underlying computation, 
communication, constraints  
distribution is a fascinating subject 
• Paco does a good job 
explaining them 
¡ A passive talk 
• Nope. Interactive  
hands-on
About Me 
o Chief Data Scientist at BlackArrow.tv 
o Have been speaking at OSCON, PyCon, Pydata et al 
The 
Nuthead 
band 
! 
o Reviewing Packt Book “Machine Learning with Spark” 
o Picked up co-authorship Second Edition of “Fast Data Processing with Spark” 
o Have done lots of things: 
• Big Data (Retail, Bioinformatics, Financial, AdTech), 
• Written Books (Web 2.0, Wireless, Java,…) 
• Standards, some work in AI, 
• Guest Lecturer at Naval PG School,… 
• Planning MS-CFinance or Statistics or Computational Math 
o Volunteer as Robotics Judge at First Lego league World Competitions 
o @ksankar, doubleclix.wordpress.com
Spark  Data Science 
DevOps 
2:00
Close Encounters 
— 1st 
◦ This Tutorial 
— 2nd 
◦ Do More Hands-on Walkthrough 
— 3nd 
◦ Listen To Lectures 
◦ More competitions …
Spark Installation 
o Install Spark 1.1.0 in local Machine 
o https://spark.apache.org/downloads.html 
• Pre-built For Hadoop 2.4 is fine 
o Download  uncompress 
o Remember the path  use it wherever you see /usr/local/spark/ 
o I have downloaded in /usr/local  have a softlink spark to the latest version
Tutorial Materials 
o Github : https://github.com/xsankar/cloaked-ironman 
o Clone or download zip 
o Open terminal 
o cd ~/cloaked-ironman 
o IPYTHON=1 IPYTHON_OPTS=notebook --pylab inline /usr/local/spark/bin/ 
pyspark 
o Note : 
o I have a soft link “spark” in my /usr/local that points to the spark version that I 
use. For example ln -s spark-1.1.0/ spark 
o Click on ipython dashboard 
o Just look thru the ipython notebooks
Data Science - Context 
Reason Model Deploy 
o Scalable 
Model 
Deployment 
o Big 
Data 
automation 
 
purpose 
built 
appliances 
(soft/ 
hard) 
o Manage 
SLAs 
 
response 
times 
Collect Store Transform 
o Volume 
o Velocity 
o Streaming 
Data 
o Canonical 
form 
o Data 
catalog 
o Data 
Fabric 
across 
the 
organization 
o Access 
to 
multiple 
sources 
of 
data 
o Think 
Hybrid 
– 
Big 
Data 
Apps, 
Appliances 
 
Infrastructure 
o Metadata 
o Monitor 
counters 
 
Metrics 
o Structured 
vs. 
Multi-­‐ 
structured 
o Flexible 
 
Selectable 
§ Data 
Subsets 
§ Attribute 
sets 
o Refine 
model 
with 
§ Extended 
Data 
subsets 
§ Engineered 
Attribute 
sets 
o Validation 
run 
across 
a 
larger 
data 
set 
Data Management 
Data Science 
Visualize Recommend Predict Explore 
o Dynamic 
Data 
Sets 
o 2 
way 
key-­‐value 
tagging 
of 
datasets 
o Extended 
attribute 
sets 
o Advanced 
Analytics 
o Performance 
o Scalability 
o Refresh 
Latency 
o In-­‐memory 
Analytics 
o Advanced 
Visualization 
o Interactive 
Dashboards 
o Map 
Overlay 
o Infographics 
¤ Bytes to Business 
a.k.a. Build the full 
stack 
¤ Find Relevant Data 
For Business 
¤ Connect the Dots
Velocity 
Volume 
Variety 
Data Science - Context 
Context 
Connect 
edness 
Interface 
Inference 
Intelligence 
“Data of unusual size” 
that can't be brute forced 
o Three Amigos 
o Interface = Cognition 
o Intelligence = Compute(CPU)  Computational(GPU) 
o Infer Significance  Causality
Day in the life of a (super) Model 
Intelligence 
Inference 
Data Representation 
Interface 
Algorithms 
AIributes 
Parameters 
Data 
(Scoring) 
Model 
SelecMon 
Reason 
 
Learn 
Models 
Visualize, 
Recommend, 
Explore 
Model 
Assessment 
Dimensionality 
ReducMon 
Feature 
SelecMon
Data Science Maturity Model  Spark 
Isolated Analytics 
Integrated Analytics 
Aggregated Analytics 
Automated Analytics 
Data 
Small 
Data 
Larger 
Data 
set 
Big 
Data 
Big 
Data 
Factory 
Model 
Context 
Local 
Domain 
Cross-­‐domain 
+ 
External 
Cross 
domain 
+ 
External 
Model, 
Reason  
Deploy 
• Single 
set 
of 
boxes, 
usually 
owned 
by 
the 
Model 
Builders 
• Departmental 
• Deploy 
-­‐ 
Central 
AnalyMcs 
Infrastructure 
• Models 
sMll 
owned 
 
operated 
by 
Modelers 
• Partly 
Enterprise-­‐wide 
• Central 
AnalyMcs 
Infrastructure 
• Model 
 
Reason 
– 
by 
Model 
Builders 
• Deploy, 
Operate 
– 
by 
ops 
• Residuals 
and 
other 
metrics 
monitored 
by 
modelers 
• Enterprise-­‐wide 
• Distributed 
AnalyMcs 
Infrastructure 
• AI 
Augmented 
models 
• Model 
 
Reason 
– 
by 
Model 
Builders 
• Deploy, 
Operate 
– 
by 
ops 
• Data 
as 
a 
moneMzed 
service, 
extending 
to 
eco 
system 
partners 
• Reports 
• Dashboards 
• Dashboards 
+ 
some 
APIs 
• Dashboards 
+ 
Well 
defined 
APIs 
+ 
programming 
models 
Type 
• DescripMve 
 
ReacMve 
• + 
PredicMve 
• + 
AdapMve 
• AdapMve 
Datasets 
• All 
in 
the 
same 
box 
• Fixed 
data 
sets, 
usually 
in 
temp 
data 
spaces 
• Flexible 
Data 
 
AIribute 
sets 
• Dynamic 
datasets 
with 
well-­‐defined 
refresh 
policies 
Workload 
• Skunk 
works 
• Business 
relevant 
apps 
with 
approx 
SLAs 
• High 
performance 
appliance 
clusters 
• Appliances 
and 
clusters 
for 
mulMple 
workloads 
including 
real 
Mme 
apps 
• Infrastructure 
for 
emerging 
technologies 
Strategy 
• Informal 
definiMons 
• Data 
definiMons 
buried 
in 
the 
analyMcs 
models 
• Some 
data 
definiMons 
• Data 
catalogue, 
metadata 
 
AnnotaMons 
• Big 
Data 
MDM 
Strategy
The 
Sense 
 
Sensibility 
of 
a 
DataScien3st 
DevOps 
Factory 
= 
OperaMonal 
Lab 
= 
InvesMgaMve 
hIp://doubleclix.wordpress.com/2014/05/11/the-­‐sense-­‐ 
sensibility-­‐of-­‐a-­‐data-­‐scienMst-­‐devops/
Spark-The Stack
hIp://databricks.com/blog/2014/10/10/spark-­‐breaks-­‐previous-­‐large-­‐scale-­‐sort-­‐record.html
RDD – The workhorse of Spark 
o Resilient Distributed Datasets 
• Collection that can be operated in parallel 
o Transformations – create RDDs 
• Map, Filter,… 
o Actions – Get values 
• Collect, Take,… 
o We will apply these operations during this tutorial
Algorithm spectrum 
o Regression 
o Logit 
o CART 
o Ensemble : 
Random 
Forest 
o Clustering 
o KNN 
o Genetic Alg 
o Simulated 
Annealing 
o Collab 
Filtering 
o SVM 
o Kernels 
o SVD 
o NNet 
o Boltzman 
Machine 
o Feature 
Learning 
Machine 
Learning 
Cute 
Math 
Ar0ficial 
Intelligence
ALL MLlib APIs are not available in Python (as of 1.1.0) 
API 
Spark 1.1.0 Spark 1.2.0 
Java/Scala Python 
Basic Statistics ✔ ✔ 
Linear Models ✔ ✔ 
Decision Trees ✔ ✔ 
Random Forest ✖ ✖ 
Collab Filtering ✔ ✔ 
Clustering-KMeans ✔ ✔ 
Clustering-Hierarchical ✖ ✖ 
SVD ✔ ✖ 
PCA ✔ ✖ 
Standard Scaler, Normalizer ✔ ✖ 
Model Evaluation-PR/ROC 
Spark 
1.2 
MLlib 
JIRA 
h=p://bit.ly/1ywotkm
Statistical Toolbox 
o Sample data : Car mileage data 
hIps://github.com/apache/spark/blob/master/examples/src/main/python/mllib/correlaMons.py
“Mood Of the Union” 
with TF-IDF 
2:20
Scenario – Mood Of the Union 
o It has been said that the State of the Union speech by the President of USA 
reflects the social challenge faced by the country ? 
o If so, can we infer the mood of the country by analyzing SOTU ? 
o If we embark on this line of thought, how would we do it with Spark  python ? 
o Is it different from Hadoop-MapReduce ? 
o Is it better ?
POA (Plan Of Action) 
o Collect State of the Union speech by George Washington, Abe Lincoln, FDR, 
JFK, Bill Clinton, GW Bush  Barack Obama 
o Read the 7 SOTU from the 7 presidents into 7 RDDs 
o Create word vectors 
o Transform into word frequency vectors 
o Remove stock common words 
o Inspect to n words to see if they reflect the sentiment of the time 
o Compute set difference and see how new words have cropped up 
o Compute TF-IDF (homework!)
Lookout for these interesting Spark features 
o RDD Map-Reduce 
o How to parse input 
o Removing common words 
o Sort rdd by value
iPython notebook at https://github.com/xsankar/cloaked-ironman 
Read  Create word vector
Remove Common Words – 1 of 3 
iPython notebook at https://github.com/xsankar/cloaked-ironman
Remove Common Words – 2 of 3
Remove Common Words – 3 of 3
FDR vs. Barack Obama as reflected by SOTU
Barack Obama vs. Bill Clinton
GWB vs Abe Lincoln as reflected by SOTU
Epilogue 
o Interesting Exercise 
o Highlights 
• Map-reduce in a couple of lines ! 
• But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1) 
• Set differences using substractByKey 
• Ability to sort a map by values (or any arbitrary function, for that matter) 
o To Explore as homework: 
• TF-IDF in http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf 
• Haven’t seen it in python for 1.1. 
hIp://blog.cloudera.com/blog/2014/09/how-­‐to-­‐translate-­‐from-­‐mapreduce-­‐to-­‐apache-­‐spark/
Clustering 
2:50
Scenario – Clustering with Spark 
o InterGallactic Airlines have the GallacticHoppers frequent flyer program  have 
data about their customers who participate in the program. 
o The airlines execs have a feeling that other airlines will poach their customers if 
they do not keep their loyal customers happy. 
o So the business want to customize promotions to their frequent flier program. 
o Can they just have one type of promotion ? 
o Should they have different types of incentives ? 
o Who exactly are the customers in their GallacticHoppers program ? 
o Recently they have deployed an infrastructure with Spark 
o Can Spark help in this business problem ?
Clustering - Theory 
o Clustering is unsupervised learning 
o While the computers can dissect a dataset into “similar” clusters, it still needs 
human direction  domain knowledge to interpret  guide 
o Two types: 
• Centroid based clustering – k-means clustering 
• Tree based Clustering – hierarchical clustering 
o Spark implements the Scalable Kmeans++ 
• Paper : http://theory.stanford.edu/~‾sergei/papers/vldb12- 
kmpar.pdf
Lookout for these interesting Spark features 
o Application of Statistics toolbox 
o Center  Scale RDD 
o Filter RDDs
Clustering - API 
o from pyspark.mllib.clustering import KMeans 
o Kmeans.train 
o train(cls, data, k, maxIterations=100, runs=1, initializationMode=k-means||) 
o K = number of clusters to create, default=2 
o initializationMode = The initialization algorithm. This can be either random to 
choose random points as initial cluster centers, or k-means|| to use a parallel 
variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: 
k-means|| 
o KMeansModel.predict 
o Maps a point to a cluster
Data 
iPython notebook at https://github.com/xsankar/cloaked-ironman
Read Data  Create RDD
Train  Predict
Calculate error
But Data is not even
So let us center  scale the data and try again
Let us try with 5 clusters 
Looks Good
Let us map the cluster to our data
And interpret them We 
have 
mulMple 
cluster 
types: 
• 1 
: 
Very 
AcMve 
– 
Give 
them 
the 
most 
aIenMon 
• 3 
: 
Very 
AcMve 
on-­‐line, 
few 
flights 
– 
Give 
them 
on-­‐line 
coupons 
• 4 
: 
RelaMvely 
new 
customers, 
not 
that 
acMve 
– 
Give 
them 
flight 
coupons 
to 
encourage 
them 
to 
fly 
more. 
Ask 
them 
why 
they 
are 
not 
flying. 
May 
be 
they 
are 
flying 
to 
desMnaMons 
(say 
Jupiter) 
where 
InterGallacMc 
has 
less 
gates 
Note 
: 
• This 
is 
just 
a 
sample 
interpreta0on. 
• In 
real 
life 
we 
would 
“noodle” 
over 
the 
clusters 
 
tweak 
them 
to 
be 
useful, 
interpretable 
and 
dis0nguishable. 
• May 
be 
3 
is 
more 
suited 
to 
create 
targeted 
promo0ons 
iPython notebook at https://github.com/xsankar/cloaked-ironman
Epilogue 
o KMeans in Spark has enough controls 
o It does a decent job 
o We were able to control the clusters based on our experience (2 cluster is too 
low, 10 is too high, 5 seems to be right) 
o We can see that the Scalable KMeans has control over runs, parallelism et al. 
(Home work : explore the scalability) 
o We were able to interpret the results with domain knowledge and arrive at a 
scheme to solve the business opportunity 
o Naturally we would tweak the clusters to fit the business viability. 20 clusters 
with corresponding promotion schemes are unwieldy, even if the WSSE is the 
minimum.
Break 
3:15
Predicting Survivors 
with Classification 
3:45
Titanic 
Passenger 
Metadata 
• Small 
• 3 
Predictors 
• Class 
• Sex 
• Age 
• Survived? 
Classification - Scenario 
o This is a knowledge exercise 
o Classify survival from the titanic data 
o Gives us a quick dataset to run  test classification 
iPython notebook at https://github.com/xsankar/cloaked-ironman
Classifying Classifiers 
Statistical 
Structural 
Regression 
Naïve 
Bayes 
Bayesian 
Networks 
Rule-­‐based 
Distance-­‐based 
Neural 
Networks 
Production 
Rules 
Decision 
Trees 
Multi-­‐layer 
Perception 
Functional 
Ensemble 
Nearest 
Neighbor 
Linear 
Spectral 
Wavelet 
kNN 
Random 
Forests 
Learning 
vector 
Quantization 
Logistic 
Regression1 
Boosting 
SVM 
1Max 
Entropy 
Classifier 
Ref: Algorithms of the Intelligent Web, Marmanis  Babenko
Regression 
Classifiers 
Continuous 
Variables 
Categorical 
Variables 
Decision 
Trees 
k-­‐NN(Nearest 
Neighbors) 
Bias 
Variance 
Model Complexity 
Over-fitting 
Bagging 
BoosMng 
CART
Classification - Spark API 
o Logistic Regression 
o SVMWIthSGD 
o DecisionTrees 
o Data as LabelledPoint (we will see in a moment) 
o DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity=gini, 
maxDepth=4, maxBins=100) 
o Impurity – “entropy” or “gini” 
o maxBins = control to throttle communication at the expense of accuracy 
• Larger = Higher Accuracy 
• Smaller = less communication (as # of bins = number of instances) 
o data adaptive – i.e. decision tree samples on the driver and figures out the bin spacing 
i.e. the places you slice for binning 
o intelligent framework - need this for scale
Lookout for these interesting Spark features 
o Concept of Labeled Point  how to create an RDD of LPs 
o Print the tree 
o Calculate Accuracy  MSE from RDDs
Read data  extract features 
iPython notebook at https://github.com/xsankar/cloaked-ironman
Create the model
Extract labels  features
Calculate Accuracy  MSE
Use NaiveBayes Algorithm
Decision Tree – Best Practices 
DecisionTree.trainClassifier(data, 
numClasses, 
categoricalFeaturesInfo, 
impurity=gini, 
maxDepth=4, 
maxBins=100) 
maxDepth 
Tune 
with 
Data/Model 
SelecMon 
maxBins 
Set 
low, 
monitor 
communicaMons, 
increase 
if 
needed 
# 
RDD 
parMMons 
Set 
to 
# 
of 
cores 
• Usually the recommendation is that the RDD partitions should be over 
partitioned ie “more partitions than cores”, because tasks take different 
times, we need to utilize the compute power and in the end they average out 
• But for Machine Learning especially trees, all tasks are approx equal 
computationally intensive, so over partitioning doesn’t help 
• Joe Bradley talk (reference below) has interesting insights 
hIps://speakerdeck.com/jkbradley/mllib-­‐decision-­‐trees-­‐at-­‐sf-­‐scala-­‐baml-­‐meetup
Future … 
o Actually we should split the data to training  test sets 
o Then use different feature sets to see if we can increase the accuracy 
o Leave it as Homework 
o In 1.2 … 
o Random Forest 
• Bagging 
• PR for Random Forest 
o Boosting 
o Alpine lab sequoia Forest: coordinating merge 
o Model Selection Pipeline ; Design Doc
Boosting 
— Goal 
◦ Model Complexity (-) 
◦ Variance (-) 
◦ Prediction Accuracy (+) 
◦ “Output 
of 
weak 
classifiers 
into 
a 
powerful 
commiIee” 
◦ Final 
PredicMon 
= 
weighted 
majority 
vote 
◦ Later 
classifiers 
get 
misclassified 
points 
– With 
higher 
weight, 
– So 
they 
are 
forced 
– To 
concentrate 
on 
them 
◦ AdaBoost 
(AdapMveBoosting) 
◦ BoosMng 
vs 
Bagging 
– Bagging 
– 
independent 
trees 
-­‐ 
Spark 
shines 
here 
– BoosMng 
– 
successively 
weighted
Random Forests+ 
— Goal 
◦ Model Complexity (-) 
◦ Variance (-) 
◦ Prediction Accuracy (+) 
◦ Builds 
large 
collecMon 
of 
de-­‐correlated 
trees 
 
averages 
them 
◦ Improves 
Bagging 
by 
selecMng 
i.i.d* 
random 
variables 
for 
splipng 
◦ Simpler 
to 
train 
 
tune 
◦ “Do 
remarkably 
well, 
with 
very 
li=le 
tuning 
required” 
– 
ESLII 
◦ Less 
suscepMble 
to 
over 
fipng 
(than 
boosMng) 
◦ Many 
RF 
implementaMons 
– Original 
version 
-­‐ 
Fortran-­‐77 
! 
By 
Breiman/Cutler 
– Python, 
R, 
Mahout, 
Weka, 
Milk 
(ML 
toolkit 
for 
py), 
matlab 
* i.i.d – independent identically distributed 
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
◦ Two 
Step 
– Develop 
a 
set 
of 
learners 
– Combine 
the 
results 
to 
develop 
a 
composite 
predictor 
◦ Ensemble 
methods 
can 
take 
the 
form 
of: 
– Using 
different 
algorithms, 
– Using 
the 
same 
algorithm 
with 
different 
sepngs 
– Assigning 
different 
parts 
of 
the 
dataset 
to 
different 
classifiers 
◦ Bagging 
 
Random 
Forests 
are 
examples 
of 
ensemble 
method 
Ref: Machine Learning In Action 
Ensemble Methods 
— Goal 
◦ Model Complexity (-) 
◦ Variance (-) 
◦ Prediction Accuracy (+)
Random Forests 
o While Boosting splits based on best among all variables, RF splits based on best among 
randomly chosen variables 
o Simpler because it requires two variables – no. of Predictors (typically √k)  no. of trees 
(500 for large dataset, 150 for smaller) 
o Error prediction 
• For each iteration, predict for dataset that is not in the sample (OOB data) 
• Aggregate OOB predictions 
• Calculate Prediction Error for the aggregate, which is basically the OOB 
estimate of error rate 
• Can use this to search for optimal # of predictors 
• We will see how close this is to the actual error in the Heritage Health Prize 
o Assumes equal cost for mis-prediction. Can add a cost function 
o Proximity matrix  applications like adding missing data, dropping outliers Ref: R News Vol 2/3, Dec 2002 
Statistical Learning from a Regression Perspective : Berk 
A Brief Overview of RF by Dan Steinberg
Linear Regression 
4:15
Linear Regression - API 
LabeledPoint The features and labels of a data point 
LinearModel weights, intercept 
LinearRegressionModel 
Base predict() 
LinearRegressionModel 
LinearRegressionWithS 
GD 
train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0, 
initialWeights=None, regParam=1.0, regType=None, intercept=False) 
LassoModel Least-squares fit with an l_1 penalty term. 
LassoWithSGD 
train(cls, data, iterations=100, step=1.0, regParam=1.0, 
miniBatchFraction=1.0,initialWeights=None) 
RidgeRegressionModel Least-squares fit with an l_2 penalty term. 
RidgeRegressionWithS 
GD 
train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0, 
initialWeights=None)
Basic Linear Regression
Use LR model for prediction  calculate MSE
Step size is important, the model can diverge !
Interesting step size
Recommendation 
Engine 
4:25
Recommendation  Personalization - Spark 
Automated 
Analytics-­‐ 
Let 
Data 
tell 
story 
Feature 
Learning, 
AI, 
Deep 
Learning 
Learning 
Models 
-­‐ 
fit 
parameters 
as 
it 
gets 
more 
data 
Dynamic 
Models 
– 
model 
selection 
based 
on 
context 
o Knowledge 
Based 
o Demographic 
Based 
o Content 
Based 
o Collaborative 
Filtering 
o Item 
Based 
o User 
Based 
o Latent 
Factor 
based 
o User 
Rating 
o Purchased 
o Looked/Not 
purchased 
Spark 
(in 
1.1.0) 
implements 
the 
user 
based 
ALS 
collaboraMve 
filtering 
Ref: 
ALS 
-­‐ 
CollaboraMve 
Filtering 
for 
Implicit 
Feedback 
Datasets, 
Yifan 
Hu 
; 
ATT 
Labs., 
Florham 
Park, 
NJ 
; 
Koren, 
Y. 
; 
Volinsky, 
C. 
ALS-­‐WR 
-­‐ 
Large-­‐Scale 
Parallel 
CollaboraMve 
Filtering 
for 
the 
Nevlix 
Prize, 
Yunhong 
Zhou, 
Dennis 
Wilkinson, 
Robert 
Schreiber, 
Rong 
Pan
Spark Collaborative Filtering API 
o ALS.train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1) 
o ALS.trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, 
alpha=0.01) 
o MatrixFactorizationModel.predict(self, user, product) 
o MatrixFactorizationModel.predictAll(self, usersProducts)
Read  Parse
Split  Train
Evaluate
Epilogue 
o We explored interesting APIs in Spark 
o ALS-Collab Filtering 
o RDD Operations 
• Join (HashJoin) 
• In memory, Grace, Recursive hash join 
hIp://technet.microsox.com/en-­‐us/library/ms189313(v=sql.105).aspx
Questions ? 
4:45
Reference 
1. SF Scala  SF Bay Area Machine Learning, Joseph Bradley: Decision Trees on Spark 
http://functional.tv/post/98342564544/sfscala-sfbaml-joseph-bradley-decision-trees-on- 
spark 
2. http://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling- 
needed-for-k-means-clustering 
3. http://stats.stackexchange.com/questions/19216/variables-are-often-adjusted-e-g-standardised- 
before-making-a-model-when-is 
4. http://funny-pictures.picphotos.net/tongue-out-smiley-face/smile-day.net*wp-content* 
uploads*2012*01*Tongue-Out-Smiley-Face1.jpg/ 
5. https://speakerdeck.com/jkbradley/mllib-decision-trees-at-sf-scala-baml-meetup 
6. http://www.rosebt.com/1/post/2011/10/big-data-analytics-maturity-model.html 
7. http://blogs.gartner.com/matthew-davis/
Essential Reading List 
o A few useful things to know about machine learning - by Pedro Domingos 
• http://dl.acm.org/citation.cfm?id=2347755 
o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert 
• http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/ 
lack_of_a_priori_distinctions_wolpert.pdf 
o http://www.no-free-lunch.org/ 
o YC.o Cn trolling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg, 
• http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y 
%20FDR.pdf 
o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe 
• http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ 
o Avoid these three mistakes, James Faghmo 
• https://medium.com/about-data/73258b3848a4 
o Leakage in Data Mining: Formulation, Detection, and Avoidance 
• http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/ 
cs670_Tran_PreferredPaper_LeakingInDataMining.pdf
For your reading  viewing pleasure … An ordered List 
① An Introduction to Statistical Learning 
• http://www-bcf.usc.edu/~‾gareth/ISL/ 
② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning 
• http://online.stanford.edu/course/statistical-learning-winter-2014 
③ Prof. Pedro Domingo 
• https://class.coursera.org/machlearning-001/lecture/preview 
④ Prof. Andrew Ng 
• https://class.coursera.org/ml-003/lecture/preview 
⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data 
• https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120 
⑥ Mathematicalmonk @ YouTube 
• https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA 
⑦ The Elements Of Statistical Learning 
• http://statweb.stanford.edu/~‾tibs/ElemStatLearn/ 
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-machine- 
learning/
References: 
o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas 
• http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-learning 
o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel 
• http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn 
o Just The Basics, Strata 2013, William Cukierski  Ben Hamner 
• http://strataconf.com/strata2013/public/schedule/detail/27291 
o The Problem of Multiple Testing 
• http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/ 
PIIS1934148209014609.pdf
The Beginning As The 
End 
How did we do ? 
4:45
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Weitere ähnliche Inhalte

Was ist angesagt?

Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search SystemTrey Grainger
 
Distributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in PythonDistributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in PythonClare Corthell
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AITrey Grainger
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceTrey Grainger
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphTrey Grainger
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Trey Grainger
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered SearchTrey Grainger
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User IntentTrey Grainger
 
Recommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRecommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRoku
 
Smart Data Webinar: Advances in Natural Language Processing
Smart Data Webinar: Advances in Natural Language ProcessingSmart Data Webinar: Advances in Natural Language Processing
Smart Data Webinar: Advances in Natural Language ProcessingDATAVERSITY
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Linked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsLinked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsVito Ostuni
 
Information Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An IntroductionInformation Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An IntroductionKrist Wongsuphasawat
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Trey Grainger
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationTrey Grainger
 

Was ist angesagt? (20)

Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Analyzing social media with Python and other tools (4/4)
Analyzing social media with Python and other tools (4/4) Analyzing social media with Python and other tools (4/4)
Analyzing social media with Python and other tools (4/4)
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
Distributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in PythonDistributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in Python
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AI
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative Space
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge Graph
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered Search
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
 
Recommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRecommender Systems in the Linked Data era
Recommender Systems in the Linked Data era
 
Smart Data Webinar: Advances in Natural Language Processing
Smart Data Webinar: Advances in Natural Language ProcessingSmart Data Webinar: Advances in Natural Language Processing
Smart Data Webinar: Advances in Natural Language Processing
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Linked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsLinked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender Systems
 
Information Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An IntroductionInformation Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An Introduction
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
 

Ähnlich wie The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Discover Pinterest
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data ScienceTravis Oliphant
 
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data Hejwowski Piotr
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) SkillsOscar Corcho
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesLynn Langit
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
 
Getting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionGetting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionSplunk
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysDemi Ben-Ari
 

Ähnlich wie The Hitchhiker's Guide to Machine Learning with Python & Apache Spark (20)

Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Spark
SparkSpark
Spark
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data Science
 
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
Getting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionGetting Started with Splunk Breakout Session
Getting Started with Splunk Breakout Session
 
Big Data Analytics V2
Big Data Analytics V2Big Data Analytics V2
Big Data Analytics V2
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 

Mehr von Krishna Sankar

An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01Krishna Sankar
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesKrishna Sankar
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsKrishna Sankar
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team Krishna Sankar
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time SynchronizationKrishna Sankar
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleKrishna Sankar
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04Krishna Sankar
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Krishna Sankar
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0Krishna Sankar
 

Mehr von Krishna Sankar (13)

An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive Bayes
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOps
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team
 
The Art of Big Data
The Art of Big DataThe Art of Big Data
The Art of Big Data
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time Synchronization
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to Kaggle
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0
 

Kürzlich hochgeladen

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 

Kürzlich hochgeladen (20)

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 

The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

  • 1. “I want to die on Mars but not on impact” — Elon Musk, interview with Chris Anderson "There are no facts, only interpretations." - Friedrich Nietzsche “The shrewd guess, the fertile hypothesis, the courageous leap to a tentative conclusion – these are the most valuable coin of the thinker at work” -- Jerome Seymour Bruner The Hitchhiker's Guide to Machine Learning with Python Apache Spark @ksankar // doubleclix.wordpress.com http://www.bigdatatechcon.com/ classes.html#TheHitchhikersGuidetoMachineLearningwithPythonandApacheSparkPartI October 29, 2014
  • 2. Agenda o Spark Data Science DevOps • Spark, Python Machine Learning • Goals/non-goals • Intro to Spark • Stack, Mechanisms – RDD • Datasets : SOTU, Titanic, Frequent Flier • Statistical Toolbox • Summary, Correlations o “Mood Of the Union” • State of the Union w/ Washington, Lincoln, FDR, JFK, Clinton, Bush Obama • Map reduce, parse text o Clustering • K-means for Gallactic Hoppers! o Break [3:15-3:45) o Predicting Survivors with Classification • Decision Trees • NaiveBayes (Titanic data set) o Linear Regression o Recommendation Engine • Collab Filtering w/movie lens o Discussions/Slack Oct 29 2-­‐3:15 (75min), 3:45-­‐5:00 (75 min) = 150 min [20] 2:00 – 2:20 [30] 2:20 – 2:50 [25] 2:50 – 3:15 [30] 3:45 – 4:15 [10] 4:15 – 4:25 [20] 4:25 – 4:45 [15] 4:45 – 5:00
  • 3. Goals non-goals Goals ¤ Understand how to program Machine Learning with Spark Python ¤ Focus on programming ML application ¤ Give you a focused time to work thru examples § Work with me. I will wait if you want to catch-up ¤ Less theory, more usage - let us see if this works ¤ As straightforward as possible § The programs can be optimized Non-goals ¡ Go deep into the algorithms • We don’t have sufficient time. The topic can be easily a 5 day tutorial ! ¡ Dive into spark internals • That is for another day ¡ The underlying computation, communication, constraints distribution is a fascinating subject • Paco does a good job explaining them ¡ A passive talk • Nope. Interactive hands-on
  • 4. About Me o Chief Data Scientist at BlackArrow.tv o Have been speaking at OSCON, PyCon, Pydata et al The Nuthead band ! o Reviewing Packt Book “Machine Learning with Spark” o Picked up co-authorship Second Edition of “Fast Data Processing with Spark” o Have done lots of things: • Big Data (Retail, Bioinformatics, Financial, AdTech), • Written Books (Web 2.0, Wireless, Java,…) • Standards, some work in AI, • Guest Lecturer at Naval PG School,… • Planning MS-CFinance or Statistics or Computational Math o Volunteer as Robotics Judge at First Lego league World Competitions o @ksankar, doubleclix.wordpress.com
  • 5. Spark Data Science DevOps 2:00
  • 6. Close Encounters — 1st ◦ This Tutorial — 2nd ◦ Do More Hands-on Walkthrough — 3nd ◦ Listen To Lectures ◦ More competitions …
  • 7. Spark Installation o Install Spark 1.1.0 in local Machine o https://spark.apache.org/downloads.html • Pre-built For Hadoop 2.4 is fine o Download uncompress o Remember the path use it wherever you see /usr/local/spark/ o I have downloaded in /usr/local have a softlink spark to the latest version
  • 8. Tutorial Materials o Github : https://github.com/xsankar/cloaked-ironman o Clone or download zip o Open terminal o cd ~/cloaked-ironman o IPYTHON=1 IPYTHON_OPTS=notebook --pylab inline /usr/local/spark/bin/ pyspark o Note : o I have a soft link “spark” in my /usr/local that points to the spark version that I use. For example ln -s spark-1.1.0/ spark o Click on ipython dashboard o Just look thru the ipython notebooks
  • 9. Data Science - Context Reason Model Deploy o Scalable Model Deployment o Big Data automation purpose built appliances (soft/ hard) o Manage SLAs response times Collect Store Transform o Volume o Velocity o Streaming Data o Canonical form o Data catalog o Data Fabric across the organization o Access to multiple sources of data o Think Hybrid – Big Data Apps, Appliances Infrastructure o Metadata o Monitor counters Metrics o Structured vs. Multi-­‐ structured o Flexible Selectable § Data Subsets § Attribute sets o Refine model with § Extended Data subsets § Engineered Attribute sets o Validation run across a larger data set Data Management Data Science Visualize Recommend Predict Explore o Dynamic Data Sets o 2 way key-­‐value tagging of datasets o Extended attribute sets o Advanced Analytics o Performance o Scalability o Refresh Latency o In-­‐memory Analytics o Advanced Visualization o Interactive Dashboards o Map Overlay o Infographics ¤ Bytes to Business a.k.a. Build the full stack ¤ Find Relevant Data For Business ¤ Connect the Dots
  • 10. Velocity Volume Variety Data Science - Context Context Connect edness Interface Inference Intelligence “Data of unusual size” that can't be brute forced o Three Amigos o Interface = Cognition o Intelligence = Compute(CPU) Computational(GPU) o Infer Significance Causality
  • 11. Day in the life of a (super) Model Intelligence Inference Data Representation Interface Algorithms AIributes Parameters Data (Scoring) Model SelecMon Reason Learn Models Visualize, Recommend, Explore Model Assessment Dimensionality ReducMon Feature SelecMon
  • 12. Data Science Maturity Model Spark Isolated Analytics Integrated Analytics Aggregated Analytics Automated Analytics Data Small Data Larger Data set Big Data Big Data Factory Model Context Local Domain Cross-­‐domain + External Cross domain + External Model, Reason Deploy • Single set of boxes, usually owned by the Model Builders • Departmental • Deploy -­‐ Central AnalyMcs Infrastructure • Models sMll owned operated by Modelers • Partly Enterprise-­‐wide • Central AnalyMcs Infrastructure • Model Reason – by Model Builders • Deploy, Operate – by ops • Residuals and other metrics monitored by modelers • Enterprise-­‐wide • Distributed AnalyMcs Infrastructure • AI Augmented models • Model Reason – by Model Builders • Deploy, Operate – by ops • Data as a moneMzed service, extending to eco system partners • Reports • Dashboards • Dashboards + some APIs • Dashboards + Well defined APIs + programming models Type • DescripMve ReacMve • + PredicMve • + AdapMve • AdapMve Datasets • All in the same box • Fixed data sets, usually in temp data spaces • Flexible Data AIribute sets • Dynamic datasets with well-­‐defined refresh policies Workload • Skunk works • Business relevant apps with approx SLAs • High performance appliance clusters • Appliances and clusters for mulMple workloads including real Mme apps • Infrastructure for emerging technologies Strategy • Informal definiMons • Data definiMons buried in the analyMcs models • Some data definiMons • Data catalogue, metadata AnnotaMons • Big Data MDM Strategy
  • 13. The Sense Sensibility of a DataScien3st DevOps Factory = OperaMonal Lab = InvesMgaMve hIp://doubleclix.wordpress.com/2014/05/11/the-­‐sense-­‐ sensibility-­‐of-­‐a-­‐data-­‐scienMst-­‐devops/
  • 16. RDD – The workhorse of Spark o Resilient Distributed Datasets • Collection that can be operated in parallel o Transformations – create RDDs • Map, Filter,… o Actions – Get values • Collect, Take,… o We will apply these operations during this tutorial
  • 17. Algorithm spectrum o Regression o Logit o CART o Ensemble : Random Forest o Clustering o KNN o Genetic Alg o Simulated Annealing o Collab Filtering o SVM o Kernels o SVD o NNet o Boltzman Machine o Feature Learning Machine Learning Cute Math Ar0ficial Intelligence
  • 18. ALL MLlib APIs are not available in Python (as of 1.1.0) API Spark 1.1.0 Spark 1.2.0 Java/Scala Python Basic Statistics ✔ ✔ Linear Models ✔ ✔ Decision Trees ✔ ✔ Random Forest ✖ ✖ Collab Filtering ✔ ✔ Clustering-KMeans ✔ ✔ Clustering-Hierarchical ✖ ✖ SVD ✔ ✖ PCA ✔ ✖ Standard Scaler, Normalizer ✔ ✖ Model Evaluation-PR/ROC Spark 1.2 MLlib JIRA h=p://bit.ly/1ywotkm
  • 19. Statistical Toolbox o Sample data : Car mileage data hIps://github.com/apache/spark/blob/master/examples/src/main/python/mllib/correlaMons.py
  • 20.
  • 21. “Mood Of the Union” with TF-IDF 2:20
  • 22. Scenario – Mood Of the Union o It has been said that the State of the Union speech by the President of USA reflects the social challenge faced by the country ? o If so, can we infer the mood of the country by analyzing SOTU ? o If we embark on this line of thought, how would we do it with Spark python ? o Is it different from Hadoop-MapReduce ? o Is it better ?
  • 23. POA (Plan Of Action) o Collect State of the Union speech by George Washington, Abe Lincoln, FDR, JFK, Bill Clinton, GW Bush Barack Obama o Read the 7 SOTU from the 7 presidents into 7 RDDs o Create word vectors o Transform into word frequency vectors o Remove stock common words o Inspect to n words to see if they reflect the sentiment of the time o Compute set difference and see how new words have cropped up o Compute TF-IDF (homework!)
  • 24. Lookout for these interesting Spark features o RDD Map-Reduce o How to parse input o Removing common words o Sort rdd by value
  • 25. iPython notebook at https://github.com/xsankar/cloaked-ironman Read Create word vector
  • 26. Remove Common Words – 1 of 3 iPython notebook at https://github.com/xsankar/cloaked-ironman
  • 27. Remove Common Words – 2 of 3
  • 28. Remove Common Words – 3 of 3
  • 29. FDR vs. Barack Obama as reflected by SOTU
  • 30. Barack Obama vs. Bill Clinton
  • 31. GWB vs Abe Lincoln as reflected by SOTU
  • 32. Epilogue o Interesting Exercise o Highlights • Map-reduce in a couple of lines ! • But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1) • Set differences using substractByKey • Ability to sort a map by values (or any arbitrary function, for that matter) o To Explore as homework: • TF-IDF in http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf • Haven’t seen it in python for 1.1. hIp://blog.cloudera.com/blog/2014/09/how-­‐to-­‐translate-­‐from-­‐mapreduce-­‐to-­‐apache-­‐spark/
  • 34. Scenario – Clustering with Spark o InterGallactic Airlines have the GallacticHoppers frequent flyer program have data about their customers who participate in the program. o The airlines execs have a feeling that other airlines will poach their customers if they do not keep their loyal customers happy. o So the business want to customize promotions to their frequent flier program. o Can they just have one type of promotion ? o Should they have different types of incentives ? o Who exactly are the customers in their GallacticHoppers program ? o Recently they have deployed an infrastructure with Spark o Can Spark help in this business problem ?
  • 35. Clustering - Theory o Clustering is unsupervised learning o While the computers can dissect a dataset into “similar” clusters, it still needs human direction domain knowledge to interpret guide o Two types: • Centroid based clustering – k-means clustering • Tree based Clustering – hierarchical clustering o Spark implements the Scalable Kmeans++ • Paper : http://theory.stanford.edu/~‾sergei/papers/vldb12- kmpar.pdf
  • 36. Lookout for these interesting Spark features o Application of Statistics toolbox o Center Scale RDD o Filter RDDs
  • 37. Clustering - API o from pyspark.mllib.clustering import KMeans o Kmeans.train o train(cls, data, k, maxIterations=100, runs=1, initializationMode=k-means||) o K = number of clusters to create, default=2 o initializationMode = The initialization algorithm. This can be either random to choose random points as initial cluster centers, or k-means|| to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means|| o KMeansModel.predict o Maps a point to a cluster
  • 38. Data iPython notebook at https://github.com/xsankar/cloaked-ironman
  • 39. Read Data Create RDD
  • 42. But Data is not even
  • 43. So let us center scale the data and try again
  • 44. Let us try with 5 clusters Looks Good
  • 45. Let us map the cluster to our data
  • 46. And interpret them We have mulMple cluster types: • 1 : Very AcMve – Give them the most aIenMon • 3 : Very AcMve on-­‐line, few flights – Give them on-­‐line coupons • 4 : RelaMvely new customers, not that acMve – Give them flight coupons to encourage them to fly more. Ask them why they are not flying. May be they are flying to desMnaMons (say Jupiter) where InterGallacMc has less gates Note : • This is just a sample interpreta0on. • In real life we would “noodle” over the clusters tweak them to be useful, interpretable and dis0nguishable. • May be 3 is more suited to create targeted promo0ons iPython notebook at https://github.com/xsankar/cloaked-ironman
  • 47. Epilogue o KMeans in Spark has enough controls o It does a decent job o We were able to control the clusters based on our experience (2 cluster is too low, 10 is too high, 5 seems to be right) o We can see that the Scalable KMeans has control over runs, parallelism et al. (Home work : explore the scalability) o We were able to interpret the results with domain knowledge and arrive at a scheme to solve the business opportunity o Naturally we would tweak the clusters to fit the business viability. 20 clusters with corresponding promotion schemes are unwieldy, even if the WSSE is the minimum.
  • 49. Predicting Survivors with Classification 3:45
  • 50. Titanic Passenger Metadata • Small • 3 Predictors • Class • Sex • Age • Survived? Classification - Scenario o This is a knowledge exercise o Classify survival from the titanic data o Gives us a quick dataset to run test classification iPython notebook at https://github.com/xsankar/cloaked-ironman
  • 51. Classifying Classifiers Statistical Structural Regression Naïve Bayes Bayesian Networks Rule-­‐based Distance-­‐based Neural Networks Production Rules Decision Trees Multi-­‐layer Perception Functional Ensemble Nearest Neighbor Linear Spectral Wavelet kNN Random Forests Learning vector Quantization Logistic Regression1 Boosting SVM 1Max Entropy Classifier Ref: Algorithms of the Intelligent Web, Marmanis Babenko
  • 52. Regression Classifiers Continuous Variables Categorical Variables Decision Trees k-­‐NN(Nearest Neighbors) Bias Variance Model Complexity Over-fitting Bagging BoosMng CART
  • 53. Classification - Spark API o Logistic Regression o SVMWIthSGD o DecisionTrees o Data as LabelledPoint (we will see in a moment) o DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity=gini, maxDepth=4, maxBins=100) o Impurity – “entropy” or “gini” o maxBins = control to throttle communication at the expense of accuracy • Larger = Higher Accuracy • Smaller = less communication (as # of bins = number of instances) o data adaptive – i.e. decision tree samples on the driver and figures out the bin spacing i.e. the places you slice for binning o intelligent framework - need this for scale
  • 54. Lookout for these interesting Spark features o Concept of Labeled Point how to create an RDD of LPs o Print the tree o Calculate Accuracy MSE from RDDs
  • 55. Read data extract features iPython notebook at https://github.com/xsankar/cloaked-ironman
  • 57. Extract labels features
  • 60. Decision Tree – Best Practices DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity=gini, maxDepth=4, maxBins=100) maxDepth Tune with Data/Model SelecMon maxBins Set low, monitor communicaMons, increase if needed # RDD parMMons Set to # of cores • Usually the recommendation is that the RDD partitions should be over partitioned ie “more partitions than cores”, because tasks take different times, we need to utilize the compute power and in the end they average out • But for Machine Learning especially trees, all tasks are approx equal computationally intensive, so over partitioning doesn’t help • Joe Bradley talk (reference below) has interesting insights hIps://speakerdeck.com/jkbradley/mllib-­‐decision-­‐trees-­‐at-­‐sf-­‐scala-­‐baml-­‐meetup
  • 61. Future … o Actually we should split the data to training test sets o Then use different feature sets to see if we can increase the accuracy o Leave it as Homework o In 1.2 … o Random Forest • Bagging • PR for Random Forest o Boosting o Alpine lab sequoia Forest: coordinating merge o Model Selection Pipeline ; Design Doc
  • 62. Boosting — Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+) ◦ “Output of weak classifiers into a powerful commiIee” ◦ Final PredicMon = weighted majority vote ◦ Later classifiers get misclassified points – With higher weight, – So they are forced – To concentrate on them ◦ AdaBoost (AdapMveBoosting) ◦ BoosMng vs Bagging – Bagging – independent trees -­‐ Spark shines here – BoosMng – successively weighted
  • 63. Random Forests+ — Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+) ◦ Builds large collecMon of de-­‐correlated trees averages them ◦ Improves Bagging by selecMng i.i.d* random variables for splipng ◦ Simpler to train tune ◦ “Do remarkably well, with very li=le tuning required” – ESLII ◦ Less suscepMble to over fipng (than boosMng) ◦ Many RF implementaMons – Original version -­‐ Fortran-­‐77 ! By Breiman/Cutler – Python, R, Mahout, Weka, Milk (ML toolkit for py), matlab * i.i.d – independent identically distributed + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
  • 64. ◦ Two Step – Develop a set of learners – Combine the results to develop a composite predictor ◦ Ensemble methods can take the form of: – Using different algorithms, – Using the same algorithm with different sepngs – Assigning different parts of the dataset to different classifiers ◦ Bagging Random Forests are examples of ensemble method Ref: Machine Learning In Action Ensemble Methods — Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
  • 65. Random Forests o While Boosting splits based on best among all variables, RF splits based on best among randomly chosen variables o Simpler because it requires two variables – no. of Predictors (typically √k) no. of trees (500 for large dataset, 150 for smaller) o Error prediction • For each iteration, predict for dataset that is not in the sample (OOB data) • Aggregate OOB predictions • Calculate Prediction Error for the aggregate, which is basically the OOB estimate of error rate • Can use this to search for optimal # of predictors • We will see how close this is to the actual error in the Heritage Health Prize o Assumes equal cost for mis-prediction. Can add a cost function o Proximity matrix applications like adding missing data, dropping outliers Ref: R News Vol 2/3, Dec 2002 Statistical Learning from a Regression Perspective : Berk A Brief Overview of RF by Dan Steinberg
  • 67. Linear Regression - API LabeledPoint The features and labels of a data point LinearModel weights, intercept LinearRegressionModel Base predict() LinearRegressionModel LinearRegressionWithS GD train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0, initialWeights=None, regParam=1.0, regType=None, intercept=False) LassoModel Least-squares fit with an l_1 penalty term. LassoWithSGD train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0,initialWeights=None) RidgeRegressionModel Least-squares fit with an l_2 penalty term. RidgeRegressionWithS GD train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0, initialWeights=None)
  • 69. Use LR model for prediction calculate MSE
  • 70. Step size is important, the model can diverge !
  • 72.
  • 73.
  • 75. Recommendation Personalization - Spark Automated Analytics-­‐ Let Data tell story Feature Learning, AI, Deep Learning Learning Models -­‐ fit parameters as it gets more data Dynamic Models – model selection based on context o Knowledge Based o Demographic Based o Content Based o Collaborative Filtering o Item Based o User Based o Latent Factor based o User Rating o Purchased o Looked/Not purchased Spark (in 1.1.0) implements the user based ALS collaboraMve filtering Ref: ALS -­‐ CollaboraMve Filtering for Implicit Feedback Datasets, Yifan Hu ; ATT Labs., Florham Park, NJ ; Koren, Y. ; Volinsky, C. ALS-­‐WR -­‐ Large-­‐Scale Parallel CollaboraMve Filtering for the Nevlix Prize, Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, Rong Pan
  • 76. Spark Collaborative Filtering API o ALS.train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1) o ALS.trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, alpha=0.01) o MatrixFactorizationModel.predict(self, user, product) o MatrixFactorizationModel.predictAll(self, usersProducts)
  • 80. Epilogue o We explored interesting APIs in Spark o ALS-Collab Filtering o RDD Operations • Join (HashJoin) • In memory, Grace, Recursive hash join hIp://technet.microsox.com/en-­‐us/library/ms189313(v=sql.105).aspx
  • 82. Reference 1. SF Scala SF Bay Area Machine Learning, Joseph Bradley: Decision Trees on Spark http://functional.tv/post/98342564544/sfscala-sfbaml-joseph-bradley-decision-trees-on- spark 2. http://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling- needed-for-k-means-clustering 3. http://stats.stackexchange.com/questions/19216/variables-are-often-adjusted-e-g-standardised- before-making-a-model-when-is 4. http://funny-pictures.picphotos.net/tongue-out-smiley-face/smile-day.net*wp-content* uploads*2012*01*Tongue-Out-Smiley-Face1.jpg/ 5. https://speakerdeck.com/jkbradley/mllib-decision-trees-at-sf-scala-baml-meetup 6. http://www.rosebt.com/1/post/2011/10/big-data-analytics-maturity-model.html 7. http://blogs.gartner.com/matthew-davis/
  • 83. Essential Reading List o A few useful things to know about machine learning - by Pedro Domingos • http://dl.acm.org/citation.cfm?id=2347755 o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert • http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/ lack_of_a_priori_distinctions_wolpert.pdf o http://www.no-free-lunch.org/ o YC.o Cn trolling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg, • http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y %20FDR.pdf o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe • http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o Avoid these three mistakes, James Faghmo • https://medium.com/about-data/73258b3848a4 o Leakage in Data Mining: Formulation, Detection, and Avoidance • http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/ cs670_Tran_PreferredPaper_LeakingInDataMining.pdf
  • 84. For your reading viewing pleasure … An ordered List ① An Introduction to Statistical Learning • http://www-bcf.usc.edu/~‾gareth/ISL/ ② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning • http://online.stanford.edu/course/statistical-learning-winter-2014 ③ Prof. Pedro Domingo • https://class.coursera.org/machlearning-001/lecture/preview ④ Prof. Andrew Ng • https://class.coursera.org/ml-003/lecture/preview ⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data • https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120 ⑥ Mathematicalmonk @ YouTube • https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA ⑦ The Elements Of Statistical Learning • http://statweb.stanford.edu/~‾tibs/ElemStatLearn/ http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-machine- learning/
  • 85. References: o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas • http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-learning o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel • http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn o Just The Basics, Strata 2013, William Cukierski Ben Hamner • http://strataconf.com/strata2013/public/schedule/detail/27291 o The Problem of Multiple Testing • http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/ PIIS1934148209014609.pdf
  • 86. The Beginning As The End How did we do ? 4:45