The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

“I want to die on Mars but not on impact”
— Elon Musk, interview with Chris Anderson
"There are no facts, only interpretations." - Friedrich Nietzsche
“The shrewd guess, the fertile hypothesis, the courageous leap to a
tentative conclusion – these are the most valuable coin of the thinker at
work” -- Jerome Seymour Bruner
The Hitchhiker's Guide to
Machine Learning with
Python Apache Spark
@ksankar // doubleclix.wordpress.com
http://www.bigdatatechcon.com/
classes.html#TheHitchhikersGuidetoMachineLearningwithPythonandApacheSparkPartI
October 29, 2014

Agenda
o Spark Data Science DevOps
• Spark, Python Machine Learning
• Goals/non-goals
• Intro to Spark
• Stack, Mechanisms – RDD
• Datasets : SOTU, Titanic, Frequent
Flier
• Statistical Toolbox
• Summary, Correlations
o “Mood Of the Union”
• State of the Union w/ Washington,
Lincoln, FDR, JFK, Clinton, Bush
Obama
• Map reduce, parse text
o Clustering
• K-means for Gallactic Hoppers!
o Break [3:15-3:45)
o Predicting Survivors with Classification
• Decision Trees
• NaiveBayes (Titanic data set)
o Linear Regression
o Recommendation Engine
• Collab Filtering w/movie lens
o Discussions/Slack
Oct
29
2-‐3:15
(75min),
3:45-‐5:00
(75
min)
=
150
min
[20]
2:00
–
2:20
[30]
2:20
–
2:50
[25]
2:50
–
3:15
[30]
3:45
–
4:15
[10]
4:15
–
4:25
[20]
4:25
–
4:45
[15]
4:45
–
5:00

Goals non-goals
Goals
¤ Understand how to program
Machine Learning with Spark
Python
¤ Focus on programming ML
application
¤ Give you a focused time to work
thru examples
§ Work with me. I will wait
if you want to catch-up
¤ Less theory, more usage - let us
see if this works
¤ As straightforward as possible
§ The programs can be
optimized
Non-goals
¡ Go deep into the algorithms
• We don’t have sufficient
time. The topic can be
easily a 5 day tutorial !
¡ Dive into spark internals
• That is for another day
¡ The underlying computation,
communication, constraints
distribution is a fascinating subject
• Paco does a good job
explaining them
¡ A passive talk
• Nope. Interactive
hands-on

About Me
o Chief Data Scientist at BlackArrow.tv
o Have been speaking at OSCON, PyCon, Pydata et al
The
Nuthead
band
!
o Reviewing Packt Book “Machine Learning with Spark”
o Picked up co-authorship Second Edition of “Fast Data Processing with Spark”
o Have done lots of things:
• Big Data (Retail, Bioinformatics, Financial, AdTech),
• Written Books (Web 2.0, Wireless, Java,…)
• Standards, some work in AI,
• Guest Lecturer at Naval PG School,…
• Planning MS-CFinance or Statistics or Computational Math
o Volunteer as Robotics Judge at First Lego league World Competitions
o @ksankar, doubleclix.wordpress.com

Spark Data Science
DevOps
2:00

Close Encounters
— 1st
◦ This Tutorial
— 2nd
◦ Do More Hands-on Walkthrough
— 3nd
◦ Listen To Lectures
◦ More competitions …

Spark Installation
o Install Spark 1.1.0 in local Machine
o https://spark.apache.org/downloads.html
• Pre-built For Hadoop 2.4 is fine
o Download uncompress
o Remember the path use it wherever you see /usr/local/spark/
o I have downloaded in /usr/local have a softlink spark to the latest version

Tutorial Materials
o Github : https://github.com/xsankar/cloaked-ironman
o Clone or download zip
o Open terminal
o cd ~/cloaked-ironman
o IPYTHON=1 IPYTHON_OPTS=notebook --pylab inline /usr/local/spark/bin/
pyspark
o Note :
o I have a soft link “spark” in my /usr/local that points to the spark version that I
use. For example ln -s spark-1.1.0/ spark
o Click on ipython dashboard
o Just look thru the ipython notebooks

Data Science - Context
Reason Model Deploy
o Scalable
Model
Deployment
o Big
Data
automation

purpose
built
appliances
(soft/
hard)
o Manage
SLAs

response
times
Collect Store Transform
o Volume
o Velocity
o Streaming
Data
o Canonical
form
o Data
catalog
o Data
Fabric
across
the
organization
o Access
to
multiple
sources
of
data
o Think
Hybrid
–
Big
Data
Apps,
Appliances

Infrastructure
o Metadata
o Monitor
counters

Metrics
o Structured
vs.
Multi-‐
structured
o Flexible

Selectable
§ Data
Subsets
§ Attribute
sets
o Refine
model
with
§ Extended
Data
subsets
§ Engineered
Attribute
sets
o Validation
run
across
a
larger
data
set
Data Management
Data Science
Visualize Recommend Predict Explore
o Dynamic
Data
Sets
o 2
way
key-‐value
tagging
of
datasets
o Extended
attribute
sets
o Advanced
Analytics
o Performance
o Scalability
o Refresh
Latency
o In-‐memory
Analytics
o Advanced
Visualization
o Interactive
Dashboards
o Map
Overlay
o Infographics
¤ Bytes to Business
a.k.a. Build the full
stack
¤ Find Relevant Data
For Business
¤ Connect the Dots

Velocity
Volume
Variety
Data Science - Context
Context
Connect
edness
Interface
Inference
Intelligence
“Data of unusual size”
that can't be brute forced
o Three Amigos
o Interface = Cognition
o Intelligence = Compute(CPU) Computational(GPU)
o Infer Significance Causality

Day in the life of a (super) Model
Intelligence
Inference
Data Representation
Interface
Algorithms
AIributes
Parameters
Data
(Scoring)
Model
SelecMon
Reason

Learn
Models
Visualize,
Recommend,
Explore
Model
Assessment
Dimensionality
ReducMon
Feature
SelecMon

Data Science Maturity Model Spark
Isolated Analytics
Integrated Analytics
Aggregated Analytics
Automated Analytics
Data
Small
Data
Larger
Data
set
Big
Data
Big
Data
Factory
Model
Context
Local
Domain
Cross-‐domain
+
External
Cross
domain
+
External
Model,
Reason
Deploy
• Single
set
of
boxes,
usually
owned
by
the
Model
Builders
• Departmental
• Deploy
-‐
Central
AnalyMcs
Infrastructure
• Models
sMll
owned

operated
by
Modelers
• Partly
Enterprise-‐wide
• Central
AnalyMcs
Infrastructure
• Model

Reason
–
by
Model
Builders
• Deploy,
Operate
–
by
ops
• Residuals
and
other
metrics
monitored
by
modelers
• Enterprise-‐wide
• Distributed
AnalyMcs
Infrastructure
• AI
Augmented
models
• Model

Reason
–
by
Model
Builders
• Deploy,
Operate
–
by
ops
• Data
as
a
moneMzed
service,
extending
to
eco
system
partners
• Reports
• Dashboards
• Dashboards
+
some
APIs
• Dashboards
+
Well
defined
APIs
+
programming
models
Type
• DescripMve

ReacMve
• +
PredicMve
• +
AdapMve
• AdapMve
Datasets
• All
in
the
same
box
• Fixed
data
sets,
usually
in
temp
data
spaces
• Flexible
Data

AIribute
sets
• Dynamic
datasets
with
well-‐defined
refresh
policies
Workload
• Skunk
works
• Business
relevant
apps
with
approx
SLAs
• High
performance
appliance
clusters
• Appliances
and
clusters
for
mulMple
workloads
including
real
Mme
apps
• Infrastructure
for
emerging
technologies
Strategy
• Informal
definiMons
• Data
definiMons
buried
in
the
analyMcs
models
• Some
data
definiMons
• Data
catalogue,
metadata

AnnotaMons
• Big
Data
MDM
Strategy

The
Sense

Sensibility
of
a
DataScien3st
DevOps
Factory
=
OperaMonal
Lab
=
InvesMgaMve
hIp://doubleclix.wordpress.com/2014/05/11/the-‐sense-‐
sensibility-‐of-‐a-‐data-‐scienMst-‐devops/

hIp://databricks.com/blog/2014/10/10/spark-‐breaks-‐previous-‐large-‐scale-‐sort-‐record.html

RDD – The workhorse of Spark
o Resilient Distributed Datasets
• Collection that can be operated in parallel
o Transformations – create RDDs
• Map, Filter,…
o Actions – Get values
• Collect, Take,…
o We will apply these operations during this tutorial

Algorithm spectrum
o Regression
o Logit
o CART
o Ensemble :
Random
Forest
o Clustering
o KNN
o Genetic Alg
o Simulated
Annealing
o Collab
Filtering
o SVM
o Kernels
o SVD
o NNet
o Boltzman
Machine
o Feature
Learning
Machine
Learning
Cute
Math
Ar0ficial
Intelligence

ALL MLlib APIs are not available in Python (as of 1.1.0)
API
Spark 1.1.0 Spark 1.2.0
Java/Scala Python
Basic Statistics ✔ ✔
Linear Models ✔ ✔
Decision Trees ✔ ✔
Random Forest ✖ ✖
Collab Filtering ✔ ✔
Clustering-KMeans ✔ ✔
Clustering-Hierarchical ✖ ✖
SVD ✔ ✖
PCA ✔ ✖
Standard Scaler, Normalizer ✔ ✖
Model Evaluation-PR/ROC
Spark
1.2
MLlib
JIRA
h=p://bit.ly/1ywotkm

Statistical Toolbox
o Sample data : Car mileage data
hIps://github.com/apache/spark/blob/master/examples/src/main/python/mllib/correlaMons.py

“Mood Of the Union”
with TF-IDF
2:20

Scenario – Mood Of the Union
o It has been said that the State of the Union speech by the President of USA
reflects the social challenge faced by the country ?
o If so, can we infer the mood of the country by analyzing SOTU ?
o If we embark on this line of thought, how would we do it with Spark python ?
o Is it different from Hadoop-MapReduce ?
o Is it better ?

POA (Plan Of Action)
o Collect State of the Union speech by George Washington, Abe Lincoln, FDR,
JFK, Bill Clinton, GW Bush Barack Obama
o Read the 7 SOTU from the 7 presidents into 7 RDDs
o Create word vectors
o Transform into word frequency vectors
o Remove stock common words
o Inspect to n words to see if they reflect the sentiment of the time
o Compute set difference and see how new words have cropped up
o Compute TF-IDF (homework!)

Lookout for these interesting Spark features
o RDD Map-Reduce
o How to parse input
o Removing common words
o Sort rdd by value

iPython notebook at https://github.com/xsankar/cloaked-ironman
Read Create word vector

Remove Common Words – 1 of 3

FDR vs. Barack Obama as reflected by SOTU

GWB vs Abe Lincoln as reflected by SOTU

Epilogue
o Interesting Exercise
o Highlights
• Map-reduce in a couple of lines !
• But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1)
• Set differences using substractByKey
• Ability to sort a map by values (or any arbitrary function, for that matter)
o To Explore as homework:
• TF-IDF in http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
• Haven’t seen it in python for 1.1.
hIp://blog.cloudera.com/blog/2014/09/how-‐to-‐translate-‐from-‐mapreduce-‐to-‐apache-‐spark/

Scenario – Clustering with Spark
o InterGallactic Airlines have the GallacticHoppers frequent flyer program have
data about their customers who participate in the program.
o The airlines execs have a feeling that other airlines will poach their customers if
they do not keep their loyal customers happy.
o So the business want to customize promotions to their frequent flier program.
o Can they just have one type of promotion ?
o Should they have different types of incentives ?
o Who exactly are the customers in their GallacticHoppers program ?
o Recently they have deployed an infrastructure with Spark
o Can Spark help in this business problem ?

Clustering - Theory
o Clustering is unsupervised learning
o While the computers can dissect a dataset into “similar” clusters, it still needs
human direction domain knowledge to interpret guide
o Two types:
• Centroid based clustering – k-means clustering
• Tree based Clustering – hierarchical clustering
o Spark implements the Scalable Kmeans++
• Paper : http://theory.stanford.edu/~‾sergei/papers/vldb12-
kmpar.pdf

o Application of Statistics toolbox
o Center Scale RDD
o Filter RDDs

Clustering - API
o from pyspark.mllib.clustering import KMeans
o Kmeans.train
o train(cls, data, k, maxIterations=100, runs=1, initializationMode=k-means||)
o K = number of clusters to create, default=2
o initializationMode = The initialization algorithm. This can be either random to
choose random points as initial cluster centers, or k-means|| to use a parallel
variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default:
k-means||
o KMeansModel.predict
o Maps a point to a cluster

Data

So let us center scale the data and try again

Let us try with 5 clusters
Looks Good

Let us map the cluster to our data

And interpret them We
have
mulMple
cluster
types:
• 1
:
Very
AcMve
–
Give
them
the
most
aIenMon
• 3
:
Very
AcMve
on-‐line,
few
flights
–
Give
them
on-‐line
coupons
• 4
:
RelaMvely
new
customers,
not
that
acMve
–
Give
them
flight
coupons
to
encourage
them
to
fly
more.
Ask
them
why
they
are
not
flying.
May
be
they
are
flying
to
desMnaMons
(say
Jupiter)
where
InterGallacMc
has
less
gates
Note
:
• This
is
just
a
sample
interpreta0on.
• In
real
life
we
would
“noodle”
over
the
clusters

tweak
them
to
be
useful,
interpretable
and
dis0nguishable.
• May
be
3
is
more
suited
to
create
targeted
promo0ons

Epilogue
o KMeans in Spark has enough controls
o It does a decent job
o We were able to control the clusters based on our experience (2 cluster is too
low, 10 is too high, 5 seems to be right)
o We can see that the Scalable KMeans has control over runs, parallelism et al.
(Home work : explore the scalability)
o We were able to interpret the results with domain knowledge and arrive at a
scheme to solve the business opportunity
o Naturally we would tweak the clusters to fit the business viability. 20 clusters
with corresponding promotion schemes are unwieldy, even if the WSSE is the
minimum.

Predicting Survivors
with Classification
3:45

Titanic
Passenger
Metadata
• Small
• 3
Predictors
• Class
• Sex
• Age
• Survived?
Classification - Scenario
o This is a knowledge exercise
o Classify survival from the titanic data
o Gives us a quick dataset to run test classification

Classifying Classifiers
Statistical
Structural
Regression
Naïve
Bayes
Bayesian
Networks
Rule-‐based
Distance-‐based
Neural
Networks
Production
Rules
Decision
Trees
Multi-‐layer
Perception
Functional
Ensemble
Nearest
Neighbor
Linear
Spectral
Wavelet
kNN
Random
Forests
Learning
vector
Quantization
Logistic
Regression1
Boosting
SVM
1Max
Entropy
Classifier
Ref: Algorithms of the Intelligent Web, Marmanis Babenko

Regression
Classifiers
Continuous
Variables
Categorical
Variables
Decision
Trees
k-‐NN(Nearest
Neighbors)
Bias
Variance
Model Complexity
Over-fitting
Bagging
BoosMng
CART

Classification - Spark API
o Logistic Regression
o SVMWIthSGD
o DecisionTrees
o Data as LabelledPoint (we will see in a moment)
o DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity=gini,
maxDepth=4, maxBins=100)
o Impurity – “entropy” or “gini”
o maxBins = control to throttle communication at the expense of accuracy
• Larger = Higher Accuracy
• Smaller = less communication (as # of bins = number of instances)
o data adaptive – i.e. decision tree samples on the driver and figures out the bin spacing
i.e. the places you slice for binning
o intelligent framework - need this for scale

o Concept of Labeled Point how to create an RDD of LPs
o Print the tree
o Calculate Accuracy MSE from RDDs

Read data extract features

Decision Tree – Best Practices
DecisionTree.trainClassifier(data,
numClasses,
categoricalFeaturesInfo,
impurity=gini,
maxDepth=4,
maxBins=100)
maxDepth
Tune
with
Data/Model
SelecMon
maxBins
Set
low,
monitor
communicaMons,
increase
if
needed
#
RDD
parMMons
Set
to
#
of
cores
• Usually the recommendation is that the RDD partitions should be over
partitioned ie “more partitions than cores”, because tasks take different
times, we need to utilize the compute power and in the end they average out
• But for Machine Learning especially trees, all tasks are approx equal
computationally intensive, so over partitioning doesn’t help
• Joe Bradley talk (reference below) has interesting insights
hIps://speakerdeck.com/jkbradley/mllib-‐decision-‐trees-‐at-‐sf-‐scala-‐baml-‐meetup

Future …
o Actually we should split the data to training test sets
o Then use different feature sets to see if we can increase the accuracy
o Leave it as Homework
o In 1.2 …
o Random Forest
• Bagging
• PR for Random Forest
o Boosting
o Alpine lab sequoia Forest: coordinating merge
o Model Selection Pipeline ; Design Doc

Boosting
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
◦ “Output
of
weak
classifiers
into
a
powerful
commiIee”
◦ Final
PredicMon
=
weighted
majority
vote
◦ Later
classifiers
get
misclassified
points
– With
higher
weight,
– So
they
are
forced
– To
concentrate
on
them
◦ AdaBoost
(AdapMveBoosting)
◦ BoosMng
vs
Bagging
– Bagging
–
independent
trees
-‐
Spark
shines
here
– BoosMng
–
successively
weighted

Random Forests+
— Goal
◦ Variance (-)
◦ Builds
large
collecMon
of
de-‐correlated
trees

averages
them
◦ Improves
Bagging
by
selecMng
i.i.d*
random
variables
for
splipng
◦ Simpler
to
train

tune
◦ “Do
remarkably
well,
with
very
li=le
tuning
required”
–
ESLII
◦ Less
suscepMble
to
over
fipng
(than
boosMng)
◦ Many
RF
implementaMons
– Original
version
-‐
Fortran-‐77
!
By
Breiman/Cutler
– Python,
R,
Mahout,
Weka,
Milk
(ML
toolkit
for
py),
matlab
* i.i.d – independent identically distributed
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

◦ Two
Step
– Develop
a
set
of
learners
– Combine
the
results
to
develop
a
composite
predictor
◦ Ensemble
methods
can
take
the
form
of:
– Using
different
algorithms,
– Using
the
same
algorithm
with
different
sepngs
– Assigning
different
parts
of
the
dataset
to
different
classifiers
◦ Bagging

Random
Forests
are
examples
of
ensemble
method
Ref: Machine Learning In Action
Ensemble Methods
— Goal
◦ Variance (-)

Random Forests
o While Boosting splits based on best among all variables, RF splits based on best among
randomly chosen variables
o Simpler because it requires two variables – no. of Predictors (typically √k) no. of trees
(500 for large dataset, 150 for smaller)
o Error prediction
• For each iteration, predict for dataset that is not in the sample (OOB data)
• Aggregate OOB predictions
• Calculate Prediction Error for the aggregate, which is basically the OOB
estimate of error rate
• Can use this to search for optimal # of predictors
• We will see how close this is to the actual error in the Heritage Health Prize
o Assumes equal cost for mis-prediction. Can add a cost function
o Proximity matrix applications like adding missing data, dropping outliers Ref: R News Vol 2/3, Dec 2002
Statistical Learning from a Regression Perspective : Berk
A Brief Overview of RF by Dan Steinberg

Linear Regression - API
LabeledPoint The features and labels of a data point
LinearModel weights, intercept
LinearRegressionModel
Base predict()
LinearRegressionModel
LinearRegressionWithS
GD
train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0,
initialWeights=None, regParam=1.0, regType=None, intercept=False)
LassoModel Least-squares fit with an l_1 penalty term.
LassoWithSGD
train(cls, data, iterations=100, step=1.0, regParam=1.0,
miniBatchFraction=1.0,initialWeights=None)
RidgeRegressionModel Least-squares fit with an l_2 penalty term.
RidgeRegressionWithS
GD
train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0,
initialWeights=None)

Use LR model for prediction calculate MSE

Step size is important, the model can diverge !

Recommendation Personalization - Spark
Automated
Analytics-‐
Let
Data
tell
story
Feature
Learning,
AI,
Deep
Learning
Learning
Models
-‐
fit
parameters
as
it
gets
more
data
Dynamic
Models
–
model
selection
based
on
context
o Knowledge
Based
o Demographic
Based
o Content
Based
o Collaborative
Filtering
o Item
Based
o User
Based
o Latent
Factor
based
o User
Rating
o Purchased
o Looked/Not
purchased
Spark
(in
1.1.0)
implements
the
user
based
ALS
collaboraMve
filtering
Ref:
ALS
-‐
CollaboraMve
Filtering
for
Implicit
Feedback
Datasets,
Yifan
Hu
;
ATT
Labs.,
Florham
Park,
NJ
;
Koren,
Y.
;
Volinsky,
C.
ALS-‐WR
-‐
Large-‐Scale
Parallel
CollaboraMve
Filtering
for
the
Nevlix
Prize,
Yunhong
Zhou,
Dennis
Wilkinson,
Robert
Schreiber,
Rong
Pan

Spark Collaborative Filtering API
o ALS.train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1)
o ALS.trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1,
alpha=0.01)
o MatrixFactorizationModel.predict(self, user, product)
o MatrixFactorizationModel.predictAll(self, usersProducts)

Epilogue
o We explored interesting APIs in Spark
o ALS-Collab Filtering
o RDD Operations
• Join (HashJoin)
• In memory, Grace, Recursive hash join
hIp://technet.microsox.com/en-‐us/library/ms189313(v=sql.105).aspx

Reference
1. SF Scala SF Bay Area Machine Learning, Joseph Bradley: Decision Trees on Spark
http://functional.tv/post/98342564544/sfscala-sfbaml-joseph-bradley-decision-trees-on-
spark
2. http://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-
needed-for-k-means-clustering
3. http://stats.stackexchange.com/questions/19216/variables-are-often-adjusted-e-g-standardised-
before-making-a-model-when-is
4. http://funny-pictures.picphotos.net/tongue-out-smiley-face/smile-day.net*wp-content*
uploads*2012*01*Tongue-Out-Smiley-Face1.jpg/
5. https://speakerdeck.com/jkbradley/mllib-decision-trees-at-sf-scala-baml-meetup
6. http://www.rosebt.com/1/post/2011/10/big-data-analytics-maturity-model.html
7. http://blogs.gartner.com/matthew-davis/

Essential Reading List
o A few useful things to know about machine learning - by Pedro Domingos
• http://dl.acm.org/citation.cfm?id=2347755
o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert
• http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/
lack_of_a_priori_distinctions_wolpert.pdf
o http://www.no-free-lunch.org/
o YC.o Cn trolling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg,
• http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y
%20FDR.pdf
o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe
• http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o Avoid these three mistakes, James Faghmo
• https://medium.com/about-data/73258b3848a4
o Leakage in Data Mining: Formulation, Detection, and Avoidance
• http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/
cs670_Tran_PreferredPaper_LeakingInDataMining.pdf

For your reading viewing pleasure … An ordered List
① An Introduction to Statistical Learning
• http://www-bcf.usc.edu/~‾gareth/ISL/
② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning
• http://online.stanford.edu/course/statistical-learning-winter-2014
③ Prof. Pedro Domingo
• https://class.coursera.org/machlearning-001/lecture/preview
④ Prof. Andrew Ng
• https://class.coursera.org/ml-003/lecture/preview
⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data
• https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120
⑥ Mathematicalmonk @ YouTube
• https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
⑦ The Elements Of Statistical Learning
• http://statweb.stanford.edu/~‾tibs/ElemStatLearn/
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-machine-
learning/

References:
o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas
• http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-learning
o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel
• http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn
o Just The Basics, Strata 2013, William Cukierski Ben Hamner
• http://strataconf.com/strata2013/public/schedule/detail/27291
o The Problem of Multiple Testing
• http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/
PIIS1934148209014609.pdf

The Beginning As The
End
How did we do ?
4:45

The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Ähnlich wie The Hitchhiker's Guide to Machine Learning with Python & Apache Spark (20)

Mehr von Krishna Sankar

Mehr von Krishna Sankar (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

The Hitchhiker's Guide to Machine Learning with Python & Apache Spark