Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
1. “I want to die on Mars but not on impact”
— Elon Musk, interview with Chris Anderson
"There are no facts, only interpretations." - Friedrich Nietzsche
“The shrewd guess, the fertile hypothesis, the courageous leap to a
tentative conclusion – these are the most valuable coin of the thinker at
work” -- Jerome Seymour Bruner
The Hitchhiker's Guide to
Machine Learning with
Python Apache Spark
@ksankar // doubleclix.wordpress.com
http://www.bigdatatechcon.com/
classes.html#TheHitchhikersGuidetoMachineLearningwithPythonandApacheSparkPartI
October 29, 2014
2. Agenda
o Spark Data Science DevOps
• Spark, Python Machine Learning
• Goals/non-goals
• Intro to Spark
• Stack, Mechanisms – RDD
• Datasets : SOTU, Titanic, Frequent
Flier
• Statistical Toolbox
• Summary, Correlations
o “Mood Of the Union”
• State of the Union w/ Washington,
Lincoln, FDR, JFK, Clinton, Bush
Obama
• Map reduce, parse text
o Clustering
• K-means for Gallactic Hoppers!
o Break [3:15-3:45)
o Predicting Survivors with Classification
• Decision Trees
• NaiveBayes (Titanic data set)
o Linear Regression
o Recommendation Engine
• Collab Filtering w/movie lens
o Discussions/Slack
Oct
29
2-‐3:15
(75min),
3:45-‐5:00
(75
min)
=
150
min
[20]
2:00
–
2:20
[30]
2:20
–
2:50
[25]
2:50
–
3:15
[30]
3:45
–
4:15
[10]
4:15
–
4:25
[20]
4:25
–
4:45
[15]
4:45
–
5:00
3. Goals non-goals
Goals
¤ Understand how to program
Machine Learning with Spark
Python
¤ Focus on programming ML
application
¤ Give you a focused time to work
thru examples
§ Work with me. I will wait
if you want to catch-up
¤ Less theory, more usage - let us
see if this works
¤ As straightforward as possible
§ The programs can be
optimized
Non-goals
¡ Go deep into the algorithms
• We don’t have sufficient
time. The topic can be
easily a 5 day tutorial !
¡ Dive into spark internals
• That is for another day
¡ The underlying computation,
communication, constraints
distribution is a fascinating subject
• Paco does a good job
explaining them
¡ A passive talk
• Nope. Interactive
hands-on
4. About Me
o Chief Data Scientist at BlackArrow.tv
o Have been speaking at OSCON, PyCon, Pydata et al
The
Nuthead
band
!
o Reviewing Packt Book “Machine Learning with Spark”
o Picked up co-authorship Second Edition of “Fast Data Processing with Spark”
o Have done lots of things:
• Big Data (Retail, Bioinformatics, Financial, AdTech),
• Written Books (Web 2.0, Wireless, Java,…)
• Standards, some work in AI,
• Guest Lecturer at Naval PG School,…
• Planning MS-CFinance or Statistics or Computational Math
o Volunteer as Robotics Judge at First Lego league World Competitions
o @ksankar, doubleclix.wordpress.com
6. Close Encounters
— 1st
◦ This Tutorial
— 2nd
◦ Do More Hands-on Walkthrough
— 3nd
◦ Listen To Lectures
◦ More competitions …
7. Spark Installation
o Install Spark 1.1.0 in local Machine
o https://spark.apache.org/downloads.html
• Pre-built For Hadoop 2.4 is fine
o Download uncompress
o Remember the path use it wherever you see /usr/local/spark/
o I have downloaded in /usr/local have a softlink spark to the latest version
8. Tutorial Materials
o Github : https://github.com/xsankar/cloaked-ironman
o Clone or download zip
o Open terminal
o cd ~/cloaked-ironman
o IPYTHON=1 IPYTHON_OPTS=notebook --pylab inline /usr/local/spark/bin/
pyspark
o Note :
o I have a soft link “spark” in my /usr/local that points to the spark version that I
use. For example ln -s spark-1.1.0/ spark
o Click on ipython dashboard
o Just look thru the ipython notebooks
9. Data Science - Context
Reason Model Deploy
o Scalable
Model
Deployment
o Big
Data
automation
purpose
built
appliances
(soft/
hard)
o Manage
SLAs
response
times
Collect Store Transform
o Volume
o Velocity
o Streaming
Data
o Canonical
form
o Data
catalog
o Data
Fabric
across
the
organization
o Access
to
multiple
sources
of
data
o Think
Hybrid
–
Big
Data
Apps,
Appliances
Infrastructure
o Metadata
o Monitor
counters
Metrics
o Structured
vs.
Multi-‐
structured
o Flexible
Selectable
§ Data
Subsets
§ Attribute
sets
o Refine
model
with
§ Extended
Data
subsets
§ Engineered
Attribute
sets
o Validation
run
across
a
larger
data
set
Data Management
Data Science
Visualize Recommend Predict Explore
o Dynamic
Data
Sets
o 2
way
key-‐value
tagging
of
datasets
o Extended
attribute
sets
o Advanced
Analytics
o Performance
o Scalability
o Refresh
Latency
o In-‐memory
Analytics
o Advanced
Visualization
o Interactive
Dashboards
o Map
Overlay
o Infographics
¤ Bytes to Business
a.k.a. Build the full
stack
¤ Find Relevant Data
For Business
¤ Connect the Dots
10. Velocity
Volume
Variety
Data Science - Context
Context
Connect
edness
Interface
Inference
Intelligence
“Data of unusual size”
that can't be brute forced
o Three Amigos
o Interface = Cognition
o Intelligence = Compute(CPU) Computational(GPU)
o Infer Significance Causality
11. Day in the life of a (super) Model
Intelligence
Inference
Data Representation
Interface
Algorithms
AIributes
Parameters
Data
(Scoring)
Model
SelecMon
Reason
Learn
Models
Visualize,
Recommend,
Explore
Model
Assessment
Dimensionality
ReducMon
Feature
SelecMon
12. Data Science Maturity Model Spark
Isolated Analytics
Integrated Analytics
Aggregated Analytics
Automated Analytics
Data
Small
Data
Larger
Data
set
Big
Data
Big
Data
Factory
Model
Context
Local
Domain
Cross-‐domain
+
External
Cross
domain
+
External
Model,
Reason
Deploy
• Single
set
of
boxes,
usually
owned
by
the
Model
Builders
• Departmental
• Deploy
-‐
Central
AnalyMcs
Infrastructure
• Models
sMll
owned
operated
by
Modelers
• Partly
Enterprise-‐wide
• Central
AnalyMcs
Infrastructure
• Model
Reason
–
by
Model
Builders
• Deploy,
Operate
–
by
ops
• Residuals
and
other
metrics
monitored
by
modelers
• Enterprise-‐wide
• Distributed
AnalyMcs
Infrastructure
• AI
Augmented
models
• Model
Reason
–
by
Model
Builders
• Deploy,
Operate
–
by
ops
• Data
as
a
moneMzed
service,
extending
to
eco
system
partners
• Reports
• Dashboards
• Dashboards
+
some
APIs
• Dashboards
+
Well
defined
APIs
+
programming
models
Type
• DescripMve
ReacMve
• +
PredicMve
• +
AdapMve
• AdapMve
Datasets
• All
in
the
same
box
• Fixed
data
sets,
usually
in
temp
data
spaces
• Flexible
Data
AIribute
sets
• Dynamic
datasets
with
well-‐defined
refresh
policies
Workload
• Skunk
works
• Business
relevant
apps
with
approx
SLAs
• High
performance
appliance
clusters
• Appliances
and
clusters
for
mulMple
workloads
including
real
Mme
apps
• Infrastructure
for
emerging
technologies
Strategy
• Informal
definiMons
• Data
definiMons
buried
in
the
analyMcs
models
• Some
data
definiMons
• Data
catalogue,
metadata
AnnotaMons
• Big
Data
MDM
Strategy
13. The
Sense
Sensibility
of
a
DataScien3st
DevOps
Factory
=
OperaMonal
Lab
=
InvesMgaMve
hIp://doubleclix.wordpress.com/2014/05/11/the-‐sense-‐
sensibility-‐of-‐a-‐data-‐scienMst-‐devops/
16. RDD – The workhorse of Spark
o Resilient Distributed Datasets
• Collection that can be operated in parallel
o Transformations – create RDDs
• Map, Filter,…
o Actions – Get values
• Collect, Take,…
o We will apply these operations during this tutorial
17. Algorithm spectrum
o Regression
o Logit
o CART
o Ensemble :
Random
Forest
o Clustering
o KNN
o Genetic Alg
o Simulated
Annealing
o Collab
Filtering
o SVM
o Kernels
o SVD
o NNet
o Boltzman
Machine
o Feature
Learning
Machine
Learning
Cute
Math
Ar0ficial
Intelligence
18. ALL MLlib APIs are not available in Python (as of 1.1.0)
API
Spark 1.1.0 Spark 1.2.0
Java/Scala Python
Basic Statistics ✔ ✔
Linear Models ✔ ✔
Decision Trees ✔ ✔
Random Forest ✖ ✖
Collab Filtering ✔ ✔
Clustering-KMeans ✔ ✔
Clustering-Hierarchical ✖ ✖
SVD ✔ ✖
PCA ✔ ✖
Standard Scaler, Normalizer ✔ ✖
Model Evaluation-PR/ROC
Spark
1.2
MLlib
JIRA
h=p://bit.ly/1ywotkm
19. Statistical Toolbox
o Sample data : Car mileage data
hIps://github.com/apache/spark/blob/master/examples/src/main/python/mllib/correlaMons.py
22. Scenario – Mood Of the Union
o It has been said that the State of the Union speech by the President of USA
reflects the social challenge faced by the country ?
o If so, can we infer the mood of the country by analyzing SOTU ?
o If we embark on this line of thought, how would we do it with Spark python ?
o Is it different from Hadoop-MapReduce ?
o Is it better ?
23. POA (Plan Of Action)
o Collect State of the Union speech by George Washington, Abe Lincoln, FDR,
JFK, Bill Clinton, GW Bush Barack Obama
o Read the 7 SOTU from the 7 presidents into 7 RDDs
o Create word vectors
o Transform into word frequency vectors
o Remove stock common words
o Inspect to n words to see if they reflect the sentiment of the time
o Compute set difference and see how new words have cropped up
o Compute TF-IDF (homework!)
24. Lookout for these interesting Spark features
o RDD Map-Reduce
o How to parse input
o Removing common words
o Sort rdd by value
25. iPython notebook at https://github.com/xsankar/cloaked-ironman
Read Create word vector
26. Remove Common Words – 1 of 3
iPython notebook at https://github.com/xsankar/cloaked-ironman
32. Epilogue
o Interesting Exercise
o Highlights
• Map-reduce in a couple of lines !
• But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1)
• Set differences using substractByKey
• Ability to sort a map by values (or any arbitrary function, for that matter)
o To Explore as homework:
• TF-IDF in http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
• Haven’t seen it in python for 1.1.
hIp://blog.cloudera.com/blog/2014/09/how-‐to-‐translate-‐from-‐mapreduce-‐to-‐apache-‐spark/
34. Scenario – Clustering with Spark
o InterGallactic Airlines have the GallacticHoppers frequent flyer program have
data about their customers who participate in the program.
o The airlines execs have a feeling that other airlines will poach their customers if
they do not keep their loyal customers happy.
o So the business want to customize promotions to their frequent flier program.
o Can they just have one type of promotion ?
o Should they have different types of incentives ?
o Who exactly are the customers in their GallacticHoppers program ?
o Recently they have deployed an infrastructure with Spark
o Can Spark help in this business problem ?
35. Clustering - Theory
o Clustering is unsupervised learning
o While the computers can dissect a dataset into “similar” clusters, it still needs
human direction domain knowledge to interpret guide
o Two types:
• Centroid based clustering – k-means clustering
• Tree based Clustering – hierarchical clustering
o Spark implements the Scalable Kmeans++
• Paper : http://theory.stanford.edu/~‾sergei/papers/vldb12-
kmpar.pdf
36. Lookout for these interesting Spark features
o Application of Statistics toolbox
o Center Scale RDD
o Filter RDDs
37. Clustering - API
o from pyspark.mllib.clustering import KMeans
o Kmeans.train
o train(cls, data, k, maxIterations=100, runs=1, initializationMode=k-means||)
o K = number of clusters to create, default=2
o initializationMode = The initialization algorithm. This can be either random to
choose random points as initial cluster centers, or k-means|| to use a parallel
variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default:
k-means||
o KMeansModel.predict
o Maps a point to a cluster
46. And interpret them We
have
mulMple
cluster
types:
• 1
:
Very
AcMve
–
Give
them
the
most
aIenMon
• 3
:
Very
AcMve
on-‐line,
few
flights
–
Give
them
on-‐line
coupons
• 4
:
RelaMvely
new
customers,
not
that
acMve
–
Give
them
flight
coupons
to
encourage
them
to
fly
more.
Ask
them
why
they
are
not
flying.
May
be
they
are
flying
to
desMnaMons
(say
Jupiter)
where
InterGallacMc
has
less
gates
Note
:
• This
is
just
a
sample
interpreta0on.
• In
real
life
we
would
“noodle”
over
the
clusters
tweak
them
to
be
useful,
interpretable
and
dis0nguishable.
• May
be
3
is
more
suited
to
create
targeted
promo0ons
iPython notebook at https://github.com/xsankar/cloaked-ironman
47. Epilogue
o KMeans in Spark has enough controls
o It does a decent job
o We were able to control the clusters based on our experience (2 cluster is too
low, 10 is too high, 5 seems to be right)
o We can see that the Scalable KMeans has control over runs, parallelism et al.
(Home work : explore the scalability)
o We were able to interpret the results with domain knowledge and arrive at a
scheme to solve the business opportunity
o Naturally we would tweak the clusters to fit the business viability. 20 clusters
with corresponding promotion schemes are unwieldy, even if the WSSE is the
minimum.
50. Titanic
Passenger
Metadata
• Small
• 3
Predictors
• Class
• Sex
• Age
• Survived?
Classification - Scenario
o This is a knowledge exercise
o Classify survival from the titanic data
o Gives us a quick dataset to run test classification
iPython notebook at https://github.com/xsankar/cloaked-ironman
51. Classifying Classifiers
Statistical
Structural
Regression
Naïve
Bayes
Bayesian
Networks
Rule-‐based
Distance-‐based
Neural
Networks
Production
Rules
Decision
Trees
Multi-‐layer
Perception
Functional
Ensemble
Nearest
Neighbor
Linear
Spectral
Wavelet
kNN
Random
Forests
Learning
vector
Quantization
Logistic
Regression1
Boosting
SVM
1Max
Entropy
Classifier
Ref: Algorithms of the Intelligent Web, Marmanis Babenko
52. Regression
Classifiers
Continuous
Variables
Categorical
Variables
Decision
Trees
k-‐NN(Nearest
Neighbors)
Bias
Variance
Model Complexity
Over-fitting
Bagging
BoosMng
CART
53. Classification - Spark API
o Logistic Regression
o SVMWIthSGD
o DecisionTrees
o Data as LabelledPoint (we will see in a moment)
o DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity=gini,
maxDepth=4, maxBins=100)
o Impurity – “entropy” or “gini”
o maxBins = control to throttle communication at the expense of accuracy
• Larger = Higher Accuracy
• Smaller = less communication (as # of bins = number of instances)
o data adaptive – i.e. decision tree samples on the driver and figures out the bin spacing
i.e. the places you slice for binning
o intelligent framework - need this for scale
54. Lookout for these interesting Spark features
o Concept of Labeled Point how to create an RDD of LPs
o Print the tree
o Calculate Accuracy MSE from RDDs
55. Read data extract features
iPython notebook at https://github.com/xsankar/cloaked-ironman
60. Decision Tree – Best Practices
DecisionTree.trainClassifier(data,
numClasses,
categoricalFeaturesInfo,
impurity=gini,
maxDepth=4,
maxBins=100)
maxDepth
Tune
with
Data/Model
SelecMon
maxBins
Set
low,
monitor
communicaMons,
increase
if
needed
#
RDD
parMMons
Set
to
#
of
cores
• Usually the recommendation is that the RDD partitions should be over
partitioned ie “more partitions than cores”, because tasks take different
times, we need to utilize the compute power and in the end they average out
• But for Machine Learning especially trees, all tasks are approx equal
computationally intensive, so over partitioning doesn’t help
• Joe Bradley talk (reference below) has interesting insights
hIps://speakerdeck.com/jkbradley/mllib-‐decision-‐trees-‐at-‐sf-‐scala-‐baml-‐meetup
61. Future …
o Actually we should split the data to training test sets
o Then use different feature sets to see if we can increase the accuracy
o Leave it as Homework
o In 1.2 …
o Random Forest
• Bagging
• PR for Random Forest
o Boosting
o Alpine lab sequoia Forest: coordinating merge
o Model Selection Pipeline ; Design Doc
62. Boosting
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
◦ “Output
of
weak
classifiers
into
a
powerful
commiIee”
◦ Final
PredicMon
=
weighted
majority
vote
◦ Later
classifiers
get
misclassified
points
– With
higher
weight,
– So
they
are
forced
– To
concentrate
on
them
◦ AdaBoost
(AdapMveBoosting)
◦ BoosMng
vs
Bagging
– Bagging
–
independent
trees
-‐
Spark
shines
here
– BoosMng
–
successively
weighted
63. Random Forests+
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
◦ Builds
large
collecMon
of
de-‐correlated
trees
averages
them
◦ Improves
Bagging
by
selecMng
i.i.d*
random
variables
for
splipng
◦ Simpler
to
train
tune
◦ “Do
remarkably
well,
with
very
li=le
tuning
required”
–
ESLII
◦ Less
suscepMble
to
over
fipng
(than
boosMng)
◦ Many
RF
implementaMons
– Original
version
-‐
Fortran-‐77
!
By
Breiman/Cutler
– Python,
R,
Mahout,
Weka,
Milk
(ML
toolkit
for
py),
matlab
* i.i.d – independent identically distributed
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
64. ◦ Two
Step
– Develop
a
set
of
learners
– Combine
the
results
to
develop
a
composite
predictor
◦ Ensemble
methods
can
take
the
form
of:
– Using
different
algorithms,
– Using
the
same
algorithm
with
different
sepngs
– Assigning
different
parts
of
the
dataset
to
different
classifiers
◦ Bagging
Random
Forests
are
examples
of
ensemble
method
Ref: Machine Learning In Action
Ensemble Methods
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
65. Random Forests
o While Boosting splits based on best among all variables, RF splits based on best among
randomly chosen variables
o Simpler because it requires two variables – no. of Predictors (typically √k) no. of trees
(500 for large dataset, 150 for smaller)
o Error prediction
• For each iteration, predict for dataset that is not in the sample (OOB data)
• Aggregate OOB predictions
• Calculate Prediction Error for the aggregate, which is basically the OOB
estimate of error rate
• Can use this to search for optimal # of predictors
• We will see how close this is to the actual error in the Heritage Health Prize
o Assumes equal cost for mis-prediction. Can add a cost function
o Proximity matrix applications like adding missing data, dropping outliers Ref: R News Vol 2/3, Dec 2002
Statistical Learning from a Regression Perspective : Berk
A Brief Overview of RF by Dan Steinberg
67. Linear Regression - API
LabeledPoint The features and labels of a data point
LinearModel weights, intercept
LinearRegressionModel
Base predict()
LinearRegressionModel
LinearRegressionWithS
GD
train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0,
initialWeights=None, regParam=1.0, regType=None, intercept=False)
LassoModel Least-squares fit with an l_1 penalty term.
LassoWithSGD
train(cls, data, iterations=100, step=1.0, regParam=1.0,
miniBatchFraction=1.0,initialWeights=None)
RidgeRegressionModel Least-squares fit with an l_2 penalty term.
RidgeRegressionWithS
GD
train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0,
initialWeights=None)
75. Recommendation Personalization - Spark
Automated
Analytics-‐
Let
Data
tell
story
Feature
Learning,
AI,
Deep
Learning
Learning
Models
-‐
fit
parameters
as
it
gets
more
data
Dynamic
Models
–
model
selection
based
on
context
o Knowledge
Based
o Demographic
Based
o Content
Based
o Collaborative
Filtering
o Item
Based
o User
Based
o Latent
Factor
based
o User
Rating
o Purchased
o Looked/Not
purchased
Spark
(in
1.1.0)
implements
the
user
based
ALS
collaboraMve
filtering
Ref:
ALS
-‐
CollaboraMve
Filtering
for
Implicit
Feedback
Datasets,
Yifan
Hu
;
ATT
Labs.,
Florham
Park,
NJ
;
Koren,
Y.
;
Volinsky,
C.
ALS-‐WR
-‐
Large-‐Scale
Parallel
CollaboraMve
Filtering
for
the
Nevlix
Prize,
Yunhong
Zhou,
Dennis
Wilkinson,
Robert
Schreiber,
Rong
Pan
76. Spark Collaborative Filtering API
o ALS.train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1)
o ALS.trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1,
alpha=0.01)
o MatrixFactorizationModel.predict(self, user, product)
o MatrixFactorizationModel.predictAll(self, usersProducts)
80. Epilogue
o We explored interesting APIs in Spark
o ALS-Collab Filtering
o RDD Operations
• Join (HashJoin)
• In memory, Grace, Recursive hash join
hIp://technet.microsox.com/en-‐us/library/ms189313(v=sql.105).aspx
82. Reference
1. SF Scala SF Bay Area Machine Learning, Joseph Bradley: Decision Trees on Spark
http://functional.tv/post/98342564544/sfscala-sfbaml-joseph-bradley-decision-trees-on-
spark
2. http://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-
needed-for-k-means-clustering
3. http://stats.stackexchange.com/questions/19216/variables-are-often-adjusted-e-g-standardised-
before-making-a-model-when-is
4. http://funny-pictures.picphotos.net/tongue-out-smiley-face/smile-day.net*wp-content*
uploads*2012*01*Tongue-Out-Smiley-Face1.jpg/
5. https://speakerdeck.com/jkbradley/mllib-decision-trees-at-sf-scala-baml-meetup
6. http://www.rosebt.com/1/post/2011/10/big-data-analytics-maturity-model.html
7. http://blogs.gartner.com/matthew-davis/
83. Essential Reading List
o A few useful things to know about machine learning - by Pedro Domingos
• http://dl.acm.org/citation.cfm?id=2347755
o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert
• http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/
lack_of_a_priori_distinctions_wolpert.pdf
o http://www.no-free-lunch.org/
o YC.o Cn trolling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg,
• http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y
%20FDR.pdf
o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe
• http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o Avoid these three mistakes, James Faghmo
• https://medium.com/about-data/73258b3848a4
o Leakage in Data Mining: Formulation, Detection, and Avoidance
• http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/
cs670_Tran_PreferredPaper_LeakingInDataMining.pdf
84. For your reading viewing pleasure … An ordered List
① An Introduction to Statistical Learning
• http://www-bcf.usc.edu/~‾gareth/ISL/
② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning
• http://online.stanford.edu/course/statistical-learning-winter-2014
③ Prof. Pedro Domingo
• https://class.coursera.org/machlearning-001/lecture/preview
④ Prof. Andrew Ng
• https://class.coursera.org/ml-003/lecture/preview
⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data
• https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120
⑥ Mathematicalmonk @ YouTube
• https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
⑦ The Elements Of Statistical Learning
• http://statweb.stanford.edu/~‾tibs/ElemStatLearn/
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-machine-
learning/
85. References:
o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas
• http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-learning
o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel
• http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn
o Just The Basics, Strata 2013, William Cukierski Ben Hamner
• http://strataconf.com/strata2013/public/schedule/detail/27291
o The Problem of Multiple Testing
• http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/
PIIS1934148209014609.pdf