SlideShare a Scribd company logo
1 of 78
Download to read offline
Introduction to Machine
Learning with Scikit-Learn
- Preface
- What is Machine Learning?
- An Architecture for ML Data Products
- What is Scikit-Learn?
- Data Handling and Loading
- Model Evaluation
- Regressions
- Classification
- Clustering
- Workshop
Plan of Study
Preface
On teaching Machine Learning...
Is Machine Learning a one semester course?
Statistics
Artificial
Intelligence
Computer
Science
Probability
Normalization
Distributions
Smoothing
Bayes Theorem
Regression
Logits
Optimization
Planning
Computer Vision
Natural Language Processing
Reinforcement
Neural Models
Computer Vision
Anomaly Detection
Entropy
Function Approximation
Data Mining
Graph Algorithms
Big Data
Machine Learning
Practitioner ContinuumHacker Academic
Machine Learning
Practitioner Domains
Statistician Expert Programmer
Context to Data Mining & Statistics
Data
Machine or App
Users
Data Mining
Machine Learning
Statistics
Computer
Science
What is Machine Learning?
Learning by Example
Given a bunch of examples (data) extract a
meaningful pattern upon which to act.
Problem Domain Machine Learning Class
Infer a function from labeled data Supervised learning
Find structure of data without feedback Unsupervised learning
Interact with environment towards goal Reinforcement learning
How do you make predictions?
What patterns do you see?
What is the Y value?
How do you determine red from blue?
Input training data to fit a model which is then
used to predict incoming inputs into ...
Types of Algorithms by Output
Type of Output Algorithm Category
Output is one or more discrete classes Classification (supervised)
Output is continuous Regression (supervised)
Output is membership in a similar group Clustering (unsupervised)
Output is the distribution of inputs Density Estimation
Output is simplified from higher dimensions Dimensionality Reduction
Classification
Given labeled input data (with two or more labels), fit a
function that can determine for any input, what the label is.
Regression
Given continuous input data fit a function that is able to
predict the continuous value of input given other data.
Clustering
Given data, determine a pattern of associated data points
or clusters via their similarity or distance from one another.
“Model” is an overloaded term.
• Model family describes, at the broadest possible level, the
connection between the variables of interest.
• Model form specifies exactly how the variables of interest
are connected within the framework of the model family.
http://had.co.nz/stat645/model-vis.pdf
Hadley Wickham (2015)
• A fitted model is a concrete instance of the
model form where all parameters have been
estimated from data, and the model can be
used to generate predictions.
Dimensions and Features
In order to do machine learning you need a data set containing
instances (examples) that are composed of features from which
you compose dimensions.
Instance: a single data point or example composed of fields
Feature: a quantity describing an instance
Dimension: one or more attributes that describe a property
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data # X.shape == (n_samples, n_features)
y = digits.target # y.shape == (n_samples,)
Feature Space
Feature space refers to the n-dimensions where your variables live (not
including a target variable or class). The term is used often in ML literature
because in ML all variables are features (usually) and feature extraction is the
art of creating a space with decision boundaries.
Target
1. Y ≡ Thickness of car tires after some testing period
Variables
1. X1
≡ distance travelled in test
2. X2
≡ time duration of test
3. X3
≡ amount of chemical C in tires
The feature space is R3
, or more accurately, the positive quadrant in R3
as all
the X variables can only be positive quantities.
http://stats.stackexchange.com/questions/46425/what-is-feature-space
Mappings
Domain knowledge about tires might suggest that the speed the vehicle was
moving at is important, hence we generate another variable, X4
(this is the
feature extraction part):
X4
= X1
*X2
≡ the speed of the vehicle during testing.
This extends our old feature space into a new one, the positive part of R4
.
A mapping is a function, ϕ, from R3
to R4
:
ϕ(x1
,x2
,x3
) = (x1
,x2
,x3
,x1
x2
)
http://stats.stackexchange.com/questions/46425/what-is-feature-space
Your Task
Given a data set of instances of size N, create
a model that is fit from the data (built) by
extracting features and dimensions. Then use
that model to predict outcomes …
1. Data Wrangling (normalization, standardization, imputing)
2. Feature Analysis/Extraction
3. Model Selection/Building
4. Model Evaluation
5. Operationalize Model
A Tour of Machine Learning
Algorithms
Models: Instance Methods
Compare instances in data set with a similarity
measure to find best matches.
- Suffers from curse of dimensionality.
- Focus on feature representation and
similarity metrics between instances
● k-Nearest Neighbors (kNN)
● Self-Organizing Maps (SOM)
● Learning Vector Quantization (LVQ)
Models: Regression
Model relationship of independent variables, X
to dependent variable Y by iteratively
optimizing error made in predictions.
● Ordinary Least Squares
● Logistic Regression
● Stepwise Regression
● Multivariate Adaptive Regression Splines (MARS)
● Locally Estimated Scatterplot Smoothing (LOESS)
Models: Regularization Methods
Extend another method (usually regression),
penalizing complexity (minimize overfit)
- simple, popular, powerful
- better at generalization
● Ridge Regression
● LASSO (Least Absolute Shrinkage & Selection Operator)
● Elastic Net
Models: Decision Trees
Model of decisions based on data attributes.
Predictions are made by following forks in a
tree structure until a decision is made. Used for
classification & regression.
● Classification and Regression Tree (CART)
● Decision Stump
● Random Forest
● Multivariate Adaptive Regression Splines (MARS)
● Gradient Boosting Machines (GBM)
Models: Bayesian
Explicitly apply Bayes’ Theorem for
classification and regression tasks. Usually by
fitting a probability function constructed via the
chain rule and a naive simplification of Bayes.
● Naive Bayes
● Averaged One-Dependence Estimators (AODE)
● Bayesian Belief Network (BBN)
Models: Kernel Methods
Map input data into higher dimensional vector
space where the problem is easier to model.
Named after the “kernel trick” which computes
the inner product of images of pairs of data.
● Support Vector Machines (SVM)
● Radial Basis Function (RBF)
● Linear Discriminant Analysis (LDA)
Models: Clustering Methods
Organize data into into groups whose members
share maximum similarity (defined usually by a
distance metric). Two main approaches:
centroids and hierarchical clustering.
● k-Means
● Affinity Propegation
● OPTICS (Ordering Points to Identify Cluster Structure)
● Agglomerative Clustering
Models: Artificial Neural Networks
Inspired by biological neural networks, ANNs are
nonlinear function approximators that estimate
functions with a large number of inputs.
- System of interconnected neurons that activate
- Deep learning extends simple networks recursively
● Perceptron
● Back-Propagation
● Hopfield Network
● Restricted Boltzmann Machine (RBM)
● Deep Belief Networks (DBN)
Models: Ensembles
Models composed of multiple weak models that
are trained independently and whose outputs
are combined to make an overall prediction.
● Boosting
● Bootstrapped Aggregation (Bagging)
● AdaBoost
● Stacked Generalization (blending)
● Gradient Boosting Machines (GBM)
● Random Forest
Models: Other
The list before was not comprehensive, other
algorithm and model classes include:
● Conditional Random Fields (CRF)
● Markovian Models (HMMs)
● Dimensionality Reduction (PCA, PLS)
● Rule Learning (Apriori, Brill)
● More ...
An Architecture for Operationalizing
Machine Learning Algorithms
Operational Phase
Build Phase
Architecture of Machine Learning Operations
Training Data
Labels
Feature
Vectors
Estimation
Algorithm
New Data Feature
Vector
Predictive
Model
Prediction
Feedback!
The Learning Part of Machine Learning
Model Building
Initial Fixtures
Service/API
Feedback
Deploying Machine Learning as a Web Service
Annotation Service Example
Architecture Demo
https://github.com/DistrictDataLabs/product-classifier
What is Scikit-Learn?
Extensions to SciPy (Scientific Python) are
called SciKits. SciKit-Learn provides machine
learning algorithms.
● Algorithms for supervised & unsupervised learning
● Built on SciPy and Numpy
● Standard Python API interface
● Sits on top of c libraries, LAPACK, LibSVM, and Cython
● Open Source: BSD License (part of Linux)
Probably the best general ML framework out there.
Where did it come from?
Started as a Google summer of code project in
2007 by David Cournapeau, then used as a
thesis project by Matthieu Brucher.
In 2010, INRIA pushed the first public release,
and sponsors the project, as do Google,
Tinyclues, and the Python Software
Foundation.
Who uses Scikit-Learn?
Primary Features
- Generalized Linear Models
- SVMs, kNN, Bayes, Decision Trees, Ensembles
- Clustering and Density algorithms
- Cross Validation
- Grid Search
- Pipelining
- Model Evaluations
- Dataset Transformations
- Dataset Loading
A Guide to Scikit-Learn
Object-oriented interface centered around the
concept of an Estimator:
“An estimator is any object that learns from data; it may
be a classification, regression or clustering algorithm or
a transformer that extracts/filters useful features from
raw data.”
- Scikit-Learn Tutorial
Scikit-Learn API
The Scikit-Learn Estimator API
class Estimator(object):
def fit(self, X, y=None):
"""Fits estimator to data. """
# set state of ``self``
return self
def predict(self, X):
"""Predict response of ``X``. """
# compute predictions ``pred``
return pred
- fit(X,y) sets the state of the estimator.
- X is usually a 2D numpy array of shape
(num_samples, num_features).
- y is a 1D array with shape (n_samples,)
- predict(X) returns the class or value
- predict_proba() returns a 2D array of
shape (n_samples, n_classes)
Estimators
Basic methodology
from sklearn import svm
estimator = svm.SVC(gamma=0.001)
estimator.fit(X, y)
estimator.predict(x)
We’ve already discussed a broad workflow, the
following is a development workflow:
Wrapping fit and predict
Load &
Transform Data
Raw Data
Feature
Extraction
Build Model Evaluate Model
Feature
Evaluation
Transformers
class Transformer(Estimator):
def transform(self, X):
"""Transforms the input data. """
# transform ``X`` to ``X_prime``
return X_prime
from sklearn import preprocessing
Xt = preprocessing.normalize(X) # Normalizer
Xt = preprocessing.scale(X) # StandardScaler
imputer =Imputer(missing_values='Nan',
strategy='mean')
Xt = imputer.fit_transform(X)
Evaluation
Underfitting
Not enough information to accurately model real life.
Can be due to high bias, or just a too simplistic model.
Solution: Cross Validation
Overfitting
Create a model with too many parameters or is too
complex. “Memorization of the data” - and the model
can’t generalize very well.
Solution: Benchmark Testing, Ridge Regression,
Feature Analyses, Dimensionality Reduction
Error: Bias vs Variance
Bias: the difference
between expected
(average) prediction of the
model and the correct
value.
Variance: how the
predictions for a given point
vary between different
realizations for the model.
http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias vs. Variance Trade-Off
Related to model complexity:
The more parameters added
to the model (the more
complex), Bias is reduced, and
variance increased.
Sources of complexity:
- k (nearest neighbors)
- epochs (neural nets)
- # of features
- learning rate
http://scott.fortmann-roe.com/docs/BiasVariance.html
Assess how model will generalize to independent data set
(e.g. data not in the training set).
1. Divide data into training and test splits
2. Fit model on training, predict on test
3. Determine accuracy, precision and recall
4. Repeat k times with different splits then average as F1
Cross Validation (classification)
Predicted Class A Predicted Class B
Actual A True A False B #A
Actual B False A True B #B
#P(A) #P(B) total
https://en.wikipedia.org/wiki/Precision_and_recall
accuracy =
true positives + true
negatives / total
precision =
true positives / (true
positives + false positives)
recall =
true positives / (false
negatives + true positives)
F1 score =
2 * ((precision * recall) /
(precision + recall))
Cross Validation in Scikit-Learn
from sklearn import metrics
from sklearn import cross_validation as cv
splits = cv.train_test_split(X, y, test_size=0.2)
X_train, X_test, y_train, y_test = splits
model = ClassifierEstimator()
model.fit(X_train, y_train)
expected = y_test
predicted = model.predict(X_test)
print metrics.classification_report(expected, predicted)
print metrics.confusion_matrix(expected, predicted)
print metrics.f1_score(expected, predicted)
MSE & Coefficient of Determination
In regressions we can determine how well the
model fits by computing the mean square error
and the coefficient of determination.
MSE = np.mean((predicted-expected)**2)
R2
is a predictor of “goodness of fit” and is a
value ∈ [0,1] where 1 is perfect fit.
K-Part Cross Validation
from sklearn import metrics
from sklearn import cross_validation as cv
splits = cv.train_test_split(X, y, test_size=0.2)
X_train, X_test, y_train, y_test = splits
model = RegressionEstimator()
model.fit(X_train, y_train)
expected = y_test
predicted = model.predict(y_test)
print metrics.mean_squared_error(expected, predicted)
print metrics.r2_score(expected, predicted)
How to evaluate clusters?
Visualization (but only in 2D)
Other Evaluation
Unstable Data
Randomness is a significant part of data in the real
world but problems with data can significantly affect
results:
- outliers
- skew
- missing information
- incorrect data
Solution: seam testing/integration testing
Unpredictable Future
Machine learning models attempt to predict the future
as new inputs come in - but human systems and
processes are subject to change.
Solution: Precision/Recall tracking over time
Standardized Data Model Demo
(Wheat Kernel Sizes)
A Tour of Scikit-Learn
Workshop
Select a data set from:
https://archive.ics.uci.edu/ml/index.html
- Layout the data in our data model
- Choose regression, classification, or clustering
and build the best model you can from it.
- Report an evaluation of the model built
- Visualize aspects of your model
- Compare and contrast different algorithms
Submit your code via pull-request to repository!
Advanced Scikit-Learn
sklearn.pipeline.Pipeline(steps)
- Sequentially apply repeatable transformations to final
estimator that can be validated at every step.
- Each step (except for the last) must implement
Transformer, e.g. fit and transform methods.
- Pipeline itself implements both methods of
Transformer and Estimator interfaces.
Pipelines
The Scikit-Learn Transformer API
class Transformer(Estimator):
def transform(self, X):
"""Transforms the input data. """
# transform ``X`` to ``X_prime``
return X_prime
The most common use for the Pipeline is to
combine multiple feature extraction methodologies
into a single, repeatable processing step.
- FeatureUnion
- SelectKBest
- TruncatedSVD
- DictVectorizer
An example of a distance based ML pipeline:
https://github.com/mclumd/shaku/blob/master/Shaku.ipynb
Pipelined Feature Extraction
Pipelined Model
>>> from sklearn.preprocessing import PolynomialFeatures
>>> from sklearn.pipeline import make_pipeline
>>> model = make_pipeline(PolynomialFeatures(2), linear_model.
Ridge())
>>> model.fit(X_train, y_train)
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=2,
include_bias=True, interaction_only=False)), ('ridge', Ridge
(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, solver='auto', tol=0.001))])
>>> mean_squared_error(y_test, model.predict(X_test))
3.1498887586451594
>>> model.score(X_test, y_test)
0.97090576345108104
Grid Search
- Prevent overfit/collinearity by penalizing the size of
coefficients - minimize the penalized residual sum of
squares:
- Said another way, shrink the coefficients to zero.
- Where > 0 is complexity parameter that controls
shrinkage. The larger , the more robust the model to
collinearity.
- Said another way, this is the bias/variance tradeoff: the larger
the ridge alpha, the higher the bias and the lower the
Ridge Regression
Ridge Regression
>>> clf = linear_model.Ridge(alpha=0.5)
>>> clf.fit(X_train, y_train)
Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, solver='auto', tol=0.001)
>>> print mean_squared_error(y_test, clf.predict(X_test))
8.34260312032
>>> clf.score(X_test, y_test)
0.92129741176557278
We can search for the best parameter using the RidgeCV
which is a form of Grid Search, but uses a more efficient
form of leave-one-out cross-validation.
>>> import numpy as np
>>> n_alphas = 200
>>> alphas = np.logspace(-10, -2, n_alphas)
>>> clf = linear_model.RidgeCV(alphas=alphas)
>>> clf.fit(X_train, y_train)
>>> print clf.alpha_
0.0010843659686896108
>>> clf.score(X_test, y_test)
0.92542477512171173
Choosing alpha
Error as a function of alpha
>>> clf = linear_model.Ridge(fit_intercept=False)
>>> errors = []
>>> for alpha in alphas:
... splits = tts(dataset.data, dataset.target('Y1'), test_size=0.2)
... X_train, X_test, y_train, y_test = splits
... clf.set_params(alpha=alpha)
... clf.fit(X_train, y_train)
... error = mean_squared_error(y_test, clf.predict(X_test))
... errors.append(error)
...
>>> axe = plt.gca()
>>> axe.plot(alphas, errors)
>>> plt.show()
Questions, Comments?

More Related Content

What's hot

Machine Learning Course | Edureka
Machine Learning Course | EdurekaMachine Learning Course | Edureka
Machine Learning Course | Edureka
Edureka!
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
butest
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Simplilearn
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Simplilearn
 

What's hot (20)

Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
 
Machine Learning Course | Edureka
Machine Learning Course | EdurekaMachine Learning Course | Edureka
Machine Learning Course | Edureka
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Linear regression
Linear regressionLinear regression
Linear regression
 

Similar to Introduction to Machine Learning with SciKit-Learn

Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusone
DotNetCampus
 
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive LearningTask Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive Learning
MLAI2
 

Similar to Introduction to Machine Learning with SciKit-Learn (20)

Stock Market Prediction Using ANN
Stock Market Prediction Using ANNStock Market Prediction Using ANN
Stock Market Prediction Using ANN
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusone
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Designing Artificial Intelligence
Designing Artificial IntelligenceDesigning Artificial Intelligence
Designing Artificial Intelligence
 
Lecture-6-7.pptx
Lecture-6-7.pptxLecture-6-7.pptx
Lecture-6-7.pptx
 
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
 
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive LearningTask Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive Learning
 
Python and data analytics
Python and data analyticsPython and data analytics
Python and data analytics
 
Building Azure Machine Learning Models
Building Azure Machine Learning ModelsBuilding Azure Machine Learning Models
Building Azure Machine Learning Models
 
Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms Excel
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Machine Learning for Modern Developers
Machine Learning for Modern DevelopersMachine Learning for Modern Developers
Machine Learning for Modern Developers
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 

More from Benjamin Bengfort

Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
Benjamin Bengfort
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
Benjamin Bengfort
 

More from Benjamin Bengfort (19)

Getting Started with TRISA
Getting Started with TRISAGetting Started with TRISA
Getting Started with TRISA
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
 
Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportAn Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation Report
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
 
Annotation with Redfox
Annotation with RedfoxAnnotation with Redfox
Annotation with Redfox
 
Rasta processing of speech
Rasta processing of speechRasta processing of speech
Rasta processing of speech
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Introduction to Machine Learning with SciKit-Learn

  • 2. - Preface - What is Machine Learning? - An Architecture for ML Data Products - What is Scikit-Learn? - Data Handling and Loading - Model Evaluation - Regressions - Classification - Clustering - Workshop Plan of Study
  • 4. Is Machine Learning a one semester course? Statistics Artificial Intelligence Computer Science Probability Normalization Distributions Smoothing Bayes Theorem Regression Logits Optimization Planning Computer Vision Natural Language Processing Reinforcement Neural Models Computer Vision Anomaly Detection Entropy Function Approximation Data Mining Graph Algorithms Big Data
  • 5. Machine Learning Practitioner ContinuumHacker Academic Machine Learning Practitioner Domains Statistician Expert Programmer
  • 6. Context to Data Mining & Statistics Data Machine or App Users Data Mining Machine Learning Statistics Computer Science
  • 7. What is Machine Learning?
  • 8. Learning by Example Given a bunch of examples (data) extract a meaningful pattern upon which to act. Problem Domain Machine Learning Class Infer a function from labeled data Supervised learning Find structure of data without feedback Unsupervised learning Interact with environment towards goal Reinforcement learning
  • 9. How do you make predictions?
  • 10. What patterns do you see?
  • 11. What is the Y value?
  • 12. How do you determine red from blue?
  • 13. Input training data to fit a model which is then used to predict incoming inputs into ... Types of Algorithms by Output Type of Output Algorithm Category Output is one or more discrete classes Classification (supervised) Output is continuous Regression (supervised) Output is membership in a similar group Clustering (unsupervised) Output is the distribution of inputs Density Estimation Output is simplified from higher dimensions Dimensionality Reduction
  • 14. Classification Given labeled input data (with two or more labels), fit a function that can determine for any input, what the label is.
  • 15. Regression Given continuous input data fit a function that is able to predict the continuous value of input given other data.
  • 16. Clustering Given data, determine a pattern of associated data points or clusters via their similarity or distance from one another.
  • 17. “Model” is an overloaded term. • Model family describes, at the broadest possible level, the connection between the variables of interest. • Model form specifies exactly how the variables of interest are connected within the framework of the model family. http://had.co.nz/stat645/model-vis.pdf Hadley Wickham (2015) • A fitted model is a concrete instance of the model form where all parameters have been estimated from data, and the model can be used to generate predictions.
  • 18. Dimensions and Features In order to do machine learning you need a data set containing instances (examples) that are composed of features from which you compose dimensions. Instance: a single data point or example composed of fields Feature: a quantity describing an instance Dimension: one or more attributes that describe a property from sklearn.datasets import load_digits digits = load_digits() X = digits.data # X.shape == (n_samples, n_features) y = digits.target # y.shape == (n_samples,)
  • 19. Feature Space Feature space refers to the n-dimensions where your variables live (not including a target variable or class). The term is used often in ML literature because in ML all variables are features (usually) and feature extraction is the art of creating a space with decision boundaries. Target 1. Y ≡ Thickness of car tires after some testing period Variables 1. X1 ≡ distance travelled in test 2. X2 ≡ time duration of test 3. X3 ≡ amount of chemical C in tires The feature space is R3 , or more accurately, the positive quadrant in R3 as all the X variables can only be positive quantities. http://stats.stackexchange.com/questions/46425/what-is-feature-space
  • 20. Mappings Domain knowledge about tires might suggest that the speed the vehicle was moving at is important, hence we generate another variable, X4 (this is the feature extraction part): X4 = X1 *X2 ≡ the speed of the vehicle during testing. This extends our old feature space into a new one, the positive part of R4 . A mapping is a function, ϕ, from R3 to R4 : ϕ(x1 ,x2 ,x3 ) = (x1 ,x2 ,x3 ,x1 x2 ) http://stats.stackexchange.com/questions/46425/what-is-feature-space
  • 21. Your Task Given a data set of instances of size N, create a model that is fit from the data (built) by extracting features and dimensions. Then use that model to predict outcomes … 1. Data Wrangling (normalization, standardization, imputing) 2. Feature Analysis/Extraction 3. Model Selection/Building 4. Model Evaluation 5. Operationalize Model
  • 22. A Tour of Machine Learning Algorithms
  • 23. Models: Instance Methods Compare instances in data set with a similarity measure to find best matches. - Suffers from curse of dimensionality. - Focus on feature representation and similarity metrics between instances ● k-Nearest Neighbors (kNN) ● Self-Organizing Maps (SOM) ● Learning Vector Quantization (LVQ)
  • 24. Models: Regression Model relationship of independent variables, X to dependent variable Y by iteratively optimizing error made in predictions. ● Ordinary Least Squares ● Logistic Regression ● Stepwise Regression ● Multivariate Adaptive Regression Splines (MARS) ● Locally Estimated Scatterplot Smoothing (LOESS)
  • 25. Models: Regularization Methods Extend another method (usually regression), penalizing complexity (minimize overfit) - simple, popular, powerful - better at generalization ● Ridge Regression ● LASSO (Least Absolute Shrinkage & Selection Operator) ● Elastic Net
  • 26. Models: Decision Trees Model of decisions based on data attributes. Predictions are made by following forks in a tree structure until a decision is made. Used for classification & regression. ● Classification and Regression Tree (CART) ● Decision Stump ● Random Forest ● Multivariate Adaptive Regression Splines (MARS) ● Gradient Boosting Machines (GBM)
  • 27. Models: Bayesian Explicitly apply Bayes’ Theorem for classification and regression tasks. Usually by fitting a probability function constructed via the chain rule and a naive simplification of Bayes. ● Naive Bayes ● Averaged One-Dependence Estimators (AODE) ● Bayesian Belief Network (BBN)
  • 28. Models: Kernel Methods Map input data into higher dimensional vector space where the problem is easier to model. Named after the “kernel trick” which computes the inner product of images of pairs of data. ● Support Vector Machines (SVM) ● Radial Basis Function (RBF) ● Linear Discriminant Analysis (LDA)
  • 29. Models: Clustering Methods Organize data into into groups whose members share maximum similarity (defined usually by a distance metric). Two main approaches: centroids and hierarchical clustering. ● k-Means ● Affinity Propegation ● OPTICS (Ordering Points to Identify Cluster Structure) ● Agglomerative Clustering
  • 30. Models: Artificial Neural Networks Inspired by biological neural networks, ANNs are nonlinear function approximators that estimate functions with a large number of inputs. - System of interconnected neurons that activate - Deep learning extends simple networks recursively ● Perceptron ● Back-Propagation ● Hopfield Network ● Restricted Boltzmann Machine (RBM) ● Deep Belief Networks (DBN)
  • 31. Models: Ensembles Models composed of multiple weak models that are trained independently and whose outputs are combined to make an overall prediction. ● Boosting ● Bootstrapped Aggregation (Bagging) ● AdaBoost ● Stacked Generalization (blending) ● Gradient Boosting Machines (GBM) ● Random Forest
  • 32. Models: Other The list before was not comprehensive, other algorithm and model classes include: ● Conditional Random Fields (CRF) ● Markovian Models (HMMs) ● Dimensionality Reduction (PCA, PLS) ● Rule Learning (Apriori, Brill) ● More ...
  • 33. An Architecture for Operationalizing Machine Learning Algorithms
  • 34. Operational Phase Build Phase Architecture of Machine Learning Operations Training Data Labels Feature Vectors Estimation Algorithm New Data Feature Vector Predictive Model Prediction
  • 36. The Learning Part of Machine Learning Model Building Initial Fixtures Service/API Feedback
  • 37. Deploying Machine Learning as a Web Service
  • 40.
  • 41. What is Scikit-Learn? Extensions to SciPy (Scientific Python) are called SciKits. SciKit-Learn provides machine learning algorithms. ● Algorithms for supervised & unsupervised learning ● Built on SciPy and Numpy ● Standard Python API interface ● Sits on top of c libraries, LAPACK, LibSVM, and Cython ● Open Source: BSD License (part of Linux) Probably the best general ML framework out there.
  • 42. Where did it come from? Started as a Google summer of code project in 2007 by David Cournapeau, then used as a thesis project by Matthieu Brucher. In 2010, INRIA pushed the first public release, and sponsors the project, as do Google, Tinyclues, and the Python Software Foundation.
  • 44. Primary Features - Generalized Linear Models - SVMs, kNN, Bayes, Decision Trees, Ensembles - Clustering and Density algorithms - Cross Validation - Grid Search - Pipelining - Model Evaluations - Dataset Transformations - Dataset Loading
  • 45. A Guide to Scikit-Learn
  • 46. Object-oriented interface centered around the concept of an Estimator: “An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data.” - Scikit-Learn Tutorial Scikit-Learn API
  • 47. The Scikit-Learn Estimator API class Estimator(object): def fit(self, X, y=None): """Fits estimator to data. """ # set state of ``self`` return self def predict(self, X): """Predict response of ``X``. """ # compute predictions ``pred`` return pred
  • 48. - fit(X,y) sets the state of the estimator. - X is usually a 2D numpy array of shape (num_samples, num_features). - y is a 1D array with shape (n_samples,) - predict(X) returns the class or value - predict_proba() returns a 2D array of shape (n_samples, n_classes) Estimators
  • 49. Basic methodology from sklearn import svm estimator = svm.SVC(gamma=0.001) estimator.fit(X, y) estimator.predict(x)
  • 50. We’ve already discussed a broad workflow, the following is a development workflow: Wrapping fit and predict Load & Transform Data Raw Data Feature Extraction Build Model Evaluate Model Feature Evaluation
  • 51. Transformers class Transformer(Estimator): def transform(self, X): """Transforms the input data. """ # transform ``X`` to ``X_prime`` return X_prime from sklearn import preprocessing Xt = preprocessing.normalize(X) # Normalizer Xt = preprocessing.scale(X) # StandardScaler imputer =Imputer(missing_values='Nan', strategy='mean') Xt = imputer.fit_transform(X)
  • 53. Underfitting Not enough information to accurately model real life. Can be due to high bias, or just a too simplistic model. Solution: Cross Validation
  • 54. Overfitting Create a model with too many parameters or is too complex. “Memorization of the data” - and the model can’t generalize very well. Solution: Benchmark Testing, Ridge Regression, Feature Analyses, Dimensionality Reduction
  • 55. Error: Bias vs Variance Bias: the difference between expected (average) prediction of the model and the correct value. Variance: how the predictions for a given point vary between different realizations for the model. http://scott.fortmann-roe.com/docs/BiasVariance.html
  • 56. Bias vs. Variance Trade-Off Related to model complexity: The more parameters added to the model (the more complex), Bias is reduced, and variance increased. Sources of complexity: - k (nearest neighbors) - epochs (neural nets) - # of features - learning rate http://scott.fortmann-roe.com/docs/BiasVariance.html
  • 57. Assess how model will generalize to independent data set (e.g. data not in the training set). 1. Divide data into training and test splits 2. Fit model on training, predict on test 3. Determine accuracy, precision and recall 4. Repeat k times with different splits then average as F1 Cross Validation (classification) Predicted Class A Predicted Class B Actual A True A False B #A Actual B False A True B #B #P(A) #P(B) total
  • 58. https://en.wikipedia.org/wiki/Precision_and_recall accuracy = true positives + true negatives / total precision = true positives / (true positives + false positives) recall = true positives / (false negatives + true positives) F1 score = 2 * ((precision * recall) / (precision + recall))
  • 59. Cross Validation in Scikit-Learn from sklearn import metrics from sklearn import cross_validation as cv splits = cv.train_test_split(X, y, test_size=0.2) X_train, X_test, y_train, y_test = splits model = ClassifierEstimator() model.fit(X_train, y_train) expected = y_test predicted = model.predict(X_test) print metrics.classification_report(expected, predicted) print metrics.confusion_matrix(expected, predicted) print metrics.f1_score(expected, predicted)
  • 60. MSE & Coefficient of Determination In regressions we can determine how well the model fits by computing the mean square error and the coefficient of determination. MSE = np.mean((predicted-expected)**2) R2 is a predictor of “goodness of fit” and is a value ∈ [0,1] where 1 is perfect fit.
  • 61. K-Part Cross Validation from sklearn import metrics from sklearn import cross_validation as cv splits = cv.train_test_split(X, y, test_size=0.2) X_train, X_test, y_train, y_test = splits model = RegressionEstimator() model.fit(X_train, y_train) expected = y_test predicted = model.predict(y_test) print metrics.mean_squared_error(expected, predicted) print metrics.r2_score(expected, predicted)
  • 62. How to evaluate clusters? Visualization (but only in 2D) Other Evaluation
  • 63. Unstable Data Randomness is a significant part of data in the real world but problems with data can significantly affect results: - outliers - skew - missing information - incorrect data Solution: seam testing/integration testing
  • 64. Unpredictable Future Machine learning models attempt to predict the future as new inputs come in - but human systems and processes are subject to change. Solution: Precision/Recall tracking over time
  • 65. Standardized Data Model Demo (Wheat Kernel Sizes)
  • 66. A Tour of Scikit-Learn
  • 67. Workshop Select a data set from: https://archive.ics.uci.edu/ml/index.html - Layout the data in our data model - Choose regression, classification, or clustering and build the best model you can from it. - Report an evaluation of the model built - Visualize aspects of your model - Compare and contrast different algorithms Submit your code via pull-request to repository!
  • 69. sklearn.pipeline.Pipeline(steps) - Sequentially apply repeatable transformations to final estimator that can be validated at every step. - Each step (except for the last) must implement Transformer, e.g. fit and transform methods. - Pipeline itself implements both methods of Transformer and Estimator interfaces. Pipelines
  • 70. The Scikit-Learn Transformer API class Transformer(Estimator): def transform(self, X): """Transforms the input data. """ # transform ``X`` to ``X_prime`` return X_prime
  • 71. The most common use for the Pipeline is to combine multiple feature extraction methodologies into a single, repeatable processing step. - FeatureUnion - SelectKBest - TruncatedSVD - DictVectorizer An example of a distance based ML pipeline: https://github.com/mclumd/shaku/blob/master/Shaku.ipynb Pipelined Feature Extraction
  • 72. Pipelined Model >>> from sklearn.preprocessing import PolynomialFeatures >>> from sklearn.pipeline import make_pipeline >>> model = make_pipeline(PolynomialFeatures(2), linear_model. Ridge()) >>> model.fit(X_train, y_train) Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)), ('ridge', Ridge (alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, solver='auto', tol=0.001))]) >>> mean_squared_error(y_test, model.predict(X_test)) 3.1498887586451594 >>> model.score(X_test, y_test) 0.97090576345108104
  • 74. - Prevent overfit/collinearity by penalizing the size of coefficients - minimize the penalized residual sum of squares: - Said another way, shrink the coefficients to zero. - Where > 0 is complexity parameter that controls shrinkage. The larger , the more robust the model to collinearity. - Said another way, this is the bias/variance tradeoff: the larger the ridge alpha, the higher the bias and the lower the Ridge Regression
  • 75. Ridge Regression >>> clf = linear_model.Ridge(alpha=0.5) >>> clf.fit(X_train, y_train) Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, solver='auto', tol=0.001) >>> print mean_squared_error(y_test, clf.predict(X_test)) 8.34260312032 >>> clf.score(X_test, y_test) 0.92129741176557278
  • 76. We can search for the best parameter using the RidgeCV which is a form of Grid Search, but uses a more efficient form of leave-one-out cross-validation. >>> import numpy as np >>> n_alphas = 200 >>> alphas = np.logspace(-10, -2, n_alphas) >>> clf = linear_model.RidgeCV(alphas=alphas) >>> clf.fit(X_train, y_train) >>> print clf.alpha_ 0.0010843659686896108 >>> clf.score(X_test, y_test) 0.92542477512171173 Choosing alpha
  • 77. Error as a function of alpha >>> clf = linear_model.Ridge(fit_intercept=False) >>> errors = [] >>> for alpha in alphas: ... splits = tts(dataset.data, dataset.target('Y1'), test_size=0.2) ... X_train, X_test, y_train, y_test = splits ... clf.set_params(alpha=alpha) ... clf.fit(X_train, y_train) ... error = mean_squared_error(y_test, clf.predict(X_test)) ... errors.append(error) ... >>> axe = plt.gca() >>> axe.plot(alphas, errors) >>> plt.show()