Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets.
The purpose of this one day course is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms for a variety of machine learning tasks and learn how to implement these tasks with our data using Scikit-Learn and Python. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology.
2. - Preface
- What is Machine Learning?
- An Architecture for ML Data Products
- What is Scikit-Learn?
- Data Handling and Loading
- Model Evaluation
- Regressions
- Classification
- Clustering
- Workshop
Plan of Study
4. Is Machine Learning a one semester course?
Statistics
Artificial
Intelligence
Computer
Science
Probability
Normalization
Distributions
Smoothing
Bayes Theorem
Regression
Logits
Optimization
Planning
Computer Vision
Natural Language Processing
Reinforcement
Neural Models
Computer Vision
Anomaly Detection
Entropy
Function Approximation
Data Mining
Graph Algorithms
Big Data
8. Learning by Example
Given a bunch of examples (data) extract a
meaningful pattern upon which to act.
Problem Domain Machine Learning Class
Infer a function from labeled data Supervised learning
Find structure of data without feedback Unsupervised learning
Interact with environment towards goal Reinforcement learning
13. Input training data to fit a model which is then
used to predict incoming inputs into ...
Types of Algorithms by Output
Type of Output Algorithm Category
Output is one or more discrete classes Classification (supervised)
Output is continuous Regression (supervised)
Output is membership in a similar group Clustering (unsupervised)
Output is the distribution of inputs Density Estimation
Output is simplified from higher dimensions Dimensionality Reduction
16. Clustering
Given data, determine a pattern of associated data points
or clusters via their similarity or distance from one another.
17. “Model” is an overloaded term.
• Model family describes, at the broadest possible level, the
connection between the variables of interest.
• Model form specifies exactly how the variables of interest
are connected within the framework of the model family.
http://had.co.nz/stat645/model-vis.pdf
Hadley Wickham (2015)
• A fitted model is a concrete instance of the
model form where all parameters have been
estimated from data, and the model can be
used to generate predictions.
18. Dimensions and Features
In order to do machine learning you need a data set containing
instances (examples) that are composed of features from which
you compose dimensions.
Instance: a single data point or example composed of fields
Feature: a quantity describing an instance
Dimension: one or more attributes that describe a property
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data # X.shape == (n_samples, n_features)
y = digits.target # y.shape == (n_samples,)
19. Feature Space
Feature space refers to the n-dimensions where your variables live (not
including a target variable or class). The term is used often in ML literature
because in ML all variables are features (usually) and feature extraction is the
art of creating a space with decision boundaries.
Target
1. Y ≡ Thickness of car tires after some testing period
Variables
1. X1
≡ distance travelled in test
2. X2
≡ time duration of test
3. X3
≡ amount of chemical C in tires
The feature space is R3
, or more accurately, the positive quadrant in R3
as all
the X variables can only be positive quantities.
http://stats.stackexchange.com/questions/46425/what-is-feature-space
20. Mappings
Domain knowledge about tires might suggest that the speed the vehicle was
moving at is important, hence we generate another variable, X4
(this is the
feature extraction part):
X4
= X1
*X2
≡ the speed of the vehicle during testing.
This extends our old feature space into a new one, the positive part of R4
.
A mapping is a function, ϕ, from R3
to R4
:
ϕ(x1
,x2
,x3
) = (x1
,x2
,x3
,x1
x2
)
http://stats.stackexchange.com/questions/46425/what-is-feature-space
21. Your Task
Given a data set of instances of size N, create
a model that is fit from the data (built) by
extracting features and dimensions. Then use
that model to predict outcomes …
1. Data Wrangling (normalization, standardization, imputing)
2. Feature Analysis/Extraction
3. Model Selection/Building
4. Model Evaluation
5. Operationalize Model
23. Models: Instance Methods
Compare instances in data set with a similarity
measure to find best matches.
- Suffers from curse of dimensionality.
- Focus on feature representation and
similarity metrics between instances
● k-Nearest Neighbors (kNN)
● Self-Organizing Maps (SOM)
● Learning Vector Quantization (LVQ)
24. Models: Regression
Model relationship of independent variables, X
to dependent variable Y by iteratively
optimizing error made in predictions.
● Ordinary Least Squares
● Logistic Regression
● Stepwise Regression
● Multivariate Adaptive Regression Splines (MARS)
● Locally Estimated Scatterplot Smoothing (LOESS)
26. Models: Decision Trees
Model of decisions based on data attributes.
Predictions are made by following forks in a
tree structure until a decision is made. Used for
classification & regression.
● Classification and Regression Tree (CART)
● Decision Stump
● Random Forest
● Multivariate Adaptive Regression Splines (MARS)
● Gradient Boosting Machines (GBM)
27. Models: Bayesian
Explicitly apply Bayes’ Theorem for
classification and regression tasks. Usually by
fitting a probability function constructed via the
chain rule and a naive simplification of Bayes.
● Naive Bayes
● Averaged One-Dependence Estimators (AODE)
● Bayesian Belief Network (BBN)
28. Models: Kernel Methods
Map input data into higher dimensional vector
space where the problem is easier to model.
Named after the “kernel trick” which computes
the inner product of images of pairs of data.
● Support Vector Machines (SVM)
● Radial Basis Function (RBF)
● Linear Discriminant Analysis (LDA)
29. Models: Clustering Methods
Organize data into into groups whose members
share maximum similarity (defined usually by a
distance metric). Two main approaches:
centroids and hierarchical clustering.
● k-Means
● Affinity Propegation
● OPTICS (Ordering Points to Identify Cluster Structure)
● Agglomerative Clustering
30. Models: Artificial Neural Networks
Inspired by biological neural networks, ANNs are
nonlinear function approximators that estimate
functions with a large number of inputs.
- System of interconnected neurons that activate
- Deep learning extends simple networks recursively
● Perceptron
● Back-Propagation
● Hopfield Network
● Restricted Boltzmann Machine (RBM)
● Deep Belief Networks (DBN)
31. Models: Ensembles
Models composed of multiple weak models that
are trained independently and whose outputs
are combined to make an overall prediction.
● Boosting
● Bootstrapped Aggregation (Bagging)
● AdaBoost
● Stacked Generalization (blending)
● Gradient Boosting Machines (GBM)
● Random Forest
32. Models: Other
The list before was not comprehensive, other
algorithm and model classes include:
● Conditional Random Fields (CRF)
● Markovian Models (HMMs)
● Dimensionality Reduction (PCA, PLS)
● Rule Learning (Apriori, Brill)
● More ...
34. Operational Phase
Build Phase
Architecture of Machine Learning Operations
Training Data
Labels
Feature
Vectors
Estimation
Algorithm
New Data Feature
Vector
Predictive
Model
Prediction
41. What is Scikit-Learn?
Extensions to SciPy (Scientific Python) are
called SciKits. SciKit-Learn provides machine
learning algorithms.
● Algorithms for supervised & unsupervised learning
● Built on SciPy and Numpy
● Standard Python API interface
● Sits on top of c libraries, LAPACK, LibSVM, and Cython
● Open Source: BSD License (part of Linux)
Probably the best general ML framework out there.
42. Where did it come from?
Started as a Google summer of code project in
2007 by David Cournapeau, then used as a
thesis project by Matthieu Brucher.
In 2010, INRIA pushed the first public release,
and sponsors the project, as do Google,
Tinyclues, and the Python Software
Foundation.
46. Object-oriented interface centered around the
concept of an Estimator:
“An estimator is any object that learns from data; it may
be a classification, regression or clustering algorithm or
a transformer that extracts/filters useful features from
raw data.”
- Scikit-Learn Tutorial
Scikit-Learn API
47. The Scikit-Learn Estimator API
class Estimator(object):
def fit(self, X, y=None):
"""Fits estimator to data. """
# set state of ``self``
return self
def predict(self, X):
"""Predict response of ``X``. """
# compute predictions ``pred``
return pred
48. - fit(X,y) sets the state of the estimator.
- X is usually a 2D numpy array of shape
(num_samples, num_features).
- y is a 1D array with shape (n_samples,)
- predict(X) returns the class or value
- predict_proba() returns a 2D array of
shape (n_samples, n_classes)
Estimators
50. We’ve already discussed a broad workflow, the
following is a development workflow:
Wrapping fit and predict
Load &
Transform Data
Raw Data
Feature
Extraction
Build Model Evaluate Model
Feature
Evaluation
51. Transformers
class Transformer(Estimator):
def transform(self, X):
"""Transforms the input data. """
# transform ``X`` to ``X_prime``
return X_prime
from sklearn import preprocessing
Xt = preprocessing.normalize(X) # Normalizer
Xt = preprocessing.scale(X) # StandardScaler
imputer =Imputer(missing_values='Nan',
strategy='mean')
Xt = imputer.fit_transform(X)
53. Underfitting
Not enough information to accurately model real life.
Can be due to high bias, or just a too simplistic model.
Solution: Cross Validation
54. Overfitting
Create a model with too many parameters or is too
complex. “Memorization of the data” - and the model
can’t generalize very well.
Solution: Benchmark Testing, Ridge Regression,
Feature Analyses, Dimensionality Reduction
55. Error: Bias vs Variance
Bias: the difference
between expected
(average) prediction of the
model and the correct
value.
Variance: how the
predictions for a given point
vary between different
realizations for the model.
http://scott.fortmann-roe.com/docs/BiasVariance.html
56. Bias vs. Variance Trade-Off
Related to model complexity:
The more parameters added
to the model (the more
complex), Bias is reduced, and
variance increased.
Sources of complexity:
- k (nearest neighbors)
- epochs (neural nets)
- # of features
- learning rate
http://scott.fortmann-roe.com/docs/BiasVariance.html
57. Assess how model will generalize to independent data set
(e.g. data not in the training set).
1. Divide data into training and test splits
2. Fit model on training, predict on test
3. Determine accuracy, precision and recall
4. Repeat k times with different splits then average as F1
Cross Validation (classification)
Predicted Class A Predicted Class B
Actual A True A False B #A
Actual B False A True B #B
#P(A) #P(B) total
59. Cross Validation in Scikit-Learn
from sklearn import metrics
from sklearn import cross_validation as cv
splits = cv.train_test_split(X, y, test_size=0.2)
X_train, X_test, y_train, y_test = splits
model = ClassifierEstimator()
model.fit(X_train, y_train)
expected = y_test
predicted = model.predict(X_test)
print metrics.classification_report(expected, predicted)
print metrics.confusion_matrix(expected, predicted)
print metrics.f1_score(expected, predicted)
60. MSE & Coefficient of Determination
In regressions we can determine how well the
model fits by computing the mean square error
and the coefficient of determination.
MSE = np.mean((predicted-expected)**2)
R2
is a predictor of “goodness of fit” and is a
value ∈ [0,1] where 1 is perfect fit.
61. K-Part Cross Validation
from sklearn import metrics
from sklearn import cross_validation as cv
splits = cv.train_test_split(X, y, test_size=0.2)
X_train, X_test, y_train, y_test = splits
model = RegressionEstimator()
model.fit(X_train, y_train)
expected = y_test
predicted = model.predict(y_test)
print metrics.mean_squared_error(expected, predicted)
print metrics.r2_score(expected, predicted)
62. How to evaluate clusters?
Visualization (but only in 2D)
Other Evaluation
63. Unstable Data
Randomness is a significant part of data in the real
world but problems with data can significantly affect
results:
- outliers
- skew
- missing information
- incorrect data
Solution: seam testing/integration testing
64. Unpredictable Future
Machine learning models attempt to predict the future
as new inputs come in - but human systems and
processes are subject to change.
Solution: Precision/Recall tracking over time
67. Workshop
Select a data set from:
https://archive.ics.uci.edu/ml/index.html
- Layout the data in our data model
- Choose regression, classification, or clustering
and build the best model you can from it.
- Report an evaluation of the model built
- Visualize aspects of your model
- Compare and contrast different algorithms
Submit your code via pull-request to repository!
69. sklearn.pipeline.Pipeline(steps)
- Sequentially apply repeatable transformations to final
estimator that can be validated at every step.
- Each step (except for the last) must implement
Transformer, e.g. fit and transform methods.
- Pipeline itself implements both methods of
Transformer and Estimator interfaces.
Pipelines
70. The Scikit-Learn Transformer API
class Transformer(Estimator):
def transform(self, X):
"""Transforms the input data. """
# transform ``X`` to ``X_prime``
return X_prime
71. The most common use for the Pipeline is to
combine multiple feature extraction methodologies
into a single, repeatable processing step.
- FeatureUnion
- SelectKBest
- TruncatedSVD
- DictVectorizer
An example of a distance based ML pipeline:
https://github.com/mclumd/shaku/blob/master/Shaku.ipynb
Pipelined Feature Extraction
74. - Prevent overfit/collinearity by penalizing the size of
coefficients - minimize the penalized residual sum of
squares:
- Said another way, shrink the coefficients to zero.
- Where > 0 is complexity parameter that controls
shrinkage. The larger , the more robust the model to
collinearity.
- Said another way, this is the bias/variance tradeoff: the larger
the ridge alpha, the higher the bias and the lower the
Ridge Regression
76. We can search for the best parameter using the RidgeCV
which is a form of Grid Search, but uses a more efficient
form of leave-one-out cross-validation.
>>> import numpy as np
>>> n_alphas = 200
>>> alphas = np.logspace(-10, -2, n_alphas)
>>> clf = linear_model.RidgeCV(alphas=alphas)
>>> clf.fit(X_train, y_train)
>>> print clf.alpha_
0.0010843659686896108
>>> clf.score(X_test, y_test)
0.92542477512171173
Choosing alpha
77. Error as a function of alpha
>>> clf = linear_model.Ridge(fit_intercept=False)
>>> errors = []
>>> for alpha in alphas:
... splits = tts(dataset.data, dataset.target('Y1'), test_size=0.2)
... X_train, X_test, y_train, y_test = splits
... clf.set_params(alpha=alpha)
... clf.fit(X_train, y_train)
... error = mean_squared_error(y_test, clf.predict(X_test))
... errors.append(error)
...
>>> axe = plt.gca()
>>> axe.plot(alphas, errors)
>>> plt.show()