Visual diagnostics at scale

Visual Diagnostics
at Scale
SciPy 2019

Dr. Rebecca Bilbro
Chief Data Scientist, ICX Media
Co-creator, Scikit-Yellowbrick
Author, Applied Text Analysis with Python
@rebeccabilbro

Census Dataset
500K instances
50 features
(age, occupation,
education, sex, ethnicity
marital status)
Sarcasm Dataset
50K instances
5K features
(“love”, 🙄, “totally”, “best”,
“surprise”, “Sherlock”,
capitalization, timestamp)
Sensor Dataset
5M instances
15 features
(Ammonia, Acetaldehyde,
Acetone, Ethylene, Ethanol,
Toluene ppmv)

Scaling pain
points are
dataset-
speciﬁc
● Many features
● Many instances
● Feature variance
● Heteroskedasticity
● Covariance
● Noise

Logistic Regression Fit Times (seconds)
500 - 5M instances / 5 - 50 features
10 seconds

Multilayer Perceptron Fit Times (seconds)
500 - 5M instances / 5 - 50 features
5 min, 48
seconds

Support Vector Machine Fit Times (seconds)
500 - 500K instances / 5 - 50 features
5 hours, 24
seconds

Support Vector Machine Fit Times (seconds)
500 - 500K instances / 5 - 50 features
5 hours, 24
seconds
😵

How to
optimize?
● Be patient
● Be wrong
● Be rich
● Steer

The Model
Selection
Triple
Arun Kumar, et al. http://bit.ly/2abVNrI

Models are aggregations
So are visualizations

Use visualizations
to steer model selection

Adventures in
Model Visualization

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from yellowbrick.features import ParallelCoordinates
data = load_iris()
oz = ParallelCoordinates(ax=axes[idx], fast=True)
oz.fit_transform(data.data, data.target)
oz.finalize()
Each point drawn individually
as connected line segment
With standardization
Points grouped by class, each class
drawn as single segment

from yellowbrick.features import Rank2D
from yellowbrick.pipeline import VisualPipeline
from yellowbrick.model_selection import CVScores
from yellowbrick.regressor import PredictionError
viz_pipe = VisualPipeline([
('rank2d', Rank2D(features=features, algorithm='covariance')),
('prederr', PredictionError(model)),
('cvscores', CVScores(model, cv=cv, scoring='r2'))
])
Visual
Pipelines

Machine learning is not particularly
well-suited to object-oriented
programming

class Estimator(object):
def fit(self, X, y=None):
"""
Fits estimator to data.
"""
# set state of self
return self
def predict(self, X):
"""
Predict response of X
"""
# compute predictions pred
return pred
class Transformer(Estimator):
def transform(self, X):
"""
Transforms the input data.
"""
# transform X to X_prime
return X_prime
class Pipeline(Transfomer):
@property
def named_steps(self):
"""
Returns a sequence of estimators
"""
return self.steps
@property
def _final_estimator(self):
"""
Terminating estimator
"""
return self.steps[-1]
The scikit-learn API
self.X

class Visualizer(Estimator):
def draw(self):
"""
Draw called from scikit-learn methods.
"""
return self.ax
def finalize(self):
self.set_title()
self.legend()
def poof(self):
self.finalize()
plt.show()
import matplotlib.pyplot as plt
from yellowbrick.base import Visualizer
class MyVisualizer(Visualizer):
def __init__(self, ax=None, **kwargs):
super(MyVisualizer, self).__init__(ax, **kwargs)
def fit(self, X, y=None):
self.draw(X)
return self
def draw(self, X):
if self.ax is None:
self.ax = self.gca()
self.ax.plt(X)
def finalize(self):
self.set_title("My Visualizer")
The Yellowbrick API

A tool for students
vs.
A tool for practitioners?

Yellowbrick Quick Methods
from sklearn.linear_model import Lasso
from yellowbrick.regressor import ResidualsPlot
# Option 1: scikit-learn style
viz = ResidualsPlot(Lasso())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.poof()
from sklearn.linear_model import Lasso
from yellowbrick.regressor import residuals_plot
# Option 2: Quick Method
viz = residuals_plot(
Lasso(), X_train, y_train, X_test, y_test
)
��

.. plot::
:context: close-figs
:include-source: False
:alt: Recursive Feature Elimination
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from yellowbrick.features import RFECV
# Create a dataset with only 3 informative features
X, y = make_classification(
n_samples=1000, n_features=25, n_informative=3,
n_redundant=2, n_repeated=0, n_classes=8,
n_clusters_per_class=1, random_state=0
)
viz = RFECV(SVC(kernel='linear', C=1))
viz.fit(X, y)
viz.poof()
The Plot Directive

=========================================== test session starts ============================================
platform darwin -- Python 3.7.1, pytest-5.0.0, py-1.8.0, pluggy-0.12.0
rootdir: /Users/rbilbro/pyjects/yb, inifile: setup.cfg
plugins: flakes-4.0.0, cov-2.7.1
collected 932 items
tests/__init__.py s... [ 0%]
tests/base.py s [ 0%]
tests/conftest.py s [ 0%]
tests/fixtures.py s [ 0%]
tests/images.py s [ 0%]
tests/rand.py s [ 0%]
tests/test_base.py s............ [ 2%]
...........................................................................................................
...........................................................................................................
...........................................................................................................
...........................................................................................................
tests/test_utils/test_target.py s............ [ 68%]
tests/test_utils/test_timer.py s..... [ 68%]
tests/test_utils/test_types.py s.................................................................... [ 70%]
....x................................x.............................................................. [ 72%]
.... [ 73%]
tests/test_utils/test_wrapper.py s....
===================== 854 passed, 72 skipped, 6 xfailed, 33 warnings in 225.96 seconds =====================
Also Testing

Machine-learning oriented aggregation
YB (current) Seaborn

Brushing and Filtering
Ok for only 5 features Not good for 23 features

Parallelization with joblib
Elbow Curve Validation Curve

Figures & Axes
YB wraps a matplotlib axes.Axes object
● Visualizers behave as part of larger ﬁg
● Make multi-axis plots for publications, etc.
● Give users control over size, style, interaction
But what to do as visualizers become
more complex, e.g. multi-axis in their
own right?
➔ AxesGrid Toolkit (e.g.
make_axes_locatable)

Other
places we’re
looking
● Altair
● Bokeh
● Pandas
● Seaborn
● Datashader
● ...suggestions?

● ML experimentation is in tension with time, $$$, reality.
● Human-driven steering is useful for data of any size.
● The stakes are much higher for big data.
● Scikit-YB supports visual steering via Visualizer objects.
● Wrapping both scikit-learn and Matplotlib APIs is tricky!
● The path forward includes optimized aggregations, including
zoom-and-ﬁlter, brushing, parallelization, and multi-axis plotting.
Main Points

Visual diagnostics at scale

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Visual diagnostics at scale

Ähnlich wie Visual diagnostics at scale (20)

Mehr von Rebecca Bilbro

Mehr von Rebecca Bilbro (19)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Visual diagnostics at scale