Practical Machine Learning in Python

Practical Machine
Learning in Python
Matt Spitz
via
@mattspitz

Practical Machine Learning in Python 2

This is the Age of Aquarius Data
• Data is plentiful
• application logs
• external APIs
• Facebook, Twitter

• public datasets
• Analysis adds value
• understanding your users
• dynamic application decisions
• Storage / CPU time is cheap


Machine Learning in Python
• Python is well-suited for data analysis
• Versatile
• quick and dirty scripts
• full-featured, realtime applications
• Mature ML packages
• tons of choices (see: mloss.org)
• plug-and-play or DIY


Classification Problem: Terminology
• Data points
• feature set: “interesting” facts about an event/thing
• label: a description of that event/thing
• Classification
• training set: a bunch of labeled feature sets
• given a training set, build a classifier to predict labels for
unlabeled feature sets


SluggerML
• Two questions
• What features are strong predictors for home runs and strikeouts?
• Given a particular situation, with what probability will the batter
hit a home run or strike out?
• Feature sets represent game state for a plate appearance
• game: day vs. night, wind direction...
• at-bat: inning, #strikes, left-right matchup...
• batter/pitcher: age, weight, fielding position...
• Labels represent outcome
• HR (home run), K (strikeout), OTHER
• Poor Man’s Sabermetrics


SluggerML: Example
• Training set
• {game_daynight: day, batter_age: 24, pitcher_weight: 211}
• label: HR
• label: K
• {game_daynight: night, batter_age: 27, pitcher_weight: 195}
• label: OTHER
• Classifier predictions
• {game_daynight: night, batter_age: 36, pitcher_weight: 225}
• 2.6% HR 15.6% K
• 2.2% HR 19.1% K


SluggerML: Gathering Data
• Sources
• Retrosheet
• play-by-play logs for every game since 1956
• Sean Lahman’s Baseball Archive
• detailed stats about individual players

• Coalescing
• 1st pass, Lahman: create player database
• shelve module
• 2nd pass, Retrosheet: track game state, join on player db
• Scrubbing
• ensure consistency


SluggerML: Gathering Data
• Training set
• regular-season games from 1980-2011
• 5,669,301 plate appearances
• 135,602 home runs
• 871,226 strikeouts


Selecting a Toolkit: Tradeoffs
• Speed
• offline vs. realtime
• Transparency
• internal visibility
• customizability
• Support
• maturity
• community


Selecting a Toolkit: High-Level Options
• External bindings
• python interfaces to popular packages
• Matlab, R, Octave, SHOGUN Toolbox
• transition legacy workflows
• Python implementations
• collections of algorithms
• (mostly) python
• external subcomponents
• DIY
• building blocks


Selecting a Toolkit: Python Implementations
• nltk
• focus on NLP
• book: Natural Language Processing with Python (O’Reilly ‘09)
• mlpy
• regression, classification, clustering
• PyML
• focus on SVM
• PyBrain
• focus on neural networks


Selecting a Toolkit: Python Implementations
• mdp-toolkit
• data processing management
• nodes represent tasks in a data workflow
• scheduling, parallelization
• scikit-learn
• supervised, unsupervised, feature selection, visualization
• heavy development, large team
• excellent documentation
• active community


Selecting a Toolkit: Do It Yourself
• Basic building blocks
• NumPy
• SciPy
• C/C++ implementations
• LIBLINEAR
• LIBSVM
• OpenCV
• ...your own?


SluggerML: Two Questions
• What features are strong predictors for home runs
and strikeouts?
• Given a particular situation, with what probability will
the batter hit a home run or strike out?


SluggerML: Feature Selection
• Identifies predictive features
• strongly correlated with labels
• predictive: max_benchpress
• not predictive: favorite_cookie
• scikit-learn: chi-square feature selection
• Visualizing significance
• for each well-supported value, find correlation with HR/K
• “well-supported”: >= 0.05% of samples with feature=value
• correlation: ( P(HR | feature=value) / P(HR) ) - 1


Batter: Home vs. Visiting
50.0%

40.0%

30.0%

20.0%

10.0%
Correlation

0.0% Home Run
Strikeout
-10.0%

-20.0%

-30.0%

-40.0%

-50.0%
home team visiting team


Batter: Fielding Position
50.0%

40.0%

30.0%

20.0%

10.0%
Correlation

0.0% Home Run
Strikeout
-10.0%

-20.0%

-30.0%

-40.0%

-50.0%
P C 1B 2B 3B SS LF CF RF DH PH


Game: Temperature (˚F)
50.0%

40.0%

30.0%

20.0%

10.0%
Correlation

0.0% Home Run
Strikeout
-10.0%

-20.0%

-30.0%

-40.0%

-50.0%
35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100-104


Game: Year
50.0%

40.0%

30.0%

20.0%

10.0%
Correlation

0.0% Home Run
Strikeout
-10.0%

-20.0%

-30.0%

-40.0%

-50.0%
1980-1984 1985-1989 1990-1994 1995-1999 2000-2004 2005-2009 2010-2011


SluggerML: Realtime Classification
• Given features, predict label probabilities
• nltk: NaiveBayesClassifier
• Web frontend
• gunicorn, nginx


Tips and Tricks
• Persistent classifier internals
• once trained, save and reuse
• depends on implementation
• string representation may exist
• create your own
• Using generators where possible
• avoid keeping data in memory
• single-pass algorithms
• conversion pass before training
• Multicore text processing
• scrubbing: low memory footprint
• multiprocessing module


The Fine Print™
• Plug-and-play is easy!
• Don’t blindly apply ML
• understand your data
• understand your algorithms
• ml-class.org is an excellent resource


Thanks!
github.com/mattspitz/sluggerml
slideshare.net/mattspitz/practical-machine-learning-in-python

@mattspitz

Practical Machine Learning in Python

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (19)

Ähnlich wie Practical Machine Learning in Python

Ähnlich wie Practical Machine Learning in Python (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Practical Machine Learning in Python

Hinweis der Redaktion