This document discusses practical machine learning in Python. It introduces the SluggerML baseball analytics project, which uses machine learning to predict the probability of home runs and strikeouts given various game and player features. Scikit-learn is identified as a good Python machine learning toolkit due to its speed, transparency, and large community support. The document outlines SluggerML's approach for gathering baseball play-by-play data, selecting predictive features, building classifiers, and providing real-time predictions through a web interface.
2. Practical Machine Learning in Python 2
This is the Age of Aquarius Data
• Data is plentiful
• application logs
• external APIs
• Facebook, Twitter
• public datasets
• Analysis adds value
• understanding your users
• dynamic application decisions
• Storage / CPU time is cheap
3. Practical Machine Learning in Python 3
Machine Learning in Python
• Python is well-suited for data analysis
• Versatile
• quick and dirty scripts
• full-featured, realtime applications
• Mature ML packages
• tons of choices (see: mloss.org)
• plug-and-play or DIY
4. Practical Machine Learning in Python 4
Classification Problem: Terminology
• Data points
• feature set: “interesting” facts about an event/thing
• label: a description of that event/thing
• Classification
• training set: a bunch of labeled feature sets
• given a training set, build a classifier to predict labels for
unlabeled feature sets
5. Practical Machine Learning in Python 5
SluggerML
• Two questions
• What features are strong predictors for home runs and strikeouts?
• Given a particular situation, with what probability will the batter
hit a home run or strike out?
• Feature sets represent game state for a plate appearance
• game: day vs. night, wind direction...
• at-bat: inning, #strikes, left-right matchup...
• batter/pitcher: age, weight, fielding position...
• Labels represent outcome
• HR (home run), K (strikeout), OTHER
• Poor Man’s Sabermetrics
6. Practical Machine Learning in Python 6
SluggerML: Example
• Training set
• {game_daynight: day, batter_age: 24, pitcher_weight: 211}
• label: HR
• {game_daynight: day, batter_age: 36, pitcher_weight: 242}
• label: K
• {game_daynight: night, batter_age: 27, pitcher_weight: 195}
• label: OTHER
• Classifier predictions
• {game_daynight: night, batter_age: 36, pitcher_weight: 225}
• 2.6% HR 15.6% K
• {game_daynight: day, batter_age: 20, pitcher_weight: 216}
• 2.2% HR 19.1% K
7. Practical Machine Learning in Python 7
SluggerML: Gathering Data
• Sources
• Retrosheet
• play-by-play logs for every game since 1956
• Sean Lahman’s Baseball Archive
• detailed stats about individual players
• Coalescing
• 1st pass, Lahman: create player database
• shelve module
• 2nd pass, Retrosheet: track game state, join on player db
• Scrubbing
• ensure consistency
8. Practical Machine Learning in Python 8
SluggerML: Gathering Data
• Training set
• regular-season games from 1980-2011
• 5,669,301 plate appearances
• 135,602 home runs
• 871,226 strikeouts
9. Practical Machine Learning in Python 9
Selecting a Toolkit: Tradeoffs
• Speed
• offline vs. realtime
• Transparency
• internal visibility
• customizability
• Support
• maturity
• community
10. Practical Machine Learning in Python 10
Selecting a Toolkit: High-Level Options
• External bindings
• python interfaces to popular packages
• Matlab, R, Octave, SHOGUN Toolbox
• transition legacy workflows
• Python implementations
• collections of algorithms
• (mostly) python
• external subcomponents
• DIY
• building blocks
11. Practical Machine Learning in Python 11
Selecting a Toolkit: Python Implementations
• nltk
• focus on NLP
• book: Natural Language Processing with Python (O’Reilly ‘09)
• mlpy
• regression, classification, clustering
• PyML
• focus on SVM
• PyBrain
• focus on neural networks
12. Practical Machine Learning in Python 12
Selecting a Toolkit: Python Implementations
• mdp-toolkit
• data processing management
• nodes represent tasks in a data workflow
• scheduling, parallelization
• scikit-learn
• supervised, unsupervised, feature selection, visualization
• heavy development, large team
• excellent documentation
• active community
13. Practical Machine Learning in Python 13
Selecting a Toolkit: Do It Yourself
• Basic building blocks
• NumPy
• SciPy
• C/C++ implementations
• LIBLINEAR
• LIBSVM
• OpenCV
• ...your own?
14. Practical Machine Learning in Python 14
SluggerML: Two Questions
• What features are strong predictors for home runs
and strikeouts?
• Given a particular situation, with what probability will
the batter hit a home run or strike out?
15. Practical Machine Learning in Python 15
SluggerML: Feature Selection
• Identifies predictive features
• strongly correlated with labels
• predictive: max_benchpress
• not predictive: favorite_cookie
• scikit-learn: chi-square feature selection
• Visualizing significance
• for each well-supported value, find correlation with HR/K
• “well-supported”: >= 0.05% of samples with feature=value
• correlation: ( P(HR | feature=value) / P(HR) ) - 1
16. Practical Machine Learning in Python 16
SluggerML: Feature Selection
Batter: Home vs. Visiting
50.0%
40.0%
30.0%
20.0%
10.0%
Correlation
0.0% Home Run
Strikeout
-10.0%
-20.0%
-30.0%
-40.0%
-50.0%
home team visiting team
17. Practical Machine Learning in Python 17
SluggerML: Feature Selection
Batter: Fielding Position
50.0%
40.0%
30.0%
20.0%
10.0%
Correlation
0.0% Home Run
Strikeout
-10.0%
-20.0%
-30.0%
-40.0%
-50.0%
P C 1B 2B 3B SS LF CF RF DH PH
19. Practical Machine Learning in Python 19
SluggerML: Feature Selection
Game: Year
50.0%
40.0%
30.0%
20.0%
10.0%
Correlation
0.0% Home Run
Strikeout
-10.0%
-20.0%
-30.0%
-40.0%
-50.0%
1980-1984 1985-1989 1990-1994 1995-1999 2000-2004 2005-2009 2010-2011
20. Practical Machine Learning in Python 20
SluggerML: Realtime Classification
• Given features, predict label probabilities
• nltk: NaiveBayesClassifier
• Web frontend
• gunicorn, nginx
21. Practical Machine Learning in Python 21
Tips and Tricks
• Persistent classifier internals
• once trained, save and reuse
• depends on implementation
• string representation may exist
• create your own
• Using generators where possible
• avoid keeping data in memory
• single-pass algorithms
• conversion pass before training
• Multicore text processing
• scrubbing: low memory footprint
• multiprocessing module
22. Practical Machine Learning in Python 22
The Fine Print™
• Plug-and-play is easy!
• Don’t blindly apply ML
• understand your data
• understand your algorithms
• ml-class.org is an excellent resource
Data is everywhere clickstream data users are bad at managing fb permissions; you can get a lot out of the graph APIThere’s value in learning about data - how people use your site- feature or advertisement personalizationOne thing that enables this is that resources are cheap these days
Python is a fantastic programming environment for data processing and analyticson one end of the spectrum, quick and dirty scripts... or full-featured applications ready for a deployment at scaleWide variety of toolkits for off-the-shelf analysis or building out your own data processing applications
For this talk... discussing one flavor of analytics and machine learning, the classification problemintuition: training set: what you know about the world train a classifier to predict things that you don’t
As a concrete example, I started playing around with some baseball stats to illustrate how one might go about building ML applications in pythoneven if you’re not into baseball, you know that the iconic visions of success and failure are the home run and the strikeout in all the movies, hitting a home run is equivalent to getting the girl and striking out is seen as a major setback
As with any machine learning problem, you want to get your data into a classifier-consumable format. That is, labeled feature sets. For each play in the game, keep track of the game state and output a labeled feature bundle representing the situation and its outcome: HR, K, (other)
speed: offline: deadline ~ hours, daysrealtime: user waiting on the other side (user actions: => milliseconds)transparency:seeing what’s going on with an algorithm in case the docs aren’t clearmodifying or patching an algorithm to meet your needssupport:maturity, active development how strong is the community around the project? are there tutorials available?
interface with external packages if you’ve done some analysis already and want to transition to python without throwing away codepython toolkits provide sets of algorithms, mostly python implementationsoften use external packages with C bindings, some even use other toolkitsDIY: use the external packages yourself
to give a sampling of what’s available, i chose some toolkits that were last updated within a yearAs a disclaimer... -Not exhaustive, just a sampling -some of these tools I’ve used, some I haven’t! -I’m sure I’ve missed your favorite, and for that I apologizedifferent packages focus on different things, so one isn’t necessarily going to suit all of your needs
buzz around scikit-learn last year - checked it out recently and it’s been built out a lot
NumPy: fast and efficient arraysSciPy: scientific tools and algorithms built on NumPyCan also use popular C/C++ implementations using python bindingspython is a modular language, so you can always sub out your implementation without disrupting your workflow too muchnow, as an example of applying these toolkits...
speed isn’t criticalspeed is critical (imagine that you’re a coach) baseball is slow, but it’s not THAT slow
identifies predictive features certain values are strongly correlated with certain labelssklearn- wasn’t clear on the documented usage, looked at the code
for a coach
don’t we need to train our classifier to run our web application?save them on disk!pickle or pull out a textual representation(another argument for using a package that allows you to do this)why compute things twice?use generatorslots and lots of dataavoid keeping it all in memorysingle pass algorithm (bayes)first-pass conversion to compact data (numpy vectors, not python objects)not always possible, but keep it in mindtake advantage of multiple cores - if your processing step has a minimal memory footprint (just one line at a time), do it on multiple cores - multiple processes on different input files or multiprocessing module is great at this
you don't need to know everything about the algorithms you use …but you can't just blindly apply these things and hope that they magically workml-class.org: free class, provides an excellent foundation and starting point for understanding MLin no time, you, too, can be a number muncher
source code for SluggerML on github; kind of a mess, and I’m sorry about thatand I’m @mattspitz on the twitters