We've been taught that "data science" is the esoteric domain of PhDs,
but like anything else, it's easy once you understand it. This talk
explains the basics of data science, covering concepts in supervised
learning (including a detailed explanation of decision trees and
random forests) as well as examples of unsupervised learning
algorithms. Far from being a dry and academic topic, data science and machine learning are useful and practical analytical tools. (This talk is intended for a general audience.)
Topics will include:
1) An introduction to supervised learning using the popular decision
tree algorithm
2) The concepts of training and scoring, and the meaning of "real time"
machine learning
3) Model validation using holdout sets
4) Model complexity and overfitting; understanding bias and variance;
using ensembles to reduce variance
5) An overview of unsupervised learning models including clustering,
topic modeling and anomaly detection
and more!
2. About me
• 10+ years experience in data science at various consumer
web companies
• Worked on web search at Yahoo and Microsoft
• Led the Mobile data science team at Groupon
• Joined BigML as VP Data Science in July 2013
• Joined JLL Spark as VP Data in July 2017
• Advisor to High Fidelity Genetics
3
3. Finding meaningful patterns in data
• The famous “Iris” data set has measurements for 150 flowers
• Given a flower’s measurements, can we predict its species?
Iris setosa Iris versicolor Iris virginica
5
7. PetalWidth(cm)
Petal Length (cm)
Prediction: Iris setosa
Prediction: Iris versicolor
Prediction: Iris virginica
Prediction:
Iris virginica
Congratulations! You just scored
four new flowers using your model,
and made a prediction about the
species of each one.
9
8. Training versus Scoring
• This process had two steps: training and scoring
• When training on historical data, you’re using data gathered over
some length of time
• When scoring new data points, you want the answer immediately
(in “real time”)
10
9. 11
Predicts “blue” with high confidence
Explains a large chunk of the data
(high support)
Predicts “blue” with low confidence
Explains a small chunk of the data
(low support)
10. Support and Confidence
• A rectangle with a large number of data points has high “support”
• A rectangle that is purely one color has high “confidence”
• If there is a small number of data points, confidence is low even if
it’s purely one color
12
11. PetalWidth(cm)
Petal Length (cm)
13
Width <= 0.8? Width > 0.8?
Width > 1.75? Width <= 1.75?
Length <= 5? Length > 5?
50 red
45 blue
1 blue, 48 green 4 blue, 2 green
“Decision Tree”
“Leaf Nodes”
50 blue, 50 green
5 blue, 50 green
50 red, 50 blue, 50 green
12. • Data is just a table of values
• Each row is an instance, an example
of the concept to be learned
• Each column is an attribute or
feature of the instance
• The column we want to predict is the
label or output
• Because we have a label, this is
supervised learning
14
instance
instance
feature labelfeature
13. Demo: The General Social Survey
• Sociology survey given in the United States since 1972
• Data is 39,000 responses, almost 400 questions each
• Demographic data like income, race, gender, education, marital status
• Many questions about personal beliefs
• “Should an atheist be allowed to teach college, or not?”
• “Are we spending the right amount of money on education?”
• Can we predict income from these responses?
16
14. How good is our model?
• The model looks good, but how do we quantify this?
17
15. 80%
training set
20%
holdout set
3 out of 4 predictions are correct
Accuracy = 75%
100% of data
1. Train a model using
80% training set
2. Pretend 20% holdout
is new data, and
feed it to the model
3. Check accuracy of
predictions
16. Predicting political views
• What happens if we predict political views instead of income?
• A different subset of variables becomes important!
19
20. The Value of Predictive Modeling
• Provides deep insight into your data
• Finds the small subset of important variables
• Extremely useful for business!
23
21. Demo: The StumbleUpon Dataset
• StumbleUpon is an app that recommends web pages
• Dataset of 7,400 web pages is provided, with each page labeled as
either “evergreen” or “ephemeral”
• We want to predict the page’s class using this historical data
24
While some pages we recommend, such as news
articles or seasonal recipes, are only relevant for a
short period of time, others maintain a timeless
quality and can be recommended to users long after
they are discovered. In other words, pages can
either be classified as "ephemeral" or "evergreen".
22. Training a model on StumbleUpon data
• Live demo: training a model on StumbleUpon data
• Key concepts:
• “Bag of words” text analysis
• Evaluating the model using a holdout set
• Combining multiple models to improve accuracy
• The “ensemble” of multiple models has better accuracy!
25
23. “Ensembles” of Models
• Training multiple models on random subsets of the data gave us a
better result!
• Why?
26
24. Bias and Variance
• We train a model with the goal of fitting it correctly to the data
• When a model isn’t flexible enough, it may underfit the data, and we
say it has high bias
• When a model is too flexible, it may overfit the data, and we say it
has high variance
For a formal definition of bias and variance, see
Thomas Dietterich’s paper on the subject
27. Decision trees have high variance
• Decision trees can represent complex functions
• But they are prone to overfitting; they have high variance
• If you draw enough lines, you can create a “model” that just
memorizes the dataset!
28.
29. Decision trees have high variance
• We can reduce this problem by:
• Taking several random samples from the original data set
• Training a decision tree on each sample
• Having these trees vote on the class
• Goal: Get the expressiveness of a decision tree, with less overfitting
37. Benefits of a Decision Tree Ensemble
• Voted boundary is more accurate than for a single tree
• “Best of both worlds”: Get most of the expressiveness of decision
trees with lower variance
• We’re actually taking advantage of the variance by feeding a different
random sample to each tree and seeing what happens!
46
38. Why draw straight lines in decision trees?
• Imagine you have 400 variables in your dataset
• You only need to examine 400 variables to draw
the “best” straight line between the dots
• If you want a diagonal line in two dimensions,
there are (400 choose 2) or 79,800
combinations of variables to examine
• Some biology datasets have 100,000 variables!
• (100,000 choose 2) = 4,999,950,000
combinations of 2 variables!
47
39. Popular algorithms for supervised learning
• We got pretty deep into Decision Trees and ensembles of trees
• Other popular algorithms for supervised learning:
• Support Vector Machines
• Neural Nets (“Deep Learning”)
• Check out BigML’s automated deep learning!
50
40. Recap: Supervised Learning Topics
• Definition of supervised learning
• Training and scoring a model
• Support and confidence
• Model evaluation using a holdout set
• Bias and variance, underfitting and overfitting
• Using ensembles to improve models
• … And a whole lot about decision trees!
51
42. What if we don’t have labels?
• Can we still get insight into our data if we don’t know the
colors of the dots?
• Since we don’t have labels, this is unsupervised learning
• Clustering: Find “clumps” of unlabeled data that might be interesting
• Anomaly detection: Find outliers in unlabeled data
• Topic Modeling: Identify topics in free text
54
43. Clustering
• Concept: Find “lumps” of data that exist in distinct clusters
• K-means clustering:
1. Choose a number of clusters k that you are looking for
2. Choose initial “centroids” for the clusters
3. Compute which data points are closest to each centroid
4. Compute the actual center for each of the sets of datapoints
5. Continue until the k centroids stop moving
55
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57. Demo: The Whisky Dataset
• Data on the flavors of 86 single-malt Scotch whiskies
• No labels, just a bunch of taste information
• Can we get insight into this dataset?
69
58. Demo: Breast Cancer Dataset
• Train a predictive model using the 699 biopsies
• The “label” of benign or malignant is known for each one
• We can train a highly accurate predictive model with this
data
74
59. Demo: Breast Cancer Dataset
• What if we remove the labels of “benign” and “malignant”?
75
60. 10 lines are needed
to isolate this data point
(not anomalous)
61. Only 4 lines are needed
to isolate this data point
(highly anomalous)
62. Demo: Anomaly Detection
• Remove the labels of benign or malignant
• Train an anomaly detector on this unlabeled data
• Create a new dataset with the anomaly scores as “labels”
• Use these “labels” to train a predictive model!
78
64. Minority Report
• Anomaly detection works great on large unlabeled datasets,
especially if you expect to find an (adversarial) minority class
• Millions of credit card transactions, billions of network events …
• Doesn’t require you to know what you’re looking for!
81
65. Topic Modeling using LDA
• Uncovers groups of related words (“topics”) in documents
• Does not require an external corpus (e.g. training on Wikipedia)
• No semantic parsing of text
• Unsupervised
69. The (assumed) generative process
children
A topic, which is a
distribution over words
A distribution over topics,
specific to each document
A distribution over
topic distributions,
fixed for this corpus
A word
in a document
Topic 1
Topic 3 Topic 2
A distribution over
word distributions,
fixed for this corpus
Word 1
Word 2Word 3
76. How do we get such “good” topics?
• Imagine that each document can only belong to one topic
• Does that make it easier or harder to find “good” clusters of words?
• LDA allows documents to belong to multiple topics
77. Recap: Unsupervised Learning Topics
• Unsupervised learning uses unlabeled data
• Clustering: Finding clumps in unlabeled data
• Anomaly Detection: Finding “weird” instances in unlabeled data
• Topic Modeling: Extracting meaningful topics from free text
94
78. Final Thought
• Supervised learning has many different algorithms to solve one
problem (predicting the output)
• Unsupervised learning has a many different algorithms to solve many
different problems
95
David Gerster
gerster@bigml.com