1. Small Data Machine Learning
Andrei Zmievski
The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic.
Questions - now and later
2. WORK
We are all superheroes, because we help our customers keep their mission-critical apps
running smoothly. If interested, I can show you a demo of what I’m working on. Come find
me.
3. WORK
We are all superheroes, because we help our customers keep their mission-critical apps
running smoothly. If interested, I can show you a demo of what I’m working on. Come find
me.
11. @a
For those of you who don’t know me..
Acquired in October 2008
Had a different account earlier, but then @k asked if I wanted it..
Know many other single-letter Twitterers.
24. REPLYCLEANER
Even with false negatives, reduces garbage to where visual filtering is possible
- uses trained model to classify tweets into good/bad
- blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
25. REPLYCLEANER
Even with false negatives, reduces garbage to where visual filtering is possible
- uses trained model to classify tweets into good/bad
- blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
34. “Field of study that gives
computers the ability to learn
without being explicitly
programmed.”
— Arthur Samuel (1959)
concerns the construction and study of systems that can learn from data
40. supervised
unsupervised
no labels in the dataset, algorithm needs to find structure
Example: clustering
We will be talking about classification, a supervised learning process.
44. features
# of rooms
2
sq. m
house age
yard?
feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
45. features parameters
# of rooms
2
sq. m
house age
yard?
102.3
0.94
-10.1
83.0
feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
46. features parameters
1
# of rooms
2
sq. m
house age
yard?
45.7
102.3
0.94
-10.1
83.0
feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
47. features parameters
1
# of rooms
2
sq. m
house age
yard?
45.7
102.3
0.94
-10.1
83.0
feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
48. features parameters = prediction
1
# of rooms
2
sq. m
house age
yard?
45.7
102.3
0.94
-10.1
83.0
feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
49. features parameters = prediction
1
# of rooms
2
sq. m
house age
yard?
45.7
102.3
0.94
-10.1
83.0
758,013
feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
57. whisky price $
200
160
120
80
40
5
10
15 20 25 30 35
whisky age
LINEAR REGRESSION
Models the relationship between a scalar dependent variable y and one or more explanatory
variables denoted X. Here X = whisky age, y = whisky price.
Linear regression does not work well for classification because its output is unbounded.
Thresholding on some value is tricky and does not produce good results.
58. whisky price $
200
160
120
80
40
5
10
15 20 25 30 35
whisky age
LINEAR REGRESSION
Models the relationship between a scalar dependent variable y and one or more explanatory
variables denoted X. Here X = whisky age, y = whisky price.
Linear regression does not work well for classification because its output is unbounded.
Thresholding on some value is tricky and does not produce good results.
59. whisky price $
200
160
120
80
40
5
10
15 20 25 30 35
whisky age
LINEAR REGRESSION
Models the relationship between a scalar dependent variable y and one or more explanatory
variables denoted X. Here X = whisky age, y = whisky price.
Linear regression does not work well for classification because its output is unbounded.
Thresholding on some value is tricky and does not produce good results.
60. 1
0.5
z
0
1
g(z) =
1+e
z
LOGISTIC REGRESSION
Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin.
z is just our old dot product, the linear predictor. Transforms unbounded output into
bounded.
61. 1
0.5
z
0
1
g(z) =
1+e
z =✓·X
z
LOGISTIC REGRESSION
Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin.
z is just our old dot product, the linear predictor. Transforms unbounded output into
bounded.
62. 1
h✓ (X) =
1+e
✓·X
Probability that y=1 for input X
LOGISTIC REGRESSION
If hypothesis describes spam, then given X=body of email, if y=0.7, means there’s 70%
chance it’s spam. Thresholding on that is up to you.
68. independent
&
discriminant
Independent: feature A should not co-occur (correlate) with feature B highly.
Discriminant: a feature should provide uniquely classifiable data (what letter a tweet starts
with is not a good feature).
69. possible features
@a at the end of the tweet
‣ @a...
‣ length < N chars
‣ # of user mentions in the tweet
‣ # of hashtags
‣ language!
‣ @a followed by punctuation and a word
character (except for apostrophe)
‣ …and more
‣
70. feature = extractor(tweet)
For each feature, write a small function that takes a tweet and returns a numeric value
(floating-point).
75. Language
Detection
Can’t trust the language field in user’s profile data.
Used character N-grams and character sets for detection.
Has its own error rate, so needs some post-processing.
76. Language
Detection
pear / Text_LanguageDetect
pecl / textcat
Can’t trust the language field in user’s profile data.
Used character N-grams and character sets for detection.
Has its own error rate, so needs some post-processing.
77. EnglishNotEnglish
✓
✓
✓
✓
Clean-up text (remove mentions, links, etc)
Run language detection
If unknown/low weight, pretend it’s English, else:
If not a character set-determined language, try harder:
✓ Tokenize into words
✓ Difference with English vocabulary
✓ If words remain, run parts-of-speech tagger on each
✓ For NNS, VBZ, and VBD run stemming algorithm
✓ If result is in English vocabulary, remove from remaining
✓ If remaining list is not empty, calculate:
unusual_word_ratio = size(remaining)/size(words)
✓ If ratio < 20%, pretend it’s English
A lot of this is heuristic-based, after some trial-and-error.
Seems to help with my corpus.
83. OVER
SAMPLING
Oversampling: use multiple copies of good tweets to equalize with bad
Problem: bias very high, each good tweet would have to be copied 100 times, and not
contribute any variance to the good category
84. OVER
SAMPLING
Oversampling: use multiple copies of good tweets to equalize with bad
Problem: bias very high, each good tweet would have to be copied 100 times, and not
contribute any variance to the good category
88. chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10
The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)
89. chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10
1
The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)
90. chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10
1
2
The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)
91. chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10
1
2
0
The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)
92. chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10
1
2
0
77
The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)
93. Model
Training
We have the hypothesis (decision function) and the training set,
How do we actually determine the weights/parameters?
94. COST
FUNCTION
Measures how far the prediction of the system is from the reality.
The cost depends on the parameters.
The less the cost, the closer we’re to the ideal parameters for the model.
95. REALITY
COST
FUNCTION
PREDICTION
Measures how far the prediction of the system is from the reality.
The cost depends on the parameters.
The less the cost, the closer we’re to the ideal parameters for the model.
96. COST
FUNCTION
m
X
1
J(✓) =
Cost(h✓ (x), y)
m i=1
Measures how far the prediction of the system is from the reality.
The cost depends on the parameters.
The less the cost, the closer we’re to the ideal parameters for the model.
98. LOGISTIC COST
y=1
0
y=0
1
Correct guess
Incorrect guess
0
1
Cost = 0
Cost = huge
When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess),
the more we penalize the algorithm. Same for y=0.
100. GRADIENT DESCENT
Random starting point.
Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step.
Repeat.
Imagine a ball rolling down from a hill.
102. ✓i = ✓i
@J(✓)
↵
@✓i
each parameter
Have to update them simultaneously (the whole vector at a time).
103. learning rate
✓i = ✓i
@J(✓)
↵
@✓i
Controls how big a step you take
If α is big have an aggressive gradient descent
If α is small take tiny steps
If too small, tiny steps, takes too long
If too big, can overshoot the minimum and fail to converge
104. ✓i = ✓i
@J(✓)
↵
@✓i
derivative
aka
“the slope”
The slope indicates the steepness of the descent step for each weight, i.e. direction.
Keep going for a number of iterations or until cost is below a threshold (convergence).
Graph the cost function versus # of iterations and see where it starts to approach 0, past that
are diminishing returns.
105. ✓i = ✓i
↵
m
X
j
(h✓ (x ) y
j
j=1
THE UPDATE ALGORITHM
Derivative for logistic regression simplifies to this term.
Have to update the weights simultaneously!
j
)xi
106. X1 = [1 12.0]
X2 = [1 -3.5]
Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
107. X1 = [1 12.0]
X2 = [1 -3.5]
y1 = 1
y2 = 0
Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
108. X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]
y1 = 1
y2 = 0
Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
109. X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]
y1 = 1
y2 = 0
↵ = 0.05
Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
110. X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]
y1 = 1
y2 = 0
↵ = 0.05
-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
111. X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]
y1 = 1
y2 = 0
↵ = 0.05
-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e
Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
112. X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]
y1 = 1
y2 = 0
↵ = 0.05
-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e
T0
Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
113. X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]
y1 = 1
y2 = 0
↵ = 0.05
-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e
T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
114. X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]
y1 = 1
y2 = 0
↵ = 0.05
-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e
T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
= 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1)
Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
115. X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]
y1 = 1
y2 = 0
↵ = 0.05
-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e
T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
= 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1)
= 0.088
Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
122. TEST
TRAINING
DATA
Train model on training set, then test results on test set.
Rinse, lather, repeat feature selection/synthesis/training until results are "good enough".
Pick the best parameters and save them (DB, other).
123. Putting It All
Together
Let’s put our model to use, finally.
The tool hooks into Twitter Streaming API, and naturally that comes with need to do certain
error handling, etc. Once we get the actual tweet though..
124. Load the model
The weights we have calculated via training
Easiest is to load them from DB (can be used to test different models).
125. HARD
CODED
RULES
We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
126. SKIP
truncated retweets: "RT @A ..."
HARD
CODED
RULES
We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
127. SKIP
HARD
CODED
RULES
truncated retweets: "RT @A ..."
@ mentions of friends
We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
128. SKIP
HARD
CODED
RULES
truncated retweets: "RT @A ..."
@ mentions of friends
tweets from friends
We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
134. Finally
h✓ (X) =
1
1+e
(✓0 +✓1 X1 +✓2 X2 +... )
If h > threshold , tweet is bad, otherwise good
Remember that the output of h() is 0..1 (probability).
Threshold is [0, 1], adjust it for your degree of tolerance. I used 0.9 to reduce false positives.
135. extract features
3 simple steps
Invoke the feature extractor to construct the feature vector for this tweet.
Evaluate the decision function over the feature vector (input the calculated feature
parameters into the equation).
Use the output of the classifier.
136. extract features
run the model
3 simple steps
Invoke the feature extractor to construct the feature vector for this tweet.
Evaluate the decision function over the feature vector (input the calculated feature
parameters into the equation).
Use the output of the classifier.
137. extract features
run the model
act on the result
3 simple steps
Invoke the feature extractor to construct the feature vector for this tweet.
Evaluate the decision function over the feature vector (input the calculated feature
parameters into the equation).
Use the output of the classifier.
139. Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective
140. Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective
-Connection handling, backoff in case of problems, undocumented API errors, etc.
141. Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective
-No way for blocked person to get ahold of you via Twitter anymore, so when training the
model, err on the side of caution.
142. Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective
-Some tweets are shown on the website, but never seen through the API.
143. Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective
-Lots of room for improvement.
144. Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective
PHP sucks at math-y stuff
-Lots of room for improvement.
145. Realtime feedback
★ More features
★ Grammar analysis
★ Support Vector Machines or
decision trees
★ Clockwork Raven for manual
classification
★ Other minimization algos:
BFGS, conjugate gradient
★ Wish pecl/scikit-learn existed
★
NEXT
STEPS
Click on the tweets that are bad and it immediately incorporates them into the model.
Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences.
SVMs more appropriate for biased data sets.
Farm out manual classification to Mechanical Turk.
May help avoid local minima, no need to pick alpha, often faster than GD.
146. MongoDB
★ pear/Text_LanguageDetect
★ English vocabulary corpus
★ Parts-Of-Speech tagging
★ SplFixedArray
★ phirehose
★ Python’s scikit-learn (for
validation)
★ Code sample
★
TOOLS
MongoDB (great fit for JSON data)
English vocabulary corpus: http://corpus.byu.edu/ or http://www.keithv.com/software/wlist/
SplFixedArray in PHP (memory savings and slightly faster)
147. LEARN
Coursera.org ML course
★ Ian Barber’s blog
★ FastML.com
★
Click on the tweets that are bad and it immediately incorporates them into the model.
Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences.
SVMs more appropriate for biased data sets.
Farm out manual classification to Mechanical Turk.
May help avoid local minima, no need to pick alpha, often faster than GD.