Learning Linear Models with Hadoop

Learning Linear Models
with Hadoop
Ulrich Rückert

© 2012 Datameer, Inc. All rights reserved.

Thursday, March 28, 2013

Agenda

What are linear models anyway?
How to learn linear models with Hadoop
Demo
Tips, tricks and caveats
Conclusion



Predictive Analytics
Test Data
Age Income BuysBook
Target 22 67000 ?
Example Learning Attributes Attribute 39 41000 ?

Task Age Income BuysBook
24 60000 yes
• Ad on booksellerʼs web page 65 80000 no
60 95000 no
• Will a customer buy this book? 35 52000 yes

• Training set: observations on 20
43
45000
75000
yes
yes
Model
previous customers 26 51000 yes
52 47000 no
• Test set: new customers 47 38000 no
25 22000 no

Letʼs learn a linear 33 47000 yes

model! Training Data Age
22
Income
67000
BuysBook
yes
39 41000 no

Prediction



Linear Models
Expert1 Expert2 BuysBook
24 60 ?
64 80 ?
60 96 ?
Whatʼs in the black box?
• Letʼs pretend all attributes are
expert ratings
• Large positive value means yes
• Small value means no Expert 1 Expert 2 Prediction

• Intermediate value: donʼt know 24
65
60
80
?
?
60 95 ?
Let the experts vote
• Sum over ratings for each row
• Larger than threshold: predict yes Expert1
24
Expert2
60
Prediction
?

• Smaller: predict no 64
60
80
96
?
?



Linear Models
24 60 ?
64 80 ?
60 96 ?
Whatʼs in the black box?
• Letʼs pretend all attributes are
expert ratings Threshold

• Large positive value means yes 97

• Small value means no Expert 1 Expert 2 > threshold

• Intermediate value: donʼt know 24
65
+
+
60
80
=
=
84
145
no
yes
60 + 95 = 155 yes
Let the experts vote
• Sum over ratings for each row
• Larger than threshold: predict yes Expert1
24
Expert2
60
Prediction
no

• Smaller: predict no 64
60
80
96
yes
yes



Linear Models
24 60 ?
64 80 ?
Assign a weight to each 60 96 ?

expert
• Expert is mostly correct: large Weight 1 Weight 2 Threshold
weight
0.75 0.25 48
• Expert is uninformative: zero
• Expert is consistently wrong: Expert 1 Expert 2 > threshold

negative weight 0.75 • 24 + 0.25 • 60 = 33 no
0.75 • 64 + 0.25 • 80 = 68 yes
0.75 • 60 + 0.25 • 96 = 69 yes
Learning models
• A linear model contains weights
and threshold Expert1 Expert2 Prediction
24 60 no
• Learn by ﬁnding weights with 64 80 yes
lowest error on training data 60 96 yes



Linear Models
24 60 ?
64 80 ?

expert
weight
0 0.25 18

negative weight 0 • 24 + 0.25 • 60 = 15 no
0 • 64 + 0.25 • 80 = 20 yes
0 • 60 + 0.25 • 96 = 24 yes
Learning models
24 60 no
• Learn by ﬁnding weights with 64 80 yes



Linear Models
24 60 ?
64 80 ?

expert
weight
-0.5 0.25 -8

negative weight -0.5 • 24 + 0.25 • 60 = 3 yes
-0.5 • 64 + 0.25 • 80 = -12 no
-0.5 • 60 + 0.25 • 96 = -6 yes
Learning models
24 60 yes
• Learn by ﬁnding weights with 64 80 no



Stochastic Gradient Start with default weights
Decent (SGD)
• Main idea: start with default
weights Read next training row

• For each row check if current
weights predict correctly
• If misclassiﬁcation: adjust weights Do weights predict the
correct label?
Yes

How to adjust weights?
No
• if positive class: add row
Adjust weights
• if negative class: subtract row



repeat
row = readNextRow();
if(predict(weights, row.attributes) != row.class)
weights += row.class * row.attributes;
threshold += -row.class;
endif
end

Weight 1 Weight 2 Threshold

1 -1 0

Age Income > threshold

1•? + -1 • ? = ? ?

Age Income BuysBook
24 60 +1



repeat
endif
end


1 -1 0


1 • 24 + -1 • 60 = -36 -1

Age Income BuysBook
24 60 +1



repeat
endif
end


25 59 0


25 • 24 + 59 • 60 = 4140 +1

Age Income BuysBook
24 60 +1



repeat
endif
end


25 59 -1


25 • 24 + 59 • 60 = 4140 +1

Age Income BuysBook
24 60 +1



repeat
endif
end


25 59 -1


25 • ? + 59 • ? = ? ?

Age Income BuysBook
30 30 -1



repeat
endif
end


25 59 -1


25 • 30 + 59 • 30 = 2520 +1

Age Income BuysBook
30 30 -1



repeat
endif
end


-5 29 -1


-5 • 30 + 29 • 30 = 720 +1

Age Income BuysBook
30 30 -1



repeat
endif
end


-5 29 0


-5 • 30 + 29 • 30 = 720 +1

Age Income BuysBook
30 30 -1



Learning - Convergence
repeat
endif
end



repeat
weights += 0.001 * row.class * row.attributes;
endif
end



for i=1 to ∞
weights += (1/i) * row.class * row.attributes;
endif
end



Learning - Margin
for i = 1 to ∞
if(margin(weights, row.attributes, threshold) <= 1)
weights += (1/n) * row.class * row.attributes;
endif
end


0.5 0.25 26.5

Age Income Margin > threshold

0.5 • 24 + 0.25 • 60 = 27 +1

Age Income BuysBook
24 60 +1



Learning - Regularization
for i = 1 to ∞
Attributes are often row = readNextRow();
correlated weights += (1/n) * row.class * row.attributes;
• Contributions cancel out endif
end
• This leads to unreasonably
large weights...
• ... and models which are not Weight 1 Weight 2 Threshold
robust to noise
0.5 0.5 30
Regularization Age Income > threshold

• Make sure weights donʼt get too 0.5 • 24 + 0.5 • 60 = 42 +1

large
• L2 regularization: weights are Age Income BuysBook
proportional to attribute quality 24 60 +1



for i = 1 to ∞
end
• This leads to unreasonably
large weights...
robust to noise
1000 -399.3 30

• Make sure weights donʼt get too 1000 • 24 + -399.3 • 60 = 42 +1

large



for i = 1 to ∞

• This leads to unreasonably end
weights = i/(i+r) * weights;

large weights...
robust to noise
1000 -399.3 30

• Make sure weights donʼt get too 1000 • 24 + -399.3 • 60 = 42 +1

large



Implementation on Hadoop

Map-Reduce
• Input data must be in random order

• Mapper: send data to reducer in random order

• Reducer: run the actual Stochastic Gradient Descent

Evaluation and Parameter Selection
• Perform several runs with varying parameters

• Learn on training set, evaluate on test set

• Many runs with with partial data often better than one run with all data



Demo




Stochastic Gradient Descent: Pros and Cons
• One sweep over the data: easy to implement on top of Hadoop

• Flexible: support vector machines, logistic regression, etc.

• Provides good enough estimate instead of optimum

• Parameter selection and evaluation are crucial

Alternative: convex optimization
• Formulate learning as numerical optimization problem

• On Hadoop: usually LBFGS

• See Vowpal Wobbit for a large scale implementation



Conclusion

Linear Models
• Prediction based on weighted vote and threshold

Stochastic Gradient Descent
• Adjust weight vector iteratively for each misclassiﬁed row

• Decreasing step size to ensure convergence

• Margins and regularization for robustness

Implementation
• Mapper provides random order, reducer performs SGD

• Evaluation and parameter selection are crucial



Thanks
urueckert@datameer.com



Learning Linear Models with Hadoop

Recommended

Recommended

More Related Content

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Learning Linear Models with Hadoop