Weitere ähnliche Inhalte Ähnlich wie News from Mahout (20) Mehr von Ted Dunning (20) Kürzlich hochgeladen (20) News from Mahout2. whoami – Ted Dunning
Chief Application Architect, MapR Technologies
Committer, member, Apache Software Foundation
– particularly Mahout, Zookeeper and Drill
(we’re hiring)
Contact me at
tdunning@maprtech.com
tdunning@apache.com
ted.dunning@gmail.com
@ted_dunning
©MapR Technologies - Confidential 2
3. Slides and such (available late tonight):
– http://www.mapr.com/company/events/nyhug-03-05-2013
Hash tags: #mapr #nyhug #mahout
©MapR Technologies - Confidential 3
4. New in Mahout
0.8 is coming soon (1-2 months)
gobs of fixes
QR decomposition is 10x faster
– makes ALS 2-3 times faster
May include Bayesian Bandits
Super fast k-means
– fast
– online (!?!)
©MapR Technologies - Confidential 4
5. New in Mahout
0.8 is coming soon (1-2 months)
gobs of fixes
QR decomposition is 10x faster
– makes ALS 2-3 times faster
May include Bayesian Bandits
Super fast k-means
– fast
– online (!?!)
– fast
Possible new edition of MiA coming
– Japanese and Korean editions released, Chinese coming
©MapR Technologies - Confidential 5
6. New in Mahout
0.8 is coming soon (1-2 months)
gobs of fixes
QR decomposition is 10x faster
– makes ALS 2-3 times faster
May include Bayesian Bandits
Super fast k-means
– fast
– online (!?!)
– fast
Possible new edition of MiA coming
– Japanese and Korean editions released, Chinese coming
©MapR Technologies - Confidential 6
8. We have a product
to sell …
from a web-site
©MapR Technologies - Confidential 8
9. What tag-
What line?
picture?
Bogus Dog Food is the Best!
Now available in handy 1 ton
bags!
Buy 5!
What call to
action?
©MapR Technologies - Confidential 9
10. The Challenge
Design decisions affect probability of success
– Cheesy web-sites don’t even sell cheese
The best designers do better when allowed to fail
– Exploration juices creativity
But failing is expensive
– If only because we could have succeeded
– But also because offending or disappointing customers is bad
©MapR Technologies - Confidential 10
11. More Challenges
Too many designs
– 5 pictures
– 10 tag-lines
– 4 calls to action
– 3 back-ground colors
=> 5 x 10 x 4 x 3 = 600 designs
It gets worse quickly
– What about changes on the back-end?
– Search engine variants?
– Checkout process variants?
©MapR Technologies - Confidential 11
12. Example – AB testing in real-time
I have 15 versions of my landing page
Each visitor is assigned to a version
– Which version?
A conversion or sale or whatever can happen
– How long to wait?
Some versions of the landing page are horrible
– Don’t want to give them traffic
©MapR Technologies - Confidential 12
13. A Quick Diversion
You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
I flip the coin and while it is in the air ask again
I catch the coin and ask again
I look at the coin (and you don’t) and ask again
Why does the answer change?
– And did it ever have a single value?
©MapR Technologies - Confidential 13
14. A Philosophical Conclusion
Probability as expressed by humans is subjective and depends on
information and experience
©MapR Technologies - Confidential 14
16. 5 heads out of 10 throws
©MapR Technologies - Confidential 16
17. 2 heads out of 12 throws
©MapR Technologies - Confidential 17
18. So now you understand
Bayesian probability
©MapR Technologies - Confidential 18
19. Another Quick Diversion
Let’s play a shell game
This is a special shell game
It costs you nothing to play
The pea has constant probability of being under each shell
(trust me)
How do you find the best shell?
How do you find it while maximizing the number of wins?
©MapR Technologies - Confidential 19
21. Interim Thoughts
Can you identify winners or losers without trying them out?
Can you ever completely eliminate a shell with a bad streak?
Should you keep trying apparent losers?
©MapR Technologies - Confidential 21
22. So now you understand
multi-armed bandits
©MapR Technologies - Confidential 22
23. Conclusions
Can you identify winners or losers without trying them out?
No
Can you ever completely eliminate a shell with a bad streak?
No
Should you keep trying apparent losers?
Yes, but at a decreasing rate
©MapR Technologies - Confidential 23
24. Is there an optimum
strategy?
©MapR Technologies - Confidential 24
25. Bayesian Bandit
Compute distributions based on data so far
Sample p1, p2 and p2 from these distributions
Pick shell i where i = argmaxi pi
Lemma 1: The probability of picking shell i will match the
probability it is the best shell
Lemma 2: This is as good as it gets
©MapR Technologies - Confidential 25
26. And it works!
0.12
0.11
0.1
0.09
0.08
0.07
regret
0.06
ε- greedy, ε = 0.05
0.05
0.04 Bayesian Bandit with Gam m a- Norm al
0.03
0.02
0.01
0
0 100 200 300 400 500 600 700 800 900 1000 1100
n
©MapR Technologies - Confidential 26
28. The Code
Select an alternative
n = dim(k)[1]
p0 = rep(0, length.out=n)
for (i in 1:n) {
p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)
}
return (which(p0 == max(p0)))
Select and learn
for (z in 1:steps) {
i = select(k)
j = test(i)
k[i,j] = k[i,j]+1
}
return (k)
But we already know how to count!
©MapR Technologies - Confidential 28
29. The Basic Idea
We can encode a distribution by sampling
Sampling allows unification of exploration and exploitation
Can be extended to more general response models
©MapR Technologies - Confidential 29
30. The Original Problem
x2
x1
Bogus Dog Food is the Best!
Now available in handy 1 ton
bags!
Buy 5!
x3
©MapR Technologies - Confidential 30
31. Response Function
æ ö
p(win) = w çåqi xi ÷
è i ø
1
0.5
y
0
-6 -4 -2 0 2 4 6
x
©MapR Technologies - Confidential 31
32. Generalized Banditry
Suppose we have an infinite number of bandits
– suppose they are each labeled by two real numbers x and y in [0,1]
– also that expected payoff is a parameterized function of x and y
E [ z ] = f (x, y | q )
– now assume a distribution for θ that we can learn online
Selection works by sampling θ, then computing f
Learning works by propagating updates back to θ
– If f is linear, this is very easy
– For special other kinds of f it isn’t too hard
Don’t just have to have two labels, could have labels and context
©MapR Technologies - Confidential 32
33. Context Variables
x2
x1
Bogus Dog Food is the Best!
Now available in handy 1 ton
bags!
Buy 5!
x3
user.geo env.time env.day_of_week env.weekend
©MapR Technologies - Confidential 33
34. Caveats
Original Bayesian Bandit only requires real-time
Generalized Bandit may require access to long history for learning
– Pseudo online learning may be easier than true online
Bandit variables can include content, time of day, day of week
Context variables can include user id, user features
Bandit × context variables provide the real power
©MapR Technologies - Confidential 34
35. You can do this
yourself!
©MapR Technologies - Confidential 35
38. What is Quality?
Robust clustering not a goal
– we don’t care if the same clustering is replicated
Generalization is critical
Agreement to “gold standard” is a non-issue
©MapR Technologies - Confidential 38
45. For Example
1
D (X) >
2
D (X)
2
s
4 2 5
Grouping these
two clusters
seriously hurts
squared distance
©MapR Technologies - Confidential 45
47. Typical k-means Failure
Selecting two seeds
here cannot be
fixed with Lloyds
Result is that these two
clusters get glued
together
©MapR Technologies - Confidential 47
48. Ball k-means
Provably better for highly clusterable data
Tries to find initial centroids in each “core” of each real clusters
Avoids outliers in centroid computation
initialize centroids randomly with distance maximizing tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than closest cluster
©MapR Technologies - Confidential 48
49. Still Not a Win
Ball k-means is nearly guaranteed with k = 2
Probability of successful seeding drops exponentially with k
Alternative strategy has high probability of success, but takes
O(nkd + k3d) time
©MapR Technologies - Confidential 49
50. Still Not a Win
Ball k-means is nearly guaranteed with k = 2
Probability of successful seeding drops exponentially with k
Alternative strategy has high probability of success, but takes O(
nkd + k3d ) time
But for big data, k gets large
©MapR Technologies - Confidential 50
51. Surrogate Method
Start with sloppy clustering into lots of clusters
κ = k log n clusters
Use this sketch as a weighted surrogate for the data
Results are provably good for highly clusterable data
©MapR Technologies - Confidential 51
52. Algorithm Costs
Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d (log k + log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids
• Even the sloppy surrogate may suffice
©MapR Technologies - Confidential 52
53. Algorithm Costs
Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d ( log k + log log n )) per point
– fast, in-memory, high-quality clustering of κ weighted centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality
– result is k high-quality centroids
• For many purposes, even the sloppy surrogate may suffice
©MapR Technologies - Confidential 53
54. Algorithm Costs
How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal
©MapR Technologies - Confidential 54
55. Algorithm Costs
How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal
©MapR Technologies - Confidential 55
56. How It Works
For each point
– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
If centroids > κ ≈ C log N
– Recursively cluster centroids with higher threshold
©MapR Technologies - Confidential 56
58. But Wait, …
Finding nearest centroid is inner loop
This could take O( d κ ) per point and κ can be big
Happily, approximate nearest centroid works fine
©MapR Technologies - Confidential 58
60. LSH Bit-match Versus Cosine
1
0.8
0.6
0.4
0.2
Y Ax is
0
0 8 16 24 32 40 48 56 64
- 0.2
- 0.4
- 0.6
- 0.8
-1
X Ax is
©MapR Technologies - Confidential 60
62. Parallel Speedup?
200
Non- threaded
✓
100
2
Tim e per point (μs)
Threaded version
3
50
4
40 6
5
8
30
10 14
12
20 Perfect Scaling 16
10
1 2 3 4 5 20
Threads
©MapR Technologies - Confidential 62
63. Quality
Ball k-means implementation appears significantly better than
simple k-means
Streaming k-means + ball k-means appears to be about as good as
ball k-means alone
All evaluations on 20 newsgroups with held-out data
Figure of merit is mean and median squared distance to nearest
cluster
©MapR Technologies - Confidential 63
64. Contact Me!
We’re hiring at MapR in US and Europe
MapR software available for research use
Get the code as part of Mahout trunk (or 0.8 very soon)
Contact me at tdunning@maprtech.com or @ted_dunning
Share news with @apachemahout
©MapR Technologies - Confidential 64
Hinweis der Redaktion The basic idea here is that I have colored slides to be presented by you in blue. You should substitute and reword those slides as you like. In a few places, I imagined that we would have fast back and forth as in the introduction or final slide where we can each say we are hiring in turn.The overall thrust of the presentation is for you to make these points:Amex does lots of modelingit is expensivehaving a way to quickly test models and new variables would be awesomeso we worked on a new project with MapRMy part will say the following:Knn basic pictorial motivation (could move to you if you like)describe knn quality metric of overlapshow how bad metric breaks knn (optional)quick description of LSH and projection searchpicture of why k-means search is coolmotivate k-means speed as tool for k-means searchdescribe single pass k-means algorithmdescribe basic data structuresshow parallel speedupOur summary should state that we have achievedsuper-fast k-means clusteringinitial version of super-fast knn search with good overlap