Weitere ähnliche Inhalte Ähnlich wie Doing-the-impossible (20) Kürzlich hochgeladen (20) Doing-the-impossible2. © 2014 MapR Technologies 2
• "Decoder ring"
• "the next thing I want to do is this"
• Flajolet
3. © 2014 MapR Technologies 3
• What's the problem?
– speed
– feasibility
– communication
– incremental computation
– tree-based pre-computation
• What do we need?
– on-line version
– associative version
4. © 2014 MapR Technologies 4
• Why is that hard (impossible)?
– pathological inputs
– median ... any element of the first half of the data could be the median
– k-th most common ... any element could occur enough in the second
half to be biggest
– unique elements ... hashing loses information, any compact
representation must have false positives or negatives.
5. © 2014 MapR Technologies 5
• What can we do?
– give up ... a slow, but exact answer may not be sooo bad
– give up ... a fast, but inexact answer may not be sooo bad
• The good news:
– approximate can be very, very close to exact
6. © 2014 MapR Technologies 6
The Classic Problems
• Most common (top-40)
• Count distinct
• Quantiles, with focus on extremes
7. © 2014 MapR Technologies 7
Classic Solutions
• Leaky counters
– Forget values, remember uncertainties
• Count min sketch
– Many small hash tables
• Count distinct with HyperLogLog
– Many hashes again
• New Solution - Quantiles by t-digest
– A new low in clustering
8. © 2014 MapR Technologies 8
Classic Solutions - Leaky counters
• Intuition:
– Common elements are rarely rare, rare elements are always rare
• Leaky counter:
– new element inserted with count=1, error = ceiling((N-1)/w)
– every w samples {dropAll( if f+error < ceiling(N/w) )}
• Adaptation to heavy hitters is trivial
9. © 2014 MapR Technologies 9
Classic Solutions - Count min sketch
• Intuition:
– A gazillion hashed counters can't all be wrong
• Big array of counters, each row has different hash function
• Increment counter in each row determined by hashing
• Probe by finding minimum hashed counter for probe key
• Oops... finding heavy hitters is tricky ... requires keeping log n
sketches
10. © 2014 MapR Technologies 10
Increment Hashed Locations to Insert
a
h
i
(a)
11. © 2014 MapR Technologies 11
Probe Using min of Counts
mini"k[h
i
(a)]
12. Classic Solutions - Count distinct with HyperLogLog
© 2014 MapR Technologies 12
• Intuition:
– The smallest of n uniform samples is expected to be 1/n
– Hashing turns anything into uniform distribution
– Hashing again turns anything into a new uniform distribution
• Best done with pictures
14. © 2014 MapR Technologies 14
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
ix
15. © 2014 MapR Technologies 15
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
hash(ix)
17. 0 5 10 15 20 25 30
© 2014 MapR Technologies 17
0.0 1.0 2.0
Original distribution
x ~ G(0.2, 0.2)
Mean = 1, median = 0.1, 5%−ile = 10-6
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.4 0.8
After hashing
19. © 2014 MapR Technologies 19
Repeated Minimum
10 samples
Min is ~ 0.1
20. © 2014 MapR Technologies 20
Min(x)
PDF
0.00 0.02 0.04 0.06 0.08 0.10
0 20 40 60 80
Observed minimum value
(100 samples x 10,000 replications)
21. © 2014 MapR Technologies 21
Min(x)
PDF
0.00 0.02 0.04 0.06 0.08 0.10
0 20 40 60 80
Theoretical distribution
Observed minimum value
(100 samples x 10,000 replications)
22. © 2014 MapR Technologies 22
Min(x)
PDF
Mean = 0.0099
0.00 0.02 0.04 0.06 0.08 0.10
0 20 40 60 80
Theoretical distribution
Observed minimum value
(100 samples x 10,000 replications)
24. © 2014 MapR Technologies 24
Mean = −2.3
10−2.3
= 0.0056
Observed minimum log10(value)
Min(x)
PDF
0.0 0.2 0.4 0.6 0.8 1.0
Error
1e−05 1e−04 0.001 0.01 0.1
25. © 2014 MapR Technologies 25
T-digest for Quantiles
• Intuition:
– 1-d k-means with size cap
– Make size cap depend on distance to nearest end
• Experimental verification
– Distribution in cluster very uniform
– Accuracy far better than alternatives, especially at extremes