Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Probabilistic Data Structures
and Approximate Solutions
by Oleksandr Pryymak
PyData London 2014IPython notebook with code >>

Probabilistic||Approximate: Why?
Often:
● an approximate answer is sufficient
● need to trade accuracy for scalability or speed
● need to analyse stream of data
Catch:
● despite typically achieving good result, exists a
chance of the bad worst case behaviour.
● use on large datasets (law of large numbers)

Code: Approximation
import random
x = [random.randint(0,80000) for _ in xrange(10000)]
y = [i>>8 for i in x] # trim 8 bits off of integers
z = x[:500] # 5% sample (x is uniform)
avx = average(x)
avy = average(y) * 2**8 # add 8 bits
avz = average(z)
print avx
print avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx))
print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx))
39547.8816
39420.7744 error 0.321401%
39591.424 error 0.110100%

Code: Sampling Data
Interview question:
Get K samples from an infinite stream

Probabilistic Data Structures
Generally they are:
● Use less space than a full dataset
● Require higher CPU load
● Stream-friendly
● Can be parallelized
● Have controlled error rate

Hash functions
One-way function:
arbitrary length of the key ->
to a fixed length of the message
message = hash(key)
However, collisions are possible:
hash(key1) = hash(key2)

Hash collisions and performance
● Cryptographic hashes not ideal for our use (like bcrypt)
● Need a fast algorithm with the lowest number of collisions:
Hash Lowercase Random UUID Numbers
============= ============= =========== ==============
Murmur 145 ns 259 ns 92 ns
6 collis 5 collis 0 collis
FNV-1 184 ns 730 ns 92 ns
DJB2 156 ns 437 ns 93 ns
SDBM 148 ns 484 ns 90 ns
SuperFastHash 164 ns 344 ns 118 ns
CRC32 250 ns 946 ns 130 ns
LoseLose 338 ns - -
215178 collis
by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
Murmur2 collisions
● cataract collides with periti
● roquette collides with skivie
● shawl collides with stormbound
● dowlases collides with tramontane
● cricketingscollides with twanger
● longans collides with whigs

Hash randomness visualised hashmap
Great
murmur2
on a sequence of numbers
Not so great
DJB2
on a sequence of numbers

Comparison: Locality Sensitive Hashing (LSH)

Comparison: Locality Sensitive Hashing (LSH)
Image hashes
Kernelized locality-sensitive hashing for scalable image search
B Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org
Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed high-
dimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22

Membership test: Bloom filter
Bloom filter is probabilistic but only yields false positives.
Hash each item k times indices into bit field.
`
1..m
At least one 0 means
w definitely isn’t in set.
All 1s would mean w
probably is in set.

Use Bloom filter to serve requests

Use Bloom filter to store graphs
Graphs only gain nodes because of Bloom
filter false positives.
Pell et al., PNAS 2012

Counting Distinct Elements
In: infinite stream of data
Question: how many distinct elements are there?
is similar to:
In: coin flips
Question: how many times it has been flipped?

Coin flips: intuition
● Long runs of HEADs in random series are rare.
● The longer you look, the more likely you see a long one.
● Long runs are very rare and are correlated with how
many coins you’ve flipped.

Cardinality estimation
Basic algorithm:
● n=0
● For each input item:
○ Hash item into bit string
○ Count trailing zeroes in bit string
○ If this count > n:
■ Let n = count
● Estimated cardinality (“count distinct”) = 2^n

Cardinality estimation: HyperLogLog
Demo by: http://www.
aggregateknowledge.
com/science/blog/hll.html
Billions of distinct values in 1.5KB of
RAM with 2% relative error
HyperLogLog: the analysis of a near-optimal
cardinality estimation algorithm
P.Flajolet, É.Fusy, O.Gandouet, F.Meunier;
2007

Count-min sketch
count(value) = min{w1
[h1
(value)], ... wd
[hd
(value)]}
Frequency histogram
estimation with chance
of over-counting

Machine Learning: Feature hashing
High-dimensional
machine learning without
feature dictionary
by Andrew Clegg “Approximate methods for
scalable data mining”

Locality-sensitive hashing
To approximate nearest
neighbours
by Andrew Clegg “Approximate methods for
scalable data mining”

Probabilistic Databases
● PrDB (University of Maryland)
● Orion (Purdue University)
● MayBMS (Cornell University)
● BlinkDB v0.1alpha
(UC Berkeley and MIT)

BlinkDB: queries
Queries with Bounded Errors
and Bounded Response Times
on Very Large Data

References
Mining of Massive Datasets
by Jure Leskovec, Anand Rajaraman, and Jeff Ullman
http://infolab.stanford.edu/~ullman/mmds.html

Summary
● know the data structures
● know what you sacrifice
● control errors
http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-
structures-web-analytics-data-mining/ by Ilya Katsov

Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak

Ähnlich wie Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak (20)

Mehr von PyData

Mehr von PyData (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak