Hash Functions FTW

Hash Functions FTW*
Fast Hashing, Bloom Filters & Hash-Oriented Storage

Sunny Gleason

* For the win (see urbandictionary FTW[1]); this expression has nothing to do with hash functions

What’s in this Presentation

• Hash Function Survey
• Hash Performance
• Bloom Filters
• HashFile : Hash Storage

Hash Functions
int getIntHash(byte[] data); // 32-bit
long getLongHash(byte[] data) // 64-bit

int v1 = hash(“foo”); int v2 = hash(“goo”);

int hash(byte[] value) { // a simple hash
int h = 0;
for (byte b: value) { h = (h<<5) ^ (h>>27) ^ b; }
return h % PRIME;
}

Hash Functions

• Goal : v1 has many bit differences from v2
• Desirable Properties:
• Uniform Distribution - no collisions
• Very Fast Computation

Hash Applications
Goal: O(1) access
• Hash Table
• Hash Set
• Bloom Filter

Popular Hash Functions
• FNV Hash
• DJB Hash
• Jenkins Hash
• Murmur2
• New (Promising?): CrapWow
• Awesome & Slow: SHA-1, MD5 etc.

Evaluating Hash Functions
• Hash Function “Zoo”
• Quality of: CRC32 DJB Jenkins FNV
Murmur2 SHA1
• Performance: !"#$%&'()*(+",-'%./%0'/%1',23$%
(MM ops/s) '#"
'!"
&#"
&!"
%#" *+,-.,/"
%!" 012312%"
$#" 456$"
$!"
#"
!"
%#(" ('" )"

A Strawman “Set”
• N keys, K bytes per key
• Allocate array of size K * N bytes
• Utilize array storage as:
• a heap or tree: O(lg N) insert/delete/
remove
• a hash: O(1) insert/delete/remove
• What if we don’t have room for K*N
bytes?

Bloom Filter
• Key Point: give up on storing all the keys
• Store r bits per key instead of K bytes
• Allocate bit vector of size: M = r * N,
where N is expected number of entries
• Use multiple hash functions of key to
determine which bits to set
• Premise: if hash functions are well-
distributed, few collisions, high accuracy

Tuning Bloom Filters
Let r = M bits / N keys (r: num bits/key)
Let k = 0.7 * r (k: num hashes to use)
Let p = 0.6185 ** r (p: probability of false positives)

Working backwards, we can use desired false
positive rate p to tune the data structure space
consumption:

r = 8, p = 2.1e-2 r = 16, p = 4.5e-4
r = 24, p = 9.8e-6 r = 32, p = 2.1e-7
r = 40, p = 4.5e-9 r = 48, p = 9.6e-11

Bloom Filter Performance
100MM entries, 8bits/key : 833k ops/s
100MM entries, 32bits/key : 256k ops/s
1BN entries, 8bits/key : 714k ops/s
1BN entries, 32bits/key : 185k ops/s

Hypothesis : difference between 100MM and
1BN is due to locality of memory access in
smaller bit vector

Hash-Oriented Storage
• HashFile : 64-bit clone of djb’s constant db
“CDB”

• Plain ol’ Key/Value storage:
add(byte[] k, byte[] v), byte[] lookup(byte[] k)

• Constant aka “Immutable” Data Store
create(), add(k, v) ... , build() ... before lookup(k)

• Use properties of hash table to achieve
O(1) disk seeks per lookup

HashFile Structure
• Header (ﬁxed width): table pointers,
contains offests of hash tables and count of
elements per table
• Body (variable width): contains
concatenation of all keys and values (with
data lengths)
• Footer (ﬁxed width): hash “tables”
containing long hash values of keys
alongside long offsets into body

HashFile Diagram
HEADER BODY FOOTER
p1s3p2s4p3s2p4s1 k1v1k2v2k3v3k4v4k5v5k6v6k7v7 hk7o7hk3o3hk4o4hk1o1

• Create: initialize empty header, start appending
keys/values while recording offsets and hash values
of keys

• Build: take list of hash values and offsets and turn
them into hash tables, backﬁll header with values

• Lookup: compute hash(key), compute offset into
table (hash modulo size of table), use table to ﬁnd
offset into body, return the value from body

HashFile Performance
• Spec: ≤ 2 disk seeks per lookup
• Number of seeks independent of number
of entries
• X25E SSD: 1BN 8-byte keys, values (41GB):
650μs lookup w/ cold cache, up to 700x
faster as ﬁlesystem cache warms, 0.9μs
when in-memory
• With 100MM entries (4GB), cold cache is
~600μs (from locality), 0.6μs warm

Conclusions

• Be aware of different Hash Functions and
their collision / performance tradeoffs
• Bloom Filters are extremely useful for fast,
large-scale set membership
• HashFile provides excellent performance in
cases where a static K/V store sufﬁces

Future Work
• Implement cWow hash in Java
• Extend HashFile with conﬁgurable hash,
pointer, and key/value lengths to conserve
space (reduce 24 bytes-per-KV overhead)
• Implement a read-write (non-constant)
version of HashFile
• Bloom Filter that spills to SSD

Thank You!
...Any questions? :)

References
• GitHub Project: g414-hash (hash
function, bloom ﬁlter, HashFile
implementations)
• Wikipedia: Hash Function, Bloom Filter
• Non-Cryptographic Hash Function Zoo
• DJB CDB, sg-cdb (java implementation)

Hash Functions FTW

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Hash Functions FTW

Ähnlich wie Hash Functions FTW (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hash Functions FTW