Hash function landscape

The hash function
landscape
Sandeep Joshi
Aug 14, 2020

Build the toolbox
Finding the best hash function for your use-case
1. How many types of hash functions are there ?
2. Why are they different ?
3. How are they related to each other ?
4. How are they related to Machine learning ?

Are they same or different ?
Depends on the hash function
Small change in input => large change in output
OR
Small change in input => No change in output

Takeaway 1 : exact versus fuzzy
FuzzyExact
Data dependent
Learning to hash
Data-independent
Locality sensitive
Using Machine
Learning
Minimize collisions
Small change in input => big change in output
Maximize collisions
Small change in input => no change in output

Takeaway 2 : part of the key changing
How to quickly update the hash when part of the key is changing ?
1. Tabulation hash : key is sum of individual pieces
2. Rabin fingerprint : key is a stream

Not going to cover
● Regular hash functions like FNV, Jenkins, MurmurHash, etc
● Cryptographic hash functions
● Extendible hashing
● Linear Probing, etc
● Perfect hash

Chess game analysis
Build a game tree
Was this position reached before ?
Number of unique chess games that can be
played is about 1040
to 10120
[Shannon number]

Problem
Map game position to single random number
Change that random number on every move
Need a hash key which is SUM of individual positions

Chess combinations
How many combinations define a board ?
1. 64 x 32
2. 64 x 18
3. 64 x 13
4. 64 x 12
Answer : 64 x 13
13 = 1 empty + 2 x (pawn, king, queen, bishop, rook, knight)

Zobrist hash
13 x 64 board positions = 832 combinations
Create random bitstring for each of 64 position
https://github.com/official-stockfish/Stockfish/blob/4006f2c9132db034a27a94be33070d6aaa
b75b24/src/position.cpp#L114-L116
.
1 0x234234fa
2 0x78ebfa21
... 0x45e64564
13 0x974e4534

Zobrist hash
Hash key = XOR of 64 keys
https://github.com/official-stockfish/Stockfish/blob/4006f2c9132db034a27a94be33070d6aaab75b2
4/src/position.cpp#L345-L349
1 0x234234fa
2 0x78ebfa21
... 0x45e64564
13 0x974e4534

Zobrist hash is incremental
Advantage of XOR : On every move, erase prev position and add new position
https://github.com/official-stockfish/Stockfish/blob/4006f2c9132db034a27a94be33070d6aaa
b75b24/src/position.cpp#L771
.

Why does it work ?
Zobrist hash is a type of Tabulation hash
Collision when : (x1 ^ x2 ^ … ^ x64 ) = (y1 ^ y2 ^ … ^ y64)
In other words : x1 ^ x2 ^ x64 ^ y1 ^ y2 ^ … ^ y64 = 0
How many bits are enough ?
64-bits for chess, different for other games

Problem
How to build a hash function on a streaming window ?
● One solution was to create “shingles” and hash them
● No incremental update !
hash(1234), hash(2345), hash(3456), ….
1, 2, 3, 4, 5, 6, 7, 8

Decimal base example
Let’s say number = 2312425254, prime = 97
hash(231) = 231 % 97
hash(312) = 312 % 97 = [ ]%97
hash(124) = [(hash(312) - 300) x 10 + 4] % 97
hash(231) - (first_digit x 100) x 10) + last_digit

Arithmetic over smaller set of numbers
Galois Field (GF) = smaller set of numbers
F4 is not (00, 01, 10, 11)
Can do addition and multiplication

Rabin fingerprint
Does arithmetic over a smaller set of numbers
Decimal base Rabin fingerprint in Galois field (GF)
Prime number 97 Irreducible polynomial (say = x2
+ x + 1)
2031 0x1011 becomes (x3
+ x + 1)
2031 (mod p) (x3
+ x + 1) mod (irreducible poly)
Fingerprint = 2031 (mod 97) Fingerprint = (x3
+ x + 1) (mod irreducible poly)
Easy to update a stream
hash(124) = [(hash(312) - 300) x
10 + 4] % 97
Easy update in binary Galois fields

Computing division mod p
https://github.com/opendedup/rabinfingerprint/blob/master/src/org/rabinfingerprint/polynomial/Polynomials.java

Updating fingerprint
https://github.com/opendedup/rabinfingerprint/blob/master/src/org/rabinfingerprint/fingerprint/RabinFingerprintPolynomial.java

probability of error
Just as (mod p) in Decimal can lead to collision...
Similarly, in Galois Field GF(2^k)
If k > log(nm/e), probability of error is less than “e”
Where
1. Pattern length = n
2. Text length = m
3. Fingerprint polynomial length has to be >= k

Exact versus fuzzy
FuzzyExact
Data dependent
Learning to hash
Data-independent
Locality sensitive
Using Machine
Learning
Minimize collisions
Small change in input => big change in output
Maximize collisions
Small change in input => no change in output

Spatial hash
(fuzzy but data independent)

Problem of spatial hash
Map any location on the earth to a fuzzy hash
Ability to find neighbouring locations only based on their hash
Ability to reduce or increase the resolution

GeoHash
Geohash is the result of binary search down the grid
1. Even bit = 0 if left of longitude, else = 1
2. Odd bit = 0 if above latitude, else = 1
https://www.researchgate.net/figure/Geohash-binary-code_fig2_332061286

GeoHash
Problems with geohash
1. Earth is sphere, but rectangle area changes based on latitude
2. In some cases, neighbours may not be adjacent

Google S2
Hilbert curve - space filling curve

Uber H3
Why Hexagons ? Earth is like a soccer ball
The key idea is “tessellation” (regular tiling)
Many hexagons + few pentagons can cover a soccer ball

Uber H3
Hexagon has an advantage
1. Fixed number of neighbours
2. Fixed distance to all 6 neighbours

Uber H3
Hierarchical : 3 bits per resolution - upto 15 resolutions
110 110/101 110/101/011

Uber H3 advantages
1. Map any location into a 64-bit number
geoToH3 (latitude, longitude, resolution) => 64 bit number
2. Find adjacent cells using the coordinate system
3. Define route from point A to point B

Uber H3 advantages
4. Is one cell inside another ? (yes, do a prefix match)
5. Want to save space ? (yes, truncate the hash to reduce resolution)

Social hash
(fuzzy but data dependent)

Facebook graph
https://www.slideshare.net/AayushShrestha1/facebook-open-graph-api-and-how-to-use-it/6

How graph is stored and fetched

How graph is stored and fetched
New features are being developed (newsfeed, albums, notifications)
For each GraphQL query, many servers are contacted to fetch objects
Frontend :=> GraphQL :=> PHP layer
Facebook is read intensive : 90 percent of requests are Reads

Sharding challenge
Assign social graph nodes to servers such that
1. Reduce fanout : Ensure objects which will be fetched together are on same
machine.
Especially problem with celebrities (i.e. decide closest neighbours ?? )
2. Stability of assignment : avoid continuously moving objects between servers

Consistent hashing is not optimal
Earlier, they used consistent hashing
Each object has a random “fbid” (64 bit integer)
FBID (mod P) mapped to a slot in some virtual ring

Building the social hash
Two step process : static and dynamic
Run Bipartite graph partition using Pregel.
Each iteration begins with the previous assignment to provide stability
Query1 Query2 QueryN
Fbid 12 Fbid 212 Fbid 212 Fbid 1232 Fbid 86 Fbid 3

The custom hash function
...1.5B+ Facebook users into 21,000 balanced groups such that each user shares
her group with at least 50% of her friends - (from their paper)

There is a gradation in hashing...
You can create a hash function which is tuned to your requirements
1. Exact : FNV, MD5
2. Syntactic : PhotoDNA
3. Deeper Syntactic : Geometric hash, SIFT
4. Semantic hash : Machine learning classifier
https://github.com/facebook/ThreatExchange/blob/master/hashing/hashing.pdf/

For example, perceptual hash
Apply an averaging and normalizing process
1. Normalize the size
2. Reduce the color
3. Average the color
4. Use Image transforms (DCT) to extract features from image
http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html

Other fuzzy Image/Audio/Text hashes
SIFT
Geometric hash : find rotated objects, by using a common reference frame
Bag of words, TF-IDF

Neural network classifier
https://opendatascience.com/a-beginners-guide-to-understanding-convolutional-neural-networks/

Learning-to-hash => neural network
Social Hash => clustering using GNN (Graph neural network)
Perceptual/Geometric Hash => do it via CNN
Use of Doc2Vec, Node2Vec
Learned indexes paper by Google

Hashing with a neural network
Neural network can be taught to find similarities ( feature extraction )
When does a neural network make sense ?
1. If you want a deeper semantic match
2. If you have enough training data
3. If you are willing to trust (i.e. no control over hash function code)

Conclusion
Exact
1. Zobrist Hash : addition of individual hashes
2. Rolling Hash : streaming (drop hash of first char, add hash of last char)
Fuzzy
3. Spatial Hash : Uber H3
4. Social Hash : keep related graph nodes closer
Neural network classifier : its building a fuzzy, data-dependent hash function

hash
https://www.youtube.com/watch?v=UILoSqvIM2w (Uber H3)
Learning-to-hash vs LSH https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/LTHSurvey.pdf
Learned Indexes https://learning2hash.github.io/papers.html

zobrist
https://rjlipton.wordpress.com/2012/04/14/tabulation-hashing-and-independence/ zobrist
https://content.iospress.com/articles/icga-journal/icg28302 zobrist
https://cs.stackexchange.com/questions/33807/prove-that-this-family-of-hash-function-is-3-wise-independent-but-not-4-wis
wegman Carter : k-independent hashing
Hyatt and Cozzie

Uber h3
https://github.com/uber/h3/blob/f621d07cbf15c3b78243b429f24eca009bdb1f13/src/h3lib/lib/faceijk.c

Hash function landscape

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Hash function landscape

Ähnlich wie Hash function landscape (20)

Mehr von Sandeep Joshi

Mehr von Sandeep Joshi (11)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hash function landscape