2. Build the toolbox
Finding the best hash function for your use-case
1. How many types of hash functions are there ?
2. Why are they different ?
3. How are they related to each other ?
4. How are they related to Machine learning ?
3. Are they same or different ?
Depends on the hash function
Small change in input => large change in output
OR
Small change in input => No change in output
4. Takeaway 1 : exact versus fuzzy
FuzzyExact
Data dependent
Learning to hash
Data-independent
Locality sensitive
Using Machine
Learning
Minimize collisions
Small change in input => big change in output
Maximize collisions
Small change in input => no change in output
5. Takeaway 2 : part of the key changing
How to quickly update the hash when part of the key is changing ?
1. Tabulation hash : key is sum of individual pieces
2. Rabin fingerprint : key is a stream
6. Not going to cover
● Regular hash functions like FNV, Jenkins, MurmurHash, etc
● Cryptographic hash functions
● Extendible hashing
● Linear Probing, etc
● Perfect hash
8. Chess game analysis
Build a game tree
Was this position reached before ?
Number of unique chess games that can be
played is about 1040
to 10120
[Shannon number]
9. Problem
Map game position to single random number
Change that random number on every move
Need a hash key which is SUM of individual positions
10. Chess combinations
How many combinations define a board ?
1. 64 x 32
2. 64 x 18
3. 64 x 13
4. 64 x 12
Answer : 64 x 13
13 = 1 empty + 2 x (pawn, king, queen, bishop, rook, knight)
11. Zobrist hash
13 x 64 board positions = 832 combinations
Create random bitstring for each of 64 position
https://github.com/official-stockfish/Stockfish/blob/4006f2c9132db034a27a94be33070d6aaa
b75b24/src/position.cpp#L114-L116
.
1 0x234234fa
2 0x78ebfa21
... 0x45e64564
13 0x974e4534
13. Zobrist hash is incremental
Advantage of XOR : On every move, erase prev position and add new position
https://github.com/official-stockfish/Stockfish/blob/4006f2c9132db034a27a94be33070d6aaa
b75b24/src/position.cpp#L771
.
14. Why does it work ?
Zobrist hash is a type of Tabulation hash
Collision when : (x1 ^ x2 ^ … ^ x64 ) = (y1 ^ y2 ^ … ^ y64)
In other words : x1 ^ x2 ^ x64 ^ y1 ^ y2 ^ … ^ y64 = 0
How many bits are enough ?
64-bits for chess, different for other games
16. Problem
How to build a hash function on a streaming window ?
● One solution was to create “shingles” and hash them
● No incremental update !
hash(1234), hash(2345), hash(3456), ….
1, 2, 3, 4, 5, 6, 7, 8
17. Decimal base example
Let’s say number = 2312425254, prime = 97
hash(231) = 231 % 97
hash(312) = 312 % 97 = [ ]%97
hash(124) = [(hash(312) - 300) x 10 + 4] % 97
hash(231) - (first_digit x 100) x 10) + last_digit
18. Arithmetic over smaller set of numbers
Galois Field (GF) = smaller set of numbers
F4 is not (00, 01, 10, 11)
Can do addition and multiplication
19. Rabin fingerprint
Does arithmetic over a smaller set of numbers
Decimal base Rabin fingerprint in Galois field (GF)
Prime number 97 Irreducible polynomial (say = x2
+ x + 1)
2031 0x1011 becomes (x3
+ x + 1)
2031 (mod p) (x3
+ x + 1) mod (irreducible poly)
Fingerprint = 2031 (mod 97) Fingerprint = (x3
+ x + 1) (mod irreducible poly)
Easy to update a stream
hash(124) = [(hash(312) - 300) x
10 + 4] % 97
Easy update in binary Galois fields
20. Computing division mod p
https://github.com/opendedup/rabinfingerprint/blob/master/src/org/rabinfingerprint/polynomial/Polynomials.java
22. probability of error
Just as (mod p) in Decimal can lead to collision...
Similarly, in Galois Field GF(2^k)
If k > log(nm/e), probability of error is less than “e”
Where
1. Pattern length = n
2. Text length = m
3. Fingerprint polynomial length has to be >= k
23. Exact versus fuzzy
FuzzyExact
Data dependent
Learning to hash
Data-independent
Locality sensitive
Using Machine
Learning
Minimize collisions
Small change in input => big change in output
Maximize collisions
Small change in input => no change in output
25. Problem of spatial hash
Map any location on the earth to a fuzzy hash
Ability to find neighbouring locations only based on their hash
Ability to reduce or increase the resolution
26. GeoHash
Geohash is the result of binary search down the grid
1. Even bit = 0 if left of longitude, else = 1
2. Odd bit = 0 if above latitude, else = 1
https://www.researchgate.net/figure/Geohash-binary-code_fig2_332061286
27. GeoHash
Problems with geohash
1. Earth is sphere, but rectangle area changes based on latitude
2. In some cases, neighbours may not be adjacent
29. Uber H3
Why Hexagons ? Earth is like a soccer ball
The key idea is “tessellation” (regular tiling)
Many hexagons + few pentagons can cover a soccer ball
30. Uber H3
Hexagon has an advantage
1. Fixed number of neighbours
2. Fixed distance to all 6 neighbours
32. Uber H3 advantages
1. Map any location into a 64-bit number
geoToH3 (latitude, longitude, resolution) => 64 bit number
2. Find adjacent cells using the coordinate system
3. Define route from point A to point B
33. Uber H3 advantages
4. Is one cell inside another ? (yes, do a prefix match)
5. Want to save space ? (yes, truncate the hash to reduce resolution)
34. Exact versus fuzzy
FuzzyExact
Data dependent
Learning to hash
Data-independent
Locality sensitive
Using Machine
Learning
Minimize collisions
Small change in input => big change in output
Maximize collisions
Small change in input => no change in output
38. How graph is stored and fetched
New features are being developed (newsfeed, albums, notifications)
For each GraphQL query, many servers are contacted to fetch objects
Frontend :=> GraphQL :=> PHP layer
Facebook is read intensive : 90 percent of requests are Reads
39. Sharding challenge
Assign social graph nodes to servers such that
1. Reduce fanout : Ensure objects which will be fetched together are on same
machine.
Especially problem with celebrities (i.e. decide closest neighbours ?? )
2. Stability of assignment : avoid continuously moving objects between servers
40. Consistent hashing is not optimal
Earlier, they used consistent hashing
Each object has a random “fbid” (64 bit integer)
FBID (mod P) mapped to a slot in some virtual ring
41. Building the social hash
Two step process : static and dynamic
Run Bipartite graph partition using Pregel.
Each iteration begins with the previous assignment to provide stability
Query1 Query2 QueryN
Fbid 12 Fbid 212 Fbid 212 Fbid 1232 Fbid 86 Fbid 3
42. The custom hash function
...1.5B+ Facebook users into 21,000 balanced groups such that each user shares
her group with at least 50% of her friends - (from their paper)
44. There is a gradation in hashing...
You can create a hash function which is tuned to your requirements
1. Exact : FNV, MD5
2. Syntactic : PhotoDNA
3. Deeper Syntactic : Geometric hash, SIFT
4. Semantic hash : Machine learning classifier
https://github.com/facebook/ThreatExchange/blob/master/hashing/hashing.pdf/
45. For example, perceptual hash
Apply an averaging and normalizing process
1. Normalize the size
2. Reduce the color
3. Average the color
4. Use Image transforms (DCT) to extract features from image
http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html
46. Other fuzzy Image/Audio/Text hashes
SIFT
Geometric hash : find rotated objects, by using a common reference frame
Bag of words, TF-IDF
48. Learning-to-hash => neural network
Social Hash => clustering using GNN (Graph neural network)
Perceptual/Geometric Hash => do it via CNN
Use of Doc2Vec, Node2Vec
Learned indexes paper by Google
49. Hashing with a neural network
Neural network can be taught to find similarities ( feature extraction )
When does a neural network make sense ?
1. If you want a deeper semantic match
2. If you have enough training data
3. If you are willing to trust (i.e. no control over hash function code)
50. Conclusion
Exact
1. Zobrist Hash : addition of individual hashes
2. Rolling Hash : streaming (drop hash of first char, add hash of last char)
Fuzzy
3. Spatial Hash : Uber H3
4. Social Hash : keep related graph nodes closer
Neural network classifier : its building a fuzzy, data-dependent hash function