SlideShare ist ein Scribd-Unternehmen logo
1 von 53
The hash function
landscape
Sandeep Joshi
Aug 14, 2020
Build the toolbox
Finding the best hash function for your use-case
1. How many types of hash functions are there ?
2. Why are they different ?
3. How are they related to each other ?
4. How are they related to Machine learning ?
Are they same or different ?
Depends on the hash function
Small change in input => large change in output
OR
Small change in input => No change in output
Takeaway 1 : exact versus fuzzy
FuzzyExact
Data dependent
Learning to hash
Data-independent
Locality sensitive
Using Machine
Learning
Minimize collisions
Small change in input => big change in output
Maximize collisions
Small change in input => no change in output
Takeaway 2 : part of the key changing
How to quickly update the hash when part of the key is changing ?
1. Tabulation hash : key is sum of individual pieces
2. Rabin fingerprint : key is a stream
Not going to cover
● Regular hash functions like FNV, Jenkins, MurmurHash, etc
● Cryptographic hash functions
● Extendible hashing
● Linear Probing, etc
● Perfect hash
Zobrist/Tabulation hash
Chess game analysis
Build a game tree
Was this position reached before ?
Number of unique chess games that can be
played is about 1040
to 10120
[Shannon number]
Problem
Map game position to single random number
Change that random number on every move
Need a hash key which is SUM of individual positions
Chess combinations
How many combinations define a board ?
1. 64 x 32
2. 64 x 18
3. 64 x 13
4. 64 x 12
Answer : 64 x 13
13 = 1 empty + 2 x (pawn, king, queen, bishop, rook, knight)
Zobrist hash
13 x 64 board positions = 832 combinations
Create random bitstring for each of 64 position
https://github.com/official-stockfish/Stockfish/blob/4006f2c9132db034a27a94be33070d6aaa
b75b24/src/position.cpp#L114-L116
.
1 0x234234fa
2 0x78ebfa21
... 0x45e64564
13 0x974e4534
Zobrist hash
Hash key = XOR of 64 keys
https://github.com/official-stockfish/Stockfish/blob/4006f2c9132db034a27a94be33070d6aaab75b2
4/src/position.cpp#L345-L349
1 0x234234fa
2 0x78ebfa21
... 0x45e64564
13 0x974e4534
Zobrist hash is incremental
Advantage of XOR : On every move, erase prev position and add new position
https://github.com/official-stockfish/Stockfish/blob/4006f2c9132db034a27a94be33070d6aaa
b75b24/src/position.cpp#L771
.
Why does it work ?
Zobrist hash is a type of Tabulation hash
Collision when : (x1 ^ x2 ^ … ^ x64 ) = (y1 ^ y2 ^ … ^ y64)
In other words : x1 ^ x2 ^ x64 ^ y1 ^ y2 ^ … ^ y64 = 0
How many bits are enough ?
64-bits for chess, different for other games
Rabin fingerprint
Problem
How to build a hash function on a streaming window ?
● One solution was to create “shingles” and hash them
● No incremental update !
hash(1234), hash(2345), hash(3456), ….
1, 2, 3, 4, 5, 6, 7, 8
Decimal base example
Let’s say number = 2312425254, prime = 97
hash(231) = 231 % 97
hash(312) = 312 % 97 = [ ]%97
hash(124) = [(hash(312) - 300) x 10 + 4] % 97
hash(231) - (first_digit x 100) x 10) + last_digit
Arithmetic over smaller set of numbers
Galois Field (GF) = smaller set of numbers
F4 is not (00, 01, 10, 11)
Can do addition and multiplication
Rabin fingerprint
Does arithmetic over a smaller set of numbers
Decimal base Rabin fingerprint in Galois field (GF)
Prime number 97 Irreducible polynomial (say = x2
+ x + 1)
2031 0x1011 becomes (x3
+ x + 1)
2031 (mod p) (x3
+ x + 1) mod (irreducible poly)
Fingerprint = 2031 (mod 97) Fingerprint = (x3
+ x + 1) (mod irreducible poly)
Easy to update a stream
hash(124) = [(hash(312) - 300) x
10 + 4] % 97
Easy update in binary Galois fields
Computing division mod p
https://github.com/opendedup/rabinfingerprint/blob/master/src/org/rabinfingerprint/polynomial/Polynomials.java
Updating fingerprint
https://github.com/opendedup/rabinfingerprint/blob/master/src/org/rabinfingerprint/fingerprint/RabinFingerprintPolynomial.java
probability of error
Just as (mod p) in Decimal can lead to collision...
Similarly, in Galois Field GF(2^k)
If k > log(nm/e), probability of error is less than “e”
Where
1. Pattern length = n
2. Text length = m
3. Fingerprint polynomial length has to be >= k
Exact versus fuzzy
FuzzyExact
Data dependent
Learning to hash
Data-independent
Locality sensitive
Using Machine
Learning
Minimize collisions
Small change in input => big change in output
Maximize collisions
Small change in input => no change in output
Spatial hash
(fuzzy but data independent)
Problem of spatial hash
Map any location on the earth to a fuzzy hash
Ability to find neighbouring locations only based on their hash
Ability to reduce or increase the resolution
GeoHash
Geohash is the result of binary search down the grid
1. Even bit = 0 if left of longitude, else = 1
2. Odd bit = 0 if above latitude, else = 1
https://www.researchgate.net/figure/Geohash-binary-code_fig2_332061286
GeoHash
Problems with geohash
1. Earth is sphere, but rectangle area changes based on latitude
2. In some cases, neighbours may not be adjacent
Google S2
Hilbert curve - space filling curve
Uber H3
Why Hexagons ? Earth is like a soccer ball
The key idea is “tessellation” (regular tiling)
Many hexagons + few pentagons can cover a soccer ball
Uber H3
Hexagon has an advantage
1. Fixed number of neighbours
2. Fixed distance to all 6 neighbours
Uber H3
Hierarchical : 3 bits per resolution - upto 15 resolutions
110 110/101 110/101/011
Uber H3 advantages
1. Map any location into a 64-bit number
geoToH3 (latitude, longitude, resolution) => 64 bit number
2. Find adjacent cells using the coordinate system
3. Define route from point A to point B
Uber H3 advantages
4. Is one cell inside another ? (yes, do a prefix match)
5. Want to save space ? (yes, truncate the hash to reduce resolution)
Exact versus fuzzy
FuzzyExact
Data dependent
Learning to hash
Data-independent
Locality sensitive
Using Machine
Learning
Minimize collisions
Small change in input => big change in output
Maximize collisions
Small change in input => no change in output
Social hash
(fuzzy but data dependent)
Facebook graph
https://www.slideshare.net/AayushShrestha1/facebook-open-graph-api-and-how-to-use-it/6
How graph is stored and fetched
How graph is stored and fetched
New features are being developed (newsfeed, albums, notifications)
For each GraphQL query, many servers are contacted to fetch objects
Frontend :=> GraphQL :=> PHP layer
Facebook is read intensive : 90 percent of requests are Reads
Sharding challenge
Assign social graph nodes to servers such that
1. Reduce fanout : Ensure objects which will be fetched together are on same
machine.
Especially problem with celebrities (i.e. decide closest neighbours ?? )
2. Stability of assignment : avoid continuously moving objects between servers
Consistent hashing is not optimal
Earlier, they used consistent hashing
Each object has a random “fbid” (64 bit integer)
FBID (mod P) mapped to a slot in some virtual ring
Building the social hash
Two step process : static and dynamic
Run Bipartite graph partition using Pregel.
Each iteration begins with the previous assignment to provide stability
Query1 Query2 QueryN
Fbid 12 Fbid 212 Fbid 212 Fbid 1232 Fbid 86 Fbid 3
The custom hash function
...1.5B+ Facebook users into 21,000 balanced groups such that each user shares
her group with at least 50% of her friends - (from their paper)
Fuzzy match in general
There is a gradation in hashing...
You can create a hash function which is tuned to your requirements
1. Exact : FNV, MD5
2. Syntactic : PhotoDNA
3. Deeper Syntactic : Geometric hash, SIFT
4. Semantic hash : Machine learning classifier
https://github.com/facebook/ThreatExchange/blob/master/hashing/hashing.pdf/
For example, perceptual hash
Apply an averaging and normalizing process
1. Normalize the size
2. Reduce the color
3. Average the color
4. Use Image transforms (DCT) to extract features from image
http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html
Other fuzzy Image/Audio/Text hashes
SIFT
Geometric hash : find rotated objects, by using a common reference frame
Bag of words, TF-IDF
Neural network classifier
https://opendatascience.com/a-beginners-guide-to-understanding-convolutional-neural-networks/
Learning-to-hash => neural network
Social Hash => clustering using GNN (Graph neural network)
Perceptual/Geometric Hash => do it via CNN
Use of Doc2Vec, Node2Vec
Learned indexes paper by Google
Hashing with a neural network
Neural network can be taught to find similarities ( feature extraction )
When does a neural network make sense ?
1. If you want a deeper semantic match
2. If you have enough training data
3. If you are willing to trust (i.e. no control over hash function code)
Conclusion
Exact
1. Zobrist Hash : addition of individual hashes
2. Rolling Hash : streaming (drop hash of first char, add hash of last char)
Fuzzy
3. Spatial Hash : Uber H3
4. Social Hash : keep related graph nodes closer
Neural network classifier : its building a fuzzy, data-dependent hash function
hash
https://www.youtube.com/watch?v=UILoSqvIM2w (Uber H3)
Learning-to-hash vs LSH https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/LTHSurvey.pdf
Learned Indexes https://learning2hash.github.io/papers.html
zobrist
https://rjlipton.wordpress.com/2012/04/14/tabulation-hashing-and-independence/ zobrist
https://content.iospress.com/articles/icga-journal/icg28302 zobrist
https://cs.stackexchange.com/questions/33807/prove-that-this-family-of-hash-function-is-3-wise-independent-but-not-4-wis
wegman Carter : k-independent hashing
Hyatt and Cozzie
Uber h3
https://github.com/uber/h3/blob/f621d07cbf15c3b78243b429f24eca009bdb1f13/src/h3lib/lib/faceijk.c

Weitere ähnliche Inhalte

Ähnlich wie Hash function landscape

C interview question answer 1
C interview question answer 1C interview question answer 1
C interview question answer 1
Amit Kapoor
 
Debugging With Id
Debugging With IdDebugging With Id
Debugging With Id
guest215c4e
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
sonu sharma
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
Kgr Sushmitha
 

Ähnlich wie Hash function landscape (20)

C interview question answer 1
C interview question answer 1C interview question answer 1
C interview question answer 1
 
Password Storage Sucks!
Password Storage Sucks!Password Storage Sucks!
Password Storage Sucks!
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
The why and how of moving to php 7.x
The why and how of moving to php 7.xThe why and how of moving to php 7.x
The why and how of moving to php 7.x
 
Debugging With Id
Debugging With IdDebugging With Id
Debugging With Id
 
The why and how of moving to php 7.x
The why and how of moving to php 7.xThe why and how of moving to php 7.x
The why and how of moving to php 7.x
 
End-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in FinanceEnd-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in Finance
 
HyperLogLog and friends
HyperLogLog and friendsHyperLogLog and friends
HyperLogLog and friends
 
What is c language
What is c languageWhat is c language
What is c language
 
Sudarshan Gaikaiwari - Lucene @ Yelp
Sudarshan Gaikaiwari - Lucene @ YelpSudarshan Gaikaiwari - Lucene @ Yelp
Sudarshan Gaikaiwari - Lucene @ Yelp
 
Lucene at Yelp - By Sudarshan Gaikaiwari
Lucene at Yelp - By Sudarshan Gaikaiwari  Lucene at Yelp - By Sudarshan Gaikaiwari
Lucene at Yelp - By Sudarshan Gaikaiwari
 
A Survey of Password Attacks and Safe Hashing Algorithms
A Survey of Password Attacks and Safe Hashing AlgorithmsA Survey of Password Attacks and Safe Hashing Algorithms
A Survey of Password Attacks and Safe Hashing Algorithms
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Modern C++
Modern C++Modern C++
Modern C++
 
Apache Hama at Samsung Open Source Conference
Apache Hama at Samsung Open Source ConferenceApache Hama at Samsung Open Source Conference
Apache Hama at Samsung Open Source Conference
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
C Interview Questions for Fresher
C Interview Questions for FresherC Interview Questions for Fresher
C Interview Questions for Fresher
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview Question and Answer
C interview Question and AnswerC interview Question and Answer
C interview Question and Answer
 

Mehr von Sandeep Joshi

Mehr von Sandeep Joshi (11)

Block ciphers
Block ciphersBlock ciphers
Block ciphers
 
Synthetic data generation
Synthetic data generationSynthetic data generation
Synthetic data generation
 
How to build a feedback loop in software
How to build a feedback loop in softwareHow to build a feedback loop in software
How to build a feedback loop in software
 
Programming workshop
Programming workshopProgramming workshop
Programming workshop
 
Android malware presentation
Android malware presentationAndroid malware presentation
Android malware presentation
 
Doveryai, no proveryai - Introduction to tla+
Doveryai, no proveryai - Introduction to tla+Doveryai, no proveryai - Introduction to tla+
Doveryai, no proveryai - Introduction to tla+
 
Apache spark undocumented extensions
Apache spark undocumented extensionsApache spark undocumented extensions
Apache spark undocumented extensions
 
Lockless
LocklessLockless
Lockless
 
Rate limiters in big data systems
Rate limiters in big data systemsRate limiters in big data systems
Rate limiters in big data systems
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheads
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 

Kürzlich hochgeladen

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Kürzlich hochgeladen (20)

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 

Hash function landscape

  • 2. Build the toolbox Finding the best hash function for your use-case 1. How many types of hash functions are there ? 2. Why are they different ? 3. How are they related to each other ? 4. How are they related to Machine learning ?
  • 3. Are they same or different ? Depends on the hash function Small change in input => large change in output OR Small change in input => No change in output
  • 4. Takeaway 1 : exact versus fuzzy FuzzyExact Data dependent Learning to hash Data-independent Locality sensitive Using Machine Learning Minimize collisions Small change in input => big change in output Maximize collisions Small change in input => no change in output
  • 5. Takeaway 2 : part of the key changing How to quickly update the hash when part of the key is changing ? 1. Tabulation hash : key is sum of individual pieces 2. Rabin fingerprint : key is a stream
  • 6. Not going to cover ● Regular hash functions like FNV, Jenkins, MurmurHash, etc ● Cryptographic hash functions ● Extendible hashing ● Linear Probing, etc ● Perfect hash
  • 8. Chess game analysis Build a game tree Was this position reached before ? Number of unique chess games that can be played is about 1040 to 10120 [Shannon number]
  • 9. Problem Map game position to single random number Change that random number on every move Need a hash key which is SUM of individual positions
  • 10. Chess combinations How many combinations define a board ? 1. 64 x 32 2. 64 x 18 3. 64 x 13 4. 64 x 12 Answer : 64 x 13 13 = 1 empty + 2 x (pawn, king, queen, bishop, rook, knight)
  • 11. Zobrist hash 13 x 64 board positions = 832 combinations Create random bitstring for each of 64 position https://github.com/official-stockfish/Stockfish/blob/4006f2c9132db034a27a94be33070d6aaa b75b24/src/position.cpp#L114-L116 . 1 0x234234fa 2 0x78ebfa21 ... 0x45e64564 13 0x974e4534
  • 12. Zobrist hash Hash key = XOR of 64 keys https://github.com/official-stockfish/Stockfish/blob/4006f2c9132db034a27a94be33070d6aaab75b2 4/src/position.cpp#L345-L349 1 0x234234fa 2 0x78ebfa21 ... 0x45e64564 13 0x974e4534
  • 13. Zobrist hash is incremental Advantage of XOR : On every move, erase prev position and add new position https://github.com/official-stockfish/Stockfish/blob/4006f2c9132db034a27a94be33070d6aaa b75b24/src/position.cpp#L771 .
  • 14. Why does it work ? Zobrist hash is a type of Tabulation hash Collision when : (x1 ^ x2 ^ … ^ x64 ) = (y1 ^ y2 ^ … ^ y64) In other words : x1 ^ x2 ^ x64 ^ y1 ^ y2 ^ … ^ y64 = 0 How many bits are enough ? 64-bits for chess, different for other games
  • 16. Problem How to build a hash function on a streaming window ? ● One solution was to create “shingles” and hash them ● No incremental update ! hash(1234), hash(2345), hash(3456), …. 1, 2, 3, 4, 5, 6, 7, 8
  • 17. Decimal base example Let’s say number = 2312425254, prime = 97 hash(231) = 231 % 97 hash(312) = 312 % 97 = [ ]%97 hash(124) = [(hash(312) - 300) x 10 + 4] % 97 hash(231) - (first_digit x 100) x 10) + last_digit
  • 18. Arithmetic over smaller set of numbers Galois Field (GF) = smaller set of numbers F4 is not (00, 01, 10, 11) Can do addition and multiplication
  • 19. Rabin fingerprint Does arithmetic over a smaller set of numbers Decimal base Rabin fingerprint in Galois field (GF) Prime number 97 Irreducible polynomial (say = x2 + x + 1) 2031 0x1011 becomes (x3 + x + 1) 2031 (mod p) (x3 + x + 1) mod (irreducible poly) Fingerprint = 2031 (mod 97) Fingerprint = (x3 + x + 1) (mod irreducible poly) Easy to update a stream hash(124) = [(hash(312) - 300) x 10 + 4] % 97 Easy update in binary Galois fields
  • 20. Computing division mod p https://github.com/opendedup/rabinfingerprint/blob/master/src/org/rabinfingerprint/polynomial/Polynomials.java
  • 22. probability of error Just as (mod p) in Decimal can lead to collision... Similarly, in Galois Field GF(2^k) If k > log(nm/e), probability of error is less than “e” Where 1. Pattern length = n 2. Text length = m 3. Fingerprint polynomial length has to be >= k
  • 23. Exact versus fuzzy FuzzyExact Data dependent Learning to hash Data-independent Locality sensitive Using Machine Learning Minimize collisions Small change in input => big change in output Maximize collisions Small change in input => no change in output
  • 24. Spatial hash (fuzzy but data independent)
  • 25. Problem of spatial hash Map any location on the earth to a fuzzy hash Ability to find neighbouring locations only based on their hash Ability to reduce or increase the resolution
  • 26. GeoHash Geohash is the result of binary search down the grid 1. Even bit = 0 if left of longitude, else = 1 2. Odd bit = 0 if above latitude, else = 1 https://www.researchgate.net/figure/Geohash-binary-code_fig2_332061286
  • 27. GeoHash Problems with geohash 1. Earth is sphere, but rectangle area changes based on latitude 2. In some cases, neighbours may not be adjacent
  • 28. Google S2 Hilbert curve - space filling curve
  • 29. Uber H3 Why Hexagons ? Earth is like a soccer ball The key idea is “tessellation” (regular tiling) Many hexagons + few pentagons can cover a soccer ball
  • 30. Uber H3 Hexagon has an advantage 1. Fixed number of neighbours 2. Fixed distance to all 6 neighbours
  • 31. Uber H3 Hierarchical : 3 bits per resolution - upto 15 resolutions 110 110/101 110/101/011
  • 32. Uber H3 advantages 1. Map any location into a 64-bit number geoToH3 (latitude, longitude, resolution) => 64 bit number 2. Find adjacent cells using the coordinate system 3. Define route from point A to point B
  • 33. Uber H3 advantages 4. Is one cell inside another ? (yes, do a prefix match) 5. Want to save space ? (yes, truncate the hash to reduce resolution)
  • 34. Exact versus fuzzy FuzzyExact Data dependent Learning to hash Data-independent Locality sensitive Using Machine Learning Minimize collisions Small change in input => big change in output Maximize collisions Small change in input => no change in output
  • 35. Social hash (fuzzy but data dependent)
  • 37. How graph is stored and fetched
  • 38. How graph is stored and fetched New features are being developed (newsfeed, albums, notifications) For each GraphQL query, many servers are contacted to fetch objects Frontend :=> GraphQL :=> PHP layer Facebook is read intensive : 90 percent of requests are Reads
  • 39. Sharding challenge Assign social graph nodes to servers such that 1. Reduce fanout : Ensure objects which will be fetched together are on same machine. Especially problem with celebrities (i.e. decide closest neighbours ?? ) 2. Stability of assignment : avoid continuously moving objects between servers
  • 40. Consistent hashing is not optimal Earlier, they used consistent hashing Each object has a random “fbid” (64 bit integer) FBID (mod P) mapped to a slot in some virtual ring
  • 41. Building the social hash Two step process : static and dynamic Run Bipartite graph partition using Pregel. Each iteration begins with the previous assignment to provide stability Query1 Query2 QueryN Fbid 12 Fbid 212 Fbid 212 Fbid 1232 Fbid 86 Fbid 3
  • 42. The custom hash function ...1.5B+ Facebook users into 21,000 balanced groups such that each user shares her group with at least 50% of her friends - (from their paper)
  • 43. Fuzzy match in general
  • 44. There is a gradation in hashing... You can create a hash function which is tuned to your requirements 1. Exact : FNV, MD5 2. Syntactic : PhotoDNA 3. Deeper Syntactic : Geometric hash, SIFT 4. Semantic hash : Machine learning classifier https://github.com/facebook/ThreatExchange/blob/master/hashing/hashing.pdf/
  • 45. For example, perceptual hash Apply an averaging and normalizing process 1. Normalize the size 2. Reduce the color 3. Average the color 4. Use Image transforms (DCT) to extract features from image http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html
  • 46. Other fuzzy Image/Audio/Text hashes SIFT Geometric hash : find rotated objects, by using a common reference frame Bag of words, TF-IDF
  • 48. Learning-to-hash => neural network Social Hash => clustering using GNN (Graph neural network) Perceptual/Geometric Hash => do it via CNN Use of Doc2Vec, Node2Vec Learned indexes paper by Google
  • 49. Hashing with a neural network Neural network can be taught to find similarities ( feature extraction ) When does a neural network make sense ? 1. If you want a deeper semantic match 2. If you have enough training data 3. If you are willing to trust (i.e. no control over hash function code)
  • 50. Conclusion Exact 1. Zobrist Hash : addition of individual hashes 2. Rolling Hash : streaming (drop hash of first char, add hash of last char) Fuzzy 3. Spatial Hash : Uber H3 4. Social Hash : keep related graph nodes closer Neural network classifier : its building a fuzzy, data-dependent hash function
  • 51. hash https://www.youtube.com/watch?v=UILoSqvIM2w (Uber H3) Learning-to-hash vs LSH https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/LTHSurvey.pdf Learned Indexes https://learning2hash.github.io/papers.html