SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
Variable Bit Quantisation for LSH
Variable Bit Quantisation for LSH
Sean Moran
Victor Lavrenko, Miles Osborne
School of Informatics
University of Edinburgh
ACL Sofia, August 2013
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 1/41
Variable Bit Quantisation for LSH
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
Locality Sensitive Hashing (LSH)
Variable Bit Quantisation
Evaluation
Summary
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 2/41
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
Problem: retrieve nearest neighbour(s) to a given query item
q?
y
x
1-NN
1-NN
q
x
y
Na¨ıve approach: compare query to all N database items
Scales linearly O(N) with the size of the database
Impractical for all but the smallest of databases
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 3/41
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
[1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 4/41
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
[1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 5/41
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
[1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 6/41
Variable Bit Quantisation for LSH
Hashing-based search using binary codes
Database
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 7/41
Variable Bit Quantisation for LSH
Hashing-based search using binary codes
110101
010111
H Database
Hash Table
010101
111101
.....
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 8/41
Variable Bit Quantisation for LSH
Hashing-based search using binary codes
110101
010111
H
H
Query
Database
Hash Table
010101
111101
.....
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 9/41
Variable Bit Quantisation for LSH
Hashing-based search using binary codes
110101
010111
H
H
Query
Database
Hash Table
010101
111101
.....
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 10/41
Variable Bit Quantisation for LSH
Hashing-based search using binary codes
110101
010111
H
H
Query Compute
Similarity
Nearest
Neighbours
Query
Database
Hash Table
010101
111101
.....
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 11/41
Variable Bit Quantisation for LSH
Why transform our data to binary codes?
Constant O(1) query time with respect to the dataset size.
Binary code comparison requires few machine instructions
(XOR followed by popcount).
Binary codes are extremely compact e.g. 16Mb to store 1
million, 128 bit encoded data points.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 12/41
Variable Bit Quantisation for LSH
Applications
Recommendation
Near duplicate detection
Content based IR
Recommendation
Near Duplicate Detection
Noun Clustering
Location Recognition
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 13/41
Variable Bit Quantisation for LSH
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
Locality Sensitive Hashing (LSH)
Variable Bit Quantisation
Evaluation
Summary
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 14/41
Variable Bit Quantisation for LSH
Computing similarity preserving binary codes
Locality Sensitive Hashing (LSH):
Randomised algorithm for approximate nearest neighbour
search
Generates similarity preserving binary codes for a dataset
Preserved similarity depends on the selected hash family
Probabilistic guarantee on accuracy versus search time
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 15/41
Variable Bit Quantisation for LSH
Locality Sensitive Hashing
x
y
n2
n1
h1
h2
11
0100
10
h1(p) (p)h2
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 16/41
Variable Bit Quantisation for LSH
Low Dimensional Projection
n2
n1
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 17/41
Variable Bit Quantisation for LSH
Single Bit Quantisation (SBQ)
0 1
t
n2
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 18/41
Variable Bit Quantisation for LSH
Plus many more...
Kernel methods [1]
Spectral methods [2] [3]
Neural networks [4]
Loss based methods [5]
All use single bit quantisation (SBQ)...
[1] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS ’09.
[2] Y. Weiss and A. Torralba and R. Fergus. Spectral Hashing. NIPS ’08.
[3] J. Wang and S. Kumar and SF. Chang. Semi-supervised hashing for large-scale search. PAMI ’12.
[4] R. Salakhutdinov and G. Hinton. Semantic Hashing. NIPS ’08.
[5] B. Kulis and T. Darrell. Learning to Hash with Binary Reconstructive Embeddings. NIPS ’09.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 19/41
Variable Bit Quantisation for LSH
Problem 1: SBQ leads to high quantisation errors
Highest point density typically occurs around zero:
−1.5 −1 −0.5 0 0.5 1 1.5
0
1000
2000
3000
4000
5000
6000
Projected Value
Count
Closer points can have a greater Hamming distance than
distant points:
0 1
t
n2
Same codeDifferent code
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 20/41
Variable Bit Quantisation for LSH
Problem 2: some hyperplanes are better than others
n1
n2Projected Dimension 2
Projected Dimension 1
x
y
n2
n1
h1
h2
11
0100
10
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 21/41
Variable Bit Quantisation for LSH
Problem 2: some hyperplanes are better than others
n1
n2Projected Dimension 2
Projected Dimension 1
x
y
n2
n1
h1
h2
11
0100
10
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 22/41
Variable Bit Quantisation for LSH
Problem 2: some hyperplanes are better than others
n1
n2Projected Dimension 2
Projected Dimension 1
x
y
n2
n1
h1
h2
11
0100
10
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 23/41
Variable Bit Quantisation for LSH
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
Locality Sensitive Hashing (LSH)
Variable Bit Quantisation
Evaluation
Summary
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 24/41
Variable Bit Quantisation for LSH
Our Solution: Variable Bit Quantisation
Threshold
Optimisation
Variable Bit
Allocation
Variable Bit Quantisation
Continuous
data
Binary
string
Multiple Bit Encoding
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 25/41
Variable Bit Quantisation for LSH
Our Solution: Variable Bit Quantisation
Threshold
Optimisation
Variable Bit
Allocation
Variable Bit Quantisation
Continuous
data
Binary
string
Multiple Bit Encoding
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 26/41
Variable Bit Quantisation for LSH
Multiple Bit Encoding: Natural Binary Code
0 1
t
n2
00 01 10 11
t1 t2 t3
n2
1 bit
2 bits
000 001 010
t2 t4 t6
n2
3 bits
etc...
t1 t3 t5 t7
011 111101 110100
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 27/41
Variable Bit Quantisation for LSH
Multiple Bit Encoding [1]
00 01 10 11
t1 t2 t3
n2
2 bits
Decimal distance = 3-1 = 2
Decimal distance = 2-0 = 2
[1] W. Kong and W. Li and M. Guo. Manhattan hashing for large-scale image
retrieval. SIGIR ’12.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 28/41
Variable Bit Quantisation for LSH
Our Solution: Variable Bit Quantisation
Threshold
Optimisation
Variable Bit
Allocation
Variable Bit Quantisation
Continuous
data
Binary
string
Multiple Bit Encoding
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 29/41
Variable Bit Quantisation for LSH
Threshold Optimisation
Multiple bits per hyperplane requires multiple thresholds.
F1-based optimisation using pairwise constraints matrix S:
TP: # Sij = 1 pairs in the same region. FP: # Sij = 0 pairs
in the same region. FN: # Sij = 1 pairs in different regions.
F1 score: 1.00
00 01 10 11
t1 t2 t3
n2
F1 score: 0.44
00 01 10 11
t1 t2 t3
n2
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 30/41
Variable Bit Quantisation for LSH
Our Solution: Variable Bit Quantisation
Threshold
Optimisation
Variable Bit
Allocation
Variable Bit Quantisation
Continuous
data
Binary
string
Multiple Bit Encoding
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 31/41
Variable Bit Quantisation for LSH
Variable Bit Allocation
F1 score is a measure of the neighbourhood preserving quality
of a hyperplane:
F1 score: 1.00
F1 score: 0.25
00 01 10 11
t1 t2 t3
0 bits assigned: 0 thresholds do just as well as one or more thresholds
n1
n2
2 bits assigned: 3 thresholds perfectly preserve the neighbourhood structure
Compute bit allocation that maximises the cumulative F1
score across all hyperplanes subject to a bit budget B.
Bit allocation solved as a binary integer linear program (BILP).
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 32/41
Variable Bit Quantisation for LSH
Variable Bit Allocation
max F ◦ Z
subject to Zh = 1 h ∈ {1 . . . B}
Z ◦ D ≤ B
Z is binary
F contains the F1 scores per hyperplane, per bit count
Z is an indicator matrix specifying the bit allocation
D is a constraint matrix
B is the bit budget
. denotes the Frobenius L1 norm
◦ the Hadamard product
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 33/41
Variable Bit Quantisation for LSH
Variable Bit Allocation
max F ◦ Z
subject to Zh = 1 h ∈ {1 . . . B}
Z ◦ D ≤ B
Z is binary


F h1 h2
b0 0.25 0.25
b1 0.35 0.50
b2 0.40 1.00




D
0 0
1 1
2 2




Z
1 0
0 0
0 1


Sparse solution possible: lower quality hyperplanes can be
discarded.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 34/41
Variable Bit Quantisation for LSH
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
Locality Sensitive Hashing (LSH)
Variable Bit Quantisation
Evaluation
Summary
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 35/41
Variable Bit Quantisation for LSH
Evaluation Protocol
Task: Text and image retrieval on three standard datasets:
CIFAR-10, TDT-2 and Reuters-21578.
Projections: LSH [1], Shift-invariant kernel hashing (SIKH)
[2], Binary LSI (BLSI) [3], Spectral Hashing (SH) [4] and
PCA-Hashing (PCAH) [5].
Baselines: Single Bit Quantisation (SBQ), Manhattan
Hashing (MQ)[6], Double-Bit quantisation (DBQ) [7].
Hamming Ranking: how well do we retrieve the −NN of
query points?
[1] P. Indyk and R. Motwani. Approximate nearest neighbors: removing the curse of dimensionality. In STOC ’98.
[2] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS ’09.
[3] R. Salakhutdinov and G. Hinton. Semantic Hashing. NIPS ’08.
[4] Y. Weiss and A. Torralba and R. Fergus. Spectral Hashing. NIPS ’08.
[5] J. Wang and S. Kumar and SF. Chang. Semi-supervised hashing for large-scale search. PAMI ’12.
[6] W. Kong and W. Li and M. Guo. Manhattan hashing for large-scale image retrieval. SIGIR ’12.
[7] W. Kong and W. Li. Double Bit Quantisation for Hashing. AAAI ’12.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 36/41
Variable Bit Quantisation for LSH
AUPRC across different projection methods
Dataset CIFAR-10 (32 bits) Reuters-21578 (128 bits)
SBQ MQ DBQ VBQ SBQ MQ DBQ VBQ
SIKH 0.042 0.046 0.047 0.161 0.102 0.112 0.087 0.389
LSH 0.119 0.091 0.066 0.207 0.276 0.201 0.175 0.538
BLSI 0.038 0.135 0.111 0.231 0.100 0.030 0.030 0.156
SH 0.051 0.144 0.111 0.202 0.033 0.028 0.030 0.154
PCAH 0.036 0.132 0.107 0.219 0.095 0.034 0.027 0.154
VBQ is an effective quantisation scheme for both image and
text datasets.
VBQ and a cheap projection (e.g. LSH) can outperform SBQ
and an expensive projection (e.g. PCA).
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 37/41
Variable Bit Quantisation for LSH
AUPRC for LSH across a broad bit range
32 48 64 96 128 256
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
SBQ MQ VBQ
Number of Bits
AUPRC
8 16 24 32 40 48 56 64
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
SBQ MQ VBQ
Number of Bits
AUPRC
(a) Reuters-21578 (b) CIFAR-10
VBQ is effective across a wide range of bits.
See paper for further results.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 38/41
Variable Bit Quantisation for LSH
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
Locality Sensitive Hashing (LSH)
Variable Bit Quantisation
Evaluation
Summary
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 39/41
Variable Bit Quantisation for LSH
Summary
Proposed a general data-driven technique to adaptively assign
variable bits per LSH hyperplane
Hyperplanes better preserving the neighbourhood structure
are afforded more bits from the bit budget
VBQ substantially increased retrieval performance across
standard text and image datasets
Future work: evaluate with an LSH system that uses hash
tables for fast retrieval
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 40/41
Variable Bit Quantisation for LSH
Thank you for your attention
Sean Moran
sean.moran@ed.ac.uk
www.seanjmoran.com
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 41/41

Weitere ähnliche Inhalte

Ähnlich wie Variable Bit Quantisation for LSH Improves Similarity Search

PhD thesis defence slides
PhD thesis defence slidesPhD thesis defence slides
PhD thesis defence slidesSean Moran
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filterxlight
 
2015 agbt sv_poster_justin
2015 agbt sv_poster_justin2015 agbt sv_poster_justin
2015 agbt sv_poster_justinGenomeInABottle
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Universitat Politècnica de Catalunya
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Thomas Keane
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingSeonghyun Kim
 
4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...
4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...
4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...Venkat Projects
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsDimitris Kontokostas
 
Utilise Multipath Propagation to Improve the performance of BCH and RS Codes
Utilise Multipath Propagation to Improve the performance of BCH and RS CodesUtilise Multipath Propagation to Improve the performance of BCH and RS Codes
Utilise Multipath Propagation to Improve the performance of BCH and RS CodesALYAA AL-BARRAK
 

Ähnlich wie Variable Bit Quantisation for LSH Improves Similarity Search (9)

PhD thesis defence slides
PhD thesis defence slidesPhD thesis defence slides
PhD thesis defence slides
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
2015 agbt sv_poster_justin
2015 agbt sv_poster_justin2015 agbt sv_poster_justin
2015 agbt sv_poster_justin
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...
4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...
4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Utilise Multipath Propagation to Improve the performance of BCH and RS Codes
Utilise Multipath Propagation to Improve the performance of BCH and RS CodesUtilise Multipath Propagation to Improve the performance of BCH and RS Codes
Utilise Multipath Propagation to Improve the performance of BCH and RS Codes
 

Kürzlich hochgeladen

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Variable Bit Quantisation for LSH Improves Similarity Search

  • 1. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Sean Moran Victor Lavrenko, Miles Osborne School of Informatics University of Edinburgh ACL Sofia, August 2013 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 1/41
  • 2. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Fast search in large-scale datasets Locality Sensitive Hashing (LSH) Variable Bit Quantisation Evaluation Summary Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 2/41
  • 3. Variable Bit Quantisation for LSH Fast search in large-scale datasets Problem: retrieve nearest neighbour(s) to a given query item q? y x 1-NN 1-NN q x y Na¨ıve approach: compare query to all N database items Scales linearly O(N) with the size of the database Impractical for all but the smallest of databases Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 3/41
  • 4. Variable Bit Quantisation for LSH Fast search in large-scale datasets [1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 4/41
  • 5. Variable Bit Quantisation for LSH Fast search in large-scale datasets [1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 5/41
  • 6. Variable Bit Quantisation for LSH Fast search in large-scale datasets [1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 6/41
  • 7. Variable Bit Quantisation for LSH Hashing-based search using binary codes Database Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 7/41
  • 8. Variable Bit Quantisation for LSH Hashing-based search using binary codes 110101 010111 H Database Hash Table 010101 111101 ..... Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 8/41
  • 9. Variable Bit Quantisation for LSH Hashing-based search using binary codes 110101 010111 H H Query Database Hash Table 010101 111101 ..... Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 9/41
  • 10. Variable Bit Quantisation for LSH Hashing-based search using binary codes 110101 010111 H H Query Database Hash Table 010101 111101 ..... Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 10/41
  • 11. Variable Bit Quantisation for LSH Hashing-based search using binary codes 110101 010111 H H Query Compute Similarity Nearest Neighbours Query Database Hash Table 010101 111101 ..... Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 11/41
  • 12. Variable Bit Quantisation for LSH Why transform our data to binary codes? Constant O(1) query time with respect to the dataset size. Binary code comparison requires few machine instructions (XOR followed by popcount). Binary codes are extremely compact e.g. 16Mb to store 1 million, 128 bit encoded data points. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 12/41
  • 13. Variable Bit Quantisation for LSH Applications Recommendation Near duplicate detection Content based IR Recommendation Near Duplicate Detection Noun Clustering Location Recognition Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 13/41
  • 14. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Fast search in large-scale datasets Locality Sensitive Hashing (LSH) Variable Bit Quantisation Evaluation Summary Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 14/41
  • 15. Variable Bit Quantisation for LSH Computing similarity preserving binary codes Locality Sensitive Hashing (LSH): Randomised algorithm for approximate nearest neighbour search Generates similarity preserving binary codes for a dataset Preserved similarity depends on the selected hash family Probabilistic guarantee on accuracy versus search time Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 15/41
  • 16. Variable Bit Quantisation for LSH Locality Sensitive Hashing x y n2 n1 h1 h2 11 0100 10 h1(p) (p)h2 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 16/41
  • 17. Variable Bit Quantisation for LSH Low Dimensional Projection n2 n1 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 17/41
  • 18. Variable Bit Quantisation for LSH Single Bit Quantisation (SBQ) 0 1 t n2 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 18/41
  • 19. Variable Bit Quantisation for LSH Plus many more... Kernel methods [1] Spectral methods [2] [3] Neural networks [4] Loss based methods [5] All use single bit quantisation (SBQ)... [1] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS ’09. [2] Y. Weiss and A. Torralba and R. Fergus. Spectral Hashing. NIPS ’08. [3] J. Wang and S. Kumar and SF. Chang. Semi-supervised hashing for large-scale search. PAMI ’12. [4] R. Salakhutdinov and G. Hinton. Semantic Hashing. NIPS ’08. [5] B. Kulis and T. Darrell. Learning to Hash with Binary Reconstructive Embeddings. NIPS ’09. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 19/41
  • 20. Variable Bit Quantisation for LSH Problem 1: SBQ leads to high quantisation errors Highest point density typically occurs around zero: −1.5 −1 −0.5 0 0.5 1 1.5 0 1000 2000 3000 4000 5000 6000 Projected Value Count Closer points can have a greater Hamming distance than distant points: 0 1 t n2 Same codeDifferent code Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 20/41
  • 21. Variable Bit Quantisation for LSH Problem 2: some hyperplanes are better than others n1 n2Projected Dimension 2 Projected Dimension 1 x y n2 n1 h1 h2 11 0100 10 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 21/41
  • 22. Variable Bit Quantisation for LSH Problem 2: some hyperplanes are better than others n1 n2Projected Dimension 2 Projected Dimension 1 x y n2 n1 h1 h2 11 0100 10 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 22/41
  • 23. Variable Bit Quantisation for LSH Problem 2: some hyperplanes are better than others n1 n2Projected Dimension 2 Projected Dimension 1 x y n2 n1 h1 h2 11 0100 10 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 23/41
  • 24. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Fast search in large-scale datasets Locality Sensitive Hashing (LSH) Variable Bit Quantisation Evaluation Summary Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 24/41
  • 25. Variable Bit Quantisation for LSH Our Solution: Variable Bit Quantisation Threshold Optimisation Variable Bit Allocation Variable Bit Quantisation Continuous data Binary string Multiple Bit Encoding Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 25/41
  • 26. Variable Bit Quantisation for LSH Our Solution: Variable Bit Quantisation Threshold Optimisation Variable Bit Allocation Variable Bit Quantisation Continuous data Binary string Multiple Bit Encoding Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 26/41
  • 27. Variable Bit Quantisation for LSH Multiple Bit Encoding: Natural Binary Code 0 1 t n2 00 01 10 11 t1 t2 t3 n2 1 bit 2 bits 000 001 010 t2 t4 t6 n2 3 bits etc... t1 t3 t5 t7 011 111101 110100 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 27/41
  • 28. Variable Bit Quantisation for LSH Multiple Bit Encoding [1] 00 01 10 11 t1 t2 t3 n2 2 bits Decimal distance = 3-1 = 2 Decimal distance = 2-0 = 2 [1] W. Kong and W. Li and M. Guo. Manhattan hashing for large-scale image retrieval. SIGIR ’12. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 28/41
  • 29. Variable Bit Quantisation for LSH Our Solution: Variable Bit Quantisation Threshold Optimisation Variable Bit Allocation Variable Bit Quantisation Continuous data Binary string Multiple Bit Encoding Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 29/41
  • 30. Variable Bit Quantisation for LSH Threshold Optimisation Multiple bits per hyperplane requires multiple thresholds. F1-based optimisation using pairwise constraints matrix S: TP: # Sij = 1 pairs in the same region. FP: # Sij = 0 pairs in the same region. FN: # Sij = 1 pairs in different regions. F1 score: 1.00 00 01 10 11 t1 t2 t3 n2 F1 score: 0.44 00 01 10 11 t1 t2 t3 n2 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 30/41
  • 31. Variable Bit Quantisation for LSH Our Solution: Variable Bit Quantisation Threshold Optimisation Variable Bit Allocation Variable Bit Quantisation Continuous data Binary string Multiple Bit Encoding Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 31/41
  • 32. Variable Bit Quantisation for LSH Variable Bit Allocation F1 score is a measure of the neighbourhood preserving quality of a hyperplane: F1 score: 1.00 F1 score: 0.25 00 01 10 11 t1 t2 t3 0 bits assigned: 0 thresholds do just as well as one or more thresholds n1 n2 2 bits assigned: 3 thresholds perfectly preserve the neighbourhood structure Compute bit allocation that maximises the cumulative F1 score across all hyperplanes subject to a bit budget B. Bit allocation solved as a binary integer linear program (BILP). Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 32/41
  • 33. Variable Bit Quantisation for LSH Variable Bit Allocation max F ◦ Z subject to Zh = 1 h ∈ {1 . . . B} Z ◦ D ≤ B Z is binary F contains the F1 scores per hyperplane, per bit count Z is an indicator matrix specifying the bit allocation D is a constraint matrix B is the bit budget . denotes the Frobenius L1 norm ◦ the Hadamard product Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 33/41
  • 34. Variable Bit Quantisation for LSH Variable Bit Allocation max F ◦ Z subject to Zh = 1 h ∈ {1 . . . B} Z ◦ D ≤ B Z is binary   F h1 h2 b0 0.25 0.25 b1 0.35 0.50 b2 0.40 1.00     D 0 0 1 1 2 2     Z 1 0 0 0 0 1   Sparse solution possible: lower quality hyperplanes can be discarded. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 34/41
  • 35. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Fast search in large-scale datasets Locality Sensitive Hashing (LSH) Variable Bit Quantisation Evaluation Summary Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 35/41
  • 36. Variable Bit Quantisation for LSH Evaluation Protocol Task: Text and image retrieval on three standard datasets: CIFAR-10, TDT-2 and Reuters-21578. Projections: LSH [1], Shift-invariant kernel hashing (SIKH) [2], Binary LSI (BLSI) [3], Spectral Hashing (SH) [4] and PCA-Hashing (PCAH) [5]. Baselines: Single Bit Quantisation (SBQ), Manhattan Hashing (MQ)[6], Double-Bit quantisation (DBQ) [7]. Hamming Ranking: how well do we retrieve the −NN of query points? [1] P. Indyk and R. Motwani. Approximate nearest neighbors: removing the curse of dimensionality. In STOC ’98. [2] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS ’09. [3] R. Salakhutdinov and G. Hinton. Semantic Hashing. NIPS ’08. [4] Y. Weiss and A. Torralba and R. Fergus. Spectral Hashing. NIPS ’08. [5] J. Wang and S. Kumar and SF. Chang. Semi-supervised hashing for large-scale search. PAMI ’12. [6] W. Kong and W. Li and M. Guo. Manhattan hashing for large-scale image retrieval. SIGIR ’12. [7] W. Kong and W. Li. Double Bit Quantisation for Hashing. AAAI ’12. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 36/41
  • 37. Variable Bit Quantisation for LSH AUPRC across different projection methods Dataset CIFAR-10 (32 bits) Reuters-21578 (128 bits) SBQ MQ DBQ VBQ SBQ MQ DBQ VBQ SIKH 0.042 0.046 0.047 0.161 0.102 0.112 0.087 0.389 LSH 0.119 0.091 0.066 0.207 0.276 0.201 0.175 0.538 BLSI 0.038 0.135 0.111 0.231 0.100 0.030 0.030 0.156 SH 0.051 0.144 0.111 0.202 0.033 0.028 0.030 0.154 PCAH 0.036 0.132 0.107 0.219 0.095 0.034 0.027 0.154 VBQ is an effective quantisation scheme for both image and text datasets. VBQ and a cheap projection (e.g. LSH) can outperform SBQ and an expensive projection (e.g. PCA). Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 37/41
  • 38. Variable Bit Quantisation for LSH AUPRC for LSH across a broad bit range 32 48 64 96 128 256 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 SBQ MQ VBQ Number of Bits AUPRC 8 16 24 32 40 48 56 64 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 SBQ MQ VBQ Number of Bits AUPRC (a) Reuters-21578 (b) CIFAR-10 VBQ is effective across a wide range of bits. See paper for further results. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 38/41
  • 39. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Fast search in large-scale datasets Locality Sensitive Hashing (LSH) Variable Bit Quantisation Evaluation Summary Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 39/41
  • 40. Variable Bit Quantisation for LSH Summary Proposed a general data-driven technique to adaptively assign variable bits per LSH hyperplane Hyperplanes better preserving the neighbourhood structure are afforded more bits from the bit budget VBQ substantially increased retrieval performance across standard text and image datasets Future work: evaluate with an LSH system that uses hash tables for fast retrieval Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 40/41
  • 41. Variable Bit Quantisation for LSH Thank you for your attention Sean Moran sean.moran@ed.ac.uk www.seanjmoran.com Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 41/41