SlideShare ist ein Scribd-Unternehmen logo
1 von 51
This is your cover.
Insert an amazing image and align it
with this grey rectangle. Please use the
red, patterned, Shutterstock background
for internal presentations only.
The search for a new visual search
Mike Ranzinger, Senior Research Engineer @ Shutterstock
Image
Footage
Music
Editorial
Single user
Team
Companies
Agencies
Shutterstock
platform
Create
Edit
Share
Publish
• We’re going to explore a new technology that was just released to beta called “Composition
Aware Search”
• This involves some key technologies:
•Convolutional Neural Nets (Vision and NLP)
•Discriminative Localization
•Multi-modal Embeddings
•Dimensionality Reduction
•Inverted Multi-Index
• Yes, between this presentation, and our publicly shared white paper, you should be able to
implement this yourself
•(non-commercially of course).
Outline
Language and visual domain mismatch
VS
Query: Red Bike
An image can help provide visual
interest to your written content.
Insert an image and align
it with this grey rectangle.
Domain mismatch
• The saying is: “A picture is worth a thousand
words”
• Our average query length is 2 words
• Sometimes it’s hard to describe exactly what
you’re looking for
• Our users are accustomed to looking
through multiple pages of results to find
what they
were looking for
Image similarity / reverse image search
• Common problem: I have picture X without a license, and I need to get a
license for it
•Perhaps you saw it on social media, and you wanted to share it more officially
• My toy problem: I took this bad picture, find me a good one!
• We don’t use words, at all. We communicate through pixels.
Reverse image search
My bike
How does it work?
Trained CNN
Fixed length
vectorTrained CNN
Our bike images
Maximum inner product
search between our
collection, and the
query vector
OurCollection
• We have a vision model that can produce
an N-dimensional vector for a given image.
• Train a language model that maps a query
to the vector of the downloaded image.
• Training set: Query to download pairs.
Multimodal embedding / query language models
Lemur on rock
• Kiros et. al. “Unifying visual-semantic embeddings with multimodal
neural language models”
• Trained using “Triplet Loss”
•Let 𝑓(𝑥) be the L2 normalized output of the vision model on image 𝑥
•Let 𝑔(𝑞) be the L2 normalized output of the language model on query 𝑞
•Let 𝑞' be the query corresponding to image 𝑥'
•Let 𝑚 be some margin 0 < 𝑚 < 2
• 𝐿 = max 0, 𝑓 𝑥2 ∘ 𝑔 𝑞' − 𝑓 𝑥' ∘ 𝑔 𝑞' + 𝑚
• In words, the dot product between a query and it’s corresponding
image (green) should be greater than the query and some unrelated
image (red) by some margin 𝑚.
Multimodal embedding
• We train the vision model first
• Next, we train the language model.
•We don’t backprop gradients though the vision model because it degrades
it
• Once we’ve finished training the language model, we can search for
images given a query using MIPS, the same way that we did with reverse
image search.
Multimodal embedding
Multimodal embedding search
• Here’s an example of a “fully convolutional” neural network.
• A fully convolutional network is typically a series of convolutions and
downsampling operations that ends with a global average pooling operation.
• The GAP reduces the final feature maps down to a single vector (one value per
feature map).
• We call a position (y, x) in the final feature maps a “spatial feature vector”
Spatial feature vectors
• Like the feature vector produced by the global average pool,
spatial vectors also encode information in the same
embedding space.
• Importantly, these vectors tend to encode more localized
information based on the receptive field of the given neuron.
• We exploit these vectors to build out CAS
Spatial feature vectors
• Zhou et. al. introduced a very important paper
titled “Learning Deep Features for Discriminative
Localization”
• They introduce the concept of “Class Activation
Maps” (CAM), which is effectively a heatmap of
the classification strength for each output
position before the GAP, for a given class.
Discriminative localization
• Let 𝑓6 𝑦, 𝑥 be the activation of unit 𝑘 at position (y, x) of the last
convolutional layer
• 𝐹6 =
'
:,;
∑ 𝑓6 𝑦, 𝑥:,;
•The result of the GAP for unit 𝑘
• For a given class 𝑐, the input to the softmax, 𝑆@ =	∑ 𝑤6
@
6 𝐹6
•In words, the dot product between the GAP features, and the learned
vector for the given class, where 𝑤6
@
is the weight for class 𝑐 for unit 𝑘
• Let 𝑀@ 𝑦, 𝑥 = ∑ 𝑤6
@
6 𝑓6 𝑦, 𝑥 be the class activation map, or in
words, the importance of spatial position (y,x) for the classification
of class c.
•I’d recommend reading the paper to see the full derivation
Discriminative localization
CAM for highest probability guess, which is “meerkat” with probability 40%.
What it looks like for us
What it looks like for us
Mountain bike, 43%
• Recall that the output of the GAP is
• 𝐹6 =
'
:,;
∑ 𝑓6 𝑦, 𝑥:,;
• What if, instead of needing class 𝑐, we instead use 𝐹6 as the target
• 𝑀@ 𝑦, 𝑥 = ∑ 𝐹66 𝑓6 𝑦, 𝑥
• Basically, this tells us how close a given spatial vector is to the average
vector. One way to interpret this is, “how salient is the spatial vector to the
classification”.
Auto-saliency
Note that “lemur” isn’t actually a class that the network was trained against. The
closest class neighbors are meerkat and koala.
Auto-saliency
Auto-saliency
Auto-saliency
Multiple regions within the image may be deemed important.
Auto-saliency
• Why is this important?
•It allows us to visualize how the network behaves on inputs for classes that it
wasn’t explicitly trained on.
• The idea that this works also reveals an open problem for us:
•In order for the salient vectors to emerge, the non-salient regions of the image
must either try to align themselves in the same direction as the salient vector
•Dilation
•Or, the non-salient regions must reduce their magnitude so not to bias the
salient vector
Auto-saliency
• We have now seen how we can use CAMs, as well as the GAP vectors
themselves to guide the heatmaps.
• Finally, we can look back at the language model we trained earlier.
• 𝐿 = max 0, 𝑓 𝑥2 ∘ 𝑔 𝑞' − 𝒇 𝒙 𝟏 ∘ 𝒈 𝒒 𝟏 + 𝑚
•The language model learns to match the direction of the GAP
•In effect, we can use the language model to generate the class weights for the CAM
technique on the fly.
•I think it’s neat to interpret the language model as a low-rank approximator of
the (potentially infinite) classification weight matrix.
Language models as discriminators
Composition aware search - overview
Spatial IndexCollect
This texture is also an anchor,
with position, size, and query
image.
Lamp
Lamp and Chair are called
“anchors”, which have both
a position and a query string.
Chair
VisionModel
Language Model
• Vision Model
•We are using a variant of the Inception v3 paper by Szegedy et. al. titled
“Rethinking the Inception Architecture for Computer Vision”
•Notable differences:
•We are not using batch normalization
•We are using ELU non-linearities instead of ReLUs.
• Language Model
•We tried to be fancy and use cool tech such as character-models and LSTMs
•The character LSTMs massively overfit on us
•So, we used words, and dropped recurrency altogether in favor of a simpler
convolutional language model as described by Collobert et. al. in “Natural
Language Processing (Almost) from Scratch”
Models
• Let’s look at the query formulation for this, starting
with the simple case.
• Let 𝑆(𝑖) be the score for image 𝑖
• Let 𝐐 be the set of anchors, and 𝐪L be the 𝑗-th
anchor L2 normalized (column) vector
• Let 𝐕O be the set of spatial vectors in image 𝑖, and 𝐯OQ
be the 𝑝-th spatial L2 normalized (column) vector
• Let 𝑤LQ be a positional weight applied to position 𝑝
based on the position of query 𝑗
The search problem
𝑆(𝑖) =
1
𝐐
T max
UVQVW
𝑤LQ 𝐯OQ
⏉
𝐪L
𝐐
L
Take the average over the anchors.
The search problem
𝑆 𝑖 =
1
𝐐
T max
UVQVW
𝑤LQ 𝐯OQ
⏉
𝐪L
𝐐
L
We only care about the largest
weighted similarity score. This
gives us a single score per
query anchor for an image.
We use this to weight the similarity
score based on the relative position
of the anchor to the spatial vector.
Take the average over the anchors.
Query Canvas
Visualization of 𝑤
𝑆(𝑖) =
1
𝐐
T max
UVQVW
𝑤LQ 𝐯OQ
⏉
𝐪L
𝐐
L
Column Row
𝑤
The search problem
• Using the above definition, the size of the index is
defined by the following variables:
• 𝐶 - The size of the collection
• 𝐷 - The dimensionality of the spatial vectors
• 𝑃 - The number of spatial positions
• For our (beta) production offering, we have:
• 𝐶 = 10,000,000
• 𝐷 = 256
• 𝑃 = 64, we use 8 rows and 8 columns
• Requires about 611 GB of space to store the index.
• Algorithm complexity is also 𝑂(𝐶 ⋅ 𝐐 ⋅ 𝑃 ⋅ 𝐷),
which is, a lot.
𝑆(𝑖) =
1
𝐐
T max
UVQVW
𝑤LQ 𝐯OQ
⏉
𝐪L
𝐐
L
Visualizing concepts
Using PCA, we can visualize how the concepts are arranged based
on the 2 principal directions.
Cat
Background
Visualizing concepts
Mountains
Cyclists
Sky
Visualizing concepts – reduction methods
Asphalt
PCA t-SNE
Visualizing concepts – reduction methods
• t-SNE was superior at disentangling concepts on a 2d plane
•Manifold learning technique
•Popular technique for data visualization
• PCA was still able to do a decent job
•Linear
• We use PCA because embedding a new point is efficiently
computed with a single GEMM
The search problem (part 2)
• Since we are performing a dimensionality
reduction on the spatial vectors for each image,
let’s re-define the search problem.
• Let 𝐁O ∈ ℝd×f be the orthonormal basis of the
PCA for image 𝑖 such that we preserve 𝑁
dimensions, and 𝑁 ≤ 𝐷.
• Let 𝐝OQ = 𝐁O 𝐯OQbe the reduced dimensionality
spatial vector for image 𝑖 at position 𝑝.
𝑆 𝑖 =
1
𝐐
T max
UVQVW
𝑤LQ 𝐝OQ
⏉
𝐁O 𝐪L
𝐐
L
The search problem (part 2)
Then compute the dot product
between the two vectors in the
subspace.
Project the query vector into the
reduced dimensionality subspace
for the image.
𝑆 𝑖 =
1
𝐐
T max
UVQVW
𝑤LQ 𝐝OQ
⏉
𝐁O 𝐪L
𝐐
L
The search problem (part 2)
Naïve definition
• Requires 𝐶𝐷 𝑃 storage space
•611 GB for 10mil images
• Computation 𝑂(𝐶 ⋅ 𝐐 ⋅ 𝑃 ⋅ 𝐷)
•In practice,
𝑃 = 64, 𝐷 = 256, so 𝑃 ⋅ 𝐷 = 16384
𝑆(𝑖) =
1
𝐐
T max
UVQVW
𝑤LQ 𝐯OQ
⏉
𝐪L
𝐐
L
𝑆 𝑖 =
1
𝐐
T max
UVQVW
𝑤LQ 𝐝OQ
⏉
𝐁O 𝐪L
𝐐
L
New definition
• Requires 𝐶𝑁 𝐷 + 𝑃 storage space
• 𝑁 ≈ 4
•48 GB for 10mil images
•12.7x reduction
• 𝑂 𝐶 ⋅ 𝐐 𝑁 𝐷 + 𝑃
• 𝑁 𝐷 + 𝑃 = 1280
•12.8x reduction
The search problem (part 3)
• Now the current computational complexity is:
• 𝑂 𝐶 ⋅ 𝐐 𝑁 𝐷 + 𝑃
• Importantly, this is still intractable because it still processes
every image in the collection.
•Users typically don’t want to wait roughly 7 seconds for a
response
• Our best bet is to formulate the problem such that we only
process a tiny fraction of 𝐶
Inverted index
• Construction:
•Select codebook size, 𝐖
•Find the 𝐖 centroids of 𝐶 using a K-means like process
•Each of these centroids are called “codewords”
•Assign each vector in 𝐂 to it’s nearest vector in 𝐖
• Inference:
•Find the 𝑘 nearest codewords in 𝐖 to 𝐪
•Either return all of the vectors in the 𝑘 codewords, or perhaps
find the 𝑘′ nearest vectors to 𝐪 within the codewords.
Inverted multi-index
• Introduced by Babenko and Lempitsky at CVPR 2012
• This technique combines Product Quantization with
Inverted Indices
• Construction:
•Split your collection 𝐶 into 𝑁 partitions, typically 𝑁 = 2
•For each partition 𝑀, find 𝐖 cluster centers, as with
the inverted index
•For each vector in 𝐶, assign it to the nearest codeword
in each partition independently.
•This forms a Cartesian product of the codebooks,
such that the full codebook is essentially size 𝐖 d
• Paper: Inverted Multi-Index
Dims 1 → 𝑟 Dims 𝑟 + 1 → 𝐷
Inverted multi-index
• Inference:
•Sort the codebooks in each partition 𝑀 based on
distance to 𝐪′
•Traverse the 𝑁 codebooks by visiting the nearest
codeword 𝑚 defined by the sum of the distances for
each 𝐪′ to each 𝐦.
• I strongly recommend reading the paper for this one.
It’s hard to explain on a slide.
Inverted multi-index
Source: “The Inverted Multi-Index” by Artem Babenko and Victor Lempitsky
Visualization of a set of datapoints, and their respective clusters.
The inverted multi-index for
CAS
• For the most part, we use the basic formulation of the IMI
•We use 𝐖 = 10000, which results in 100-million possible codewords
• Except:
•Each image has 𝑃 spatial vectors associated with it, so we assign each
spatial vector to a cluster independently
•This is actually the main reason we use the IMI over the regular
inverted index, because we effectively have 𝑃 ⋅ 𝐶 vectors to index,
and the inverted index doesn’t scale as well.
•The paper primarily addresses 𝐿2 distance, but we use cosine distance
•Scale the codebook vectors by
d
d
such that the magnitude of any
set of codewords 𝑚', 𝑚2, ⋯ , 𝑚t = 1
The inverted multi-index for
CAS
• We expand clusters for each query term until we reach a fixed
number of images
• We then take the set union of expansions for each query term,
and run the previously defined scoring function.
• We look for about 5k images per anchor, so we typically only
rank between 0.05% and 0.15% of the collection.
Spatial-semantic image search
by visual feature synthesis
Mai et. al. introduced the above titled paper at CVPR 2017
Credit: Mai et. Al.
Spatial-semantic image search by visual feature synthesis
Mai et. al. paper
• Joint effort between Portland State University and Adobe
Research
• Their problem space is very similar to CAS
• Key Differences:
•Their language model learns to map all of the non-uniformly
sized anchors to a single feature vector, and then search
proceeds like a standard nearest neighbors query.
•They train their models using a dataset with object
localization information (COCO)
• Basically, if your data has localized labels, their approach is
very compelling.
Levels Of Supervision
Unsupervised
• No labeled data
• GANs, VAEs, etc.
Where I want to be.
LeCun thinks so too.
Semi-supervised
• Some labeled data
What best
leverages
Shutterstock’s
data
Supervised
• Classification labels
• ILSVRC, etc.
Where CAS is
currently
Very supervised
• Classification and
localization labels,
sometimes even
pixel level
segmentation.
• COCO, etc.
Mai et. al.
Challenges to the current system
• The global average pool encourages a couple different bad behaviors:
•It “dilates” the salient regions of the image, such that neurons that
are near the salient concept adopt the salient vectors instead of
representing the primary concept of their own receptive field
•It creates a hierarchy of vector magnitudes, such that salient concepts
have much larger magnitude than less salient concepts. This can allow
the network to learn less-robust representations of the non-salient
image patches.
This is your closing slide..
Insert an amazing image and align it with this
grey rectangle for a dramatic transition.
Feel free to change the copy to white should
want it to show up better against the image.
Thank you!
Mike Ranzinger
mranzinger@shutterstock.com
www.Shutterstock.com/labs/compositionsearch

Weitere ähnliche Inhalte

Was ist angesagt?

Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
Uwe Friedrichsen
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PyData
 

Was ist angesagt? (20)

Learning deep features for discriminative localization
Learning deep features for discriminative localizationLearning deep features for discriminative localization
Learning deep features for discriminative localization
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Deep Learning in Robotics
Deep Learning in RoboticsDeep Learning in Robotics
Deep Learning in Robotics
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
MLIP - Chapter 6 - Generation, Super-Resolution, Style transfer
MLIP - Chapter 6 - Generation, Super-Resolution, Style transferMLIP - Chapter 6 - Generation, Super-Resolution, Style transfer
MLIP - Chapter 6 - Generation, Super-Resolution, Style transfer
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Deep Learning behind Prisma
Deep Learning behind PrismaDeep Learning behind Prisma
Deep Learning behind Prisma
 
Angular and Deep Learning
Angular and Deep LearningAngular and Deep Learning
Angular and Deep Learning
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)
 
Variants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooVariants of GANs - Jaejun Yoo
Variants of GANs - Jaejun Yoo
 
Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...
Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...
Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
 
[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions
 

Ähnlich wie The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017

151106 Sketch-based 3D Shape Retrievals using Convolutional Neural Networks
151106 Sketch-based 3D Shape Retrievals using Convolutional Neural Networks151106 Sketch-based 3D Shape Retrievals using Convolutional Neural Networks
151106 Sketch-based 3D Shape Retrievals using Convolutional Neural Networks
Junho Cho
 
Lecture 1.1 - Terms & Concepts
Lecture 1.1 - Terms & ConceptsLecture 1.1 - Terms & Concepts
Lecture 1.1 - Terms & Concepts
mgordon320
 
Java 8 selected updates
Java 8 selected updatesJava 8 selected updates
Java 8 selected updates
Vinay H G
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
MLconf
 

Ähnlich wie The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017 (20)

Image captioning
Image captioningImage captioning
Image captioning
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Computer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathonComputer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathon
 
151106 Sketch-based 3D Shape Retrievals using Convolutional Neural Networks
151106 Sketch-based 3D Shape Retrievals using Convolutional Neural Networks151106 Sketch-based 3D Shape Retrievals using Convolutional Neural Networks
151106 Sketch-based 3D Shape Retrievals using Convolutional Neural Networks
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
 
lec10svm.ppt
lec10svm.pptlec10svm.ppt
lec10svm.ppt
 
Svm ms
Svm msSvm ms
Svm ms
 
lec10svm.ppt
lec10svm.pptlec10svm.ppt
lec10svm.ppt
 
Lecture 1.1 - Terms & Concepts
Lecture 1.1 - Terms & ConceptsLecture 1.1 - Terms & Concepts
Lecture 1.1 - Terms & Concepts
 
Knowing when to look
Knowing when to lookKnowing when to look
Knowing when to look
 
lec10svm.ppt
lec10svm.pptlec10svm.ppt
lec10svm.ppt
 
06 image features
06 image features06 image features
06 image features
 
Java 8 selected updates
Java 8 selected updatesJava 8 selected updates
Java 8 selected updates
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognition
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation Survey
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Graphics on the Go
Graphics on the GoGraphics on the Go
Graphics on the Go
 

Mehr von StampedeCon

Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 

Mehr von StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 

Kürzlich hochgeladen

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 

Kürzlich hochgeladen (20)

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017

  • 1. This is your cover. Insert an amazing image and align it with this grey rectangle. Please use the red, patterned, Shutterstock background for internal presentations only. The search for a new visual search Mike Ranzinger, Senior Research Engineer @ Shutterstock
  • 3. • We’re going to explore a new technology that was just released to beta called “Composition Aware Search” • This involves some key technologies: •Convolutional Neural Nets (Vision and NLP) •Discriminative Localization •Multi-modal Embeddings •Dimensionality Reduction •Inverted Multi-Index • Yes, between this presentation, and our publicly shared white paper, you should be able to implement this yourself •(non-commercially of course). Outline
  • 4.
  • 5. Language and visual domain mismatch VS Query: Red Bike
  • 6. An image can help provide visual interest to your written content. Insert an image and align it with this grey rectangle. Domain mismatch • The saying is: “A picture is worth a thousand words” • Our average query length is 2 words • Sometimes it’s hard to describe exactly what you’re looking for • Our users are accustomed to looking through multiple pages of results to find what they were looking for
  • 7. Image similarity / reverse image search
  • 8. • Common problem: I have picture X without a license, and I need to get a license for it •Perhaps you saw it on social media, and you wanted to share it more officially • My toy problem: I took this bad picture, find me a good one! • We don’t use words, at all. We communicate through pixels. Reverse image search
  • 9. My bike How does it work? Trained CNN Fixed length vectorTrained CNN Our bike images Maximum inner product search between our collection, and the query vector OurCollection
  • 10. • We have a vision model that can produce an N-dimensional vector for a given image. • Train a language model that maps a query to the vector of the downloaded image. • Training set: Query to download pairs. Multimodal embedding / query language models Lemur on rock
  • 11. • Kiros et. al. “Unifying visual-semantic embeddings with multimodal neural language models” • Trained using “Triplet Loss” •Let 𝑓(𝑥) be the L2 normalized output of the vision model on image 𝑥 •Let 𝑔(𝑞) be the L2 normalized output of the language model on query 𝑞 •Let 𝑞' be the query corresponding to image 𝑥' •Let 𝑚 be some margin 0 < 𝑚 < 2 • 𝐿 = max 0, 𝑓 𝑥2 ∘ 𝑔 𝑞' − 𝑓 𝑥' ∘ 𝑔 𝑞' + 𝑚 • In words, the dot product between a query and it’s corresponding image (green) should be greater than the query and some unrelated image (red) by some margin 𝑚. Multimodal embedding
  • 12. • We train the vision model first • Next, we train the language model. •We don’t backprop gradients though the vision model because it degrades it • Once we’ve finished training the language model, we can search for images given a query using MIPS, the same way that we did with reverse image search. Multimodal embedding
  • 14. • Here’s an example of a “fully convolutional” neural network. • A fully convolutional network is typically a series of convolutions and downsampling operations that ends with a global average pooling operation. • The GAP reduces the final feature maps down to a single vector (one value per feature map). • We call a position (y, x) in the final feature maps a “spatial feature vector” Spatial feature vectors
  • 15. • Like the feature vector produced by the global average pool, spatial vectors also encode information in the same embedding space. • Importantly, these vectors tend to encode more localized information based on the receptive field of the given neuron. • We exploit these vectors to build out CAS Spatial feature vectors
  • 16. • Zhou et. al. introduced a very important paper titled “Learning Deep Features for Discriminative Localization” • They introduce the concept of “Class Activation Maps” (CAM), which is effectively a heatmap of the classification strength for each output position before the GAP, for a given class. Discriminative localization
  • 17. • Let 𝑓6 𝑦, 𝑥 be the activation of unit 𝑘 at position (y, x) of the last convolutional layer • 𝐹6 = ' :,; ∑ 𝑓6 𝑦, 𝑥:,; •The result of the GAP for unit 𝑘 • For a given class 𝑐, the input to the softmax, 𝑆@ = ∑ 𝑤6 @ 6 𝐹6 •In words, the dot product between the GAP features, and the learned vector for the given class, where 𝑤6 @ is the weight for class 𝑐 for unit 𝑘 • Let 𝑀@ 𝑦, 𝑥 = ∑ 𝑤6 @ 6 𝑓6 𝑦, 𝑥 be the class activation map, or in words, the importance of spatial position (y,x) for the classification of class c. •I’d recommend reading the paper to see the full derivation Discriminative localization
  • 18. CAM for highest probability guess, which is “meerkat” with probability 40%. What it looks like for us
  • 19. What it looks like for us Mountain bike, 43%
  • 20. • Recall that the output of the GAP is • 𝐹6 = ' :,; ∑ 𝑓6 𝑦, 𝑥:,; • What if, instead of needing class 𝑐, we instead use 𝐹6 as the target • 𝑀@ 𝑦, 𝑥 = ∑ 𝐹66 𝑓6 𝑦, 𝑥 • Basically, this tells us how close a given spatial vector is to the average vector. One way to interpret this is, “how salient is the spatial vector to the classification”. Auto-saliency
  • 21. Note that “lemur” isn’t actually a class that the network was trained against. The closest class neighbors are meerkat and koala. Auto-saliency
  • 24. Multiple regions within the image may be deemed important. Auto-saliency
  • 25. • Why is this important? •It allows us to visualize how the network behaves on inputs for classes that it wasn’t explicitly trained on. • The idea that this works also reveals an open problem for us: •In order for the salient vectors to emerge, the non-salient regions of the image must either try to align themselves in the same direction as the salient vector •Dilation •Or, the non-salient regions must reduce their magnitude so not to bias the salient vector Auto-saliency
  • 26. • We have now seen how we can use CAMs, as well as the GAP vectors themselves to guide the heatmaps. • Finally, we can look back at the language model we trained earlier. • 𝐿 = max 0, 𝑓 𝑥2 ∘ 𝑔 𝑞' − 𝒇 𝒙 𝟏 ∘ 𝒈 𝒒 𝟏 + 𝑚 •The language model learns to match the direction of the GAP •In effect, we can use the language model to generate the class weights for the CAM technique on the fly. •I think it’s neat to interpret the language model as a low-rank approximator of the (potentially infinite) classification weight matrix. Language models as discriminators
  • 27. Composition aware search - overview Spatial IndexCollect This texture is also an anchor, with position, size, and query image. Lamp Lamp and Chair are called “anchors”, which have both a position and a query string. Chair VisionModel Language Model
  • 28. • Vision Model •We are using a variant of the Inception v3 paper by Szegedy et. al. titled “Rethinking the Inception Architecture for Computer Vision” •Notable differences: •We are not using batch normalization •We are using ELU non-linearities instead of ReLUs. • Language Model •We tried to be fancy and use cool tech such as character-models and LSTMs •The character LSTMs massively overfit on us •So, we used words, and dropped recurrency altogether in favor of a simpler convolutional language model as described by Collobert et. al. in “Natural Language Processing (Almost) from Scratch” Models
  • 29. • Let’s look at the query formulation for this, starting with the simple case. • Let 𝑆(𝑖) be the score for image 𝑖 • Let 𝐐 be the set of anchors, and 𝐪L be the 𝑗-th anchor L2 normalized (column) vector • Let 𝐕O be the set of spatial vectors in image 𝑖, and 𝐯OQ be the 𝑝-th spatial L2 normalized (column) vector • Let 𝑤LQ be a positional weight applied to position 𝑝 based on the position of query 𝑗 The search problem 𝑆(𝑖) = 1 𝐐 T max UVQVW 𝑤LQ 𝐯OQ ⏉ 𝐪L 𝐐 L
  • 30. Take the average over the anchors. The search problem 𝑆 𝑖 = 1 𝐐 T max UVQVW 𝑤LQ 𝐯OQ ⏉ 𝐪L 𝐐 L We only care about the largest weighted similarity score. This gives us a single score per query anchor for an image. We use this to weight the similarity score based on the relative position of the anchor to the spatial vector. Take the average over the anchors.
  • 31. Query Canvas Visualization of 𝑤 𝑆(𝑖) = 1 𝐐 T max UVQVW 𝑤LQ 𝐯OQ ⏉ 𝐪L 𝐐 L Column Row 𝑤
  • 32. The search problem • Using the above definition, the size of the index is defined by the following variables: • 𝐶 - The size of the collection • 𝐷 - The dimensionality of the spatial vectors • 𝑃 - The number of spatial positions • For our (beta) production offering, we have: • 𝐶 = 10,000,000 • 𝐷 = 256 • 𝑃 = 64, we use 8 rows and 8 columns • Requires about 611 GB of space to store the index. • Algorithm complexity is also 𝑂(𝐶 ⋅ 𝐐 ⋅ 𝑃 ⋅ 𝐷), which is, a lot. 𝑆(𝑖) = 1 𝐐 T max UVQVW 𝑤LQ 𝐯OQ ⏉ 𝐪L 𝐐 L
  • 33. Visualizing concepts Using PCA, we can visualize how the concepts are arranged based on the 2 principal directions. Cat Background
  • 35. Visualizing concepts – reduction methods Asphalt PCA t-SNE
  • 36. Visualizing concepts – reduction methods • t-SNE was superior at disentangling concepts on a 2d plane •Manifold learning technique •Popular technique for data visualization • PCA was still able to do a decent job •Linear • We use PCA because embedding a new point is efficiently computed with a single GEMM
  • 37. The search problem (part 2) • Since we are performing a dimensionality reduction on the spatial vectors for each image, let’s re-define the search problem. • Let 𝐁O ∈ ℝd×f be the orthonormal basis of the PCA for image 𝑖 such that we preserve 𝑁 dimensions, and 𝑁 ≤ 𝐷. • Let 𝐝OQ = 𝐁O 𝐯OQbe the reduced dimensionality spatial vector for image 𝑖 at position 𝑝. 𝑆 𝑖 = 1 𝐐 T max UVQVW 𝑤LQ 𝐝OQ ⏉ 𝐁O 𝐪L 𝐐 L
  • 38. The search problem (part 2) Then compute the dot product between the two vectors in the subspace. Project the query vector into the reduced dimensionality subspace for the image. 𝑆 𝑖 = 1 𝐐 T max UVQVW 𝑤LQ 𝐝OQ ⏉ 𝐁O 𝐪L 𝐐 L
  • 39. The search problem (part 2) Naïve definition • Requires 𝐶𝐷 𝑃 storage space •611 GB for 10mil images • Computation 𝑂(𝐶 ⋅ 𝐐 ⋅ 𝑃 ⋅ 𝐷) •In practice, 𝑃 = 64, 𝐷 = 256, so 𝑃 ⋅ 𝐷 = 16384 𝑆(𝑖) = 1 𝐐 T max UVQVW 𝑤LQ 𝐯OQ ⏉ 𝐪L 𝐐 L 𝑆 𝑖 = 1 𝐐 T max UVQVW 𝑤LQ 𝐝OQ ⏉ 𝐁O 𝐪L 𝐐 L New definition • Requires 𝐶𝑁 𝐷 + 𝑃 storage space • 𝑁 ≈ 4 •48 GB for 10mil images •12.7x reduction • 𝑂 𝐶 ⋅ 𝐐 𝑁 𝐷 + 𝑃 • 𝑁 𝐷 + 𝑃 = 1280 •12.8x reduction
  • 40. The search problem (part 3) • Now the current computational complexity is: • 𝑂 𝐶 ⋅ 𝐐 𝑁 𝐷 + 𝑃 • Importantly, this is still intractable because it still processes every image in the collection. •Users typically don’t want to wait roughly 7 seconds for a response • Our best bet is to formulate the problem such that we only process a tiny fraction of 𝐶
  • 41. Inverted index • Construction: •Select codebook size, 𝐖 •Find the 𝐖 centroids of 𝐶 using a K-means like process •Each of these centroids are called “codewords” •Assign each vector in 𝐂 to it’s nearest vector in 𝐖 • Inference: •Find the 𝑘 nearest codewords in 𝐖 to 𝐪 •Either return all of the vectors in the 𝑘 codewords, or perhaps find the 𝑘′ nearest vectors to 𝐪 within the codewords.
  • 42. Inverted multi-index • Introduced by Babenko and Lempitsky at CVPR 2012 • This technique combines Product Quantization with Inverted Indices • Construction: •Split your collection 𝐶 into 𝑁 partitions, typically 𝑁 = 2 •For each partition 𝑀, find 𝐖 cluster centers, as with the inverted index •For each vector in 𝐶, assign it to the nearest codeword in each partition independently. •This forms a Cartesian product of the codebooks, such that the full codebook is essentially size 𝐖 d • Paper: Inverted Multi-Index Dims 1 → 𝑟 Dims 𝑟 + 1 → 𝐷
  • 43. Inverted multi-index • Inference: •Sort the codebooks in each partition 𝑀 based on distance to 𝐪′ •Traverse the 𝑁 codebooks by visiting the nearest codeword 𝑚 defined by the sum of the distances for each 𝐪′ to each 𝐦. • I strongly recommend reading the paper for this one. It’s hard to explain on a slide.
  • 44. Inverted multi-index Source: “The Inverted Multi-Index” by Artem Babenko and Victor Lempitsky Visualization of a set of datapoints, and their respective clusters.
  • 45. The inverted multi-index for CAS • For the most part, we use the basic formulation of the IMI •We use 𝐖 = 10000, which results in 100-million possible codewords • Except: •Each image has 𝑃 spatial vectors associated with it, so we assign each spatial vector to a cluster independently •This is actually the main reason we use the IMI over the regular inverted index, because we effectively have 𝑃 ⋅ 𝐶 vectors to index, and the inverted index doesn’t scale as well. •The paper primarily addresses 𝐿2 distance, but we use cosine distance •Scale the codebook vectors by d d such that the magnitude of any set of codewords 𝑚', 𝑚2, ⋯ , 𝑚t = 1
  • 46. The inverted multi-index for CAS • We expand clusters for each query term until we reach a fixed number of images • We then take the set union of expansions for each query term, and run the previously defined scoring function. • We look for about 5k images per anchor, so we typically only rank between 0.05% and 0.15% of the collection.
  • 47. Spatial-semantic image search by visual feature synthesis Mai et. al. introduced the above titled paper at CVPR 2017 Credit: Mai et. Al. Spatial-semantic image search by visual feature synthesis
  • 48. Mai et. al. paper • Joint effort between Portland State University and Adobe Research • Their problem space is very similar to CAS • Key Differences: •Their language model learns to map all of the non-uniformly sized anchors to a single feature vector, and then search proceeds like a standard nearest neighbors query. •They train their models using a dataset with object localization information (COCO) • Basically, if your data has localized labels, their approach is very compelling.
  • 49. Levels Of Supervision Unsupervised • No labeled data • GANs, VAEs, etc. Where I want to be. LeCun thinks so too. Semi-supervised • Some labeled data What best leverages Shutterstock’s data Supervised • Classification labels • ILSVRC, etc. Where CAS is currently Very supervised • Classification and localization labels, sometimes even pixel level segmentation. • COCO, etc. Mai et. al.
  • 50. Challenges to the current system • The global average pool encourages a couple different bad behaviors: •It “dilates” the salient regions of the image, such that neurons that are near the salient concept adopt the salient vectors instead of representing the primary concept of their own receptive field •It creates a hierarchy of vector magnitudes, such that salient concepts have much larger magnitude than less salient concepts. This can allow the network to learn less-robust representations of the non-salient image patches.
  • 51. This is your closing slide.. Insert an amazing image and align it with this grey rectangle for a dramatic transition. Feel free to change the copy to white should want it to show up better against the image. Thank you! Mike Ranzinger mranzinger@shutterstock.com www.Shutterstock.com/labs/compositionsearch