Visual Search and Question Answering II

Liangliang Cao
http://www.llcao.net
UMass (now at Google AI*)
* The research in this talk are done before
joining Google/Facebook
Visual Search and Question Answering
Lu Jiang
http://www.lujiang.info/
Google AI
Yannis Kalantidis
http://www.skamalas.com/
Facebook AI*
ICME 2019 Tutorial
July 8th 13:30--17:00

I. Overview of Visual Search and Understanding (Liangliang).
II. Visual Representations and Indexing (Yannis)
III. MemexQA (Lu)
Outline
2

Section II:
Visual Representations and Indexing
3

Visual Search: We want to see more of the “same”
4

Color Similarity
*slide credit: Clayton Mellina, Huy Nguyen5

Compositional Similarity

Identity Similarity

Semantic Similarity

Visual Search Applications
Similarity search:
● Given an image as query, show me visually similar images
● Useful tool for commercial photo search & licensing
● Visually congruent native ads
Clustering and deduplication:
● Cluster images of a large collection for browsing
● Personal photo album summarization
● Deduplicate or diversify image search results
Batch search and recommendations:
● Use all photos from a group to recommend photos to the group admin
● Use all photos favorited by a user to get recommendations
● Visual recommendations can be combined with social metadata
9

Basic Ingredients for large-scale search
Representation Learning
Documents/images/videos are represented as vectors
Quantization and Indexing
● Storing high dimensional features could be prohibitive
○ Hashing (bad performance, reconstruction not possible)
○ Quantization (better performance, allows approx. reconstruction)
● Searching in them can only be feasible if only a very small
percentage of the collection is checked → Indexing
10

Some Recent Visual Representations
A (highly biased) set of recent CNN architectures that aim at:
● Reducing network parameters
○ Multi-Fiber Networks [ECCV 2018]
● Reducing memory for attention mechanisms
○ A2
-Nets: Double Attention Networks [NeurIPS 2018]
● Reasoning with global context
○ Global Reasoning Networks [CVPR 2019]
● Reducing spatial redundancy
○ Octave Convolutions [arXiv 2019]
12

Visual Representations
○ A2
13

The Multi-fiber Unit
Idea: slice the complex residual unit into N parallel and separated units (called
fibers), each of which is isolated from the others
14

The Multi-fiber Unit
● one fiber cannot access and
utilize the feature learned from
the others.
● Transistor component:
facilitates information flow
across these fibers
● number of the first-layer output
channels to be 4 times smaller
(cost would be reduced by a
factor of 2)
[Chen, Kalantidis, et al. Multi-Fiber Networks. ECCV 2018] 15

Results on Imagenet

○ A2
18

Reducing computations for attention mechanisms
Incorporating global context
● e.g. the attention mechanisms [Vaswani et al. 2017, Wang et al. 2018]
● Enables interactions between locations over the full coordinate space
● Requires computing and storing a (quadratic) matrix of all input location pairs
Convolutional Neural Networks model local relations
● Operate on the (spatio-temporal) coordinate space grid
● Require stacking multiple layers to capture relations
between distant locations
[Vaswani et al. Attention is all you need. NIPS 2017]
[Wang et al. Non-local Neural Networks. CVPR, 2018] 19

A2
-Nets: Double Attention Networks
Decomposed attention mechanism
Aggregate and propagate features from the entire
(spatio-temporal) input space efficiently
● First attention: Gather features from the entire
space into a compact set through second-order
attention pooling
● Second attention: Adaptively select and
distribute features to each location.
[Chen, Kalantidis, et al. A2
-Nets: Double Attention Networks. NeurIPS 2018] 20

Accuracy on Imagenet
A2
-Nets: Double Attention Networks
[Chen, Kalantidis, et al. A2
-Nets: Double Attention Networks. NeurIPS 2018] 21

○ A2
22

Global context modeling is highly important
● Attention-like mechanisms becoming standard across ML
A limitation of current global context modeling approaches
● Follow the Gather → Distribute model
● Only focus on delivering information
● Rely on convolutional layers for reasoning
Can we capture and reason on global region
interactions efficiently?
23
Beyond the simple attention mechanism

Gather → Reason → Distribute
Can we construct a (latent) space, where relations over sets of features scattered
over the coordinate space, translate to simple feature interactions?
24
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
Global Reasoning Networks
Coordinate Space Interaction Space

1) From Coordinate Space to Interaction Space
2) Reasoning in Interaction Space
3) From Interaction Space (back) to Coordinate Space
→ Weighted projections
→ Graph convolutions
→ Weighted broadcasting
25
Global Reasoning in Three Steps
Coordinate Space
Interaction Space

Interaction Space
● We want to learn a set of projections for (arbitrary) region features
Projection
Coordinate Space
26
From Coordinate Space to Interaction Space

learnable projection weights
27
Given a set of input features , compute projection function

28
C
H
W
H
W
C
bi

29
H
N
W
N
C

● After projection → N feature vectors
Projection
Coordinate Space
30

● After projection → N feature vectors
● Relations between arbitrary regions → interactions between features
Projection
Coordinate Space
Interaction Space
31
● What is an efficient way of reasoning over feature interactions?

How to model interactions?
● Treat each feature as a node in a fully-connected graph
● Learn the edge weights that correspond to interactions of features
● Graph convolution formulation by [Kipf & Welling]:
Reverse
Projection
N x N (learnt)
adjacency matrix
state update
32
Reasoning in Interaction Space
[Kipf & Welling. Semi-supervised classification with graph convolutional networks. ICLR, 2017]

● Reverse projection: Distribute the updated states back
● Reuse projection weights
Reverse
Projection
Coordinate Space
Interaction Space
33
From Interaction Space to Coordinate Space

● Projection: Weighted global pooling
34
Global Reasoning (GloRe) Unit

● Reasoning: Graph Convolution
35

● Reverse projection: Weighted broadcasting
36

37

What do
the learnt projection
weights look like?
38

Visualization of projection weights
What do the learnt projections
look like?
39

The Global Reasoning (GloRe) unit
● Is highly efficient (smaller computational cost than a self-attention)
● Is a plug-and-play residual unit that can be inserted in CNNs for different tasks
Image Classification & Action Recognition backbone CNNs
● Insert one or more units units different positions
Semantic segmentation
● Insert before bottleneck
40
Global Reasoning Networks
Figure from [Noa et al ICCV 2015]

41
Ablations on Imagenet
How many blocks to add and where?
How many graph convolutions?

42
Experiments on ImageNet

○ A2
43

[Huang et al. Multi-Scale Dense Networks for Resource Efficient Image Classification, ICLR 2018]
[Chen et al. Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition, ICLR 2019]
Reducing Spatial Redundancy
Many approaches exploit multi-scale inputs
• Recent Examples
• Multi-scale DenseNets [Huang et al.]: Multi-resolution paths over a DenseNet
• Big-Little Nets [Chen et al.]: Multi-resolution paths, synchronizing at every block
• Network architecture is altered
Spatial-redundancy in feature maps
• ConvNet kernels are highly local
• Some feature maps must contain low
frequency information (smooth and slowly
varying)
44

Octave Convolution
Advantages
•Multi-scale processing with effective
communication between the low- and
high-frequency maps
•Gains in terms of FLOPS
•Gains in terms of memory
•Larger receptive field for low-frequency
feature maps
The Octave Convolution
kernel
46

import OctConv as conv
Ablation study on ImageNet for varying models and
ratios 47

Is the speedup real?
•On CPU (i.e. FB production): Reaching (almost) theoretical gains!
•On GPU: An optimized CUDA-level implementation is required
Results for
ResNet-50
49

Recent Visual Representations
Code online:
● Multi-Fiber Networks [ECCV 2018]
○ https://github.com/cypw/PyTorch-MFNet
● Global Reasoning Networks [CVPR 2019]
○ https://github.com/facebookresearch/GloRe (coming soon)
● Octave Convolutions [arXiv 2019]
○ https://github.com/facebookresearch/OctConv
50

Basic Ingredients for large-scale search
Representation Learning
Documents/images/videos are represented as vectors
Quantization and Indexing
● Storing high dimensional features could be prohibitive
○ Hashing (bad performance, reconstruction not possible)
○ Quantization (better performance, allows approx. reconstruction)
● Searching in them can only be feasible if only a very small
percentage of the collection is checked → Indexing
52

Quantization: k-means
Pros:
● Very high compression
Cons:
● Hard to train for large k
● Performance is good only for large k
Idea: Create a “vocabulary” in high-dimensional space through clustering
Represent each vector with the index of its closest “word”
[McQueen 1967]53

Quantization: product quantization
Idea: Split the vector in multiple sub-vectors, create a vocabulary for each subvector
Represent each feature with the list of indices for its closest words
[Gray, ASSP 1984]
[Jegou, Douze & Schmid, PAMI 2011]54

Quantization: product quantization
Pros:
● Tunable compression & better reconstruction
● Easy & fast to train, a vocabulary of size k
gives you km
effective “cells” for m subvectors
Cons:
● Independence assumption (“fix”: PCA)
● Unbalanced partitioning (fix: OPQ)
[Gray, ASSP 1984]
[Jegou, Douze & Schmid, PAMI 2011]55

Optimized product quantization
[Ge et al, CVPR 2013, PAMI 2014]56

Locally Optimized Product Quantization
[Kalantidis & Avrithis, CVPR 2014]
Idea: Locally optimize residuals, balance variance across subspaces, use multi-index
57

[Kalantidis & Avrithis, CVPR 2014]58

● Balance variance across subspaces
● Local optimization using OPQ
● 20% improvement in precision
over state-of-the-art
● Overhead independent of database size
Stats for multi-LOPQ:
● 1 Billion 128-dimensional vectors
● ~22GB memory
● less than 55ms search time
Idea: Locally optimize residuals, balance variance across subspaces, use multi-index
59

Indexing
21.1 3.33 21.2 20.1 2.21 11.1 11.2 0.21
id: 123984
.
.
.
.
5,4 id:123984...
1
5
6
...
7
2
4
21.1 3.33 21.2 20.1 11.1 11.2 0.21
11
231
661
id: 123984
.
.
.
.
11 id:123984... ...
60

Indexing: multi-index
Pros:
● 2-step quantization: in the second stage one can quantize residuals
● Finer partitioning / smaller residuals
● Need to search many cells/posting lists:
multi-sequence: fast algorithm for traversing neighboring cells
[Babenko & Lempitsky, CVPR 2012]
Idea: Use product quantization for indexing: Split into 2 sub-vectors
61

Multi-LOPQ: Searching in a multi-index
● split query vector
● sort PQ centroids by ascending
distance for each subvector
● start at the cell (Q1
[0], Q2
[0]), the
first clusters in each posting list
● for the current cell (Q1
[a], Q2
[b]),
insert both its bottom and right
neighbors into a priority queue
with priority:
dist(xL
, Q1
[a]) + dist(xR
, Q2
[b])
62

[Kalantidis & Avrithis, CVPR 2014]63

Project Name
Thank you!
Yannis Kalantidis
ykalant@image.ntua.gr
http://www.skamalas.com
64

https://github.com/yahoo/lopq
[Kalantidis et al, ECCV-W 2016]65

Visual Search and Question Answering II

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Visual Search and Question Answering II

Ähnlich wie Visual Search and Question Answering II (20)

Mehr von Wanjin Yu

Mehr von Wanjin Yu (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Visual Search and Question Answering II