Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Visual Search and Question Answering II
1. Liangliang Cao
http://www.llcao.net
UMass (now at Google AI*)
* The research in this talk are done before
joining Google/Facebook
Visual Search and Question Answering
Lu Jiang
http://www.lujiang.info/
Google AI
Yannis Kalantidis
http://www.skamalas.com/
Facebook AI*
ICME 2019 Tutorial
July 8th 13:30--17:00
2. I. Overview of Visual Search and Understanding (Liangliang).
II. Visual Representations and Indexing (Yannis)
III. MemexQA (Lu)
Outline
2
9. Visual Search Applications
Similarity search:
● Given an image as query, show me visually similar images
● Useful tool for commercial photo search & licensing
● Visually congruent native ads
Clustering and deduplication:
● Cluster images of a large collection for browsing
● Personal photo album summarization
● Deduplicate or diversify image search results
Batch search and recommendations:
● Use all photos from a group to recommend photos to the group admin
● Use all photos favorited by a user to get recommendations
● Visual recommendations can be combined with social metadata
9
10. Basic Ingredients for large-scale search
Representation Learning
Documents/images/videos are represented as vectors
Quantization and Indexing
● Storing high dimensional features could be prohibitive
○ Hashing (bad performance, reconstruction not possible)
○ Quantization (better performance, allows approx. reconstruction)
● Searching in them can only be feasible if only a very small
percentage of the collection is checked → Indexing
10
12. Some Recent Visual Representations
A (highly biased) set of recent CNN architectures that aim at:
● Reducing network parameters
○ Multi-Fiber Networks [ECCV 2018]
● Reducing memory for attention mechanisms
○ A2
-Nets: Double Attention Networks [NeurIPS 2018]
● Reasoning with global context
○ Global Reasoning Networks [CVPR 2019]
● Reducing spatial redundancy
○ Octave Convolutions [arXiv 2019]
12
13. Visual Representations
A (highly biased) set of recent CNN architectures that aim at:
● Reducing network parameters
○ Multi-Fiber Networks [ECCV 2018]
● Reducing memory for attention mechanisms
○ A2
-Nets: Double Attention Networks [NeurIPS 2018]
● Reasoning with global context
○ Global Reasoning Networks [CVPR 2019]
● Reducing spatial redundancy
○ Octave Convolutions [arXiv 2019]
13
14. The Multi-fiber Unit
Idea: slice the complex residual unit into N parallel and separated units (called
fibers), each of which is isolated from the others
14
15. The Multi-fiber Unit
● one fiber cannot access and
utilize the feature learned from
the others.
● Transistor component:
facilitates information flow
across these fibers
● number of the first-layer output
channels to be 4 times smaller
(cost would be reduced by a
factor of 2)
[Chen, Kalantidis, et al. Multi-Fiber Networks. ECCV 2018] 15
18. Visual Representations
A (highly biased) set of recent CNN architectures that aim at:
● Reducing network parameters
○ Multi-Fiber Networks [ECCV 2018]
● Reducing memory for attention mechanisms
○ A2
-Nets: Double Attention Networks [NeurIPS 2018]
● Reasoning with global context
○ Global Reasoning Networks [CVPR 2019]
● Reducing spatial redundancy
○ Octave Convolutions [arXiv 2019]
18
19. Reducing computations for attention mechanisms
Incorporating global context
● e.g. the attention mechanisms [Vaswani et al. 2017, Wang et al. 2018]
● Enables interactions between locations over the full coordinate space
● Requires computing and storing a (quadratic) matrix of all input location pairs
Convolutional Neural Networks model local relations
● Operate on the (spatio-temporal) coordinate space grid
● Require stacking multiple layers to capture relations
between distant locations
[Vaswani et al. Attention is all you need. NIPS 2017]
[Wang et al. Non-local Neural Networks. CVPR, 2018] 19
20. A2
-Nets: Double Attention Networks
Decomposed attention mechanism
Aggregate and propagate features from the entire
(spatio-temporal) input space efficiently
● First attention: Gather features from the entire
space into a compact set through second-order
attention pooling
● Second attention: Adaptively select and
distribute features to each location.
[Chen, Kalantidis, et al. A2
-Nets: Double Attention Networks. NeurIPS 2018] 20
22. Visual Representations
A (highly biased) set of recent CNN architectures that aim at:
● Reducing network parameters
○ Multi-Fiber Networks [ECCV 2018]
● Reducing memory for attention mechanisms
○ A2
-Nets: Double Attention Networks [NeurIPS 2018]
● Reasoning with global context
○ Global Reasoning Networks [CVPR 2019]
● Reducing spatial redundancy
○ Octave Convolutions [arXiv 2019]
22
23. Global context modeling is highly important
● Attention-like mechanisms becoming standard across ML
A limitation of current global context modeling approaches
● Follow the Gather → Distribute model
● Only focus on delivering information
● Rely on convolutional layers for reasoning
Can we capture and reason on global region
interactions efficiently?
23
Beyond the simple attention mechanism
24. Gather → Reason → Distribute
Can we construct a (latent) space, where relations over sets of features scattered
over the coordinate space, translate to simple feature interactions?
24
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
Global Reasoning Networks
Coordinate Space Interaction Space
25. 1) From Coordinate Space to Interaction Space
2) Reasoning in Interaction Space
3) From Interaction Space (back) to Coordinate Space
→ Weighted projections
→ Graph convolutions
→ Weighted broadcasting
25
Global Reasoning in Three Steps
Coordinate Space
Interaction Space
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
26. Interaction Space
● We want to learn a set of projections for (arbitrary) region features
Projection
Coordinate Space
26
From Coordinate Space to Interaction Space
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
27. learnable projection weights
27
Given a set of input features , compute projection function
From Coordinate Space to Interaction Space
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
28. 28
Given a set of input features , compute projection function
From Coordinate Space to Interaction Space
C
H
W
H
W
C
bi
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
29. 29
Given a set of input features , compute projection function
From Coordinate Space to Interaction Space
H
N
W
N
C
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
30. ● After projection → N feature vectors
Projection
Coordinate Space
30
From Coordinate Space to Interaction Space
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
31. ● After projection → N feature vectors
● Relations between arbitrary regions → interactions between features
Projection
Coordinate Space
Interaction Space
31
From Coordinate Space to Interaction Space
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
● What is an efficient way of reasoning over feature interactions?
32. How to model interactions?
● Treat each feature as a node in a fully-connected graph
● Learn the edge weights that correspond to interactions of features
● Graph convolution formulation by [Kipf & Welling]:
Reverse
Projection
N x N (learnt)
adjacency matrix
state update
32
Reasoning in Interaction Space
[Kipf & Welling. Semi-supervised classification with graph convolutional networks. ICLR, 2017]
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
33. ● Reverse projection: Distribute the updated states back
● Reuse projection weights
Reverse
Projection
Coordinate Space
Interaction Space
33
From Interaction Space to Coordinate Space
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
34. ● Projection: Weighted global pooling
34
Global Reasoning (GloRe) Unit
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
35. ● Projection: Weighted global pooling
● Reasoning: Graph Convolution
35
Global Reasoning (GloRe) Unit
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
36. ● Projection: Weighted global pooling
● Reasoning: Graph Convolution
● Reverse projection: Weighted broadcasting
36
Global Reasoning (GloRe) Unit
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
37. ● Projection: Weighted global pooling
● Reasoning: Graph Convolution
● Reverse projection: Weighted broadcasting
37
Global Reasoning (GloRe) Unit
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
38. ● Projection: Weighted global pooling
● Reasoning: Graph Convolution
● Reverse projection: Weighted broadcasting
What do
the learnt projection
weights look like?
38
Global Reasoning (GloRe) Unit
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
39. Visualization of projection weights
What do the learnt projections
look like?
39
Global Reasoning (GloRe) Unit
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
40. The Global Reasoning (GloRe) unit
● Is highly efficient (smaller computational cost than a self-attention)
● Is a plug-and-play residual unit that can be inserted in CNNs for different tasks
Image Classification & Action Recognition backbone CNNs
● Insert one or more units units different positions
Semantic segmentation
● Insert before bottleneck
40
Global Reasoning Networks
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
Figure from [Noa et al ICCV 2015]
41. 41
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
Ablations on Imagenet
How many blocks to add and where?
How many graph convolutions?
42. 42
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
Experiments on ImageNet
43. Visual Representations
A (highly biased) set of recent CNN architectures that aim at:
● Reducing network parameters
○ Multi-Fiber Networks [ECCV 2018]
● Reducing memory for attention mechanisms
○ A2
-Nets: Double Attention Networks [NeurIPS 2018]
● Reasoning with global context
○ Global Reasoning Networks [CVPR 2019]
● Reducing spatial redundancy
○ Octave Convolutions [arXiv 2019]
43
44. [Huang et al. Multi-Scale Dense Networks for Resource Efficient Image Classification, ICLR 2018]
[Chen et al. Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition, ICLR 2019]
Reducing Spatial Redundancy
Many approaches exploit multi-scale inputs
• Recent Examples
• Multi-scale DenseNets [Huang et al.]: Multi-resolution paths over a DenseNet
• Big-Little Nets [Chen et al.]: Multi-resolution paths, synchronizing at every block
• Network architecture is altered
Spatial-redundancy in feature maps
• ConvNet kernels are highly local
• Some feature maps must contain low
frequency information (smooth and slowly
varying)
44
46. Octave Convolution
Advantages
•Multi-scale processing with effective
communication between the low- and
high-frequency maps
•Gains in terms of FLOPS
•Gains in terms of memory
•Larger receptive field for low-frequency
feature maps
The Octave Convolution
kernel
46
47. import OctConv as conv
Ablation study on ImageNet for varying models and
ratios 47
49. Is the speedup real?
•On CPU (i.e. FB production): Reaching (almost) theoretical gains!
•On GPU: An optimized CUDA-level implementation is required
Results for
ResNet-50
49
52. Basic Ingredients for large-scale search
Representation Learning
Documents/images/videos are represented as vectors
Quantization and Indexing
● Storing high dimensional features could be prohibitive
○ Hashing (bad performance, reconstruction not possible)
○ Quantization (better performance, allows approx. reconstruction)
● Searching in them can only be feasible if only a very small
percentage of the collection is checked → Indexing
52
53. Quantization: k-means
Pros:
● Very high compression
Cons:
● Hard to train for large k
● Performance is good only for large k
Idea: Create a “vocabulary” in high-dimensional space through clustering
Represent each vector with the index of its closest “word”
[McQueen 1967]53
54. Quantization: product quantization
Idea: Split the vector in multiple sub-vectors, create a vocabulary for each subvector
Represent each feature with the list of indices for its closest words
[Gray, ASSP 1984]
[Jegou, Douze & Schmid, PAMI 2011]54
55. Quantization: product quantization
Pros:
● Tunable compression & better reconstruction
● Easy & fast to train, a vocabulary of size k
gives you km
effective “cells” for m subvectors
Cons:
● Independence assumption (“fix”: PCA)
● Unbalanced partitioning (fix: OPQ)
[Gray, ASSP 1984]
[Jegou, Douze & Schmid, PAMI 2011]55
61. Indexing: multi-index
Pros:
● 2-step quantization: in the second stage one can quantize residuals
● Finer partitioning / smaller residuals
● Need to search many cells/posting lists:
multi-sequence: fast algorithm for traversing neighboring cells
[Babenko & Lempitsky, CVPR 2012]
Idea: Use product quantization for indexing: Split into 2 sub-vectors
61
62. Multi-LOPQ: Searching in a multi-index
● split query vector
● sort PQ centroids by ascending
distance for each subvector
● start at the cell (Q1
[0], Q2
[0]), the
first clusters in each posting list
● for the current cell (Q1
[a], Q2
[b]),
insert both its bottom and right
neighbors into a priority queue
with priority:
dist(xL
, Q1
[a]) + dist(xR
, Q2
[b])
62