This document describes research on using region-oriented convolutional neural networks for object retrieval. It discusses using local CNNs like CaffeNet, Fast R-CNN, and SDS to extract visual features from object candidates in images. These features are used to match against query descriptors. Pooled regional features are ranked to retrieve relevant shots. Fine-tuning pre-trained networks on larger datasets like COCO can improve retrieval accuracy. Combining global and local approaches through re-ranking provides an additional boost in performance.
9. from shallow to deep learning
9
Bag of Words
SIFT
Histograms of gradients
Convolutional Neural Networks (CNNs)
“hand crafted” features
state of art
“learned” features
10. why deep learning now?
10
state of art
large datasets Powerful GPUs
...
16. object candidates
16
state of art
Selective Search bounding boxes
Uijlings et al. (Trento), Selective Search for Object Recognition (2013)
MCG segments
Arbeláez et al. (Berkeley), Multiscale Combinatorial Grouping (2014)
17. R-CNN
17
state of art
Girshick et al. (Berkeley), Rich feature hierarchies for accurate object detection and semantic segmentation (2014)
Object Detection network
22. TRECVid Instance Search
22
local CNNs for instance search
large collection of videos
464h
shots
~470k
frames
1/4 fps
...in our case, subset of 13k shots (23k frames)
23. a Big Data scenario
23
local CNNs for instance search
24. query descriptors
24
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
visual features
visual features
visual features
query set
descriptors
image
bbox
region
26. query descriptors
26
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
visual features
visual features
visual features
query set
descriptors
image
bbox
region
27. object
candidates
main scheme
27
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
visual
features
visual
features
visual
features
query
descriptors
matching
matching
matching
frames
in 1 shot
pooling
pooling
pooling
ranking
ranking
ranking
34. object
candidates
main scheme
34
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
visual
features
visual
features
visual
features
query
descriptors
matching
matching
matching
frames
in 1 shot
pooling
pooling
pooling
ranking
ranking
ranking
44. as a reminder...
44
local CNNs for instance search
Selective Search bounding boxes
Uijlings et al. (Trento), Selective Search for Object Recognition (2013)
MCG segments
Arbeláez et al. (Berkeley), Multiscale Combinatorial Grouping (2014)
Fast R-CNN
SDS
55. about the results
● Although not outperforming CaffeNet: SDS good for localization!
55
conclusions
maybe more suitable for TRECVid localization task?
56. about fine-tuning
● Networks trained on objects, but not on the objects to retrieve
56
conclusions
fine-tuning on a larger dataset is clearly the next step
57. about object candidates
● Only 100 candidates decreseases likelihood to success
... but using a higher number
57
conclusions
Fast SDS would be the key
66. interactive: Multi-image aggregation
Query images for a topic was used with the min distance to each shot.
The best option with SIFT-BoW is average, wheteher features (Avg-Pooling) or similarity scores (Sim-Avg)
annex
Zhu et al. (NII), Multi-image aggregation for better visual object retrieval (2014)