SlideShare ist ein Scribd-Unternehmen logo
1 von 32
NetVLAD:
CNN architecture for weakly
supervised place recognition
(CVPR2016)
2016. 06. 09.
Vision&Language
조근희
Visual place recognition
• significant amount of attention in the past years
• computer vision, robotics communities
• motivated by, e.g., applications in autonomous driving,
augmented reality or geo-localizing archival imagery
• still remains extremely challenging
• How can we recognize the same street-corner in the entire
city or on the scale of the entire country despite the fact it
can be captured in different illuminations or change its
appearance over time?
• traditionally cast as an instance retrieval task
• the query image location is estimated using the locations of
the most visually similar images obtained by querying the
large geotagged database
• local invariant features,bag-of-visual-words,VLAD, fisher vector
• convolution neural network
In this work
• investigate whether this gap in performance
can be bridged by CNN representations
developed and trained directly for place
recognition
• three main challenges
• First, what is a good CNN architecture for place
recognition?
• Second, how to gather sufficient amount of
annotated data for the training?
• Third, how can we train the developed
architecture in an end-to-end manner tailored for
the place recognition task?
CNN architecture for place
recognition
• convolutional neural network architecture for place
recognition
• aggregates mid-level (conv5) convolutional features
extracted from the entire image into a compact single
vector representation amenable to efficient indexing
• new trainable generalized VLAD layer, NetVLAD
• inspired by the Vector of Locally Aggregated Descriptors
(VLAD) representation that has shown excellent
performance in image retrieval and place recognition
• pluggable into any CNN architecture and amenable
to training via backpropagation.
• aggregated representation is then compressed using
Principal Component Analysis (PCA) to obtain the
final compact descriptor of the image
Annotated data for the
training
• to train the architecture for place recognition
• we gather a large dataset of multiple panoramic
images depicting the same place from different
viewpoints over time from the Google Street
View Time Machine
• such data is available, but provides only weak
form of supervision:
• we know the two panoramas are captured at
approximately similar positions based on their
(noisy) GPS but we don’t know which parts of the
panoramas depict the same parts of the scene
How can we train
• learning procedure for place recognition
• learns parameters of the architecture in an end-
to-end manner tailored for the place recognition
task from the weakly labelled Time Machine
imagery
• resulting representation is robust to changes
in viewpoint and lighting conditions
• while simultaneously learns to focus on the
relevant parts of the image such as the building
facades and the skyline, while ignoring confusing
elements such as cars and people that may occur
at many different places
Method overview
• place recognition as image retrieval
• query image with unknown location is used to visually
search a large geotagged image database
• locations of top ranked images are used as
suggestions for the location of the query
• notations
• a function f: acts as the “image representation
extractor”
• Ii : given an image
• f(Ii): fixed size vector
• {Ii}: the entire database
• f(q): the query image representation
• simply finding the nearest database image to
the query
• either exactly or through fast approximate
nearest neighbor search
• by sorting images based on the Euclidean
distance d(q, Ii) between f(q) and f(Ii)
• representation is parametrized with a set of
parameters θ and we emphasize this fact by
referring to it as fθ(I)
• euclidean distance dθ(Ii , Ij ) = ||fθ(Ii) − fθ(Ij)||
also depends on the same parameters
Proposed CNN architecture fθ
• most image retrieval pipelines
• (i) extracting local descriptors
• (ii) pooled in an orderless manner
• motivations
• robustness to lighting and viewpoint changes are
provided by the descriptors themselves
• scale invariance is ensured through extracting
descriptors at multiple scales
• for step (i), we crop the CNN at the last
convolutional layer and view it as a dense
descriptor extractor.
• the output of the last convolutional layer is a H × W ×
D map which can be considered as a set of D-
dimensional descriptors extracted at H × W spatial
locations
• for step (ii) we design a new pooling layer
inspired by the Vector Locally Aggregated
Descriptors (VLAD)
• pools the extracted descriptors into a fixed image
representation and its parameters are learnable via
back-propagation
NetVLAD: A Generalized
VLAD layer (fVLAD)
• Vector of Locally Aggregated Descriptors (VLAD)
• popular descriptor pooling methods for both instance
level retrieval and image classification
• notations
• {xi} : given N D-dimensional local image descriptors as input
• {ck}: K cluster centres (“visual words”) as VLAD parameters
• V: the output VLAD image representation(K×D-dimensional)
• xi(j): the j-th dimensions of the i-th descriptor
• ck(j): k-th cluster centre
• ak(xi): the membership of the descriptor xi to k-th visual word
• matrix V is then L2-normalized column-wise, converted
into a vector, and finally L2-normalized in its entirety
• to construct a layer amenable to training via
backpropagation, it is required that the layer’s
operation is differentiable with respect to all its
parameters and the input
• we replace it with soft assignment of descriptors
to multiple clusters
• final form of the NetVLAD layer
• obtained by plugging the soft-assignment (3) into
the VLAD descriptor (1) resulting in
• where {wk}, {bk} and {ck} are sets of trainable
parameters for each cluster k
Learning from time machine
data
• two main challenges
• (i) how to gather enough annotated training data
• possible to obtain large amounts of weakly labelled
imagery depicting the same places over time from the
Google Street Time Machine
• (ii) what is the appropriate loss for the place
recognition task
• we will design a new weakly supervised triplet ranking
loss that can deal with the incomplete and noisy
position annotations of the Street View Time Machine
imagery
Weak supervision from the
Time Machine
• Google Street View Time Machine
• provides multiple street-level panoramic images taken
at different times at close-by spatial locations on the
map
• precious for learning an image representation for
place recognition
• The same locations are depicted at different times and
seasons, providing the learning algorithm with crucial
information it can use to discover which features are
useful or distracting, and what changes should the
image representation be invariant to, in order to
achieve good place recognition performance
• provides only incomplete and noisy supervision
• for a given training query q
• {pq
i }: potential positives
• {nq
j}: definite negatives
Weakly supervised triplet
ranking loss
• to learn a representation fθ that will optimize
place recognition performance
• goal is to rank a database image Ii∗ from a close-
by location higher than all other far away images
Ii in the database
• we wish the Euclidean distance dθ(q, I) between the
query q and a close-by image Ii∗ to be smaller than the
distance to far away images in the database Ii
• i.e. dθ(q, Ii∗) < dθ(q, Ii), for all images Ii further than a certain
distance from the query on the map.
• next we show how this requirement can be translated
into a ranking loss between training triplets {q, Ii∗, Ii}
• from the Google Street View Time Machine data, we
obtain a training dataset of tuples (q, {pq
i}, {nq
j})
• where for each training query image q we have a set of
potential positives {pq
i} and the set of definite negatives {nq
j}
• where l is the hinge loss l(x) = max(x, 0), and m is a
constant parameter giving the margin
Datasets and evaluation
methodology
• dataset
• Pittsburgh (Pitts250k): contains 250k database
images downloaded from Google Street View
and 24k test queries generated from Street View
• Tokyo 24/7 [79] contains 76k database images
and 315 query images taken using mobile phone
cameras
• extremely challenging dataset
• TokyoTM; Tokyo 24/7 (=test) and TokyoTM
train/val are all geographically disjoint
• evaluation metric
• The query image is deemed correctly localized if at
least one of the top N retrieved database images is
within d = 25 meters from the ground truth position
of the query
• The percentage of correctly recognized queries (Recall)
is then plotted for different values of N
• For Tokyo 24/7 we follow [79] and perform spatial non-
maximal suppression on ranked database images before
evaluation.
• implementation details:
• Max pooling (fmax) and our NetVLAD (fVLAD) layers:
AlexNet and VGG-16
• both are cropped at the last convolutional layer
(conv5), before ReLU
Results and discussion
• “off-the-shelf” networks
• base network cropped at conv5
• Max pooling (fmax), VLAD (fVLAD)
• AlexNet, VGG-16(pretrained for ImageNet),
Place205
• the state-of-the-art local feature based
compact descriptor
• VLAD pooling with intra-normalization on top of
densely extracted RootSIFTs
• Dimensionality reduction
• Which layers should be trained?
• Importance of Time Machine training
• Qualitative evaluation
Image retrieval
• our best performing network (VGG-16, fV LAD
with whitening down to 256-D) trained
completely on Pittsburgh
• to extract image representations for standard object
and image retrieval benchmarks (Oxford 5k, Paris 6k,
Holidays)
Three principal contributions
• convolutional neural network (CNN)
architecture
• trainable in an end-to-end manner directly for the
place recognition
• pluggable into any CNN architecture and
amenable to training via backpropagation
• training procedure, based on a new weakly
supervised ranking loss
• to learn parameters of the architecture in an end-
to-end manner
• large improvement
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...
“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...
“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...
Edge AI and Vision Alliance
 

Was ist angesagt? (20)

Introduction of slam
Introduction of slamIntroduction of slam
Introduction of slam
 
3D Perception for Autonomous Driving - Datasets and Algorithms -
3D Perception for Autonomous Driving - Datasets and Algorithms -3D Perception for Autonomous Driving - Datasets and Algorithms -
3D Perception for Autonomous Driving - Datasets and Algorithms -
 
Depth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors IIDepth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors II
 
Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)
 
三次元点群を取り扱うニューラルネットワークのサーベイ
三次元点群を取り扱うニューラルネットワークのサーベイ三次元点群を取り扱うニューラルネットワークのサーベイ
三次元点群を取り扱うニューラルネットワークのサーベイ
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection review
 
multiple object tracking using particle filter
multiple object tracking using particle filtermultiple object tracking using particle filter
multiple object tracking using particle filter
 
Radon Transform - image analysis
Radon Transform - image analysisRadon Transform - image analysis
Radon Transform - image analysis
 
Spatial filtering using image processing
Spatial filtering using image processingSpatial filtering using image processing
Spatial filtering using image processing
 
“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...
“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...
“Person Re-Identification and Tracking at the Edge: Challenges and Techniques...
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer Vision
 
サーベイ論文:画像からの歩行者属性認識
サーベイ論文:画像からの歩行者属性認識サーベイ論文:画像からの歩行者属性認識
サーベイ論文:画像からの歩行者属性認識
 
Faster rcnn
Faster rcnnFaster rcnn
Faster rcnn
 
6.frequency domain image_processing
6.frequency domain image_processing6.frequency domain image_processing
6.frequency domain image_processing
 
Image colorization
Image colorizationImage colorization
Image colorization
 
Intensity Transformation
Intensity TransformationIntensity Transformation
Intensity Transformation
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
 
Camera model ‫‬
Camera model ‫‬Camera model ‫‬
Camera model ‫‬
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 
Object tracking a survey
Object tracking a surveyObject tracking a survey
Object tracking a survey
 

Ähnlich wie NetVLAD: CNN architecture for weakly supervised place recognition

Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Sunando Sengupta
 
Deep image retrieval learning global representations for image search
Deep image retrieval  learning global representations for image searchDeep image retrieval  learning global representations for image search
Deep image retrieval learning global representations for image search
Universitat Politècnica de Catalunya
 

Ähnlich wie NetVLAD: CNN architecture for weakly supervised place recognition (20)

Deep learning fundamental and Research project on IBM POWER9 system from NUS
Deep learning fundamental and Research project on IBM POWER9 system from NUSDeep learning fundamental and Research project on IBM POWER9 system from NUS
Deep learning fundamental and Research project on IBM POWER9 system from NUS
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
 
Introduction to computer vision
Introduction to computer visionIntroduction to computer vision
Introduction to computer vision
 
Introduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural NetworksIntroduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural Networks
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
 
Analysis of KinectFusion
Analysis of KinectFusionAnalysis of KinectFusion
Analysis of KinectFusion
 
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
 
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
 
Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...
 
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondenceParn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
 
Using Deep Learning to Derive 3D Cities from Satellite Imagery
Using Deep Learning to Derive 3D Cities from Satellite ImageryUsing Deep Learning to Derive 3D Cities from Satellite Imagery
Using Deep Learning to Derive 3D Cities from Satellite Imagery
 
Deep image retrieval learning global representations for image search
Deep image retrieval  learning global representations for image searchDeep image retrieval  learning global representations for image search
Deep image retrieval learning global representations for image search
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object tracking
 
Week5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptxWeek5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptx
 
IEEE Projects 2014-2015
IEEE Projects 2014-2015IEEE Projects 2014-2015
IEEE Projects 2014-2015
 
Automatic Dense Semantic Mapping From Visual Street-level Imagery
Automatic Dense Semantic Mapping From Visual Street-level ImageryAutomatic Dense Semantic Mapping From Visual Street-level Imagery
Automatic Dense Semantic Mapping From Visual Street-level Imagery
 
Harpster, J. - Open data on buildings with satellite imagery processing
Harpster, J. - Open data on buildings with satellite imagery processingHarpster, J. - Open data on buildings with satellite imagery processing
Harpster, J. - Open data on buildings with satellite imagery processing
 
Online video object segmentation via convolutional trident network
Online video object segmentation via convolutional trident networkOnline video object segmentation via convolutional trident network
Online video object segmentation via convolutional trident network
 
[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...
[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...
[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...
 

Kürzlich hochgeladen

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Kürzlich hochgeladen (20)

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 

NetVLAD: CNN architecture for weakly supervised place recognition

  • 1. NetVLAD: CNN architecture for weakly supervised place recognition (CVPR2016) 2016. 06. 09. Vision&Language 조근희
  • 2. Visual place recognition • significant amount of attention in the past years • computer vision, robotics communities • motivated by, e.g., applications in autonomous driving, augmented reality or geo-localizing archival imagery • still remains extremely challenging • How can we recognize the same street-corner in the entire city or on the scale of the entire country despite the fact it can be captured in different illuminations or change its appearance over time? • traditionally cast as an instance retrieval task • the query image location is estimated using the locations of the most visually similar images obtained by querying the large geotagged database • local invariant features,bag-of-visual-words,VLAD, fisher vector • convolution neural network
  • 3. In this work • investigate whether this gap in performance can be bridged by CNN representations developed and trained directly for place recognition • three main challenges • First, what is a good CNN architecture for place recognition? • Second, how to gather sufficient amount of annotated data for the training? • Third, how can we train the developed architecture in an end-to-end manner tailored for the place recognition task?
  • 4. CNN architecture for place recognition • convolutional neural network architecture for place recognition • aggregates mid-level (conv5) convolutional features extracted from the entire image into a compact single vector representation amenable to efficient indexing • new trainable generalized VLAD layer, NetVLAD • inspired by the Vector of Locally Aggregated Descriptors (VLAD) representation that has shown excellent performance in image retrieval and place recognition • pluggable into any CNN architecture and amenable to training via backpropagation. • aggregated representation is then compressed using Principal Component Analysis (PCA) to obtain the final compact descriptor of the image
  • 5. Annotated data for the training • to train the architecture for place recognition • we gather a large dataset of multiple panoramic images depicting the same place from different viewpoints over time from the Google Street View Time Machine • such data is available, but provides only weak form of supervision: • we know the two panoramas are captured at approximately similar positions based on their (noisy) GPS but we don’t know which parts of the panoramas depict the same parts of the scene
  • 6. How can we train • learning procedure for place recognition • learns parameters of the architecture in an end- to-end manner tailored for the place recognition task from the weakly labelled Time Machine imagery • resulting representation is robust to changes in viewpoint and lighting conditions • while simultaneously learns to focus on the relevant parts of the image such as the building facades and the skyline, while ignoring confusing elements such as cars and people that may occur at many different places
  • 7. Method overview • place recognition as image retrieval • query image with unknown location is used to visually search a large geotagged image database • locations of top ranked images are used as suggestions for the location of the query • notations • a function f: acts as the “image representation extractor” • Ii : given an image • f(Ii): fixed size vector • {Ii}: the entire database • f(q): the query image representation
  • 8. • simply finding the nearest database image to the query • either exactly or through fast approximate nearest neighbor search • by sorting images based on the Euclidean distance d(q, Ii) between f(q) and f(Ii) • representation is parametrized with a set of parameters θ and we emphasize this fact by referring to it as fθ(I) • euclidean distance dθ(Ii , Ij ) = ||fθ(Ii) − fθ(Ij)|| also depends on the same parameters
  • 9. Proposed CNN architecture fθ • most image retrieval pipelines • (i) extracting local descriptors • (ii) pooled in an orderless manner • motivations • robustness to lighting and viewpoint changes are provided by the descriptors themselves • scale invariance is ensured through extracting descriptors at multiple scales
  • 10. • for step (i), we crop the CNN at the last convolutional layer and view it as a dense descriptor extractor. • the output of the last convolutional layer is a H × W × D map which can be considered as a set of D- dimensional descriptors extracted at H × W spatial locations • for step (ii) we design a new pooling layer inspired by the Vector Locally Aggregated Descriptors (VLAD) • pools the extracted descriptors into a fixed image representation and its parameters are learnable via back-propagation
  • 11. NetVLAD: A Generalized VLAD layer (fVLAD) • Vector of Locally Aggregated Descriptors (VLAD) • popular descriptor pooling methods for both instance level retrieval and image classification
  • 12. • notations • {xi} : given N D-dimensional local image descriptors as input • {ck}: K cluster centres (“visual words”) as VLAD parameters • V: the output VLAD image representation(K×D-dimensional) • xi(j): the j-th dimensions of the i-th descriptor • ck(j): k-th cluster centre • ak(xi): the membership of the descriptor xi to k-th visual word • matrix V is then L2-normalized column-wise, converted into a vector, and finally L2-normalized in its entirety
  • 13. • to construct a layer amenable to training via backpropagation, it is required that the layer’s operation is differentiable with respect to all its parameters and the input • we replace it with soft assignment of descriptors to multiple clusters
  • 14. • final form of the NetVLAD layer • obtained by plugging the soft-assignment (3) into the VLAD descriptor (1) resulting in • where {wk}, {bk} and {ck} are sets of trainable parameters for each cluster k
  • 15.
  • 16.
  • 17. Learning from time machine data • two main challenges • (i) how to gather enough annotated training data • possible to obtain large amounts of weakly labelled imagery depicting the same places over time from the Google Street Time Machine • (ii) what is the appropriate loss for the place recognition task • we will design a new weakly supervised triplet ranking loss that can deal with the incomplete and noisy position annotations of the Street View Time Machine imagery
  • 18. Weak supervision from the Time Machine • Google Street View Time Machine • provides multiple street-level panoramic images taken at different times at close-by spatial locations on the map • precious for learning an image representation for place recognition • The same locations are depicted at different times and seasons, providing the learning algorithm with crucial information it can use to discover which features are useful or distracting, and what changes should the image representation be invariant to, in order to achieve good place recognition performance • provides only incomplete and noisy supervision • for a given training query q • {pq i }: potential positives • {nq j}: definite negatives
  • 19.
  • 20. Weakly supervised triplet ranking loss • to learn a representation fθ that will optimize place recognition performance • goal is to rank a database image Ii∗ from a close- by location higher than all other far away images Ii in the database • we wish the Euclidean distance dθ(q, I) between the query q and a close-by image Ii∗ to be smaller than the distance to far away images in the database Ii • i.e. dθ(q, Ii∗) < dθ(q, Ii), for all images Ii further than a certain distance from the query on the map. • next we show how this requirement can be translated into a ranking loss between training triplets {q, Ii∗, Ii}
  • 21. • from the Google Street View Time Machine data, we obtain a training dataset of tuples (q, {pq i}, {nq j}) • where for each training query image q we have a set of potential positives {pq i} and the set of definite negatives {nq j} • where l is the hinge loss l(x) = max(x, 0), and m is a constant parameter giving the margin
  • 22. Datasets and evaluation methodology • dataset • Pittsburgh (Pitts250k): contains 250k database images downloaded from Google Street View and 24k test queries generated from Street View • Tokyo 24/7 [79] contains 76k database images and 315 query images taken using mobile phone cameras • extremely challenging dataset • TokyoTM; Tokyo 24/7 (=test) and TokyoTM train/val are all geographically disjoint
  • 23. • evaluation metric • The query image is deemed correctly localized if at least one of the top N retrieved database images is within d = 25 meters from the ground truth position of the query • The percentage of correctly recognized queries (Recall) is then plotted for different values of N • For Tokyo 24/7 we follow [79] and perform spatial non- maximal suppression on ranked database images before evaluation. • implementation details: • Max pooling (fmax) and our NetVLAD (fVLAD) layers: AlexNet and VGG-16 • both are cropped at the last convolutional layer (conv5), before ReLU
  • 24. Results and discussion • “off-the-shelf” networks • base network cropped at conv5 • Max pooling (fmax), VLAD (fVLAD) • AlexNet, VGG-16(pretrained for ImageNet), Place205 • the state-of-the-art local feature based compact descriptor • VLAD pooling with intra-normalization on top of densely extracted RootSIFTs • Dimensionality reduction
  • 25.
  • 26.
  • 27. • Which layers should be trained?
  • 28. • Importance of Time Machine training
  • 30. Image retrieval • our best performing network (VGG-16, fV LAD with whitening down to 256-D) trained completely on Pittsburgh • to extract image representations for standard object and image retrieval benchmarks (Oxford 5k, Paris 6k, Holidays)
  • 31. Three principal contributions • convolutional neural network (CNN) architecture • trainable in an end-to-end manner directly for the place recognition • pluggable into any CNN architecture and amenable to training via backpropagation • training procedure, based on a new weakly supervised ranking loss • to learn parameters of the architecture in an end- to-end manner • large improvement