SlideShare ist ein Scribd-Unternehmen logo
1 von 32
NetVLAD:
CNN architecture for weakly
supervised place recognition
(CVPR2016)
2016. 06. 09.
Vision&Language
조근희
Visual place recognition
• significant amount of attention in the past years
• computer vision, robotics communities
• motivated by, e.g., applications in autonomous driving,
augmented reality or geo-localizing archival imagery
• still remains extremely challenging
• How can we recognize the same street-corner in the entire
city or on the scale of the entire country despite the fact it
can be captured in different illuminations or change its
appearance over time?
• traditionally cast as an instance retrieval task
• the query image location is estimated using the locations of
the most visually similar images obtained by querying the
large geotagged database
• local invariant features,bag-of-visual-words,VLAD, fisher vector
• convolution neural network
In this work
• investigate whether this gap in performance
can be bridged by CNN representations
developed and trained directly for place
recognition
• three main challenges
• First, what is a good CNN architecture for place
recognition?
• Second, how to gather sufficient amount of
annotated data for the training?
• Third, how can we train the developed
architecture in an end-to-end manner tailored for
the place recognition task?
CNN architecture for place
recognition
• convolutional neural network architecture for place
recognition
• aggregates mid-level (conv5) convolutional features
extracted from the entire image into a compact single
vector representation amenable to efficient indexing
• new trainable generalized VLAD layer, NetVLAD
• inspired by the Vector of Locally Aggregated Descriptors
(VLAD) representation that has shown excellent
performance in image retrieval and place recognition
• pluggable into any CNN architecture and amenable
to training via backpropagation.
• aggregated representation is then compressed using
Principal Component Analysis (PCA) to obtain the
final compact descriptor of the image
Annotated data for the
training
• to train the architecture for place recognition
• we gather a large dataset of multiple panoramic
images depicting the same place from different
viewpoints over time from the Google Street
View Time Machine
• such data is available, but provides only weak
form of supervision:
• we know the two panoramas are captured at
approximately similar positions based on their
(noisy) GPS but we don’t know which parts of the
panoramas depict the same parts of the scene
How can we train
• learning procedure for place recognition
• learns parameters of the architecture in an end-
to-end manner tailored for the place recognition
task from the weakly labelled Time Machine
imagery
• resulting representation is robust to changes
in viewpoint and lighting conditions
• while simultaneously learns to focus on the
relevant parts of the image such as the building
facades and the skyline, while ignoring confusing
elements such as cars and people that may occur
at many different places
Method overview
• place recognition as image retrieval
• query image with unknown location is used to visually
search a large geotagged image database
• locations of top ranked images are used as
suggestions for the location of the query
• notations
• a function f: acts as the “image representation
extractor”
• Ii : given an image
• f(Ii): fixed size vector
• {Ii}: the entire database
• f(q): the query image representation
• simply finding the nearest database image to
the query
• either exactly or through fast approximate
nearest neighbor search
• by sorting images based on the Euclidean
distance d(q, Ii) between f(q) and f(Ii)
• representation is parametrized with a set of
parameters θ and we emphasize this fact by
referring to it as fθ(I)
• euclidean distance dθ(Ii , Ij ) = ||fθ(Ii) − fθ(Ij)||
also depends on the same parameters
Proposed CNN architecture fθ
• most image retrieval pipelines
• (i) extracting local descriptors
• (ii) pooled in an orderless manner
• motivations
• robustness to lighting and viewpoint changes are
provided by the descriptors themselves
• scale invariance is ensured through extracting
descriptors at multiple scales
• for step (i), we crop the CNN at the last
convolutional layer and view it as a dense
descriptor extractor.
• the output of the last convolutional layer is a H × W ×
D map which can be considered as a set of D-
dimensional descriptors extracted at H × W spatial
locations
• for step (ii) we design a new pooling layer
inspired by the Vector Locally Aggregated
Descriptors (VLAD)
• pools the extracted descriptors into a fixed image
representation and its parameters are learnable via
back-propagation
NetVLAD: A Generalized
VLAD layer (fVLAD)
• Vector of Locally Aggregated Descriptors (VLAD)
• popular descriptor pooling methods for both instance
level retrieval and image classification
• notations
• {xi} : given N D-dimensional local image descriptors as input
• {ck}: K cluster centres (“visual words”) as VLAD parameters
• V: the output VLAD image representation(K×D-dimensional)
• xi(j): the j-th dimensions of the i-th descriptor
• ck(j): k-th cluster centre
• ak(xi): the membership of the descriptor xi to k-th visual word
• matrix V is then L2-normalized column-wise, converted
into a vector, and finally L2-normalized in its entirety
• to construct a layer amenable to training via
backpropagation, it is required that the layer’s
operation is differentiable with respect to all its
parameters and the input
• we replace it with soft assignment of descriptors
to multiple clusters
• final form of the NetVLAD layer
• obtained by plugging the soft-assignment (3) into
the VLAD descriptor (1) resulting in
• where {wk}, {bk} and {ck} are sets of trainable
parameters for each cluster k
Learning from time machine
data
• two main challenges
• (i) how to gather enough annotated training data
• possible to obtain large amounts of weakly labelled
imagery depicting the same places over time from the
Google Street Time Machine
• (ii) what is the appropriate loss for the place
recognition task
• we will design a new weakly supervised triplet ranking
loss that can deal with the incomplete and noisy
position annotations of the Street View Time Machine
imagery
Weak supervision from the
Time Machine
• Google Street View Time Machine
• provides multiple street-level panoramic images taken
at different times at close-by spatial locations on the
map
• precious for learning an image representation for
place recognition
• The same locations are depicted at different times and
seasons, providing the learning algorithm with crucial
information it can use to discover which features are
useful or distracting, and what changes should the
image representation be invariant to, in order to
achieve good place recognition performance
• provides only incomplete and noisy supervision
• for a given training query q
• {pq
i }: potential positives
• {nq
j}: definite negatives
Weakly supervised triplet
ranking loss
• to learn a representation fθ that will optimize
place recognition performance
• goal is to rank a database image Ii∗ from a close-
by location higher than all other far away images
Ii in the database
• we wish the Euclidean distance dθ(q, I) between the
query q and a close-by image Ii∗ to be smaller than the
distance to far away images in the database Ii
• i.e. dθ(q, Ii∗) < dθ(q, Ii), for all images Ii further than a certain
distance from the query on the map.
• next we show how this requirement can be translated
into a ranking loss between training triplets {q, Ii∗, Ii}
• from the Google Street View Time Machine data, we
obtain a training dataset of tuples (q, {pq
i}, {nq
j})
• where for each training query image q we have a set of
potential positives {pq
i} and the set of definite negatives {nq
j}
• where l is the hinge loss l(x) = max(x, 0), and m is a
constant parameter giving the margin
Datasets and evaluation
methodology
• dataset
• Pittsburgh (Pitts250k): contains 250k database
images downloaded from Google Street View
and 24k test queries generated from Street View
• Tokyo 24/7 [79] contains 76k database images
and 315 query images taken using mobile phone
cameras
• extremely challenging dataset
• TokyoTM; Tokyo 24/7 (=test) and TokyoTM
train/val are all geographically disjoint
• evaluation metric
• The query image is deemed correctly localized if at
least one of the top N retrieved database images is
within d = 25 meters from the ground truth position
of the query
• The percentage of correctly recognized queries (Recall)
is then plotted for different values of N
• For Tokyo 24/7 we follow [79] and perform spatial non-
maximal suppression on ranked database images before
evaluation.
• implementation details:
• Max pooling (fmax) and our NetVLAD (fVLAD) layers:
AlexNet and VGG-16
• both are cropped at the last convolutional layer
(conv5), before ReLU
Results and discussion
• “off-the-shelf” networks
• base network cropped at conv5
• Max pooling (fmax), VLAD (fVLAD)
• AlexNet, VGG-16(pretrained for ImageNet),
Place205
• the state-of-the-art local feature based
compact descriptor
• VLAD pooling with intra-normalization on top of
densely extracted RootSIFTs
• Dimensionality reduction
• Which layers should be trained?
• Importance of Time Machine training
• Qualitative evaluation
Image retrieval
• our best performing network (VGG-16, fV LAD
with whitening down to 256-D) trained
completely on Pittsburgh
• to extract image representations for standard object
and image retrieval benchmarks (Oxford 5k, Paris 6k,
Holidays)
Three principal contributions
• convolutional neural network (CNN)
architecture
• trainable in an end-to-end manner directly for the
place recognition
• pluggable into any CNN architecture and
amenable to training via backpropagation
• training procedure, based on a new weakly
supervised ranking loss
• to learn parameters of the architecture in an end-
to-end manner
• large improvement
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

SSII2021 [OS3-01] 設備や環境の高品質計測点群取得と自動モデル化技術
SSII2021 [OS3-01] 設備や環境の高品質計測点群取得と自動モデル化技術SSII2021 [OS3-01] 設備や環境の高品質計測点群取得と自動モデル化技術
SSII2021 [OS3-01] 設備や環境の高品質計測点群取得と自動モデル化技術
 
【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ
 
object detection with lidar-camera fusion: survey
object detection with lidar-camera fusion: surveyobject detection with lidar-camera fusion: survey
object detection with lidar-camera fusion: survey
 
[DL輪読会]Objects as Points
[DL輪読会]Objects as Points[DL輪読会]Objects as Points
[DL輪読会]Objects as Points
 
GN-Net: The Gauss-Newton Loss for Deep Direct SLAMの解説
GN-Net: The Gauss-Newton Loss for Deep Direct SLAMの解説GN-Net: The Gauss-Newton Loss for Deep Direct SLAMの解説
GN-Net: The Gauss-Newton Loss for Deep Direct SLAMの解説
 
CV分野での最近の脱○○系3選
CV分野での最近の脱○○系3選CV分野での最近の脱○○系3選
CV分野での最近の脱○○系3選
 
Mask-RCNNを用いたキャベツの結球認識
Mask-RCNNを用いたキャベツの結球認識Mask-RCNNを用いたキャベツの結球認識
Mask-RCNNを用いたキャベツの結球認識
 
[DL輪読会]End-to-End Object Detection with Transformers
[DL輪読会]End-to-End Object Detection with Transformers[DL輪読会]End-to-End Object Detection with Transformers
[DL輪読会]End-to-End Object Detection with Transformers
 
MediaPipeの紹介
MediaPipeの紹介MediaPipeの紹介
MediaPipeの紹介
 
SmoothGrad: removing noise by adding noise
SmoothGrad: removing noise by adding noiseSmoothGrad: removing noise by adding noise
SmoothGrad: removing noise by adding noise
 
サーベイ論文:画像からの歩行者属性認識
サーベイ論文:画像からの歩行者属性認識サーベイ論文:画像からの歩行者属性認識
サーベイ論文:画像からの歩行者属性認識
 
【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
 
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
[DL輪読会]Transframer: Arbitrary Frame Prediction with Generative Models
 
[DL輪読会]医用画像解析におけるセグメンテーション
[DL輪読会]医用画像解析におけるセグメンテーション[DL輪読会]医用画像解析におけるセグメンテーション
[DL輪読会]医用画像解析におけるセグメンテーション
 
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...
 
Depth Estimation論文紹介
Depth Estimation論文紹介Depth Estimation論文紹介
Depth Estimation論文紹介
 
MS COCO Dataset Introduction
MS COCO Dataset IntroductionMS COCO Dataset Introduction
MS COCO Dataset Introduction
 
VAEs for multimodal disentanglement
VAEs for multimodal disentanglementVAEs for multimodal disentanglement
VAEs for multimodal disentanglement
 
CNN-SLAMざっくり
CNN-SLAMざっくりCNN-SLAMざっくり
CNN-SLAMざっくり
 
深層学習によるHuman Pose Estimationの基礎
深層学習によるHuman Pose Estimationの基礎深層学習によるHuman Pose Estimationの基礎
深層学習によるHuman Pose Estimationの基礎
 

Ähnlich wie NetVLAD: CNN architecture for weakly supervised place recognition

Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Sunando Sengupta
 
Deep image retrieval learning global representations for image search
Deep image retrieval  learning global representations for image searchDeep image retrieval  learning global representations for image search
Deep image retrieval learning global representations for image search
Universitat Politècnica de Catalunya
 

Ähnlich wie NetVLAD: CNN architecture for weakly supervised place recognition (20)

Deep learning fundamental and Research project on IBM POWER9 system from NUS
Deep learning fundamental and Research project on IBM POWER9 system from NUSDeep learning fundamental and Research project on IBM POWER9 system from NUS
Deep learning fundamental and Research project on IBM POWER9 system from NUS
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
 
Fisheye Omnidirectional View in Autonomous Driving
Fisheye Omnidirectional View in Autonomous DrivingFisheye Omnidirectional View in Autonomous Driving
Fisheye Omnidirectional View in Autonomous Driving
 
Introduction to computer vision
Introduction to computer visionIntroduction to computer vision
Introduction to computer vision
 
Introduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural NetworksIntroduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural Networks
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
 
Analysis of KinectFusion
Analysis of KinectFusionAnalysis of KinectFusion
Analysis of KinectFusion
 
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
 
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
 
Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...
 
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondenceParn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
 
Using Deep Learning to Derive 3D Cities from Satellite Imagery
Using Deep Learning to Derive 3D Cities from Satellite ImageryUsing Deep Learning to Derive 3D Cities from Satellite Imagery
Using Deep Learning to Derive 3D Cities from Satellite Imagery
 
Deep image retrieval learning global representations for image search
Deep image retrieval  learning global representations for image searchDeep image retrieval  learning global representations for image search
Deep image retrieval learning global representations for image search
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object tracking
 
Week5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptxWeek5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptx
 
IEEE Projects 2014-2015
IEEE Projects 2014-2015IEEE Projects 2014-2015
IEEE Projects 2014-2015
 
Automatic Dense Semantic Mapping From Visual Street-level Imagery
Automatic Dense Semantic Mapping From Visual Street-level ImageryAutomatic Dense Semantic Mapping From Visual Street-level Imagery
Automatic Dense Semantic Mapping From Visual Street-level Imagery
 
Harpster, J. - Open data on buildings with satellite imagery processing
Harpster, J. - Open data on buildings with satellite imagery processingHarpster, J. - Open data on buildings with satellite imagery processing
Harpster, J. - Open data on buildings with satellite imagery processing
 
Online video object segmentation via convolutional trident network
Online video object segmentation via convolutional trident networkOnline video object segmentation via convolutional trident network
Online video object segmentation via convolutional trident network
 

Kürzlich hochgeladen

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Kürzlich hochgeladen (20)

10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 

NetVLAD: CNN architecture for weakly supervised place recognition

  • 1. NetVLAD: CNN architecture for weakly supervised place recognition (CVPR2016) 2016. 06. 09. Vision&Language 조근희
  • 2. Visual place recognition • significant amount of attention in the past years • computer vision, robotics communities • motivated by, e.g., applications in autonomous driving, augmented reality or geo-localizing archival imagery • still remains extremely challenging • How can we recognize the same street-corner in the entire city or on the scale of the entire country despite the fact it can be captured in different illuminations or change its appearance over time? • traditionally cast as an instance retrieval task • the query image location is estimated using the locations of the most visually similar images obtained by querying the large geotagged database • local invariant features,bag-of-visual-words,VLAD, fisher vector • convolution neural network
  • 3. In this work • investigate whether this gap in performance can be bridged by CNN representations developed and trained directly for place recognition • three main challenges • First, what is a good CNN architecture for place recognition? • Second, how to gather sufficient amount of annotated data for the training? • Third, how can we train the developed architecture in an end-to-end manner tailored for the place recognition task?
  • 4. CNN architecture for place recognition • convolutional neural network architecture for place recognition • aggregates mid-level (conv5) convolutional features extracted from the entire image into a compact single vector representation amenable to efficient indexing • new trainable generalized VLAD layer, NetVLAD • inspired by the Vector of Locally Aggregated Descriptors (VLAD) representation that has shown excellent performance in image retrieval and place recognition • pluggable into any CNN architecture and amenable to training via backpropagation. • aggregated representation is then compressed using Principal Component Analysis (PCA) to obtain the final compact descriptor of the image
  • 5. Annotated data for the training • to train the architecture for place recognition • we gather a large dataset of multiple panoramic images depicting the same place from different viewpoints over time from the Google Street View Time Machine • such data is available, but provides only weak form of supervision: • we know the two panoramas are captured at approximately similar positions based on their (noisy) GPS but we don’t know which parts of the panoramas depict the same parts of the scene
  • 6. How can we train • learning procedure for place recognition • learns parameters of the architecture in an end- to-end manner tailored for the place recognition task from the weakly labelled Time Machine imagery • resulting representation is robust to changes in viewpoint and lighting conditions • while simultaneously learns to focus on the relevant parts of the image such as the building facades and the skyline, while ignoring confusing elements such as cars and people that may occur at many different places
  • 7. Method overview • place recognition as image retrieval • query image with unknown location is used to visually search a large geotagged image database • locations of top ranked images are used as suggestions for the location of the query • notations • a function f: acts as the “image representation extractor” • Ii : given an image • f(Ii): fixed size vector • {Ii}: the entire database • f(q): the query image representation
  • 8. • simply finding the nearest database image to the query • either exactly or through fast approximate nearest neighbor search • by sorting images based on the Euclidean distance d(q, Ii) between f(q) and f(Ii) • representation is parametrized with a set of parameters θ and we emphasize this fact by referring to it as fθ(I) • euclidean distance dθ(Ii , Ij ) = ||fθ(Ii) − fθ(Ij)|| also depends on the same parameters
  • 9. Proposed CNN architecture fθ • most image retrieval pipelines • (i) extracting local descriptors • (ii) pooled in an orderless manner • motivations • robustness to lighting and viewpoint changes are provided by the descriptors themselves • scale invariance is ensured through extracting descriptors at multiple scales
  • 10. • for step (i), we crop the CNN at the last convolutional layer and view it as a dense descriptor extractor. • the output of the last convolutional layer is a H × W × D map which can be considered as a set of D- dimensional descriptors extracted at H × W spatial locations • for step (ii) we design a new pooling layer inspired by the Vector Locally Aggregated Descriptors (VLAD) • pools the extracted descriptors into a fixed image representation and its parameters are learnable via back-propagation
  • 11. NetVLAD: A Generalized VLAD layer (fVLAD) • Vector of Locally Aggregated Descriptors (VLAD) • popular descriptor pooling methods for both instance level retrieval and image classification
  • 12. • notations • {xi} : given N D-dimensional local image descriptors as input • {ck}: K cluster centres (“visual words”) as VLAD parameters • V: the output VLAD image representation(K×D-dimensional) • xi(j): the j-th dimensions of the i-th descriptor • ck(j): k-th cluster centre • ak(xi): the membership of the descriptor xi to k-th visual word • matrix V is then L2-normalized column-wise, converted into a vector, and finally L2-normalized in its entirety
  • 13. • to construct a layer amenable to training via backpropagation, it is required that the layer’s operation is differentiable with respect to all its parameters and the input • we replace it with soft assignment of descriptors to multiple clusters
  • 14. • final form of the NetVLAD layer • obtained by plugging the soft-assignment (3) into the VLAD descriptor (1) resulting in • where {wk}, {bk} and {ck} are sets of trainable parameters for each cluster k
  • 15.
  • 16.
  • 17. Learning from time machine data • two main challenges • (i) how to gather enough annotated training data • possible to obtain large amounts of weakly labelled imagery depicting the same places over time from the Google Street Time Machine • (ii) what is the appropriate loss for the place recognition task • we will design a new weakly supervised triplet ranking loss that can deal with the incomplete and noisy position annotations of the Street View Time Machine imagery
  • 18. Weak supervision from the Time Machine • Google Street View Time Machine • provides multiple street-level panoramic images taken at different times at close-by spatial locations on the map • precious for learning an image representation for place recognition • The same locations are depicted at different times and seasons, providing the learning algorithm with crucial information it can use to discover which features are useful or distracting, and what changes should the image representation be invariant to, in order to achieve good place recognition performance • provides only incomplete and noisy supervision • for a given training query q • {pq i }: potential positives • {nq j}: definite negatives
  • 19.
  • 20. Weakly supervised triplet ranking loss • to learn a representation fθ that will optimize place recognition performance • goal is to rank a database image Ii∗ from a close- by location higher than all other far away images Ii in the database • we wish the Euclidean distance dθ(q, I) between the query q and a close-by image Ii∗ to be smaller than the distance to far away images in the database Ii • i.e. dθ(q, Ii∗) < dθ(q, Ii), for all images Ii further than a certain distance from the query on the map. • next we show how this requirement can be translated into a ranking loss between training triplets {q, Ii∗, Ii}
  • 21. • from the Google Street View Time Machine data, we obtain a training dataset of tuples (q, {pq i}, {nq j}) • where for each training query image q we have a set of potential positives {pq i} and the set of definite negatives {nq j} • where l is the hinge loss l(x) = max(x, 0), and m is a constant parameter giving the margin
  • 22. Datasets and evaluation methodology • dataset • Pittsburgh (Pitts250k): contains 250k database images downloaded from Google Street View and 24k test queries generated from Street View • Tokyo 24/7 [79] contains 76k database images and 315 query images taken using mobile phone cameras • extremely challenging dataset • TokyoTM; Tokyo 24/7 (=test) and TokyoTM train/val are all geographically disjoint
  • 23. • evaluation metric • The query image is deemed correctly localized if at least one of the top N retrieved database images is within d = 25 meters from the ground truth position of the query • The percentage of correctly recognized queries (Recall) is then plotted for different values of N • For Tokyo 24/7 we follow [79] and perform spatial non- maximal suppression on ranked database images before evaluation. • implementation details: • Max pooling (fmax) and our NetVLAD (fVLAD) layers: AlexNet and VGG-16 • both are cropped at the last convolutional layer (conv5), before ReLU
  • 24. Results and discussion • “off-the-shelf” networks • base network cropped at conv5 • Max pooling (fmax), VLAD (fVLAD) • AlexNet, VGG-16(pretrained for ImageNet), Place205 • the state-of-the-art local feature based compact descriptor • VLAD pooling with intra-normalization on top of densely extracted RootSIFTs • Dimensionality reduction
  • 25.
  • 26.
  • 27. • Which layers should be trained?
  • 28. • Importance of Time Machine training
  • 30. Image retrieval • our best performing network (VGG-16, fV LAD with whitening down to 256-D) trained completely on Pittsburgh • to extract image representations for standard object and image retrieval benchmarks (Oxford 5k, Paris 6k, Holidays)
  • 31. Three principal contributions • convolutional neural network (CNN) architecture • trainable in an end-to-end manner directly for the place recognition • pluggable into any CNN architecture and amenable to training via backpropagation • training procedure, based on a new weakly supervised ranking loss • to learn parameters of the architecture in an end- to-end manner • large improvement