This document summarizes research on mobile augmented reality from the Stanford-Nokia Collaboration. It describes a landmark recognition system using "bag of words" feature matching. It explores approaches for feature compression, including CHoG (Compressed Histogram of Gradients) descriptors. It discusses using vocabulary trees and forests for large databases and improving accuracy. It also looks at multi-view matching, 3D modeling from images, and streaming augmented reality while minimizing latency. Future research directions include improved features, matching algorithms, and 3D modeling to enable large-scale urban landmark recognition.
Building AI-Driven Apps Using Semantic Kernel.pptx
Nokia Augmented Reality
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
Hinweis der Redaktion
Only a limited number of different Huffman trees.Catalan number yields number of rooted binary trees (ordered leaves, no cross-overs)Count unique permutations
Winder, Brown (Microsoft Resarch), “Learning Local Image Descriptors,” 64x64 patches. touristphotographs of the Trevi Fountain and of Yosemite Valley (920 images), and a test set consisting of images ofNotre Dame (500 images). BoostSSC –Boosting Similarity Sensitive CodingG. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter sensitive hashing. In Proc. ICCV, 2003.Torralba et al., Small Codes and Large Image Databases for Recognition, CVPR2009.Random Projections - P. A. ChuohaoYeo and K. Ramchandran, “Rate-EfficientVisual Correspondences Using Random Projections,” 2008.
Most retrieval application require NN search in some formThe descriptors for both SIFT and CHoG were computed from the sameset of patches. VQ-5 bin configuration, GLOH-9 cell configurationsand Huffman Tree Coding are used for CHoG, resulting in a45 dimensional descriptor. We observe that exact nearest neighborsearching is 10X faster for CHoG. Furthermore, CHoG is still 2Xfaster than using SIFT with ANN eps = 1, which incurs a small errorrate of 0.30%. The speed up results from the lower dimensionalityof the CHoG descriptor, and the use of look up tables for fastdistance computation.
The scalable vocabulary tree is the data structure at the center of our recognition system. To construct an SVT, first we take every database CD cover and extract robust local features. These features can be SIFT, SURF, or your own favorite type. Then, all the feature descriptors from all the images are represented as vectors in a high-dimensional space. Here, they are shown as 2-dimensional vectors, but in reality, they can be 64-dimensional or 128-dimensional vectors.
The scalable vocabulary tree is the data structure at the center of our recognition system. To construct an SVT, first we take every database CD cover and extract robust local features. These features can be SIFT, SURF, or your own favorite type. Then, all the feature descriptors from all the images are represented as vectors in a high-dimensional space. Here, they are shown as 2-dimensional vectors, but in reality, they can be 64-dimensional or 128-dimensional vectors.
The scalable vocabulary tree is the data structure at the center of our recognition system. To construct an SVT, first we take every database CD cover and extract robust local features. These features can be SIFT, SURF, or your own favorite type. Then, all the feature descriptors from all the images are represented as vectors in a high-dimensional space. Here, they are shown as 2-dimensional vectors, but in reality, they can be 64-dimensional or 128-dimensional vectors.
To impose some structure on this space, we perform hierarchical k-means clustering, the first step of which is dividing the space into k clusters using regular k-means.
And then again, recursively splitting each large cluster into k smaller clusters. We repeat this process until the clusters become sufficiently small.What results from the hierarchical k-means algorithm is a tree structure, where tree nodes are the cluster centroids and their children are the subcluster centroids.
Here is the same tree as on the previous slide, except the tree structure is more apparent. Once we have constructed an SVT on a server, how to process an incoming query is straightforward. For every query descriptor, we classify it by traversing the SVT greedily from top to bottom. Suppose the first descriptor follows this nearest neighbor path. The SVT knows which database images have features associated with every node, so it votes for the two images found on this path. Both the blue nodes and green nodes vote, but since the blue nodes are more discriminative, their vote counts for more. Then, another query descriptor goes down a different path and votes for other images. And so on, until all the query descriptors are classified. The final vote tally is a histogram indicating how likely each database image is a match.We notice that when both the query and database images are fronto-parallel, the voting scheme works well and will select the correct database match. This is because similar features are extracted from the query image and the matching database image, leading to their descriptors visiting many of the same nodes in the SVT.
Performance drops with single tree, since nodes become less discriminative – fewer features are unique to a particular database image
Feature extraction is robust against rotation and scale change. NOT robust against foreshortening.Overcome by putting multiple examples into data base that show object from different angles.
One could put all these views into one vocabulary tree.Distributing views across parallel trees prevent competition among the among the features belonging to different views of the same object. Views compete only, once all the features are considered. Select the 25 top matches for each SVT based on bin count similarity, then find match with best geometric consistency.The multiview SVT approach is attractive for multi-core server, the search process through the different trees can be run in parallel