1. Creation and Optimization of a Logo
Recognition System
Haozhi Qi, Owen Richfield, Xiaohui Zeng, Michael Zhao
Academic Mentor: Dr. Albert Ku
Industrial Mentor: Mr. Sun Lin
August 6, 2015
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
2. Problem Description
Problem: What if there was an
app that could provide a
smartphone user with
information about a company
just by recognizing that
company’s logo in an image?
Goal: Create this app.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
6. Outline
Model Introduction
Bag of Features Model
Convolutional Neural Network
Model Testing and Results
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
7. Outline
Model Introduction
Bag of Features Model
Convolutional Neural Network
Model Testing and Results
Application Demonstration
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
8. Outline
Model Introduction
Bag of Features Model
Convolutional Neural Network
Model Testing and Results
Application Demonstration
Conclusions and Future Work
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
9. Bag of Features Model
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
10. Bag of Features Model
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
11. Bag of Features Model
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
12. Bag of Features Model
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
13. Bag of Features Model
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
14. Bag of Features Model
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
17. Feature Extraction and description: SURF
Interest points detection
Rotational and scale-invariant features
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
18. Feature Extraction and description: SURF
Interest points detection
Rotational and scale-invariant features
Interest points description
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
19. Feature Extraction and description: SURF
Interest points detection
Rotational and scale-invariant features
Interest points description
Good representation form of image
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
20. Feature Extraction and description: SURF
Interest points detection
Rotational and scale-invariant features
Interest points description
Good representation form of image
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
21. SURF: Interest points detection
Use determinant of Hessian to detect blob-like structure
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
22. SURF: Interest points detection
Use determinant of Hessian to detect blob-like structure
Use box filter to approximate the second order derivative of Gaussian filter
Second-order box filter
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
23. SURF: Interest points detection
Use determinant of Hessian to detect blob-like structure
Use box filter to approximate the second order derivative of Gaussian filter
Taking advantages of integral image
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
24. SURF: Interest points detection
Use determinant of Hessian to detect blob-like structure
Use box filter to approximate the second order derivative of Gaussian filter
Taking advantages of integral domain
Apply scale-space analysis to choose
the appropriate points scale
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
25. SURF: Interest points description
Calculate dominant orientation based on Haar wavelet analysis
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
26. SURF: Interest points description
Calculate dominant orientation based on Haar wavelet analysis
Build 4*4 descriptor
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
30. Basics of K-means
Clustering Method in N-dimensional Space
Algorithmic Steps:
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
31. Basics of K-means
Clustering Method in N-dimensional Space
Algorithmic Steps:
With a given set of data, choose k cluster centers
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
32. Basics of K-means
Clustering Method in N-dimensional Space
Algorithmic Steps:
With a given set of data, choose k cluster centers
Calculate distances between each data point and each
cluster
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
33. Basics of K-means
Clustering Method in N-dimensional Space
Algorithmic Steps:
With a given set of data, choose k cluster centers
Calculate distances between each data point and each
cluster
Cluster points based on min distance
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
34. Basics of K-means
Clustering Method in N-dimensional Space
Algorithmic Steps:
With a given set of data, choose k cluster centers
Calculate distances between each data point and each
cluster
Cluster points based on min distance
Recalculate cluster centers:
vi =
1
ci
ci
j=1
xj
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
35. Basics of K-means
Clustering Method in N-dimensional Space
Algorithmic Steps:
With a given set of data, choose k cluster centers
Calculate distances between each data point and each
cluster
Cluster points based on min distance
Recalculate cluster centers:
vi =
1
ci
ci
j=1
xj
vi=new cluster center, ci=number of data points in ith
cluster, xj=jth
data point in ith
cluster.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
38. Bag of Words and Hierarchical K-means
FEATURE VECTORS
CL.
CL. CL. CL.
CL.
CL. CL.
CL.
CL. CL.
CL.
CL. CL.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
39. Bag of Words and Hierarchical K-means
CL.
CL. CL. CL.
CL.
CL. CL.
CL.
CL. CL.
CL.
CL. CL.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
40. Bag of Words and Hierarchical K-means
CL.
CL. CL. CL.
CL.
CL. CL.
CL.
CL. CL.
CL.
CL. CL.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
41. Bag of Words and Hierarchical K-means
CL.
CL. CL. CL.
CL.
CL. CL.
CL.
CL. CL.
CL.
CL. CL.
X
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
42. Bag of Words and Hierarchical K-means
CL.
CL. CL. CL.
CL.
CL. CL.
CL.
CL. CL.
CL.
CL. CL.
X X
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
43. Bag of Words and Hierarchical K-means
CL.
CL. CL. CL.
CL.
CL. CL.
CL.
CL. CL.
CL.
CL. CL.
X XXXXXX
X X X
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
44. Bag of Words and Hierarchical K-means
word
1
word
2
word
3
word
4
word
5
0
2
4
6
8
3
8
2
5
1
matches
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
45. Bag of Features Model
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
46. Inverted File Index
word 1:
word 2
word 3
word 4
word 5
word 6
...
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
47. Inverted File Index
word 1: image 1, image 3, image 5, ...
word 2: image 4, image 9, image 16, ...
word 3: image 4, image 12, image 13, ...
word 4: image 1, image 5, image 7, ...
word 5: image 2, image 3, image 9, ...
word 6: image 7, image 12, image 17, ...
...
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
48. Classification: Inverted File Index
Benefit: retrieval via the inverted file is faster than
searching every image
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
49. Classification: Inverted File Index
Benefit: retrieval via the inverted file is faster than
searching every image
Drawback: lack of spatial accuracy
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
50. Classification: Inverted File Index
Benefit: retrieval via the inverted file is faster than
searching every image
Drawback: lack of spatial accuracy
Need additional verification to re-rank the retrieval images
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
51. Bag of Features Model
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
52. Re-ranking of Return Images
Match descriptors of query image to descriptors in images
in returned list.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
53. Re-ranking of Return Images
Match descriptors of query image to descriptors in images
in returned list.
Simple Algorithm:
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
54. Re-ranking of Return Images
Match descriptors of query image to descriptors in images
in returned list.
Simple Algorithm:
Match each descriptor in query image to its nearest
neighbor descriptor from list image.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
55. Re-ranking of Return Images
Match descriptors of query image to descriptors in images
in returned list.
Simple Algorithm:
Match each descriptor in query image to its nearest
neighbor descriptor from list image.
Compare L2 norm of the pair to the norm of the query
descriptor and every other descriptor in list image.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
56. Re-ranking of Return Images
Match descriptors of query image to descriptors in images
in returned list.
Simple Algorithm:
Match each descriptor in query image to its nearest
neighbor descriptor from list image.
Compare L2 norm of the pair to the norm of the query
descriptor and every other descriptor in list image.
If original norm is significantly smaller, count as “match”.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
57. Re-ranking of Return Images
Match descriptors of query image to descriptors in images
in returned list.
Simple Algorithm:
Match each descriptor in query image to its nearest
neighbor descriptor from list image.
Compare L2 norm of the pair to the norm of the query
descriptor and every other descriptor in list image.
If original norm is significantly smaller, count as “match”.
Sum up number of “matches” for each list image and divide
by total number of features.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
58. Re-ranking of Return Images
Match descriptors of query image to descriptors in images
in returned list.
Simple Algorithm:
Match each descriptor in query image to its nearest
neighbor descriptor from list image.
Compare L2 norm of the pair to the norm of the query
descriptor and every other descriptor in list image.
If original norm is significantly smaller, count as “match”.
Sum up number of “matches” for each list image and divide
by total number of features.
The returned list is then re-ranked based on this “match
ratio” and returned to the user.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
62. Convolutional Neural Networks
Convolutional neural networks are neural networks with an
additional biological inspiration. Each layer is of two basic
types: convolution and pooling.
Convolution is the process of convolving an image with a
kernel. This idea comes from image processing where it
has been used for things like edge detection. Here, we
want to learn kernels specific to the data.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
63. Convolutional Neural Networks
Convolutional neural networks are neural networks with an
additional biological inspiration. Each layer is of two basic
types: convolution and pooling.
Convolution is the process of convolving an image with a
kernel. This idea comes from image processing where it
has been used for things like edge detection. Here, we
want to learn kernels specific to the data.
Pooling refers to the process of providing a statistical
summary of the outputs of several nearby “neurons”, e.g.
by taking an average or max.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
64. Figure: Description of convolution process from http://www.
songho.ca/dsp/convolution/files/conv2d_matrix.jpg.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
65. Implementation and Architecture
For implementation of CNNs, we used Caffe [?]. We only had
around 16,000 images, so we used two pre-trained models to
do fine-tuning:
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
66. Implementation and Architecture
For implementation of CNNs, we used Caffe [?]. We only had
around 16,000 images, so we used two pre-trained models to
do fine-tuning:
AlexNet [?], the winner of the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) 2012.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
67. Implementation and Architecture
For implementation of CNNs, we used Caffe [?]. We only had
around 16,000 images, so we used two pre-trained models to
do fine-tuning:
AlexNet [?], the winner of the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) 2012.
GoogLeNet [?], the winner of the ILSVRC 2014.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
68. Implementation and Architecture
For implementation of CNNs, we used Caffe [?]. We only had
around 16,000 images, so we used two pre-trained models to
do fine-tuning:
AlexNet [?], the winner of the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) 2012.
GoogLeNet [?], the winner of the ILSVRC 2014.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
69. Implementation and Architecture
For implementation of CNNs, we used Caffe [?]. We only had
around 16,000 images, so we used two pre-trained models to
do fine-tuning:
AlexNet [?], the winner of the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) 2012.
GoogLeNet [?], the winner of the ILSVRC 2014.
Both of these are provided in Caffe’s Model Zoo, with a file that
stores the weights of these models after training on ImageNet.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
70. AlexNet
Figure: Image of AlexNet architecture (from [?]). This also illustrates
how original the network was split to train on two GPUs.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
71. GoogLeNet
Figure: Image of GoogLeNet architecture (from [?]). Deeper, and 12x
fewer parameters than AlexNet.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
74. Dataset Construction
We gathered a data set of images of logos of 167 brands using
Bing Search API (on average, 100 images per brand),
searching for things like “<brand>”, “<brand>building”,
“<brand><product>”.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
75. Dataset Construction
We gathered a data set of images of logos of 167 brands using
Bing Search API (on average, 100 images per brand),
searching for things like “<brand>”, “<brand>building”,
“<brand><product>”. One problem we faced was that we
downloaded either mislabeled images or irrelevant images. We
filtered the dataset using two methods:
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
76. Dataset Construction
We gathered a data set of images of logos of 167 brands using
Bing Search API (on average, 100 images per brand),
searching for things like “<brand>”, “<brand>building”,
“<brand><product>”. One problem we faced was that we
downloaded either mislabeled images or irrelevant images. We
filtered the dataset using two methods:
compute the proportion of matching SIFT descriptors
between the downloaded image and a reference image for
that brand, and toss the image if it doesn’t meet some
threshold
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
77. Dataset Construction
We gathered a data set of images of logos of 167 brands using
Bing Search API (on average, 100 images per brand),
searching for things like “<brand>”, “<brand>building”,
“<brand><product>”. One problem we faced was that we
downloaded either mislabeled images or irrelevant images. We
filtered the dataset using two methods:
compute the proportion of matching SIFT descriptors
between the downloaded image and a reference image for
that brand, and toss the image if it doesn’t meet some
threshold
import ManualLabor
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
78. Testing the original pipeline
parameter tuning
cross validation
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
79. Parameter Tuning
BOW structure: how to choose vocabulary size:
words = BL
B: number of branch; L: number of level
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
80. Parameter Tuning
BOW structure: how to choose vocabulary size:
words = BL
B: number of branch; L: number of level
Too large: lack of generalization, overfitting
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
81. Parameter Tuning
BOW structure: how to choose vocabulary size:
words = BL
B: number of branch; L: number of level
Too large: lack of generalization, overfitting
Too small: lack of discrimination,mismatched
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
82. Parameter Tuning
vocabulary size
How to choose the number of images returned by inverted
file index search
accuracy
the computation time of re-ranking
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
83. Parameter Tuning
vocabulary size
How to choose the number of images returned by inverted
file index search
accuracy
the computation time of re-ranking
How to choose the number of image shown in the client
side
accuracy
mobile application, the size of screen
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
84. Parameter Tuning
vocabulary size
How to choose the number of images returned by inverted
file index search
accuracy
the computation time of re-ranking
How to choose the number of image shown in the client
side
accuracy
mobile application, the size of screen
post
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
85. Parameter Tuning
vocabulary size
the number of images returned by searching
the number of image shown
Re-ranking: how to determine weight factor w in the
weighted function
scores = w ∗ I + (1 − w) ∗ F
I: number of inliers
F: frequency of the brands in the return images
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
86. Parameters for Evaluation
vocabulary size
number of branch
number of level
the number of images returned by searching
the number of image shown
weight factor w in the weighted function
calculation of the accuracy
one correct return then accuracy = 1
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
88. Cross Validation
randomly divide the data into K
equal sized parts.
leave out part k, fit the
model to the other K-1
parts(combined), and then
obtain predictions for the
left-out kth part
this is done in turn for each
part k=1,2,...K, and then
the results are combined
choose k = 5
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
90. Testing Result
test on vocabulary size
optimal number of words: 500000 to 800000
number of branch = 14 or 15
number of level = 5
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
91. Testing Result
With other
parameters fixed,
test on
weight factor
number of return
image
number of image
shown on the
client side
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
93. Testing Summary
optimal parameter setting:
number of words: 500000 to 800000
number of image return: 15
number of image shown: 6
stability of the system was also test:
standard deviation of 5 fold cross validation range from
0.005 to 0.007
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
95. Evaluation of Deep Learning framework
Cross-validation for AlexNet
Final Accuracy reaches: (AlexNet)
AlexNet
Top-1 Accuracy 93.33%
Top-5 Accuracy 96.73%
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
96. Evaluation of Deep Learning framework
Cross-validation for GoogleNet (Top-5 Accuracy)
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
97. Evaluation of Deep Learning framework
Cross-validation for AlexNet
Cross-validation for GoogleNet
Final Accuracy reaches: (GoogleNet)
GoogleNet
Top-1 Accuracy 94.05%
Top-5 Accuracy 97.39%
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
98. Evaluation of Deep Learning framework
Final Comparison
GoogleNet AlexNet Visual Bag of Words
Accuracy (Top-5) 97.39% 96.73% 87.6%
Efficiency
Preprocess 8.47ms 7.5ms 6ms
Classification 17.7ms 6.94ms
SURF Feature
extraction
24ms
Total Time
(Including some
system level
operation)
129ms 170ms 281ms
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
100. Future development
There is still something we can do to improve the system
We can enlarge the data set. (Currently 167 classes and
16,000 images)
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
101. Future development
There is still something we can do to improve the system
We can enlarge the data set. (Currently 167 classes and
16,000 images)
Test different deep learning frameworks.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
102. Future development
There is still something we can do to improve the system
We can enlarge the data set. (Currently 167 classes and
16,000 images)
Test different deep learning frameworks.
Combine locally hand-crafted feature and globally deep
learned feature to achieve better accuracy.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo
103. We would like to thank
Mr. Sun Lin and Lenovo-Hong Kong.
Professor Shingyu Leung, Dr. Ku Yin Bon and Hong Kong
University of Science and Technology.
Professor Susanna Serna and the Institute for Pure and
Applied Mathematics.
The National Science Foundation for program funding -
Grant DMS #0931852.
Qi, Richfield, Zeng, Zhao
RIPS-HK: Lenovo