3. Introduction
Image dataset construction: is the process of collecting
relevant images for the given/target query.
Significances:
1. Labeled image datasets are the backbone for high-
level image understanding tasks;
2. Continuously drive and evaluate the progress of
feature designing and supervised learning models.
3
4. Related Works
Manual labelling based methods
Active learning based methods
Automatic learning based methods
4
5. Manual labelling based methods
Related works:
LabelMe [Sinclair et al. 1990]
Pascal VOC series [Everingham et al. 2007]
ImageNet [Deng et al. 2009]
CIFAR-10/100 [Krizhevsky et al. 2009]
Advantages: High accuracy
Disadvantages:
Time consuming and labor intensive
Scalability is limited
Diversity is limited
5
6. Active learning based methods
Related works:
Towards scalable dataset construction: An active learning
approach. [Collins et al. 2008]
Large-scale live active learning: Training object detectors
with crawled data and crowds. [Grauman et al. 2014]
Disadvantages:
Scalability is limited
Diversity is limited
Advantages:
The cost of manual labelling has some decrease.
Relatively high accuracy.
6
7. Automatic learning based methods
Related works:
Optimol: automatic online picture collection via
incremental model learning. [Li et al. 2010]
Harvesting: harvesting image databases from the web.
[Schroff et al. 2011]
Prajna: Towards recognizing whatever you want from
images without image labeling. [Hua et al. 2015]
Disadvantages:
Low accuracy
Limited diversity
Advantages: Scalability is no longer a problem
7
8. Research challenges
Our work focus on automatic learning based
methods.
The major challenges include:
Visual polysemy
Limited diversity
Low accuracy
8
9. Research challenge: Visual Polysemy
Fig. 1: Visual polysemy. For example, the query “mouse” returns
multiple visual senses on the first page of results. The retrieved web
images suffer from the low precision of any particular visual sense.
9
10. Research challenge: Limited Diversity
Fig. 2: Images for “airplane” from four different datasets.
Each dataset has their preference for image selection.
10
11. Research challenge: Low Accuracy
Fig. 3: Due to the error indexing of image search engine, even we retrieve the
sense-specific images, some instance-level noise may also be included. The
noisy images are marked with red bounding boxes.
11
12. Motivations
Fig. 4: The average accuracy of top 500 images in Google Image
Search, Web Search and Flickr Search for 10 queries.
12
13. Motivations
The first few images returned from image
search engine tend to have a relatively high
accuracy.
Image search engine restricts the number of
returned images for each query.
The diversity of the selected images not only
depends on the selection mechanisms, but
also relies on the diversity of the initial
candidate images.
13
16. Solutions
Our solution:
Fig. 6: Illustration of the process for obtaining multiple textual metadata. The input is a
textual query that we would like to find multiple textual metadata for. The output is a
set of selected textual metadata which will be used for raw image dataset construction.
16
17. Solutions
Why:
Solving Visual Polysemy by Allowing Sense-specific Diversity
Improving Scalability of Image Search Engine
Increasing Diversity of the Initial Candidate Images
17
18. Solutions
Fig.7: Illustration of the process for obtaining selected images. The input is a set of
selected textual metadata. Artificial images, inter-class noisy images, and intra-class
noisy images are marked with red, green and blue bounding boxes separately. The
output is a group of selected images in which the images corresponding to different
textual metadata.
18
21. Polysemy: Multiple Visual Senses Discovering
Fig. 8: Examples of multiple visual senses discovered by our proposed approach.
For example, our approach automatically discovers and distinguishes four senses
for “Note”: notes, galaxy note, note tablet and music note. For “Bass”, it discovers
multiple visual senses of: bass fish, bass guitar, bass amp, and Mr./Mrs. Bass etc.
21
22. Polysemy: Classifying Sense-specific Images
Fig. 9: The detailed performance comparison of classification accuracy
over 30 categories on the CMU-Poly-30 dataset
22
23. Polysemy: Classifying Sense-specific Images
Fig. 10: The detailed performance
comparison of classification accuracy
over 5 categories on the MIT-ISD
dataset.
Tab. 1: The average performance
comparison of classification accuracy
on the CMU-Poly-30 and MIT-ISD
dataset. 23
24. Polysemy: Re-ranking Search Results
Tab. 2: Web images for polysemy terms were annotated manually.
For each term, the number of annotated images, the semantic
senses, the visual senses and their distributions are provided, with
core semantic senses marked in boldface.
24
27. Accuracy: Image Classification Ability Comparison
Fig. 11: The image classification accuracy (%) comparison over 14 categories on the PASCAL VOC 2007 dataset.
Fig. 12: The image classification accuracy (%)
comparison over 6 categories on the PASCAL VOC
2007 dataset.
Tab. 4: The average accuracy (%) comparison over 14
and 6 common categories on the PASCAL VOC 2007
dataset. 27
28. Diversity: Cross-dataset Generalization Ability Comparison
Fig. 13: The cross-dataset generalization ability of various datasets by using a varying number
of training images, and tested on (a) ImageNet, (b) Optimol, (c) Harvesting, (d) DRID-20,
(e) Ours, (f) Average. 28
29. Application: Fine-grained Visual Categorization
Fine-grained visual recognition: aims to distinguish between subordinate
categories such as different birds, flowers, food, car, etc.
Large
variance
Little
variance
Fig. 14: Illustration of the two characteristics that fine-grained subcategories have: large
variance in the same subcategory as shown in the first line, and small variance among different
subcategories as shown in the second line. Images in (a) birds and (b) cars are from CUB-200-
2011 and Cars-196 datasets, respectively.
29
30. Challenges:
1) Large variance in the same subcategory;
2) Small variance among different subcategories.
Solutions:
1) Strongly supervised methods;
2) Weakly supervised methods;
3) Webly supervised methods.
30
31. Strongly supervised methods: require manually labeled
bounding boxes or part annotations during training.
Fig. 15: Bounding boxes and part annotations for fine-grained visual categorization.
31
32. Weakly supervised methods: require image-level labels
during training.
Fig. 15: Image-level labeling for fine-grained visual categorization.
32
33. Webly supervised methods: require no need for human
intervention during training.
Fig. 16: A snapshot of retrieved web images for “Bobolink”. Label noise
is marked in red bounding boxes and blue bounding boxes.
33
34. Our work focus on webly supervised based methods.
The major problems include:
Label noise (low accuracy)
Domain mismatch (limited diversity)
34
35. Fig. 17: Label noise and domain mismatch problems in the web data for fine-grained categorization.
Due to incorrect labeling, the web images collected through query “Fish Crow” may contain noise
(e.g., the “Purple Finch” image marked with red bounding box). Directly leverage the web images
without noise removal may lead to classifying the test image “Purple Finch” as “Fish Crow”. In
addition, the collected web images may contain multiple domains (e.g., natural, sketch, and cartoon
images). Due to the domain distribution between the test image “Bobolink” and the collected sketch
domain web images for “Fish Crow”, although the test image “Bobolink” has no noise in the
collected web images for “Fish Crow”, directly leverage the web images without domain mismatch
alleviation may also lead to classifying the test image “Bobolink” as “Fish Crow”.
35
36. Fig. 18: The architecture of our proposed DDN model. The input of our model is a set of “bags”, which consists
of multiple images (“instances”). The bag of images is first fed into the backbone net (e.g., VGG16) to generate
the feature map. Then the feature map goes into a fully connected layer followed by a ReLU, thus generating a
feature vector N. This feature vector is subsequently sent into two branches. The upper branch is our proposed
Attention Block, which consists of a fully connected layer followed by a tanh layer and another fully connected
layer. After the Attention Block, we obtain the instance probability vector A. In the bottom branch, the feature
vector N is multiplied by the output of the Attention Block A and then fed into a fully connected layer with a
Sigmoid layer to calculate the probability of the bag being positive. Finally, the bottom branch adopts the
Negative Bernoulli Log Loss, while the upper branch leverages our proposed Attentive Focal Loss. The
weighted sum of the two branch losses is leveraged as the final loss function for DDN. 36
37. Fig. 19: The architecture of instance-level model. Feature vector N is subsequently sent into two branches. The
upper branch consists of a fully connected layer followed by a tanh layer and another fully connected layer.
After the Attention Block, we obtain the instance probability vector. Finally, the upper branch leverages
Attentive Focal Loss.
37
38. Fig. 18: The architecture of bag-level model. Feature vector N is subsequently sent into two branches. In
bottom branch, the feature vector N is multiplied by the output of the Attention Block and then fed into a fully
connected layer with a Sigmoid layer to calculate the probability of the bag being positive. Finally, the bottom
branch adopts the Negative Bernoulli Log Loss.
38
39. Tab. 5: Fine-grained ACA (%) results on CUB200 and Sandford dogs. Tr-B/P means
bounding box or part annotation required in the training stage. Data means the
training data is manually labeled (anno.) or collected from the web (web). The best
result is marked in red, the second best in blue, and the third best in bold.
39
40. Publications
▪ Yazhou Yao, Jian Zhang, Fumin Shen, Li Liu, Fan Zhu, Dongxiang Zhang, and Heng-Tao Shen. "Towards Automatic Construction of Diverse,
High-quality Image Dataset", IEEE Transactions on Knowledge and Data Engineering (TKDE), 2019.
▪ Yazhou Yao, Zeren Sun, Fumin Shen, Li Liu, Limin Wang, Fan Zhu, Lizhong Ding, Gangshan Wu, Ling Shao. "Dynamically Visual
Disambiguation of Keyword-based Image Search", International Joint Conference on Artificial Intelligence (IJCAI), 2019.
▪ Yazhou Yao, Fumin Shen, Jian Zhang, Li Liu, Zhenmin Tang and Ling Shao. "Extracting Privileged Information for Enhancing Classifier
Learning", IEEE Transactions on Image Processing (TIP), 2018.
▪ Yazhou Yao, Fumin Shen, Jian Zhang, Li Liu, Zhenmin Tang and Ling Shao. " Extracting Multiple Visual Senses for Web Learning ", IEEE
Transactions on Multimedia (TMM), 2018.
▪ Yazhou Yao, Jian Zhang, Fumin Shen, Wankou Yang, Xiansheng Hua and Zhenmin Tang. “Extracting Privileged Information from Untagged
Corpora for Classifier Learning”, International Joint Conference on Artificial Intelligence (IJCAI), 2018.
▪ Yazhou Yao, Jian Zhang, Fumin Shen, Wankou Yang, Pu Huang and Zhenmin Tang. “Discovering and Distinguishing Multiple Visual Senses
for Polysemous Words”, AAAI Conference on Artificial Intelligence (AAAI), 2018.
▪ Yazhou Yao, Jian Zhang, Fumin Shen, Xian-Sheng Hua, Jingsong Xu and Zhenmin Tang. "Exploiting Web Images for Dataset Construction: A
Domain Robust Approach", IEEE Transactions on Multimedia (TMM), 2017.
▪ Yazhou Yao, Xiansheng Hua, Fumin Shen, Jian Zhang and Zhenmin Tang. "A Domain Robust Approach for Image Dataset Construction ",
ACM International Conference on Multimedia (ACM MM), 2016.
▪ Yazhou Yao, Jian Zhang, Fumin Shen, Xian-Sheng Hua, Jingsong Xu and Zhenmin Tang. "Automatic Image Dataset Construction with
Multiple Textual Metadata ", IEEE Conference on Multimedia and Expo (ICME), 2016.
40