Intelligent Multimedia Recommendation

Towards Automatic Construction of Diverse,
High-quality Image Dataset
Jian Zhang
Multimedia and Data Analytics Lab
University of Technology Sydney

Outlines
➢ Introduction
➢ Related works
➢ Research challenges
➢ Motivations and solutions
➢ Experimental results
➢ Applications
2

Introduction
 Image dataset construction: is the process of collecting
relevant images for the given/target query.
 Significances:
1. Labeled image datasets are the backbone for high-
level image understanding tasks;
2. Continuously drive and evaluate the progress of
feature designing and supervised learning models.
3

Related Works
 Manual labelling based methods
 Active learning based methods
 Automatic learning based methods
4

Manual labelling based methods
Related works:
 LabelMe [Sinclair et al. 1990]
 Pascal VOC series [Everingham et al. 2007]
 ImageNet [Deng et al. 2009]
 CIFAR-10/100 [Krizhevsky et al. 2009]
Advantages: High accuracy
Disadvantages:
 Time consuming and labor intensive
 Scalability is limited
 Diversity is limited
5

Active learning based methods
Related works:
 Towards scalable dataset construction: An active learning
approach. [Collins et al. 2008]
 Large-scale live active learning: Training object detectors
with crawled data and crowds. [Grauman et al. 2014]
Disadvantages:
 Scalability is limited
 Diversity is limited
Advantages:
 The cost of manual labelling has some decrease.
 Relatively high accuracy.
6

Automatic learning based methods
Related works:
 Optimol: automatic online picture collection via
incremental model learning. [Li et al. 2010]
 Harvesting: harvesting image databases from the web.
[Schroff et al. 2011]
 Prajna: Towards recognizing whatever you want from
images without image labeling. [Hua et al. 2015]
Disadvantages:
 Low accuracy
 Limited diversity
Advantages: Scalability is no longer a problem
7

Research challenges
Our work focus on automatic learning based
methods.
The major challenges include:
 Visual polysemy
 Limited diversity
 Low accuracy
8

Research challenge: Visual Polysemy
Fig. 1: Visual polysemy. For example, the query “mouse” returns
multiple visual senses on the first page of results. The retrieved web
images suffer from the low precision of any particular visual sense.
9

Research challenge: Limited Diversity
Fig. 2: Images for “airplane” from four different datasets.
Each dataset has their preference for image selection.
10

Research challenge: Low Accuracy
Fig. 3: Due to the error indexing of image search engine, even we retrieve the
sense-specific images, some instance-level noise may also be included. The
noisy images are marked with red bounding boxes.
11

Motivations
Fig. 4: The average accuracy of top 500 images in Google Image
Search, Web Search and Flickr Search for 10 queries.
12

Motivations
 The first few images returned from image
search engine tend to have a relatively high
accuracy.
 Image search engine restricts the number of
returned images for each query.
 The diversity of the selected images not only
depends on the selection mechanisms, but
also relies on the diversity of the initial
candidate images.
13

Solutions
Query
Noisy images
pruning
Selected
images
Collected
candidate images
Query
Pruning
noisy images
Selected
images
Collected candidate
images
Discovering candidate
textual metadata
Pruning noisy
textual metadata
Traditional solution:
Our solution:
14

Solutions
Noisy textual metadata:
Fig. 5: A snapshot of the retrieved images for visual non-salient
and less relevant textual metadata.
15

Solutions
Our solution:
Fig. 6: Illustration of the process for obtaining multiple textual metadata. The input is a
textual query that we would like to find multiple textual metadata for. The output is a
set of selected textual metadata which will be used for raw image dataset construction.
16

Solutions
Why:
 Solving Visual Polysemy by Allowing Sense-specific Diversity
 Improving Scalability of Image Search Engine
 Increasing Diversity of the Initial Candidate Images
17

Solutions
Fig.7: Illustration of the process for obtaining selected images. The input is a set of
selected textual metadata. Artificial images, inter-class noisy images, and intra-class
noisy images are marked with red, green and blue bounding boxes separately. The
output is a group of selected images in which the images corresponding to different
textual metadata.
18

Solutions
Why:
 Improving the Accuracy of the Selected Images
 Ensuring the Diversity of the Selected Images
19

Experiments
Polysemy:
 Multiple Visual Senses Discovering
 Classifying Sense-specific Images
 Re-ranking Search Results
Accuracy:
 Image Dataset WSID-100 Introduction
 Image Classification Ability Comparison
Diversity:
 Cross-dataset Generalization Ability Comparison
20

Polysemy: Multiple Visual Senses Discovering
Fig. 8: Examples of multiple visual senses discovered by our proposed approach.
For example, our approach automatically discovers and distinguishes four senses
for “Note”: notes, galaxy note, note tablet and music note. For “Bass”, it discovers
multiple visual senses of: bass fish, bass guitar, bass amp, and Mr./Mrs. Bass etc.
21

Polysemy: Classifying Sense-specific Images
Fig. 9: The detailed performance comparison of classification accuracy
over 30 categories on the CMU-Poly-30 dataset
22

Polysemy: Classifying Sense-specific Images
Fig. 10: The detailed performance
comparison of classification accuracy
over 5 categories on the MIT-ISD
dataset.
Tab. 1: The average performance
comparison of classification accuracy
on the CMU-Poly-30 and MIT-ISD
dataset. 23

Polysemy: Re-ranking Search Results
Tab. 2: Web images for polysemy terms were annotated manually.
For each term, the number of annotated images, the semantic
senses, the visual senses and their distributions are provided, with
core semantic senses marked in boldface.
24

Polysemy: Re-ranking Search Results
Tab. 3: Area Under Curve (AUC) of all senses for “bass” and “mouse”.
25

Accuracy: Image Dataset WSID-100 Introduction
WSID-100: Web-supervised Image Dataset 100 Categories
http://www.multimediauts.org/dataset/WSID-100.html
26

Accuracy: Image Classification Ability Comparison
Fig. 11: The image classification accuracy (%) comparison over 14 categories on the PASCAL VOC 2007 dataset.
Fig. 12: The image classification accuracy (%)
comparison over 6 categories on the PASCAL VOC
2007 dataset.
Tab. 4: The average accuracy (%) comparison over 14
and 6 common categories on the PASCAL VOC 2007
dataset. 27

Diversity: Cross-dataset Generalization Ability Comparison
Fig. 13: The cross-dataset generalization ability of various datasets by using a varying number
of training images, and tested on (a) ImageNet, (b) Optimol, (c) Harvesting, (d) DRID-20,
(e) Ours, (f) Average. 28

Application: Fine-grained Visual Categorization
 Fine-grained visual recognition: aims to distinguish between subordinate
categories such as different birds, flowers, food, car, etc.

Large
variance
Little
variance
Fig. 14: Illustration of the two characteristics that fine-grained subcategories have: large
variance in the same subcategory as shown in the first line, and small variance among different
subcategories as shown in the second line. Images in (a) birds and (b) cars are from CUB-200-
2011 and Cars-196 datasets, respectively.
29

 Challenges:
1) Large variance in the same subcategory;
2) Small variance among different subcategories.
 Solutions:
1) Strongly supervised methods;
2) Weakly supervised methods;
3) Webly supervised methods.
30

Strongly supervised methods: require manually labeled
bounding boxes or part annotations during training.
Fig. 15: Bounding boxes and part annotations for fine-grained visual categorization.
31

Weakly supervised methods: require image-level labels
during training.
Fig. 15: Image-level labeling for fine-grained visual categorization.
32

Webly supervised methods: require no need for human
intervention during training.
Fig. 16: A snapshot of retrieved web images for “Bobolink”. Label noise
is marked in red bounding boxes and blue bounding boxes.
33

Our work focus on webly supervised based methods.
The major problems include:
 Label noise (low accuracy)
 Domain mismatch (limited diversity)
34

Fig. 17: Label noise and domain mismatch problems in the web data for fine-grained categorization.
Due to incorrect labeling, the web images collected through query “Fish Crow” may contain noise
(e.g., the “Purple Finch” image marked with red bounding box). Directly leverage the web images
without noise removal may lead to classifying the test image “Purple Finch” as “Fish Crow”. In
addition, the collected web images may contain multiple domains (e.g., natural, sketch, and cartoon
images). Due to the domain distribution between the test image “Bobolink” and the collected sketch
domain web images for “Fish Crow”, although the test image “Bobolink” has no noise in the
collected web images for “Fish Crow”, directly leverage the web images without domain mismatch
alleviation may also lead to classifying the test image “Bobolink” as “Fish Crow”.
35

Fig. 18: The architecture of our proposed DDN model. The input of our model is a set of “bags”, which consists
of multiple images (“instances”). The bag of images is first fed into the backbone net (e.g., VGG16) to generate
the feature map. Then the feature map goes into a fully connected layer followed by a ReLU, thus generating a
feature vector N. This feature vector is subsequently sent into two branches. The upper branch is our proposed
Attention Block, which consists of a fully connected layer followed by a tanh layer and another fully connected
layer. After the Attention Block, we obtain the instance probability vector A. In the bottom branch, the feature
vector N is multiplied by the output of the Attention Block A and then fed into a fully connected layer with a
Sigmoid layer to calculate the probability of the bag being positive. Finally, the bottom branch adopts the
Negative Bernoulli Log Loss, while the upper branch leverages our proposed Attentive Focal Loss. The
weighted sum of the two branch losses is leveraged as the final loss function for DDN. 36

Fig. 19: The architecture of instance-level model. Feature vector N is subsequently sent into two branches. The
upper branch consists of a fully connected layer followed by a tanh layer and another fully connected layer.
After the Attention Block, we obtain the instance probability vector. Finally, the upper branch leverages
Attentive Focal Loss.
37

Fig. 18: The architecture of bag-level model. Feature vector N is subsequently sent into two branches. In
bottom branch, the feature vector N is multiplied by the output of the Attention Block and then fed into a fully
connected layer with a Sigmoid layer to calculate the probability of the bag being positive. Finally, the bottom
branch adopts the Negative Bernoulli Log Loss.
38

Tab. 5: Fine-grained ACA (%) results on CUB200 and Sandford dogs. Tr-B/P means
bounding box or part annotation required in the training stage. Data means the
training data is manually labeled (anno.) or collected from the web (web). The best
result is marked in red, the second best in blue, and the third best in bold.
39

Publications
▪ Yazhou Yao, Jian Zhang, Fumin Shen, Li Liu, Fan Zhu, Dongxiang Zhang, and Heng-Tao Shen. "Towards Automatic Construction of Diverse,
High-quality Image Dataset", IEEE Transactions on Knowledge and Data Engineering (TKDE), 2019.
▪ Yazhou Yao, Zeren Sun, Fumin Shen, Li Liu, Limin Wang, Fan Zhu, Lizhong Ding, Gangshan Wu, Ling Shao. "Dynamically Visual
Disambiguation of Keyword-based Image Search", International Joint Conference on Artificial Intelligence (IJCAI), 2019.
▪ Yazhou Yao, Fumin Shen, Jian Zhang, Li Liu, Zhenmin Tang and Ling Shao. "Extracting Privileged Information for Enhancing Classifier
Learning", IEEE Transactions on Image Processing (TIP), 2018.
▪ Yazhou Yao, Fumin Shen, Jian Zhang, Li Liu, Zhenmin Tang and Ling Shao. " Extracting Multiple Visual Senses for Web Learning ", IEEE
Transactions on Multimedia (TMM), 2018.
▪ Yazhou Yao, Jian Zhang, Fumin Shen, Wankou Yang, Xiansheng Hua and Zhenmin Tang. “Extracting Privileged Information from Untagged
Corpora for Classifier Learning”, International Joint Conference on Artificial Intelligence (IJCAI), 2018.
▪ Yazhou Yao, Jian Zhang, Fumin Shen, Wankou Yang, Pu Huang and Zhenmin Tang. “Discovering and Distinguishing Multiple Visual Senses
for Polysemous Words”, AAAI Conference on Artificial Intelligence (AAAI), 2018.
▪ Yazhou Yao, Jian Zhang, Fumin Shen, Xian-Sheng Hua, Jingsong Xu and Zhenmin Tang. "Exploiting Web Images for Dataset Construction: A
Domain Robust Approach", IEEE Transactions on Multimedia (TMM), 2017.
▪ Yazhou Yao, Xiansheng Hua, Fumin Shen, Jian Zhang and Zhenmin Tang. "A Domain Robust Approach for Image Dataset Construction ",
ACM International Conference on Multimedia (ACM MM), 2016.
▪ Yazhou Yao, Jian Zhang, Fumin Shen, Xian-Sheng Hua, Jingsong Xu and Zhenmin Tang. "Automatic Image Dataset Construction with
Multiple Textual Metadata ", IEEE Conference on Multimedia and Expo (ICME), 2016.
40

Intelligent Multimedia Recommendation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Intelligent Multimedia Recommendation

Ähnlich wie Intelligent Multimedia Recommendation (20)

Mehr von Wanjin Yu

Mehr von Wanjin Yu (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Intelligent Multimedia Recommendation