Multimedia Content Access: Algorithms and Systems IV (SPIE Electronic Imaging 2010). January 2010.
http://eprints.soton.ac.uk/268496/
This paper proposes a new technique for auto-annotation and semantic retrieval based upon the idea of linearly mapping an image feature space to a keyword space. The new technique is compared to several related techniques, and a number of salient points about each of the techniques are discussed and contrasted. The paper also discusses how these techniques might actually scale to a real-world retrieval problem, and demonstrates this though a case study of a semantic retrieval technique being used on a real-world data-set (with a mix of annotated and unannotated images) from a picture library.
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlation and Semantic Spaces
1. Semantic Retrieval and
Automatic Annotation
Linear Transformations, Correlation and
Semantic Spaces
Jonathon Hare & Paul Lewis
School of Electronics and Computer Science
University of Southampton
2. Introduction and Motivation
• Introduce a new, simple linear-transform based
annotation/retrieval technique
• Compare against a number of similar existing
techniques for automatic annotation & semantic
retrieval that:
• Represent images by a fixed length histogram (of
visual-term occurrences)
• Optionally use SVD for noise reduction
• Are deterministic (no randomness)
• Are (relatively) computationally efficient
• Reflect on real-world performance
3. SingularValue Decomposition
• SVD can be used to filter noise by producing a rank-k
estimate of the original data matrix
• The rank-k estimate is optimal in the least-squares
sense
4. Nomenclature
• F is a visual-term occurrence matrix
(columns represent images, rows visual-
terms)
• W is a keyword occurrence matrix
(columns represent images, rows keywords)
5. Technique: linear transform
Assume that visual-term occurrence vectors can be
related to keyword occurrence vectors by a simple
linear transformation, T.
FT=W
T can be estimated using the pseudo-inverse (calculated
using the SVD, which allows noise reduction) given a
training set with known F and W, then unknown W*
can be calculated from F* (from unannotated images)
and T.
6. Technique: Semantic Spaces
• Based around the factorisation [-]= TD
• Calculated using truncated SVD
• Rows of T represent coordinates of the features
and words in a vector space
• Columns of D represent coordinates of images in
the same space
• Similar objects have similar locations in the space, so
it is possible to rank images on their distance to a
given word
F
W
Hare, J. S., Lewis, P. H., Enser, P. G. B., and Sandom, C. J., “A Linear-Algebraic Technique with an
Application in Semantic Image Retrieval,” in CIVR 2006, Sundaram, H., Naphade, M., Smith, J. R., and
Rui,Y., eds., LNCS 4071, 31–40, Springer (2006).
7. Technique: Correlation
• Pan et al defined four techniques for building
translation tables between visual terms and keywords
[i.e. the elements of the table/matrix represent
p(wi,fj)].
• The Corr method used WTF to build the table
• The Cos method used the cosine of wi and fj
• The SVDCorr and SVDCos methods filtered the
tables from the Corr and Cos methods reducing
the rank using the SVD
Pan, J.-Y.,Yang, H.-J., Duygulu, P., and Faloutsos, C., “Automatic image captioning,” IEEE International Conference
on Multimedia and Expo 2004 (ICME ’04). Vol.3 (27-30 June 2004).
8. Technique Summary
Technique Variables Notes
Transform
feature-weighting,
dimensionality reduction
Words independent
Corr, Cos feature-weighting Words independent
SVDCorr,
SVDCos
feature-weighting,
dimensionality reduction
Words independent
Semantic Space
feature-weighting,
dimensionality reduction
Inter-word
dependencies
9. Image Features
• Two types of visual-term feature considered:
• Segmented-blob based (using shape, colour, texture
descriptors) [500 terms]
• Quantised DCT-based [500 terms]
10. Experimental Protocol
• 5000 image Corel data-set
• 4000 training images
• 500 validation images (for optimising reduced rank)
• 500 test images
• Two weighting types: unweighted and IDF
• Evaluation performed as a hypothetical retrieval
experiment
• Unannotated test images retrieved in response to
using each word in turn as a query
• Mean-average precision used for comparison
12. Real-world performance
• ~20% mAP might sound low, but in reality many
queries will work quite well (reasonable initial
precision, but drops fast)
• Choice of image features is very important
• It would be difficult to learn the concept of “sun”
from grey-level SIFT features!
• See the paper for some more reflection on real-word
performance...
13. Conclusions
• We have described a set of auto-annotation/semantic
retrieval algorithms
• Performance is less than the state-of-the-art, but this is
partially explained by the use of different image
features (see our MIR 2010 paper)
• However, the methods;
• Are computationally inexpensive (although this is
proportional to the amount of training data)
• Are deterministic, and don’t rely on algorithms such
as EM which might get stuck in local minima/maxima