Multidimensional approach in cbmmirs full paper v4.0
1. A Multidimensional Approach in Content-based Multimedia
Information Retrieval System
Indra Budi, Zainal A. Hasibuan, Gema P. Mindara Albaar Rubhasy
Faculty of Computer Science Department of Computer System
University of Indonesia STMIK Indonesia
Depok, Indonesia Jakarta, Indonesia
indra@cs.ui.ac.id, zhasibua@cs.ui.ac.id, gema.parasti@ui.ac.id albaar.rubhasy@stmik-indonesia.ac.id
Abstract— In this digital era, the use of digital multimedia considering the level of human labor and the precision level.
information is highly utilized and growing very rapidly due to Therefore, in the early 1980s, content-based information
the development of the Internet. Thus, users demand for more retrieval (CBIR) was introduced to overcome the
effective content-based multimedia information retrieval disadvantages. However, by nature a multimedia document
system (CBMMIRS). The major challenge in this research area may consist of more than one type of content, for example
is that a multimedia document comprises more than one type text, images, video and audio. Thus, in late 1990s, emerged a
of contents (i.e. text, image, audio). In order to address this novel approach which combines the text-based and content-
challenge, many works have been focusing on the indexing based retrieval method in order to boost CBMMIRS
techniques development which can accommodate multiple
performance. Many authors describe such technique as a
multimedia object representation or known as object features.
However, most of the experiments use only one certain kind of
multimodal information retrieval whilst the system indexes
collection, for example a collection of WWW pages, video and retrieves using various object representation/modalities,
collections, image collections, and so forth. In this paper, we such as text, color, texture, etc. Nevertheless, in many
propose a multidimensional approach which could papers, authors used only one type of multimedia collection,
accommodates semantic indexing of various multimedia such as TRECVID for video collection [1], MIRFLICKR for
contents in different multimedia collections, since the fact is image collection [2], WIKIPEDIA-MM for world wide web
that different multimedia documents may share similar pages (WWW) collection [3], and so forth.
information. The architecture comprises three components: In this paper, we propose a multidimensional approach
(1) collection manager (which manages multimedia documents which accommodates the heterogeneous kind of the
repository); (2) indexer (which handles multimedia concept multimedia collections and the variety of multimedia
detection and indexing); and (3) query processor (which deals contents (i.e. textual, visual, and audio). The goal of this
with query and search results). Our hypothesis is that the more approach is to achieve the completeness of information,
complete the document (which indexed in many different means that the most relevant information must be available
feature spaces), the more relevant the document and should be in many type of contents. Even though this approach might
ranked higher in the search results. be fruitful, but there exist a constraint in context of applying
a number of objects features. In this case, excessive use of
Keywords- CBMMIRS, multimedia information retrieval,
object features in indexing may lead into a poor
multidimensional approach
performance, due to the famous ‘curse of dimensionality’
problem [4]. As the dimensionality of feature space
I. INTRODUCTION increases, the performance of indexing algorithms will
With the development of the Internet, the use of digital degrades. Research showed that when the dimensionality is
multimedia information (including audio, video, images and above 10, the performance is no better than a simple
graphics) is growing rapidly and has plays an important role sequential scan [5].
in modern life. Most of the multimedia files were published This paper explores a multidimensional approach in
and distributed in various formats via the social media within CBMMIRS. The rest of the paper is organized as follows. In
the Internet for instance Facebook1, Flickr2, Youtube3, and so Section 2, we show some works related to this paper. Section
forth. As a result, there is an explosion of digital multimedia 3 focuses on the multidimensional approach in CBMMIRS
objects and users demand for more efficient yet accurate using high dimension of feature spaces with various type of
content-based multimedia information retrieval system collections. Section 4 concludes this paper and in this section
(CBMMIRS). we also discuss the future works that will be conduct.
Due to the large and varied digital multimedia collection,
a text-based retrieval system is considered to be inefficient
1
http:// www.facebook.com
2
http:// www.flickr.com
3
http://www.youtube.com
2. CBMMIRS are no longer an ideal method. Currently, most
II. RELATED WORKS recent works uses the scale-invariant feature transform
The building block of a CBMMIRS comprises three (SIFT) which based on common grounds and successfully
essential processes: (1) multimedia feature extraction; (2) applied in many projects [11, 13]. SIFT could detects and
concept detection; and (3) indexing process. Each of these provides descriptions of some points from image which
processes will be discussed in the following parts. produces more information than the other feature-based
methods. There are also few criteria of the detection: local
A. Multimedia Feature Extraction contrast, local maxima/minima of certain functions (e.g.
Feature extraction is one of the major tasks that laplacian, gradient, etc.) and threshold over a curvature
determine the performance of a CBMMIRS [6]. Thus far function (e.g. harris, hessian, etc.). Next, we briefly explain
many techniques are available to generate representation of concerning audio feature extraction.
multimedia content which may comprises the combination of 3) Audio Feature Extraction
text, visual (i.e. image), and audio. Next we break down few
state-of-the-art feature extraction techniques in three Many works have been focusing on structured audio
different types of multimedia contents. analysis such as speech or music. Only few system have
been proposed to analyze on unstructured audio. One of the
1) Textual Feature Extraction popular models is the mel-frequency cepstral coefficient
The fundamental of text indexing scheme was proposed (MFCC). MFCC features are modeled based on the shape of
by Salton and McGill with the popular tf-idf scheme [7]. the overall spectrum, making it more favorable for modeling
This technique chooses a basic vocabulary of “terms” or single sound sources. On the other hand, an environmental
“words” and counts the number of occurrences of each term. sound comprises more than one source of sounds. In order
After that, this term frequency count is compared with an to tackle this issue, the matching-pursuit (MP) technique
inverse document frequency count. As a result, the tf-idf was proposed. MP provides an efficient way of selecting a
scheme reduces documents length to fixed-length lists of small basis set that would produce meaningful features as
numbers. However, the dimension reduction of this scheme well as a flexible representation [14]. It is potentially
is considered to be insignificant. The most distinguished invariant to background noise and could capture
approach to tackle this issue is the latent semantic indexing characteristics in the signal where MFCC fails. This ends
(LSI) approach. LSI uses a singular value decomposition of our discussion regarding multimedia feature extraction in
the X matrix to identify a linear subspace in the space of tf- three different types of multimedia contents. In the next
idf features that captures most of the variance in the part, we focus on the audio visual concept detection
collection [8]. Later, a major breakthrough was introduced techniques.
by Hofman with the probabilistic LSI (pLSI) model. This
B. Audio Visual Concept Detection
approach models each word in a document as a sample from
a mixture model, where the mixture components are Multimedia concept detection is considered as one ways
multinomial random variables that can be viewed as in reducing semantic gap. Reference [15] provides an
representations of “topics” [9]. But, all these two models example of a detection model which links each topics with
(LSI and pLSI) are based on the “bags of words” one or more visual concepts, known as the Visual Concept
assumption that the order of words in a document could be Detections (VCDT). However, works have been focusing
ignored. In order to mix the models that capture the only on the visual concept and few on the audio visual
exchangeability of both words and documents, the latent concept detection. One of the examples that used both visual
Dirichlet allocation (LDA) model was proposed [10]. Up till and audio content could be found in [16]. In this work, the
now, this model is widely used by many authors in their IR authors provided an approach to semantically detect
researches. Next, we discussed the image feature extraction. concept(s) from a video collection. However, the audio
detection is only classified into speech and instrumental,
2) Image Feature Extraction rather than to detect the environmental sounds. This issue
There are many ways to generate image representation needs to be more explored more thoroughly in order to
into feature vectors. The traditional method is using image improve the CBMMIRS understandings of concepts existing
histogram. This method was successfully implemented in a in a multimedia document.
large scale gallery and museum in Europe [11]. However,
this method discards all information regarding spatial C. Multimedia Concept-based Indexing
distribution of color and reduces the signature efficiency The Multimedia concept-based or semantic-based
which has been a major flaw [12]. Then, other techniques indexing approach is depends on the fusion of the concepts,
were being studied, such as using color, texture, shape, and which many works uses kernel-based classifier (e.g. support
many other features. Nevertheless, most of them could not vector machine or SVM). Basically, there are two fusion
overcome the challenging fact in image extraction which is strategies available: early fusion and late fusion. Early
the extraction of an image regardless if it were obstructed, fusion method integrates the different modalities, previously
rotated, and so forth. As a result, using image features in feature from different modalities have been fused then
3. search algorithm execute on the representation of the new Fig. 1 shows the proposed multidimensional CBMMIRS
fusion. On the other hand, late fusion will characterize architecture which adapted from [19]. The system comprises
multimedia content which employs multiple features. Using three components as follows:
this scheme, different rankings referred to data fusion or • Collection Manager (CM): this component is in charge
rank aggregation could be combined. Nonetheless, it is with collecting and managing multimedia documents
possible and promising to merge these two schemes, from various types of multimedia document collections
whereas the early fusion is based on low or intermediate- that we aim to index, searches, and retrieve by the
level features and the late fusion merges unimodal Indexer and Query Processor. The documents from
classification scores of high-level features [17]. different types of collections such as video, image,
WWW, and other multimedia collections are stored in a
III. THE PROPOSED MULTIDIMENSIONAL APPROACH repository along with their metadata which provide
information about the documents. CM also includes the
We discover that many works in CBMMIR research area administrator user interface with the intention that
are involving with just one type of multimedia collection, for he/she is capable in administering the document
example video or image collection for sequentially content- collections.
based video or image retrieval system. Here we propose a
• Indexer (IX): This component is responsible on
different approach whereas involving with different kind of
generating and maintaining data structures that
features from various type of multimedia collection (e.g.
represents one type of multimedia document feature (i.e.
video, image, WWW, and other type of collections) in order
text, image, and audio) so called index in order to
to achieve the completeness of information. Inspired by [18]
provide searching capabilities. IX exploits the
which uses three different components of documents in order
documents collected by CM for indexing processes. The
to elevate retrieval performance; we propose a similar
indexing process involves feature extraction methods for
multidimensional strategy which is applied in multimedia
each and every type of feature as follows: (1) in text
documents which also have several types of components.
feature extraction, we suggest using LDA; (2) in image
The proposed multidimensional approach is depicted in
feature extraction, we use SIFT; (3) in audio feature
Figure 1.
extraction, we intended using MP technique. In our
system design, the indexing process involving a
multimedia concept-based indexing which depends on
the robustness of multimedia concept detection method.
User Interface • Query Processor (QP): this component is responsible
Multimedia Concept-based Query
for handling query and search results. QP provides user
interface for multimedia concept-based query. The
Query Processor
concept-based query interface differs from a search tools
such as Google 4 since it allows users to resolve the
Multimedia Concept- Multimedia Search
naming heterogeneity that occurs when the identical
based Matching Process Results concept is described using different terms.
The research issues that may occur in our works are
Indexer stated below:
Multimedia
• Feature extraction techniques. The extraction
Multimedia Concept-
Concept- based Indexing techniques that we mentioned earlier, such as LDA,
Based Index
SIFT, and MP, are the state-of-the-art feature extraction
methods. Nevertheless, finding the ‘right combination’
is one of the main problems. What feature of a
Multimedia Concept Detection Process Training
Dataset
multimedia object should we choose and what extraction
technique we prefer for each feature in order to increase
the CBMMIRS performance is still remains a
Text Image Audio challenging research area.
Feature
Extraction
Feature
Extraction
Feature
Extraction • Multimedia concept detection method. Many works have
been done to automatically detect multimedia concepts.
Collection Manager However, the generic concept of a multimedia object,
including audio visual collections, has not been explored
comprehensively. The standardized visual concepts are
available, such as Wiki concepts and Visual Concept
Digital Object Metadata
Detection topics. In contrast, the standardized concept
Video Image WWW Other Repository Repository for audio is not in place. Yet, we ought to explore more
Collection Collection Collection Collection
in multimedia concept detection method in order to
Figure 1. Proposed multidimensional approach in CBMMIRS 4
http://www.google.com
4. accommodate audio visual features of a multimedia [4] N. Rasiwasia, J. C. Pereira, E. Coviello, and G. Doyle, “A
object. New Approach to Cross-Modal Multimedia Retireval”,
• Multimedia concept-based matching process. In the Proceedings of the International Conference on Multimedia,
matching process, we propose a different way in ranking October 25-29,2010, ACM New York, USA, ISBN: 978-1-
retrieved documents. Our hypothesis is that the more 60558-933-6, DOI: 10.1145/1873951.18739870.
complete documents which available in many different [5] R. Weber, H.-J. Schek, S. Blott, “A quantitative analysis and
feature spaces, the more relevant the document. performance study for similarity-search methods in high-
Therefore, such documents should be weighed more in dimensional spaces”, Proceedings of the 24th VLDB
order to raise the rank. This hypothesis has to be proven Conference, New York, USA, 1998, pp. 194–205.
in experiment that will be performed in the next phase of [6] M. M. Rahman, B. C. Desai, and P. Bhattacharya, “A Feature
this work. Level Fusion in Similarity Matching to Content-based Image
• Multimedia concept-based query interface. As stated Retrieval”, Information Fusion, 2006.
earlier, concept-based query interface differs from a [7] G. Salton and M. McGill, “Introduction to Modern
general search tools. The issue of this research area is to Information Retrieval”, McGraw-Hill, 1983.
minimize the ambiguity of different terms with similar [8] S. Deerwester, S. Dumais, T. T. Landauer, G. Furnas, and R.
concept. Harshman, “Indexing by Latent Semantic Analysis”, Journal
of the American Society of Information Science, 41(6):391-
IV. CONCLUSION AND FUTURE WORKS 407, 1990.
This paper proposes a multidimensional approach in [9] T. Hofman, “Probablistic Latent Semantic Indexing”,
CBMMIRS which can accommodate various types of Proceedings of the Twenty-Second Annual International
multimedia object features (i.e. text, image, and audio) in SIGIR Conference, 1999.
numerous multimedia document collections. In our design, [10] D. M. Brei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet
the system comprises three components: (1) collection Allocation”, Journal of Machine Learning Research 3, 2003,
manager (which responsible in storing multimedia document pp. 993-1022.
collections); (2) indexer (which responsible in extracting and [11] P. H. Lewis, K. Martinez, F. S. Abas, M. Faizal, A. Fauzi, S.
indexing document features in order to be searched by user); C. Y. Chan, M. J. Addis, M. J. Boniface, P. Grimwood, A.
and (3) query processor (which responsible in managing Stevenson, C. Lahanier, J. Stevenson, “An Integrated Content
queries and search results). We also identify few research and Metadata Based Retrieval System for Art”, Journal IEEE
issues in these three CBMMIRS components. Nevertheless, Transactions on Image Processing, vol.13, March
further experiment needs to be conducted not only to test the 2004, pp.302-313.
retrieval performance, but also to prove our hypothesis, [12] E. Valle, M. Cord, and S. Philipp-Foliguet, “Content-based
which is that the more complete the document (which Retrieval of Images for Cultural Institutions using Local
indexed in several different feature spaces), the more Descriptors”, Proceedings of Geometric Modelling and
relevant the document compare to the others which only Imaging — New Trends — GMAI 2006, London England,
indexed in only one feature space. Thus, such documents July 05–06, 2006, DOI: 10.1109/GMAI.2006.16..
should be place in the top list of the search results. [13] M. Kampel, R. Huber-Mörk, M. Zaharieva, “Image-Based
Retrieval and Identification of Ancient Coins”, Journal IEEE
ACKNOWLEDGMENT Intelligent Systems, Vol. 24 Issue 2, March 2009
IEEE Educational Activities Department Piscataway, NJ,
This paper was fully supported by DRPM UI Research
USA, pp.26-34, DOI: 10.11109/MIS.2009.29.
Grant under contract Number 1198/SK/R/UI/2010 (research
project on Indonesian e-Cultural Heritage and Natural [14] S. Chu, S. Narayan, and C.-C. J. Kuo, “Environmental Sound
History Framework). Recognition Using MP-based Features”, Proceedings of
International Conference on Accoustics, Speech, and Signal
REFERENCES Processing, 2008.
[1] M. J. Huskes and M. S. Lew, “The MIR Flickr Retrieval [15] Z. Zhao and H. Glotin, “Concept Content Based Wikipedia
Evaluation”, MIR ’08 Proceeding of the 1st ACM WEB Image Retrieval using CLEF VCDT 2008”.
International Conference on Multimedia Information [16] M. Rautiainen, T. Seppänen, J. Penttilä, and J. Peltola,
Retrieval, ACM New York, USA, 2008, ISBN: 978-1-60558- “Detecting Semantic Concepts from Video Using Temporal
312-9, DOI:10.1145/1460096.1460104. Gradients and Audio Classification”.
[2] A. F. Smeaton, P. Over, and W. Kraaij, “Evaluation [17] S. Ayache, G. Qu´enot, and J. Gensel, “Classifier Fusion for
Campaigns and TRECVid”, MIR ’06 Proceedings of the 8th SVM-Based Multimedia Semantic Indexing”.
ACM International Workshop on Multimedia Information [18] Z. A. Hasibuan, “Multi Dimensions Concept-based
Retrieval, ACM New York, USA, 2006, ISBN: 1-59593-495- Information Retrieval System”, Proceedings of ALL/ACH
2, DOI: 10.1145/1178677.1178722. 2000 Conference, Glasgow, UK, 2000.
[3] A. Popescu, T. Tsikrika, and J. Kludas, “Overview of the [19] Z. A. Hasibuan, A. Kurniawan, and R. Budiarto, “Multi-
Wikipedia Retrieval Task at ImageCLEF 2010”, Working Format Concept-Based Information Retrieval using Data
Notes of the ImageCLEF 2010 Lab, Padua, Italy, 2010. Grid”, Journal of Advanced Computing and Applications Vol.
1 No. 1, 2009, pp. 1-11.