Application of Topic Segmentation in Audiovisual Information Retrieval

Application of Topic Segmentation in
Audiovisual Information Retrieval
Petra Galuščáková
galuscakova@ufal.mff.cuni.cz

Information Retrieval
● Finding material (usually documents) of an unstructured nature
(usually text) that satisfies an information need from within large
collections (usually stored on computers) [Manning, 21]
● Audiovisual Information Retrieval
- Documents to retrieve in audiovisual format
- Harder navigation
● Dependency on segmentation
- We want to minimize user`s needed work and retrieve
exact start point
- Especially audio and audiovisual data
-> we need precise segmentation
- Eskevich [6] states significantly better results of IR with
textTiling segmentation algorithm used then with c99
segmentation algorithm

Topic Segmentation
● Segment
● Coherent part of data
● Definition depends on the application – i. e. news
story, paragraphs in text
● Hierarchical/linear structure
● Audiovisual recordings
● No given text structure
● Needs to be segmented on sentences first

Topic Segmentation in Text
● Automatic Speech Recognition for transformation of audio track into text
● Errors in transcripts could influence segmentation
● Malioutov et al.[20] shows differences in evaluation of segmentation algorithms in
dependency of manual and automatic transcripts
● Hsueh and Moore [12] shows that despite the word recognition error (WER equal
to 39.1%) - their segmentation systems did not work significantly worse on ASR
transcripts than on reference transcripts.
– ASR system is likely to mis-recognize different occurences of words in the
same way
– Use more features than ASR output and the impact of recognition errors
could be reduced

Systems for Topic
Segmentation
● Lexical Cohesion Based
● TextTilling [10], C99 [3], LCSeg [8], MinCut [19], Dotplot [16], IClustSeg [26],
TextLec [29], DivSeg [31], NM09[24], U00 [33], JSeg [1], Transeg [17], LCP [15],
LSITilling [A9], TopSeg [11]
● Features Based
● [12], [7] PLSA [14] – Decisoin trees[25, 32], Maximum Entropy, SVM [14]
● Generative Models
● HMM [13, 32], BayesSeg, U00 [4, 33]

Lexical Cohesion
● Cohesion
- The sentences "stick together" to function as a whole [23]
- Achieved through back-reference, conjunction, and semantic word relations
● Division according to Halliday and Hasan [9]:
● Reiteration:
– Reiteration with identity of reference:
1. Mary bit into a peach. 2. Unfortunately the peach wasn't ripe.
– Reiteration without identity of reference:
1. Mary ate some peaches. 2. She likes peaches very much.
– Reiteration by means of superordinate (subdominate, and synonyms):
1. Mary ate a peach. 2. She likes fruit.
● Collocation:
– Systematic semantic relation (systematically classifiable):
1. Mary likes green apples. 2. She does not like red ones.
– Nonsystematic semantic relation (not systematically classifiable):
1. Mary spent three hours in the garden yesterday. 2. She was digging
potatoes.

Systems for Topic
Segmentation - C99
● C99 [3]
● Based on the cosine measure of sentence pairs
– Similarity between sentences x and y, fi,j denotes frequency of word j in
sentence I
– Similarity values are used to build the similarity matrix [17]
– Then the ranked matrix is built according to the similarity matrix
● Each value in the similarity matrix is replaced by its rank in the local
region. The rank is the number of neighbouring elements with a lower
similarity value [3]
– Finally clustering is applicated
● Iteratively searching for maximum density of matrices in the rank matrix

Systems for Topic
Segmentation - TextTiling
● Based on a lexical repetition
● Uses cosine measure
● A window of fixed length is being gradually slid through the text, and information
about word overlap between the left and right part of the window is converted into
digital signal.[10]
● Graph is then smoothed
● Shape of the post-processed signal is used to determine segment breaks.
● High similarity values, implying that the adjacent blocks cohere well, tend to form
peaks, whereas low similarity values, indicating a potential boundary between tiles,
create valleys. [10]

Systems for Topic
Segmentation – Features
Based● Text
● Lexical features
- Cue words and n-grams (now, okay, let’s, um, so, good night, ...) [12, 28]
- Distribution of nouns [7]
● Contextual Features:
- Dialogue act type [12]
- Speaker role (e.g., project manager, marketing expert)
- Tense, aspect [24]
● Vocabulary
- Word groups (months, day, coutry names, named entities, ...)
- POS tags
- Pronoun (Does the sentence contain a pronoun?), Numbers (segment of a
specific length), Is this sentence part of a conversation, i.e. does this sentence
contain “direct speech”? [12]
- Interlocutors mention agenda items (e.g., presentation, meeting) or content words
more often when initiating a new discussion. [12]

Systems for Topic
Based● Text
● According to Hsueh [12] interlocutors do the following more often than usual at
segment boundaries: start speaking before they are ready, give information, elicit
an assessment of what has been said so far, or act to smooth social functioning
and make the group happier
● Lexical Chains [2, 14]
- Does the word appear in the next few sentences?
- Does the word appear in the next few words?
- Does the word appear in the previous few sentences?
- Does the word appear in the previous few words?
- Does the word appear in the previous few sentences but not in the next few
sentences?
- Does the word begin the preceding sentence?

Systems for Topic
BasedAudio:
● Conversational Features [12]
- Amount of overlapping speech
- Speaker activity change [24]
● Prosodic Features [12]
- Fundamental frequency F0 – maximum, mean F0, patterns across the
boundary [32]
- Energy, energy at multiple points (e.g., the first and last 100 and 200 ms, the
first and last quarter, the first and second half)
- Pitch contour (relative to the speaker’s baseline [32]) – pitch is less robust [30]
- Rate of speech (number of words and the number of syllables spoken per
second)
- Silence [1]
- Duration of pauses [30], vowels [1], final vowels and final rhymes [32]

Segmentation Using Audio
Information
● Segment is likely to start with higher pitched sounds and a lower rate of speech
● Tendency of speakers to reset pitch at the start of a new major unit - final fall in pitch
associated with the ends of such units [30]
● Slowing down toward the ends of units [30]
● Topic shifts often occur after a pause of relatively long duration [12]

Systems for Topic
Based● Video:
● Color similarity
– Based on histogram
● Motion similarity
– Pixel comparison
– Especially frontal shots, hand movements [12]
– Gestural features (eye gaze behaviour) [5], face similarity
● Bag of Visual Words
● Interlocutors do not move around a lot when a new discussion is brought up [12]

Systems for Topic
Based
● Hearst [11] creates new features as a combination of another features
● He shows that the most useful features are the anchor face and pauses
● According to Hsueh [12] must be lexical features combined with other features, in
particular, conversational features (i.e., lexical cohesion, overlap, pause, speaker
change)

Fusion
● Llinas [18] defines fusion as an information process that associates, correlates and
combines data and information from single or multiple sensors or sources to achieve
refined estimates of parameters, characteristics, events and behaviors
● From many sources of information and context, how to make our best to “interpret”
the data [22]
● Levels of fusion
● Early fusion strategy
- All modalities are „concatenated into one“
- Only one decision is taken over the concatenated input
● Intermediate fusion strategy
- I.e. creataing various feature vectors, which are finally processed by HMM
● Late fusion strategy
- Each source is processed individually by a specific recognizer

Our Approach - Objectives
● Segmentes should be further porcessed by IR system
● Usable on several systems – MediaEval Competition Data and Dialogy corpus
● Applicable to various types of recordings news data and dialogs
● Language independent – should work at least with English and Czech data
● Small amount of training data for given type of recordings
● Training data exists for other type of recordings (i. e. TDT corpus – available in LDC,
Malach)
● Possible to integrate users feedback (in Dialogy corpus)

Our Approach - Solution
● Should be feature based – one of the future could be
output of cohesion based algorithm (TextTiling)
● Should incorporate all types of information (textual, audio
and visual)
● Should use fusion for mixing these different sources
● In visual track - shot detection should be used
● Active learning could help to incorporate user feedback

References
● [1] Katarina Bartkova: How far can prosodic cues help in word segmentation? In Proceedings of the 3rd International Conference on
Speech Prosody SP2006, 2006
● [2] Doug Beeferman, Adam Berger, John Lafferty: Statistical models for text segmentation, Journal Machine Learning - Special issue on
natural language learning archive Volume 34 Issue 1-3, Feb. 1999, Pages 177 – 210, 1999
● [3] Freddy Y. Y. Choi : Advances in domain independent linear text segmentation, Proceedings of the 1st Meeting of the North American
Chapter of the Association for Computational Linguistics (ANLP-NAACL-00). pp. 26–33, 2000
[4] Jacob Eisenstein, Regina Barzilay: Bayesian Unsupervised Topic Segmentation, Proceeding EMNLP '08 Proceedings of the
Conference on Empirical Methods in Natural Language Processing, Pages 334-343, 2008
● [5] Jacob Eisenstein, Regina Barzilay, All Davis: Gestural Cohesion for Topic Segmentation, ACL 2008: 852-860, 2008
● [6] Maria Eskevich, Gareth J. F. Jones: DCU at MediaEval 2011: Rich Speech Retrieval. MediaEval 2011
● [7] Martin Franz , Bhuvana Ramabhadran , Todd Ward , Michael Picheny: Automated Transcription and Topic Segmentation of Large
Spoken Archives, In Proceedings of Eurospeech, 2003
● [8] Michel Galley , Kathleen Mckeown : Discourse Segmentation of Multi-Party Conversation, in 41st Annual Meeting of ACL, 2003
● [9] M. A. K. Halliday, Ruqaiya Hasa: Cohesion in English, 1976
● [10] Marti A. Hearst TextTiling: A Quantitative Approach to Discourse Segmentation, Technical Report, 1993
● [11] Winston Hsu, Shih-fu Chang, Chih-wei Huang, Lyndon Kennedy Ching-yung Lin, Giridharan Iyengar: Discovery and Fusion of Salient
Multi-modal Features towards News Story Segmentation, In IS&T/SPIE Electronic Imagin, 2004
● [12] Pei-yun Hsueh, Johanna D. Moore: Combining Multiple Knowledge Sources for Dialogue Segmentation in Multimedia Archives. ACL
2007, 2007.

References
● [13] Minwoo Jeong, Ivan Titov:Multi-document Topic Segmentation, Proceeding CIKM '10 Proceedings of the 19th ACM international
conference on Information and knowledge management, Pages 1119-1128, 2010
● [14] David Kaucha, Francine Chen: Feature-Based Segmentation of Narrative Documents, Proceeding FeatureEng '05 Proceedings of
the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing, Pages 32-39, 2005
● [15] Hideki Kozima: Text Segmentation Based On Similarity Between Words, Proceeding ACL '93 Proceedings of the 31st annual
meeting on Association for Computational Linguistics, Pages 286-288, 1993
● [16] Niraj Kumar, Piyush Rai, Chandrika Pulla and C.V. Jawahar Video Scene Segmentation with a Semantic Similarity Proceedings of
5th Indian International Conference on Artificial Intelligence (IICAI 2011),14-16 December, 2011, Bangalore, India, 2011.
● [17] Alexandre Labadié, Violaine Prince: Lexical and semantic methods in inner text topic segmentation: A comparison between c99
and Transeg, Proceeding NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems:
Applications of Natural Language to Information Systems, Pages 347 – 349, 2008
● [18] James Llinas, Christopher Bowman, Galina Rogova, Alan Steinberg, and Frank White: Revisiting the JDL Data Fusion Model II, In
P. Svensson and J. Schubert Eds., Proceedings of the Seventh International Conference on Information Fusion FUSION 2004, 2004
● [19] Igor Malioutov, Regina Barzilay: Minimum Cut Model for Spoken Lecture Segmentation, Proceeding ACL-44 Proceedings of the
21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational
Linguistics, Pages 25-32, 2006
● [20] Igor Malioutov, Alex Park, Regina Barzilay, James Glass : Making Sense of Sound: Unsupervised Topic Segmentation over
Acoustic Input, In Proceedings, ACL, 2007
● [21] Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: Introduction to Information Retrieval, 2008
● [22] Stéphane Marchand-Maillet: Multimedia Information Retrieval, Promise Witer School, 2012
● [23] Jane Morris, Graeme Hirst: Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure

References
● [24] John Niekrasz, Johanna Moore: Participant Subjectivity and Involvement as a Basis for Discourse Segmentation, Proceeding
SIGDIAL '09 Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and
Dialogue, Pages 54-61, 2009
● [25] Rebecca J. Passonneau, Diane J. Litman: Discourse Segmentation by Human and Automated Means, Journal Computational
Linguistics Volume 23 Issue 1, March 1997, Pages 103-139, 1997
● [26] Raúl Abella Pérez, José Eladio Medina Pagola: An Incremental Text Segmentation by Clustering Cohesion, Proceeding CIARP'10
Proceedings of the 15th Iberoamerican congress conference on Progress in pattern recognition, image analysis, computer vision, and
applications, Pages 261-268, 2010
● [27] Lev Pevzner, Marti A. Hearst: A Critique and Improvement of an Evaluation Metric for Text Segmentation, Journal Computational
Linguistics, Volume 28 Issue 1, March 2002, Pages 19-36, 2002
● [28] Jay M. Ponte , W. Bruce Croft : Text Segmentation by Topic, In Proceedings of the First European Conference on Research and
Advanced Technology for Digital Libraries, 1997
● [29] Laritza Hernández Rojas, José E. Medina Pagola: A Novel Method of Segmentation by Topic Using Lower Windows and Lexical
Cohesion, Proceeding CIARP'07 Proceedings of the Congress on pattern recognition 12th Iberoamerican conference on Progress in
pattern recognition, image analysis and applications Pages 724-733, 2007
● [30] Elizabeth Shriber, Andreas Stolcke, Dilek Hakkani-Tür, Gükhan Tür: Prosody-Based Automatic Segmentation of Speech into
Sentences and Topics, Journal Speech Communication - Special issue on accessing information in spoken audio archive Volume 32
Issue 1-2, Sept. 2000, Pages 127 – 154, 2000
[31] Fei Song, William M. Darling, Adnan Duric, Fred W. Kroon: An Iterative Approach to Text Segmentation, Proceeding ECIR'11
Proceedings of the 33rd European conference on Advances in information retrieval, Pages 629-640, 2011
● [32] Gökhan Tür, Andreas Stolcke, Dilek H. Tür, Elizabeth Shriberg: Integrating Prosodic and Lexical Cues for Automatic Topic
Segmentation, Comput. Linguist., Vol. 27, No. 1. pp. 31-57, 2001
●
[33] Masao Utiyama , Hitoshi Isahara: A Statistical Model for Domain-Independent Text Segmentation, In Proceedings of the 9th
Conference of the European Chapter of the Association for Computational Linguistics, 2001

Application of Topic Segmentation in Audiovisual Information Retrieval

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Application of Topic Segmentation in Audiovisual Information Retrieval

Ähnlich wie Application of Topic Segmentation in Audiovisual Information Retrieval (20)

Mehr von Petra Galuscakova

Mehr von Petra Galuscakova (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Application of Topic Segmentation in Audiovisual Information Retrieval