SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
TOPIC DETECTION &
                        TRACKING




                         Omid Dadgar


Tuesday, June 1, 2010
Background
        Topic Detection and tracking is a fairly new area of
        research in IR: Developed over the past 7 years

        Began during 1996 and 1997 with a Pilot Study
        conducted to explore various approaches and
        establish performance baseline.

        Followed by TDT2 which this presentation is
        primarily based on.



Tuesday, June 1, 2010
Background
         • Since TDT2 in 1998 there have been several
           open evaluations of TDT and progress has
           been made.

         • TDT2 however is important as it was the
           first major step in TDT after the pilot study
           and established the foundation for further
           work.


Tuesday, June 1, 2010
Background
         – To solve the TDT challenges, researchers are
           looking for robust, accurate, fully automatic
           algorithms that are source, medium, domain, and
           language independent.




Tuesday, June 1, 2010
Goals
         – To develop automatic techniques for finding
           topically related material in streams of data. This
           could be valuable in a wide variety of applications
           where efficient and timely information access is
           important. Eg. (CNN or Yahoo News)
         – It would be very helpful if computers were able to
            map out data automatically finding story
            boundaries, determining what stories go with one
            another, and discovering when something new
            (unforeseen) has happened.



Tuesday, June 1, 2010
Introduction
         • Purpose: To develop technologies for retrieval and
             automatic organization of Broadcast news and Newswire
             stories and to evaluate the performance.
         • Corpus: TDT2 processing addresses multiple sources of
            information, including newswire (text) and broadcast news
            (speech).
         • The information is modeled as a sequence of stories. These
            stories provide information on many topics




Tuesday, June 1, 2010
Introduction
         • "Topic" is defined in a special way specifically for
           TDT research. For the purposes of this project,
           topics refer to specific events or activities, such as
           the crash of a China Airlines airplane in Taipei,
           Taiwan on February 16, 1998, and encompass all
           facts, events and activities that are directly related
           to them. Here is the definition of topic and a few
           other essential terms, as used in TDT research:




Tuesday, June 1, 2010
Terms
         • TOPIC- A topic is an event or activity, along with
           all directly related events and activities.


         • EVENT- An event is something that happens at
           some specific time and place, and the unavoidable
           consequences. Specific elections, accidents,
           crimes and natural disasters are examples of
           events.




Tuesday, June 1, 2010
• ACTIVITY- An activity is a connected set of
           actions that have a common focus or purpose.
           Specific campaigns, investigations, and disaster
           relief efforts are examples of activities.


         • STORY- A story is a newswire article or a
           segment of a news broadcast with a coherent news
           focus. They must contain at least two independent,
           declarative clauses.




Tuesday, June 1, 2010
• Definition of topic: A seminal event or
             activity, along with all directly related
             events and activities.
         • Stories “on topic” is story directly connected
           to the associated event.
         • TDT technique explore for detecting the
           appearance of new topics and for tracking
           the reappearance and evolution of them.



Tuesday, June 1, 2010
TDT2 vs. Pilot Study
           In 1998, TDT2 address the same three core
           tasks(segmentation, detection, and tracking).

           Evaluation procedures were modified.

           Volume and variety of data and the number of target topics
             were expanded.

           TDT2 attacked the problems introduced by imperfect,
           machine-generated transcripts of audio data




Tuesday, June 1, 2010
Corpus
      • Linguistic Data Consortium (LDC) undertook the corpus
      creation efforts for TDT2
      • TDT2 Corpus contains data from
         – Newswire: Associated Press WorldStream, New
           York Times News Services

         – Radio: Voice of America World News, Public
           Radio International The World




Tuesday, June 1, 2010
Corpus cont.

         – Television: CNN Headline News, ABC
         World News Tonight
         • There are 300 stories/day, 5 hrs digital
           recordings/day, 54,000 stories, 630 hours of
           audio
         • For newswire source each story is clearly
           delimited by the newswire format


Tuesday, June 1, 2010
Corpus cont.
             For audio source segmentation of the broadcast
             news consists two pass procedures

             First pass: LDC staff inserted story boundaries
             and identified no-story segments

             Second pass: annotators confirmed or adjusted
             existing story boundaries



Tuesday, June 1, 2010
Corpus cont.
         • The audio source were provided in three forms

          – The sampled date audio signal

         – A manual transcription of the speech

         – An automatic transcription of the speech (ASR) by
           an automatic speech recognizer.



Tuesday, June 1, 2010
The TDT2 Corpus Cont.
        • Audio source transcription include non-news and news
        stories. Each story was labeled as “News”, “Miscellaneous”,
        “Untranscribed”.
           – Stories marked as NEWS were used
        • LDC defined 100 topics based upon random sample of the
        six sources from 01-06,98
          – Each topic was defined in terms of a three-part
        identification (what/where/when)




Tuesday, June 1, 2010
Example Topic
         Title: Mountain Hikers Lost
               – WHAT: 35 or 40 young Mountain Hikers
                were lost in an avalanche in France
                around the 20th of January.
               – WHERE: Orres, France
               – WHEN: January 4, 1998




Tuesday, June 1, 2010
Corpus cont.
         – Annotation staff worked with daily news files, each story
            was labeled “yes”, “brief” or”no”
         • TDT2 topics are based on an assumption that news stories
            are about events
             – TDT2 Event is an activity that happens at a
             specific place and time and all of its necessary
             causes and unavoidable consequences
             – Rules of interpretation specify the scope of related events
             also to be considered part of the same topic




Tuesday, June 1, 2010
Corpus cont.
         TDT2 topic definition was a collaborative process
          with annotators negotiating the scope
              – The randomly selected story was often neither
             the best not even a good representative of the
             seminal events. Annotators researched each
             event elsewhere in the news
              – Response to changes in the real world, new
             stories were reevaluated and the topics modified.



Tuesday, June 1, 2010
Organization of the TDT2 Corpus
        TDT2 Corpus was divided into three parts for research management purpose
        – Training set: the data may be used without limit for research purposes
        – Development test set: the data will be available for testing TDT algorithm
        – Evaluation test set: the data will be reserved for final formal evaluation of performance




                                      Organization of the TDT2 Corpus



Tuesday, June 1, 2010
The Three Tasks
      • The input to TDT2 project is a stream of stories.
      This stream may not be pre-segmented
      into stories, and the topics may not be known to
      the system.
      • Three technical tasks are segmentation of a
      news source into stories, the tracking of known
      topics, and the detection of unknown topics.


Tuesday, June 1, 2010
Segmentation

            – Segmenting the stream of data into constituent stories,
              applies to audio (radio and TV) source.

           – Segmentation output must be performed as the data is
             being processed. The deferral period is a primary task
             parameter.

            – Story segmentation performance depends on the forms of
              the source and on the deferral period.




Tuesday, June 1, 2010
Segmentation cont.

                 Three source condition:
                 ♦ Manual transcription
                 ♦ Automatic transcription
                 ♦ Sample data signal
                 Decision deferral period:
                 ♦ Transcription in text form(words)
                   100 1000 10,000
                 ♦ Sample data in audio form(seconds)
                   30      300     3,000




Tuesday, June 1, 2010
Tracking
             Associating incoming stories with topics that are known to
             the system. A topic is “known” by its association with the
             stories that discuss it.
             A set of training stories is identified for each topic. The
             system may train on the target topic by using all of the
             stories in the corpus
             A goal of Topic tracking is to keep track of the topics
             users are interested in . The user therefore spends less time
             searching large amounts of data, in newswire, WWW-
             based news and broadcast news(BN).




Tuesday, June 1, 2010
Tracking cont
             Performance depends on the form of the source and on the
             number of training stories for the topic, also on whether
             story boundaries are provided to the
             system
              ◊ Three source condition:
               ♦ newswire text and a manual transcription of the audio
                sources
               ♦ Newswire text and the automatic transcription of
                  the audio sources
               ♦ Newswire text and the sampled data signal
                  representing the audio sources
              ◊ Five different training conditions (# of training stories)
                 1        2        4        8         16
              ◊ Two story boundary conditions:
                  Given         Not Given
Tuesday, June 1, 2010
Detection
         – Detecting and tracking topics not previously known to the
            system.
            – Identifying topics as defined by their association with the
             stories that discuss them
            – Detection Using a whole (2 month) sub-corpus as input
            – Performance depends on the form of the source and on the
             form of the source and the maximum delay allowed before
             topic detection decisions must be output, and depends on
             whether story boundaries are provided.




Tuesday, June 1, 2010
Detection cont.
         ◊ Three source condition:
           ♦ newswire text and a manual transcription of the audio
             sources
           ♦ Newswire text and the automatic transcription of
              the audio sources
           ♦ Newswire text and the sampled data signal
             representing the audio sources
           ◊ Three different decision deferral periods (in terms of #
                source file)




Tuesday, June 1, 2010
Evaluation
        • The general TDT evaluation will be in terms of
        classical detection theory

          – Type I error “misses”: the target is not detected
        when it is present
           – Type II error “false alarms”: the target is
        falsely detected when it is not present

        • These error probabilities are combined into a
         single detection cost Cdet


Tuesday, June 1, 2010
CDet = Cmiss . Pmiss . Ptarget + CFA . PFA . PNOT.Target

         Cmiss and CFA are are the costs of Miss and a False Alarm Respectively
         Pmiss and PNOT.Target are the conditional probabilities of a Miss and
             false Alarm respectively.

         Ptarget and PNOT.Target
             are the a priori target probabilities

             (The a prior probability of a story being on some given topic or not.)

         (Ptarget = 1 - PNOT.Target)




Tuesday, June 1, 2010
Participants
      • Sponsor: DARPA
      • Researches: BBN, CMU, Dragon, GE, IBM,
        SRI, Umass, Upenn, Uiowa, Umd
      • Corpus: Collection, Annotation, Transcription,
        Dissemination: LDC
      • Automatic Transcription: Dragon
      • Evaluation: NIST




Tuesday, June 1, 2010
PARTICIPANTS
                   Eleven research sites participated in NIST’s 1998 TDT2 evaluation




                            1998 TDT Evaluation Task Site Participation
                        * Submitted after the December 21, 1998 deadline



Tuesday, June 1, 2010
Story Segmentation Results
          • Five research sites participated in the story segmentation
          • Segmentation costs achieved by the participants for ASR-transcription and
            manual transcriptions




                                      1998 TDT2 Primary Tracking Systems

           Observation: the lowest cost on ASR text was 0.14, achieved by CMU
           Dragon’s performance improved in manual transcription (0.11)



Tuesday, June 1, 2010
Decision Deferral Periods
           The period defines the amount of future material a segmentation system
           can use before making a decision




         Observation: Extended decision deferral periods were helpful for SRI, not for others
                        CMU used 100 words to make decision which had the lowest cost


Tuesday, June 1, 2010
Topic Tracking Results
        Eight research sites ran a primary system on the required evaluation, which was to
        track topics from both Newswire and ASR sources, using 4 training stories per topic




                                      1998 TDT2 Primary Tracking Systems
           BBN achieved the lowest cost 0.0056 corresponds to missing 14% of on-topic stories and
           falsely detecting 0.2% of the off-topic


Tuesday, June 1, 2010
Effect of Number of Training Stories
             Varied number of training stories supported tracking performance




                            Effect of topic training performance on tracking


                   Performance was better when systems were presented with four training
                   stories rather than one, with an average of 38% relative improvement




Tuesday, June 1, 2010
Effect of Automatic Segmentation on Tracking
       Replaces the given story boundaries in the ASR texts with the output of an
       automatic story segmentation algorithm.
       Presents a fully automated topic tracking system from newswire and broadcast
       news audio source




Tuesday, June 1, 2010
Topic Detection Results
       The required evaluation was to detect topics in the newswire+ASR source transcripts,
       deferral decisions for up to 10 source file, and using given reference story boundaries




                               1998 TDT2 Primary Detection System

         IBM’s detection cost of 0.0042 corresponds to missing 20% of the documents
         and falsely including 0.07% of the documents
         Detection performance improved slightly for the manual transcriptions


Tuesday, June 1, 2010
Effect of Decision Deferral on Detection
              Detection evaluation supported decision deferral period




                                 Effect of Decision Deferral Detection



              Small improvement with extended decision deferral periods(an average of
               7% relative improvement)


Tuesday, June 1, 2010
Effect of Automatic Segmentation Detection
      The detection cost have been computed by dividing the corpus into tow sets
      – Broadcast news “audio source” transcripts
      – Newswire “text source” after mapping the reference topic to the system-defined topics




                               Effect of Automatic Segmentation on Detection




Tuesday, June 1, 2010
Conclusion and Further Work

       • The first TDT2 Benchmark test was
       successfully completed and involved eleven
        research sites.
       • The errors introduced by ASR errors appear to
       affect tracking and detection.
       • Automatic segmentation of ASR text degrades
       tracking and detection more than ASR errors
       alone

Tuesday, June 1, 2010
Conclusion and Further Work cont.

         • Decision deferral periods appear to be useful
           for detection, more so than for segmentation

         • Since TDT2 in 1998 there have been 4 open
         evaluations




Tuesday, June 1, 2010
Further Work
         • Other tasks have been added to the core
           three tasks of segmentation, tracking and
           detection.
          • Further work has looked at monitoring
           streams of news in multiple languages (eg.
           Mandarin) and media –newswire, radio,
           television, web sites or some future
           combination.


Tuesday, June 1, 2010
Questions



Tuesday, June 1, 2010
Thank you



Tuesday, June 1, 2010

Weitere ähnliche Inhalte

Was ist angesagt?

Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)9866825059
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval ModelsNisha Arankandath
 
Data visualization
Data visualizationData visualization
Data visualizationSushil kasar
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalDustin Smith
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrievalNanthini Dominique
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean modelVaibhav Khanna
 
Text analytics in social media
Text analytics in social mediaText analytics in social media
Text analytics in social mediaJeremiah Fadugba
 
Cross language information retrieval (clir)slide
Cross language information retrieval (clir)slideCross language information retrieval (clir)slide
Cross language information retrieval (clir)slideMohd Iqbal Al-farabi
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
IRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxIRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxShivaVemula2
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 

Was ist angesagt? (20)

Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Link prediction
Link predictionLink prediction
Link prediction
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
 
06 Community Detection
06 Community Detection06 Community Detection
06 Community Detection
 
Data visualization
Data visualizationData visualization
Data visualization
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
 
Data mining
Data miningData mining
Data mining
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Text Mining
Text MiningText Mining
Text Mining
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
 
Text analytics in social media
Text analytics in social mediaText analytics in social media
Text analytics in social media
 
Ontology Learning
Ontology LearningOntology Learning
Ontology Learning
 
Multimedia Information Retrieval
Multimedia Information RetrievalMultimedia Information Retrieval
Multimedia Information Retrieval
 
Cross language information retrieval (clir)slide
Cross language information retrieval (clir)slideCross language information retrieval (clir)slide
Cross language information retrieval (clir)slide
 
Data mining
Data miningData mining
Data mining
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
IRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxIRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptx
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 

Andere mochten auch

Latent Semantics & Social Interaction
Latent Semantics & Social InteractionLatent Semantics & Social Interaction
Latent Semantics & Social Interactionfridolin.wild
 
Utilizing temporal information in topic detection and tracking
Utilizing temporal information in topic detection and trackingUtilizing temporal information in topic detection and tracking
Utilizing temporal information in topic detection and trackingGeorge Ang
 
Topic detection and tracking
Topic detection and trackingTopic detection and tracking
Topic detection and trackingGeorge Ang
 
Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term ...
Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term ...Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term ...
Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term ...Ly Nguyen
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and trackingGeorge Ang
 
Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...George Ang
 
Generating Storylines (Literature Survey)
Generating Storylines (Literature Survey)Generating Storylines (Literature Survey)
Generating Storylines (Literature Survey)Anunaya
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Sebastian Ruder
 
Capitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataCapitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataShalin Hai-Jew
 

Andere mochten auch (9)

Latent Semantics & Social Interaction
Latent Semantics & Social InteractionLatent Semantics & Social Interaction
Latent Semantics & Social Interaction
 
Utilizing temporal information in topic detection and tracking
Utilizing temporal information in topic detection and trackingUtilizing temporal information in topic detection and tracking
Utilizing temporal information in topic detection and tracking
 
Topic detection and tracking
Topic detection and trackingTopic detection and tracking
Topic detection and tracking
 
Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term ...
Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term ...Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term ...
Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term ...
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and tracking
 
Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
Generating Storylines (Literature Survey)
Generating Storylines (Literature Survey)Generating Storylines (Literature Survey)
Generating Storylines (Literature Survey)
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
 
Capitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataCapitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger Data
 

Ähnlich wie Topic detection & tracking

Topic Tracking for Punjabi Language
Topic Tracking for Punjabi LanguageTopic Tracking for Punjabi Language
Topic Tracking for Punjabi LanguageCSEIJJournal
 
Detecting Patterns in News Media Content
Detecting Patterns in News Media ContentDetecting Patterns in News Media Content
Detecting Patterns in News Media ContentIlias Flaounas
 
Knowledge Sharing and Education
Knowledge Sharing and EducationKnowledge Sharing and Education
Knowledge Sharing and EducationKaitlin Thaney
 
A Thin Stretchable Interface for Tangential Force Measurement (UIST 2012)
A Thin Stretchable Interface for Tangential Force Measurement (UIST 2012)A Thin Stretchable Interface for Tangential Force Measurement (UIST 2012)
A Thin Stretchable Interface for Tangential Force Measurement (UIST 2012)sugiuralab
 
Study of Social Network Sites in crises
Study of Social Network Sites in crisesStudy of Social Network Sites in crises
Study of Social Network Sites in crisesPablo Acuña
 
How the Live Web Feels about Events
How the Live Web Feels about EventsHow the Live Web Feels about Events
How the Live Web Feels about EventsGeorge Valkanas
 

Ähnlich wie Topic detection & tracking (8)

Topic Tracking for Punjabi Language
Topic Tracking for Punjabi LanguageTopic Tracking for Punjabi Language
Topic Tracking for Punjabi Language
 
Detecting Patterns in News Media Content
Detecting Patterns in News Media ContentDetecting Patterns in News Media Content
Detecting Patterns in News Media Content
 
Television News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/SolrTelevision News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/Solr
 
Knowledge Sharing and Education
Knowledge Sharing and EducationKnowledge Sharing and Education
Knowledge Sharing and Education
 
Type 2 fuzzy ontology ahmadchan
Type 2 fuzzy ontology ahmadchanType 2 fuzzy ontology ahmadchan
Type 2 fuzzy ontology ahmadchan
 
A Thin Stretchable Interface for Tangential Force Measurement (UIST 2012)
A Thin Stretchable Interface for Tangential Force Measurement (UIST 2012)A Thin Stretchable Interface for Tangential Force Measurement (UIST 2012)
A Thin Stretchable Interface for Tangential Force Measurement (UIST 2012)
 
Study of Social Network Sites in crises
Study of Social Network Sites in crisesStudy of Social Network Sites in crises
Study of Social Network Sites in crises
 
How the Live Web Feels about Events
How the Live Web Feels about EventsHow the Live Web Feels about Events
How the Live Web Feels about Events
 

Mehr von George Ang

Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarizationGeorge Ang
 
Huffman coding
Huffman codingHuffman coding
Huffman codingGeorge Ang
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textGeorge Ang
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿George Ang
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势George Ang
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程George Ang
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qqGeorge Ang
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道George Ang
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化George Ang
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间George Ang
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨George Ang
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站George Ang
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程George Ang
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagementGeorge Ang
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享George Ang
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享George Ang
 
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)腾讯大讲堂17 性能优化不是仅局限于后台(qzone)
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)George Ang
 

Mehr von George Ang (20)

Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar text
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qq
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享
 
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)腾讯大讲堂17 性能优化不是仅局限于后台(qzone)
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)
 

Kürzlich hochgeladen

BDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...AlexisTorres963861
 
BDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
28042024_First India Newspaper Jaipur.pdf
28042024_First India Newspaper Jaipur.pdf28042024_First India Newspaper Jaipur.pdf
28042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Minto-Morley Reforms 1909 (constitution).pptx
Minto-Morley Reforms 1909 (constitution).pptxMinto-Morley Reforms 1909 (constitution).pptx
Minto-Morley Reforms 1909 (constitution).pptxAwaiskhalid96
 
Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...
Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...
Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...Pooja Nehwal
 
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...Ismail Fahmi
 
30042024_First India Newspaper Jaipur.pdf
30042024_First India Newspaper Jaipur.pdf30042024_First India Newspaper Jaipur.pdf
30042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Julius Randle's Injury Status: Surgery Not Off the Table
Julius Randle's Injury Status: Surgery Not Off the TableJulius Randle's Injury Status: Surgery Not Off the Table
Julius Randle's Injury Status: Surgery Not Off the Tableget joys
 
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreieGujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreiebhavenpr
 
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopkoEmbed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopkobhavenpr
 
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's DevelopmentNara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Developmentnarsireddynannuri1
 
Israel Palestine Conflict, The issue and historical context!
Israel Palestine Conflict, The issue and historical context!Israel Palestine Conflict, The issue and historical context!
Israel Palestine Conflict, The issue and historical context!Krish109503
 
Pakistan PMLN Election Manifesto 2024.pdf
Pakistan PMLN Election Manifesto 2024.pdfPakistan PMLN Election Manifesto 2024.pdf
Pakistan PMLN Election Manifesto 2024.pdfFahimUddin61
 
Embed-4.pdf lkdiinlajeklhndklheduhuekjdh
Embed-4.pdf lkdiinlajeklhndklheduhuekjdhEmbed-4.pdf lkdiinlajeklhndklheduhuekjdh
Embed-4.pdf lkdiinlajeklhndklheduhuekjdhbhavenpr
 
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptxKAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptxjohnandrewcarlos
 
TDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s Leadership
TDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s LeadershipTDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s Leadership
TDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s Leadershipanjanibaddipudi1
 
Vashi Escorts, {Pooja 09892124323}, Vashi Call Girls
Vashi Escorts, {Pooja 09892124323}, Vashi Call GirlsVashi Escorts, {Pooja 09892124323}, Vashi Call Girls
Vashi Escorts, {Pooja 09892124323}, Vashi Call GirlsPooja Nehwal
 
BDSM⚡Call Girls in Sector 135 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 135 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 135 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 135 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
How Europe Underdeveloped Africa_walter.pdf
How Europe Underdeveloped Africa_walter.pdfHow Europe Underdeveloped Africa_walter.pdf
How Europe Underdeveloped Africa_walter.pdfLorenzo Lemes
 

Kürzlich hochgeladen (20)

BDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 143 Noida Escorts >༒8448380779 Escort Service
 
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
 
BDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Indirapuram Escorts >༒8448380779 Escort Service
 
28042024_First India Newspaper Jaipur.pdf
28042024_First India Newspaper Jaipur.pdf28042024_First India Newspaper Jaipur.pdf
28042024_First India Newspaper Jaipur.pdf
 
Minto-Morley Reforms 1909 (constitution).pptx
Minto-Morley Reforms 1909 (constitution).pptxMinto-Morley Reforms 1909 (constitution).pptx
Minto-Morley Reforms 1909 (constitution).pptx
 
Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...
Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...
Call Girls in Mira Road Mumbai ( Neha 09892124323 ) College Escorts Service i...
 
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
 
30042024_First India Newspaper Jaipur.pdf
30042024_First India Newspaper Jaipur.pdf30042024_First India Newspaper Jaipur.pdf
30042024_First India Newspaper Jaipur.pdf
 
Julius Randle's Injury Status: Surgery Not Off the Table
Julius Randle's Injury Status: Surgery Not Off the TableJulius Randle's Injury Status: Surgery Not Off the Table
Julius Randle's Injury Status: Surgery Not Off the Table
 
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreieGujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreie
 
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopkoEmbed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
 
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's DevelopmentNara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
 
Israel Palestine Conflict, The issue and historical context!
Israel Palestine Conflict, The issue and historical context!Israel Palestine Conflict, The issue and historical context!
Israel Palestine Conflict, The issue and historical context!
 
Pakistan PMLN Election Manifesto 2024.pdf
Pakistan PMLN Election Manifesto 2024.pdfPakistan PMLN Election Manifesto 2024.pdf
Pakistan PMLN Election Manifesto 2024.pdf
 
Embed-4.pdf lkdiinlajeklhndklheduhuekjdh
Embed-4.pdf lkdiinlajeklhndklheduhuekjdhEmbed-4.pdf lkdiinlajeklhndklheduhuekjdh
Embed-4.pdf lkdiinlajeklhndklheduhuekjdh
 
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptxKAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
 
TDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s Leadership
TDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s LeadershipTDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s Leadership
TDP As the Party of Hope For AP Youth Under N Chandrababu Naidu’s Leadership
 
Vashi Escorts, {Pooja 09892124323}, Vashi Call Girls
Vashi Escorts, {Pooja 09892124323}, Vashi Call GirlsVashi Escorts, {Pooja 09892124323}, Vashi Call Girls
Vashi Escorts, {Pooja 09892124323}, Vashi Call Girls
 
BDSM⚡Call Girls in Sector 135 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 135 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 135 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 135 Noida Escorts >༒8448380779 Escort Service
 
How Europe Underdeveloped Africa_walter.pdf
How Europe Underdeveloped Africa_walter.pdfHow Europe Underdeveloped Africa_walter.pdf
How Europe Underdeveloped Africa_walter.pdf
 

Topic detection & tracking

  • 1. TOPIC DETECTION & TRACKING Omid Dadgar Tuesday, June 1, 2010
  • 2. Background Topic Detection and tracking is a fairly new area of research in IR: Developed over the past 7 years Began during 1996 and 1997 with a Pilot Study conducted to explore various approaches and establish performance baseline. Followed by TDT2 which this presentation is primarily based on. Tuesday, June 1, 2010
  • 3. Background • Since TDT2 in 1998 there have been several open evaluations of TDT and progress has been made. • TDT2 however is important as it was the first major step in TDT after the pilot study and established the foundation for further work. Tuesday, June 1, 2010
  • 4. Background – To solve the TDT challenges, researchers are looking for robust, accurate, fully automatic algorithms that are source, medium, domain, and language independent. Tuesday, June 1, 2010
  • 5. Goals – To develop automatic techniques for finding topically related material in streams of data. This could be valuable in a wide variety of applications where efficient and timely information access is important. Eg. (CNN or Yahoo News) – It would be very helpful if computers were able to map out data automatically finding story boundaries, determining what stories go with one another, and discovering when something new (unforeseen) has happened. Tuesday, June 1, 2010
  • 6. Introduction • Purpose: To develop technologies for retrieval and automatic organization of Broadcast news and Newswire stories and to evaluate the performance. • Corpus: TDT2 processing addresses multiple sources of information, including newswire (text) and broadcast news (speech). • The information is modeled as a sequence of stories. These stories provide information on many topics Tuesday, June 1, 2010
  • 7. Introduction • "Topic" is defined in a special way specifically for TDT research. For the purposes of this project, topics refer to specific events or activities, such as the crash of a China Airlines airplane in Taipei, Taiwan on February 16, 1998, and encompass all facts, events and activities that are directly related to them. Here is the definition of topic and a few other essential terms, as used in TDT research: Tuesday, June 1, 2010
  • 8. Terms • TOPIC- A topic is an event or activity, along with all directly related events and activities. • EVENT- An event is something that happens at some specific time and place, and the unavoidable consequences. Specific elections, accidents, crimes and natural disasters are examples of events. Tuesday, June 1, 2010
  • 9. • ACTIVITY- An activity is a connected set of actions that have a common focus or purpose. Specific campaigns, investigations, and disaster relief efforts are examples of activities. • STORY- A story is a newswire article or a segment of a news broadcast with a coherent news focus. They must contain at least two independent, declarative clauses. Tuesday, June 1, 2010
  • 10. • Definition of topic: A seminal event or activity, along with all directly related events and activities. • Stories “on topic” is story directly connected to the associated event. • TDT technique explore for detecting the appearance of new topics and for tracking the reappearance and evolution of them. Tuesday, June 1, 2010
  • 11. TDT2 vs. Pilot Study In 1998, TDT2 address the same three core tasks(segmentation, detection, and tracking). Evaluation procedures were modified. Volume and variety of data and the number of target topics were expanded. TDT2 attacked the problems introduced by imperfect, machine-generated transcripts of audio data Tuesday, June 1, 2010
  • 12. Corpus • Linguistic Data Consortium (LDC) undertook the corpus creation efforts for TDT2 • TDT2 Corpus contains data from – Newswire: Associated Press WorldStream, New York Times News Services – Radio: Voice of America World News, Public Radio International The World Tuesday, June 1, 2010
  • 13. Corpus cont. – Television: CNN Headline News, ABC World News Tonight • There are 300 stories/day, 5 hrs digital recordings/day, 54,000 stories, 630 hours of audio • For newswire source each story is clearly delimited by the newswire format Tuesday, June 1, 2010
  • 14. Corpus cont. For audio source segmentation of the broadcast news consists two pass procedures First pass: LDC staff inserted story boundaries and identified no-story segments Second pass: annotators confirmed or adjusted existing story boundaries Tuesday, June 1, 2010
  • 15. Corpus cont. • The audio source were provided in three forms – The sampled date audio signal – A manual transcription of the speech – An automatic transcription of the speech (ASR) by an automatic speech recognizer. Tuesday, June 1, 2010
  • 16. The TDT2 Corpus Cont. • Audio source transcription include non-news and news stories. Each story was labeled as “News”, “Miscellaneous”, “Untranscribed”. – Stories marked as NEWS were used • LDC defined 100 topics based upon random sample of the six sources from 01-06,98 – Each topic was defined in terms of a three-part identification (what/where/when) Tuesday, June 1, 2010
  • 17. Example Topic Title: Mountain Hikers Lost – WHAT: 35 or 40 young Mountain Hikers were lost in an avalanche in France around the 20th of January. – WHERE: Orres, France – WHEN: January 4, 1998 Tuesday, June 1, 2010
  • 18. Corpus cont. – Annotation staff worked with daily news files, each story was labeled “yes”, “brief” or”no” • TDT2 topics are based on an assumption that news stories are about events – TDT2 Event is an activity that happens at a specific place and time and all of its necessary causes and unavoidable consequences – Rules of interpretation specify the scope of related events also to be considered part of the same topic Tuesday, June 1, 2010
  • 19. Corpus cont. TDT2 topic definition was a collaborative process with annotators negotiating the scope – The randomly selected story was often neither the best not even a good representative of the seminal events. Annotators researched each event elsewhere in the news – Response to changes in the real world, new stories were reevaluated and the topics modified. Tuesday, June 1, 2010
  • 20. Organization of the TDT2 Corpus TDT2 Corpus was divided into three parts for research management purpose – Training set: the data may be used without limit for research purposes – Development test set: the data will be available for testing TDT algorithm – Evaluation test set: the data will be reserved for final formal evaluation of performance Organization of the TDT2 Corpus Tuesday, June 1, 2010
  • 21. The Three Tasks • The input to TDT2 project is a stream of stories. This stream may not be pre-segmented into stories, and the topics may not be known to the system. • Three technical tasks are segmentation of a news source into stories, the tracking of known topics, and the detection of unknown topics. Tuesday, June 1, 2010
  • 22. Segmentation – Segmenting the stream of data into constituent stories, applies to audio (radio and TV) source. – Segmentation output must be performed as the data is being processed. The deferral period is a primary task parameter. – Story segmentation performance depends on the forms of the source and on the deferral period. Tuesday, June 1, 2010
  • 23. Segmentation cont. Three source condition: ♦ Manual transcription ♦ Automatic transcription ♦ Sample data signal Decision deferral period: ♦ Transcription in text form(words) 100 1000 10,000 ♦ Sample data in audio form(seconds) 30 300 3,000 Tuesday, June 1, 2010
  • 24. Tracking Associating incoming stories with topics that are known to the system. A topic is “known” by its association with the stories that discuss it. A set of training stories is identified for each topic. The system may train on the target topic by using all of the stories in the corpus A goal of Topic tracking is to keep track of the topics users are interested in . The user therefore spends less time searching large amounts of data, in newswire, WWW- based news and broadcast news(BN). Tuesday, June 1, 2010
  • 25. Tracking cont Performance depends on the form of the source and on the number of training stories for the topic, also on whether story boundaries are provided to the system ◊ Three source condition: ♦ newswire text and a manual transcription of the audio sources ♦ Newswire text and the automatic transcription of the audio sources ♦ Newswire text and the sampled data signal representing the audio sources ◊ Five different training conditions (# of training stories) 1 2 4 8 16 ◊ Two story boundary conditions: Given Not Given Tuesday, June 1, 2010
  • 26. Detection – Detecting and tracking topics not previously known to the system. – Identifying topics as defined by their association with the stories that discuss them – Detection Using a whole (2 month) sub-corpus as input – Performance depends on the form of the source and on the form of the source and the maximum delay allowed before topic detection decisions must be output, and depends on whether story boundaries are provided. Tuesday, June 1, 2010
  • 27. Detection cont. ◊ Three source condition: ♦ newswire text and a manual transcription of the audio sources ♦ Newswire text and the automatic transcription of the audio sources ♦ Newswire text and the sampled data signal representing the audio sources ◊ Three different decision deferral periods (in terms of # source file) Tuesday, June 1, 2010
  • 28. Evaluation • The general TDT evaluation will be in terms of classical detection theory – Type I error “misses”: the target is not detected when it is present – Type II error “false alarms”: the target is falsely detected when it is not present • These error probabilities are combined into a single detection cost Cdet Tuesday, June 1, 2010
  • 29. CDet = Cmiss . Pmiss . Ptarget + CFA . PFA . PNOT.Target Cmiss and CFA are are the costs of Miss and a False Alarm Respectively Pmiss and PNOT.Target are the conditional probabilities of a Miss and false Alarm respectively. Ptarget and PNOT.Target are the a priori target probabilities (The a prior probability of a story being on some given topic or not.) (Ptarget = 1 - PNOT.Target) Tuesday, June 1, 2010
  • 30. Participants • Sponsor: DARPA • Researches: BBN, CMU, Dragon, GE, IBM, SRI, Umass, Upenn, Uiowa, Umd • Corpus: Collection, Annotation, Transcription, Dissemination: LDC • Automatic Transcription: Dragon • Evaluation: NIST Tuesday, June 1, 2010
  • 31. PARTICIPANTS Eleven research sites participated in NIST’s 1998 TDT2 evaluation 1998 TDT Evaluation Task Site Participation * Submitted after the December 21, 1998 deadline Tuesday, June 1, 2010
  • 32. Story Segmentation Results • Five research sites participated in the story segmentation • Segmentation costs achieved by the participants for ASR-transcription and manual transcriptions 1998 TDT2 Primary Tracking Systems Observation: the lowest cost on ASR text was 0.14, achieved by CMU Dragon’s performance improved in manual transcription (0.11) Tuesday, June 1, 2010
  • 33. Decision Deferral Periods The period defines the amount of future material a segmentation system can use before making a decision Observation: Extended decision deferral periods were helpful for SRI, not for others CMU used 100 words to make decision which had the lowest cost Tuesday, June 1, 2010
  • 34. Topic Tracking Results Eight research sites ran a primary system on the required evaluation, which was to track topics from both Newswire and ASR sources, using 4 training stories per topic 1998 TDT2 Primary Tracking Systems BBN achieved the lowest cost 0.0056 corresponds to missing 14% of on-topic stories and falsely detecting 0.2% of the off-topic Tuesday, June 1, 2010
  • 35. Effect of Number of Training Stories Varied number of training stories supported tracking performance Effect of topic training performance on tracking Performance was better when systems were presented with four training stories rather than one, with an average of 38% relative improvement Tuesday, June 1, 2010
  • 36. Effect of Automatic Segmentation on Tracking Replaces the given story boundaries in the ASR texts with the output of an automatic story segmentation algorithm. Presents a fully automated topic tracking system from newswire and broadcast news audio source Tuesday, June 1, 2010
  • 37. Topic Detection Results The required evaluation was to detect topics in the newswire+ASR source transcripts, deferral decisions for up to 10 source file, and using given reference story boundaries 1998 TDT2 Primary Detection System IBM’s detection cost of 0.0042 corresponds to missing 20% of the documents and falsely including 0.07% of the documents Detection performance improved slightly for the manual transcriptions Tuesday, June 1, 2010
  • 38. Effect of Decision Deferral on Detection Detection evaluation supported decision deferral period Effect of Decision Deferral Detection Small improvement with extended decision deferral periods(an average of 7% relative improvement) Tuesday, June 1, 2010
  • 39. Effect of Automatic Segmentation Detection The detection cost have been computed by dividing the corpus into tow sets – Broadcast news “audio source” transcripts – Newswire “text source” after mapping the reference topic to the system-defined topics Effect of Automatic Segmentation on Detection Tuesday, June 1, 2010
  • 40. Conclusion and Further Work • The first TDT2 Benchmark test was successfully completed and involved eleven research sites. • The errors introduced by ASR errors appear to affect tracking and detection. • Automatic segmentation of ASR text degrades tracking and detection more than ASR errors alone Tuesday, June 1, 2010
  • 41. Conclusion and Further Work cont. • Decision deferral periods appear to be useful for detection, more so than for segmentation • Since TDT2 in 1998 there have been 4 open evaluations Tuesday, June 1, 2010
  • 42. Further Work • Other tasks have been added to the core three tasks of segmentation, tracking and detection. • Further work has looked at monitoring streams of news in multiple languages (eg. Mandarin) and media –newswire, radio, television, web sites or some future combination. Tuesday, June 1, 2010