SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Simple Semantics in Topic
             Detection and Tracking

Juha Makkonen, Helena Anonen-Myka, and Marko Salmenkivi
Introduction



• Topic Detection and Tracking (TDT) focuses on organizing
  news documents
• Split documents into stories, spotting new stories, tracking
  development of an event, and grouping together stories
  describing the same event
• A TDT systems runs on-line without knowledge of incoming
  stories
• Short duration events cause changing vocabulary
Introduction (cont.)


• Use semantic classes, groups consisting of terms that have
  similar meaning: location, proper names, temporal
  expressions, and general terms
• Similarity metric is applied class-wise: compare names in one
  document with names in the other, the locations in one
  document with locations in the other, etc.
• Allows a semantic similarity between terms rather than binary
  string matching
• Results in a vector of similarity measures, which is combined
  via weighted sum to produce a yes/no decision
Topic Detection and Tracking

• Compilation of on-line news and transcribed broadcasts from
  one or more sources and one or more languages
• TDT consists of five tasks:
    1. Topic tracking monitors news streams for stories discussing
       given target topic
    2. First story detection makes binary decisions on whether a
       document discusses a new, previously unreported topic
    3. Topic detection forms topic-based clusters
    4. Link detection determines whether two documents discuss the
       same topic
    5. Story segmentation finds boundaries for cohesive text
       fragments
• TDT presents unique challenges: on-line, few assumptions,
  small number of documents, changing vocabulary
Definitions



• An event is an unique thing that happens at some specific
  time and place
     • Definition neglects events with either long timelines, escalating
      directions, or lack of tight spatio-temporal constraints
• A topic is an event or activity, along with all related events or
  activities
     • A topic is a set of documents that related strongly to each
       other via a seminal event
Document Representation




• Four types of terms: locations, temporal expressions, names,
  and general terms
• Introduces simple semantics since all terms in a given type are
  compared
Event Vector




• Semantic classes are are assigned to basic questions in news
  article: who, what, when, where
    • Called NAMES, TERMS, TEMPORALS, and LOCATIONS
• An event vector is formed by combining multiple semantic
  classes
Event Vector

       TERMS         palenstinian   prime minister   appoint



     LOCATIONS      Ramallah        West Bank



       NAMES        Yassar Arafat   Mahmmoud Abbas



     TEMPORALS      Wendesday




An example event vector for AP news article starting
”RAMALLAH, West Bank — Palestinian leader Yassar Arafat
appointed his longtime deputy Mahmoud Abbas as prime minister
Wednesday...”
Comparing Event Vectors




• Comparison is done class-wise, i.e, via corresponding
  sub-vectors of two event representations
• Similarity metric can be different for each class
     • Use a weighed sum of the similarity measures for final binary
       decision
• Results in a vector in v = {v1 , v2 , v3 , v4 } ∈ R4
Similarity for NAMES and TERMS
• Use the term-frequency inverted document frequency
• Let T = {t1 , t2 , . . . , tn } denote the terms,
  D = {d1 , d2 , . . . , tm } denote the documents. Then, the
  weight w : T × D → R is defined as:
                                                   |D|
                  w (t, d) = f (t, d) · log                ,
                                                   g (t)
  where f : T × D → N represents the number of occurrences
  of term t in document d, |D| is the total number of
  documents, and g : T → N is number of documents in which
  term t occurs (i.e., the document frequency of term t).
• The similarity of two sub-vectors Xk and Yk of semantic class
  k is based on the cosine of the two:
                                |k|
                                i=1 w (ti , Xk )    · w (ti , Yk )
       σ (Xk , Yk ) =
                           |k|             2            |k|             2
                           i=1 w (ti , Xk )    ·        i=1 w (ti , Yk )
  where |k| is the number of terms in semantic class k.
Similarity for TEMPORALS
• Time intervals are mapped to a global calendar that defines a
  time-line and unit conversion
• Temporal similarity is based on comparison of intervals of
  each document. Let T be the global timeline, x ⊆ T be a
  time interval with start- and end-points, xs and xe . Similarity
  between two intervals is
                                  2∆([xs , xe ] ∩ [ys , ye ])
                     µt (x, y ) =
                                  ∆(xs , xe ) + ∆(ys , ye )
  where ∆ is the duration of the interval in days.
• For each pair of intervals from TEMPORAL vectors
  X = {x1 , x2 , . . . , xn } and Y = {y1 , y2 , . . . , yn }, determine the
  maximum value. The similarity is the average of all these
  maxima, i.e.,

                      n                             m
                      i=1 max (µs (xi , Y ))   +    j=1 max (µs (X , yj ))
    σs (X , Y ) =
                                           m+n
Similarity for LOCATIONS
• Locations are split into a five-level hierarchy
    • Continent, region, country, administrative region, and city
    • Administrative region can be replaced by mountain, seas,
       lakes, or river
    • Represented by a tree
• Similarity between two locations, x and y is based on the
  length of the common path:
                                       λ(x ∩ y )
                       µs (x, y ) =
                                      λ(x) + λ(y )
  where λ(x) is the length of the path from the root to the
  element x.
• The spatial similarity between two LOCATION vectors
  X = {x1 , x2 , . . . , xn } and Y = {y1 , y2 , . . . , ym } is
                    n                            m
                    i=1 max (µs (xi , Y ))   +   j=1 max (µs (X , yj ))
    σs (X , Y ) =
                                         m+n
Topic Detection and Tracking Algorithms


• Class-wise comparison of two event vectors produces results in
  a vector v = {v1 , v2 , v3 , v4 } ∈ R4
• Similarity is based on a weighted linear sum of class-wise
  similarity: w, v
• Simplest algorithm uses a hyper-plane: ψ(v) = w, v + b,
  and a perceptron to learn w and b.
• Data is typically not linearly separable, so, transform v to
  higher dimensional space, and use a perceptron to learn a
  hyper-plan there
     • Define φ : R4 → R15 that expands v into its powerset
     • Then hyper-plane is ψ(v) = w , φ(v) + b
Topic Tracking Algorithm



topic ← buildVector()
For each new document d
  doc ← buildVector(d)
  v ← (), decision ← ()
  For each semantic class
    v[c] ← simc (docc , topicc )
  If ( w , φ(v) + b ≥ 0)
    decision = ’YES’
  else
    decision = ’NO’
First Story Detection Algorithm

topics ← (); decision ← ()
For each new document d
  doc ← buildVector(d)
  max ← 0; max topic ← 0
  For each topic
    For each semantic class
      v[c] ← simc (docc , topicc )
    If ( w , φ(v) + b ≥ max)
      max ← w , φ(v) + b
      max topic ← topic
    If (max < θ)
      decision[d] ← ’first-story’
    else
      decision[d] ← max topic
    add(topics, doc)
Experiments




• Text corpus contains 60,000 documents from two on-line
  newspapers, two TV broadcasts, and two radio broadcasts
• Automatic term extraction combined with automata and
  gazetteer to improve performance
Topic Tracking Results


Method       Cdet   (Cdet )norm Pmiss Pfa     p      r      F1
Cosine       0.0058 0.0720      0.0100 0.0470 0.2361 0.7900 0.2927
Weighted Sum 0.0471 0.5214      0.1818 0.0668 0.1646 0.8181 0.2741

                    Table: Using (Cdet )norm




Method       Cdet   (Cdet )norm Pmiss Pfa     p      r      F1
Cosine       0.0524 0.6553      0.2582 0.0097 0.5297 0.7481 0.5481
Weighted Sum 0.0849 1.0621      0.4242 0.0015 0.8636 0.5758 0.6910

                       Table: Using F1
First-Story Detection Results


Method       Cdet   (Cdet )norm Pmiss Pfa     p      r      F1
Cosine       0.0033 0.0414      0.0000 0.0414 0.4583 1.0000 0.6386
Weighted Sum 0.0036 0.0446      0.0000 0.0446 0.4400 1.0000 0.6111

                    Table: Using (Cdet )norm




Method       Cdet   (Cdet )norm Pmiss Pfa     p      r      F1
Cosine       0.0381 0.4768      0.1818 0.0223 0.5625 0.8181 0.6667
Weighted Sum 0.0558 0.6977      0.2727 0.0159 0.6154 0.7272 0.6667

                       Table: Using F1
Discussion




• In topic tracking, performance degrades due to lack of
  vagueness factor
    • For example, matching terms Asia and Washington produce
       the same similarity score, but does not account for
       indefiniteness of the terms.
• Including a posteriori approaches that examine all the data
  and the labels might improve performance
Conclusions



• Paper presents a topic detection and tracking algorithm based
  on semantic classes
• Comparison is class-wise
• Created geographical and temporal ontologies
• Semantic augmentation degraded performance, especially in
  topic tracking
    • Partially due to inadequate spatial and temporal similarity
      function

Weitere ähnliche Inhalte

Was ist angesagt?

Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clusteringishmecse13
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural NetworksNatan Katz
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big DataChristian Robert
 
Sampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsSampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsStephane Senecal
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerChristian Robert
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
An Altering Distance Function in Fuzzy Metric Fixed Point Theorems
An Altering Distance Function in Fuzzy Metric Fixed Point TheoremsAn Altering Distance Function in Fuzzy Metric Fixed Point Theorems
An Altering Distance Function in Fuzzy Metric Fixed Point Theoremsijtsrd
 

Was ist angesagt? (12)

08 clustering
08 clustering08 clustering
08 clustering
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
 
Sampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsSampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methods
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like sampler
 
PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...
PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...
PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
An Altering Distance Function in Fuzzy Metric Fixed Point Theorems
An Altering Distance Function in Fuzzy Metric Fixed Point TheoremsAn Altering Distance Function in Fuzzy Metric Fixed Point Theorems
An Altering Distance Function in Fuzzy Metric Fixed Point Theorems
 

Ähnlich wie Simple semantics in topic detection and tracking

Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdfHabtamu100
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationPaul Houle
 
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...National Institute of Informatics
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptxthenmozhip8
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptpepe3059
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Satyam Saxena
 
General Quadrature Equation
General Quadrature EquationGeneral Quadrature Equation
General Quadrature EquationMrinalDev1
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringPier Luca Lanzi
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGGeorge Simov
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesBryan Gummibearehausen
 
FiniteElementNotes
FiniteElementNotesFiniteElementNotes
FiniteElementNotesMartin Jones
 
UnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.pptUnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.pptRamanamurthy Banda
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspectiveankurpandeyinfo
 

Ähnlich wie Simple semantics in topic detection and tracking (20)

Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
Ir models
Ir modelsIr models
Ir models
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly Information
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
General Quadrature Equation
General Quadrature EquationGeneral Quadrature Equation
General Quadrature Equation
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
Lec1
Lec1Lec1
Lec1
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
 
Ir 08
Ir   08Ir   08
Ir 08
 
Words in space
Words in spaceWords in space
Words in space
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
Bn
BnBn
Bn
 
FiniteElementNotes
FiniteElementNotesFiniteElementNotes
FiniteElementNotes
 
UnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.pptUnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.ppt
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspective
 

Mehr von George Ang

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...George Ang
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarizationGeorge Ang
 
Huffman coding
Huffman codingHuffman coding
Huffman codingGeorge Ang
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textGeorge Ang
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿George Ang
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势George Ang
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程George Ang
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qqGeorge Ang
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道George Ang
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化George Ang
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间George Ang
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨George Ang
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站George Ang
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程George Ang
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagementGeorge Ang
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享George Ang
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享George Ang
 

Mehr von George Ang (20)

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar text
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qq
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享
 

Kürzlich hochgeladen

11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdf11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Political-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptxPolitical-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptxSasikiranMarri
 
Geostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.pptGeostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.pptUsmanKaran
 
Emerging issues in migration policies.ppt
Emerging issues in migration policies.pptEmerging issues in migration policies.ppt
Emerging issues in migration policies.pptNandinituteja1
 
14042024_First India Newspaper Jaipur.pdf
14042024_First India Newspaper Jaipur.pdf14042024_First India Newspaper Jaipur.pdf
14042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdf16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
lok sabha Elections in india- 2024 .pptx
lok sabha Elections in india- 2024 .pptxlok sabha Elections in india- 2024 .pptx
lok sabha Elections in india- 2024 .pptxdigiyvbmrkt
 
15042024_First India Newspaper Jaipur.pdf
15042024_First India Newspaper Jaipur.pdf15042024_First India Newspaper Jaipur.pdf
15042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdf12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Foreign Relation of Pakistan with Neighboring Countries.pptx
Foreign Relation of Pakistan with Neighboring Countries.pptxForeign Relation of Pakistan with Neighboring Countries.pptx
Foreign Relation of Pakistan with Neighboring Countries.pptxunark75
 
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...The Lifesciences Magazine
 
Transforming Andhra Pradesh: TDP's Legacy in Road Connectivity
Transforming Andhra Pradesh: TDP's Legacy in Road ConnectivityTransforming Andhra Pradesh: TDP's Legacy in Road Connectivity
Transforming Andhra Pradesh: TDP's Legacy in Road Connectivitynarsireddynannuri1
 
Power in International Relations (Pol 5)
Power in International Relations (Pol 5)Power in International Relations (Pol 5)
Power in International Relations (Pol 5)ssuser583c35
 
13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdf13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdfFIRST INDIA
 

Kürzlich hochgeladen (14)

11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdf11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdf
 
Political-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptxPolitical-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptx
 
Geostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.pptGeostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.ppt
 
Emerging issues in migration policies.ppt
Emerging issues in migration policies.pptEmerging issues in migration policies.ppt
Emerging issues in migration policies.ppt
 
14042024_First India Newspaper Jaipur.pdf
14042024_First India Newspaper Jaipur.pdf14042024_First India Newspaper Jaipur.pdf
14042024_First India Newspaper Jaipur.pdf
 
16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdf16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdf
 
lok sabha Elections in india- 2024 .pptx
lok sabha Elections in india- 2024 .pptxlok sabha Elections in india- 2024 .pptx
lok sabha Elections in india- 2024 .pptx
 
15042024_First India Newspaper Jaipur.pdf
15042024_First India Newspaper Jaipur.pdf15042024_First India Newspaper Jaipur.pdf
15042024_First India Newspaper Jaipur.pdf
 
12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdf12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdf
 
Foreign Relation of Pakistan with Neighboring Countries.pptx
Foreign Relation of Pakistan with Neighboring Countries.pptxForeign Relation of Pakistan with Neighboring Countries.pptx
Foreign Relation of Pakistan with Neighboring Countries.pptx
 
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...
 
Transforming Andhra Pradesh: TDP's Legacy in Road Connectivity
Transforming Andhra Pradesh: TDP's Legacy in Road ConnectivityTransforming Andhra Pradesh: TDP's Legacy in Road Connectivity
Transforming Andhra Pradesh: TDP's Legacy in Road Connectivity
 
Power in International Relations (Pol 5)
Power in International Relations (Pol 5)Power in International Relations (Pol 5)
Power in International Relations (Pol 5)
 
13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdf13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdf
 

Simple semantics in topic detection and tracking

  • 1. Simple Semantics in Topic Detection and Tracking Juha Makkonen, Helena Anonen-Myka, and Marko Salmenkivi
  • 2. Introduction • Topic Detection and Tracking (TDT) focuses on organizing news documents • Split documents into stories, spotting new stories, tracking development of an event, and grouping together stories describing the same event • A TDT systems runs on-line without knowledge of incoming stories • Short duration events cause changing vocabulary
  • 3. Introduction (cont.) • Use semantic classes, groups consisting of terms that have similar meaning: location, proper names, temporal expressions, and general terms • Similarity metric is applied class-wise: compare names in one document with names in the other, the locations in one document with locations in the other, etc. • Allows a semantic similarity between terms rather than binary string matching • Results in a vector of similarity measures, which is combined via weighted sum to produce a yes/no decision
  • 4. Topic Detection and Tracking • Compilation of on-line news and transcribed broadcasts from one or more sources and one or more languages • TDT consists of five tasks: 1. Topic tracking monitors news streams for stories discussing given target topic 2. First story detection makes binary decisions on whether a document discusses a new, previously unreported topic 3. Topic detection forms topic-based clusters 4. Link detection determines whether two documents discuss the same topic 5. Story segmentation finds boundaries for cohesive text fragments • TDT presents unique challenges: on-line, few assumptions, small number of documents, changing vocabulary
  • 5. Definitions • An event is an unique thing that happens at some specific time and place • Definition neglects events with either long timelines, escalating directions, or lack of tight spatio-temporal constraints • A topic is an event or activity, along with all related events or activities • A topic is a set of documents that related strongly to each other via a seminal event
  • 6. Document Representation • Four types of terms: locations, temporal expressions, names, and general terms • Introduces simple semantics since all terms in a given type are compared
  • 7. Event Vector • Semantic classes are are assigned to basic questions in news article: who, what, when, where • Called NAMES, TERMS, TEMPORALS, and LOCATIONS • An event vector is formed by combining multiple semantic classes
  • 8. Event Vector TERMS palenstinian prime minister appoint LOCATIONS Ramallah West Bank NAMES Yassar Arafat Mahmmoud Abbas TEMPORALS Wendesday An example event vector for AP news article starting ”RAMALLAH, West Bank — Palestinian leader Yassar Arafat appointed his longtime deputy Mahmoud Abbas as prime minister Wednesday...”
  • 9. Comparing Event Vectors • Comparison is done class-wise, i.e, via corresponding sub-vectors of two event representations • Similarity metric can be different for each class • Use a weighed sum of the similarity measures for final binary decision • Results in a vector in v = {v1 , v2 , v3 , v4 } ∈ R4
  • 10. Similarity for NAMES and TERMS • Use the term-frequency inverted document frequency • Let T = {t1 , t2 , . . . , tn } denote the terms, D = {d1 , d2 , . . . , tm } denote the documents. Then, the weight w : T × D → R is defined as: |D| w (t, d) = f (t, d) · log , g (t) where f : T × D → N represents the number of occurrences of term t in document d, |D| is the total number of documents, and g : T → N is number of documents in which term t occurs (i.e., the document frequency of term t). • The similarity of two sub-vectors Xk and Yk of semantic class k is based on the cosine of the two: |k| i=1 w (ti , Xk ) · w (ti , Yk ) σ (Xk , Yk ) = |k| 2 |k| 2 i=1 w (ti , Xk ) · i=1 w (ti , Yk ) where |k| is the number of terms in semantic class k.
  • 11. Similarity for TEMPORALS • Time intervals are mapped to a global calendar that defines a time-line and unit conversion • Temporal similarity is based on comparison of intervals of each document. Let T be the global timeline, x ⊆ T be a time interval with start- and end-points, xs and xe . Similarity between two intervals is 2∆([xs , xe ] ∩ [ys , ye ]) µt (x, y ) = ∆(xs , xe ) + ∆(ys , ye ) where ∆ is the duration of the interval in days. • For each pair of intervals from TEMPORAL vectors X = {x1 , x2 , . . . , xn } and Y = {y1 , y2 , . . . , yn }, determine the maximum value. The similarity is the average of all these maxima, i.e., n m i=1 max (µs (xi , Y )) + j=1 max (µs (X , yj )) σs (X , Y ) = m+n
  • 12. Similarity for LOCATIONS • Locations are split into a five-level hierarchy • Continent, region, country, administrative region, and city • Administrative region can be replaced by mountain, seas, lakes, or river • Represented by a tree • Similarity between two locations, x and y is based on the length of the common path: λ(x ∩ y ) µs (x, y ) = λ(x) + λ(y ) where λ(x) is the length of the path from the root to the element x. • The spatial similarity between two LOCATION vectors X = {x1 , x2 , . . . , xn } and Y = {y1 , y2 , . . . , ym } is n m i=1 max (µs (xi , Y )) + j=1 max (µs (X , yj )) σs (X , Y ) = m+n
  • 13. Topic Detection and Tracking Algorithms • Class-wise comparison of two event vectors produces results in a vector v = {v1 , v2 , v3 , v4 } ∈ R4 • Similarity is based on a weighted linear sum of class-wise similarity: w, v • Simplest algorithm uses a hyper-plane: ψ(v) = w, v + b, and a perceptron to learn w and b. • Data is typically not linearly separable, so, transform v to higher dimensional space, and use a perceptron to learn a hyper-plan there • Define φ : R4 → R15 that expands v into its powerset • Then hyper-plane is ψ(v) = w , φ(v) + b
  • 14. Topic Tracking Algorithm topic ← buildVector() For each new document d doc ← buildVector(d) v ← (), decision ← () For each semantic class v[c] ← simc (docc , topicc ) If ( w , φ(v) + b ≥ 0) decision = ’YES’ else decision = ’NO’
  • 15. First Story Detection Algorithm topics ← (); decision ← () For each new document d doc ← buildVector(d) max ← 0; max topic ← 0 For each topic For each semantic class v[c] ← simc (docc , topicc ) If ( w , φ(v) + b ≥ max) max ← w , φ(v) + b max topic ← topic If (max < θ) decision[d] ← ’first-story’ else decision[d] ← max topic add(topics, doc)
  • 16. Experiments • Text corpus contains 60,000 documents from two on-line newspapers, two TV broadcasts, and two radio broadcasts • Automatic term extraction combined with automata and gazetteer to improve performance
  • 17. Topic Tracking Results Method Cdet (Cdet )norm Pmiss Pfa p r F1 Cosine 0.0058 0.0720 0.0100 0.0470 0.2361 0.7900 0.2927 Weighted Sum 0.0471 0.5214 0.1818 0.0668 0.1646 0.8181 0.2741 Table: Using (Cdet )norm Method Cdet (Cdet )norm Pmiss Pfa p r F1 Cosine 0.0524 0.6553 0.2582 0.0097 0.5297 0.7481 0.5481 Weighted Sum 0.0849 1.0621 0.4242 0.0015 0.8636 0.5758 0.6910 Table: Using F1
  • 18. First-Story Detection Results Method Cdet (Cdet )norm Pmiss Pfa p r F1 Cosine 0.0033 0.0414 0.0000 0.0414 0.4583 1.0000 0.6386 Weighted Sum 0.0036 0.0446 0.0000 0.0446 0.4400 1.0000 0.6111 Table: Using (Cdet )norm Method Cdet (Cdet )norm Pmiss Pfa p r F1 Cosine 0.0381 0.4768 0.1818 0.0223 0.5625 0.8181 0.6667 Weighted Sum 0.0558 0.6977 0.2727 0.0159 0.6154 0.7272 0.6667 Table: Using F1
  • 19. Discussion • In topic tracking, performance degrades due to lack of vagueness factor • For example, matching terms Asia and Washington produce the same similarity score, but does not account for indefiniteness of the terms. • Including a posteriori approaches that examine all the data and the labels might improve performance
  • 20. Conclusions • Paper presents a topic detection and tracking algorithm based on semantic classes • Comparison is class-wise • Created geographical and temporal ontologies • Semantic augmentation degraded performance, especially in topic tracking • Partially due to inadequate spatial and temporal similarity function