SlideShare a Scribd company logo
1 of 6
Download to read offline
Corpus-based and Knowledge-based Measures
                                      of Text Semantic Similarity

            Rada Mihalcea and Courtney Corley                                Carlo Strapparava
                     Department of Computer Science              Istituto per la Ricerca Scientifica e Tecnologica
                        University of North Texas                                     ITC – irst
                        {rada,corley}@cs.unt.edu                                   strappa@itc.it



                            Abstract                             successful to a certain degree, these lexical similarity meth-
                                                                 ods cannot always identify the semantic similarity of texts.
     This paper presents a method for measuring the se-          For instance, there is an obvious similarity between the text
     mantic similarity of texts, using corpus-based and          segments I own a dog and I have an animal, but most of
     knowledge-based measures of similarity. Previous work
     on this problem has focused mainly on either large doc-
                                                                 the current text similarity metrics will fail in identifying any
     uments (e.g. text classification, information retrieval)     kind of connection between these texts.
     or individual words (e.g. synonymy tests). Given that          There is a large number of word-to-word semantic simi-
     a large fraction of the information available today, on     larity measures, using approaches that are either knowledge-
     the Web and elsewhere, consists of short text snip-         based (Wu & Palmer 1994; Leacock & Chodorow 1998)
     pets (e.g. abstracts of scientific documents, imagine        or corpus-based (Turney 2001). Such measures have been
     captions, product descriptions), in this paper we fo-       successfully applied to language processing tasks such as
     cus on measuring the semantic similarity of short texts.    malapropism detection (Budanitsky & Hirst 2001), word
     Through experiments performed on a paraphrase data          sense disambiguation (Patwardhan, Banerjee, & Pedersen
     set, we show that the semantic similarity method out-
     performs methods based on simple lexical matching, re-      2003), and synonym identification (Turney 2001). For text-
     sulting in up to 13% error rate reduction with respect to   based semantic similarity, perhaps the most widely used ap-
     the traditional vector-based similarity metric.             proaches are the approximations obtained through query ex-
                                                                 pansion, as performed in information retrieval (Voorhees
                                                                 1993), or the latent semantic analysis method (Landauer,
                        Introduction                             Foltz, & Laham 1998) that measures the similarity of texts
Measures of text similarity have been used for a long time       by exploiting second-order word relations automatically ac-
in applications in natural language processing and related       quired from large text collections.
areas. One of the earliest applications of text similarity is       A related line of work consists of methods for paraphrase
perhaps the vectorial model in information retrieval, where      recognition, which typically seek to align sentences in com-
the document most relevant to an input query is determined       parable corpora (Barzilay & Elhadad 2003; Dolan, Quirk,
by ranking documents in a collection in reversed order of        & Brockett 2004), or paraphrase generation using distribu-
their similarity to the given query (Salton & Lesk 1971).        tional similarity applied on paths of dependency trees (Lin &
Text similarity has also been used for relevance feedback        Pantel 2001) or using bilingual parallel corpora (Barnard &
and text classification (Rocchio 1971), word sense disam-         Callison-Burch 2005). These methods target the identifica-
biguation (Lesk 1986; Schutze 1998), and more recently for       tion of paraphrases in large documents, or the generation of
extractive summarization (Salton et al. 1997), and methods       paraphrases starting with an input text, without necessarily
for automatic evaluation of machine translation (Papineni et     providing a measure of their similarity. The recently intro-
al. 2002) or text summarization (Lin & Hovy 2003). Mea-          duced textual entailment task (Dagan, Glickman, & Magnini
sures of text similarity were also found useful for the evalu-   2005) is also related to some extent, however textual entail-
ation of text coherence (Lapata & Barzilay 2005).                ment targets the identification of a directional inferential re-
   With few exceptions, the typical approach to finding the       lation between texts, which is different than textual similar-
similarity between two text segments is to use a simple lex-     ity, and hence entailment systems are not overviewed here.
ical matching method, and produce a similarity score based          In this paper, we suggest a method for measuring the
on the number of lexical units that occur in both input seg-     semantic similarity of texts by exploiting the information
ments. Improvements to this simple method have consid-           that can be drawn from the similarity of the component
ered stemming, stop-word removal, part-of-speech tagging,        words. Specifically, we describe two corpus-based and six
longest subsequence matching, as well as various weighting       knowledge-based measures of word semantic similarity, and
and normalization factors (Salton & Buckley 1997). While         show how they can be used to derive a text-to-text similarity
                                                                 metric. We show that this measure of text semantic sim-
Copyright c 2006, American Association for Artificial Intelli-    ilarity outperforms the simpler vector-based similarity ap-
gence (www.aaai.org). All rights reserved.                       proach, as evaluated on a paraphrase recognition task.
Text Semantic Similarity                             normalized with the length of each text segment. Finally the
Measures of semantic similarity have been traditionally de-         resulting similarity scores are combined using a simple av-
fined between words or concepts, and much less between               erage. Note that only open-class words and cardinals can
text segments consisting of two or more words. The em-              participate in this semantic matching process. As done in
phasis on word-to-word similarity metrics is probably due           previous work on text similarity using vector-based models,
to the availability of resources that specifically encode re-        all function words are discarded.
lations between words or concepts (e.g. WordNet), and the               The similarity between the input text segments T1 and T2
various testbeds that allow for their evaluation (e.g. TOEFL        is therefore determined using the following scoring function:
or SAT analogy/synonymy tests). Moreover, the derivation
of a text-to-text measure of similarity starting with a word-                                        (maxSim(w,T2 )∗idf (w))
based semantic similarity metric may not be straightforward,                            1 w∈{T1 }
and consequently most of the work in this area has consid-            sim(T1 , T2 ) =   2(                      idf (w)
                                                                                                                               +
ered mainly applications of the traditional vectorial model,                                          w∈{T1 }

occasionally extended to n-gram language models.                                                     (maxSim(w,T1 )∗idf (w))
   Given two input text segments, we want to automatically                                 w∈{T2 }
                                                                                                                               )   (1)
derive a score that indicates their similarity at semantic level,                                               idf (w)
thus going beyond the simple lexical matching methods tra-                                            w∈{T2 }

ditionally used for this task. Although we acknowledge the
                                                                       This similarity score has a value between 0 and 1, with a
fact that a comprehensive metric of text semantic similarity
                                                                    score of 1 indicating identical text segments, and a score of
should also take into account the structure of the text, we
                                                                    0 indicating no semantic overlap between the two segments.
take a first rough cut at this problem and attempt to model
                                                                       Note that the maximum similarity is sought only within
the semantic similarity of texts as a function of the semantic
                                                                    classes of words with the same part-of-speech. The rea-
similarity of the component words. We do this by combining
                                                                    son behind this decision is that most of the word-to-word
metrics of word-to-word similarity and word specificity into
                                                                    knowledge-based measures cannot be applied across parts-
a formula that is a potentially good indicator of the semantic
                                                                    of-speech, and consequently, for the purpose of consistency,
similarity of the two input texts.
                                                                    we imposed the “same word-class” restriction to all the
   The following section provides details on eight different
                                                                    word-to-word similarity measures. This means that, for in-
corpus-based and knowledge-based measures of word se-
                                                                    stance, the most similar word to the noun flower within the
mantic similarity. In addition to the similarity of words,
                                                                    text “There are many green plants next to the house” will
we also take into account the specificity of words, so that
                                                                    be sought among the nouns plant and house, and will ig-
we can give a higher weight to a semantic matching identi-
                                                                    nore the words with a different part-of-speech (be, green,
fied between two specific words (e.g. collie and sheepdog),
                                                                    next). Moreover, for those parts-of-speech for which a word-
and give less importance to the similarity measured between
                                                                    to-word semantic similarity cannot be measured (e.g. some
generic concepts (e.g. get and become). While the speci-
                                                                    knowledge-based measures are not defined among adjec-
ficity of words is already measured to some extent by their
                                                                    tives or adverbs), we use instead a lexical match measure,
depth in the semantic hierarchy, we are reinforcing this fac-
                                                                    which assigns a maxSim of 1 for identical occurrences of a
tor with a corpus-based measure of word specificity, based
                                                                    word in the two text segments.
on distributional information learned from large corpora.
   The specificity of a word is determined using the in-
verse document frequency (idf) introduced by Sparck-Jones                      Semantic Similarity of Words
(1972), defined as the total number of documents in the cor-         There is a relatively large number of word-to-word similar-
pus divided by the total number of documents including that         ity metrics that were previously proposed in the literature,
word. The idf measure was selected based on previous work           ranging from distance-oriented measures computed on se-
that theoretically proved the effectiveness of this weight-         mantic networks, to metrics based on models of distribu-
ing approach (Papineni 2001). In the experiments reported           tional similarity learned from large text collections. From
here, document frequency counts are derived starting with           these, we chose to focus our attention on two corpus-based
the British National Corpus – a 100 million words corpus of         metrics and six knowledge-based different metrics, selected
modern English including both spoken and written genres.            mainly for their observed performance in other natural lan-
   Given a metric for word-to-word similarity and a measure         guage processing applications.
of word specificity, we define the semantic similarity of two
text segments T1 and T2 using a metric that combines the            Corpus-based Measures
semantic similarities of each text segment in turn with re-
                                                                    Corpus-based measures of word semantic similarity try to
spect to the other text segment. First, for each word w in the
                                                                    identify the degree of similarity between words using infor-
segment T1 we try to identify the word in the segment T2
                                                                    mation exclusively derived from large corpora. In the exper-
that has the highest semantic similarity (maxSim(w, T2 )),
                                                                    iments reported here, we considered two metrics, namely:
according to one of the word-to-word similarity measures
                                                                    (1) pointwise mutual information (Turney 2001), and (2) la-
described in the following section. Next, the same process
                                                                    tent semantic analysis (Landauer, Foltz, & Laham 1998).
is applied to determine the most similar word in T1 starting
with words in T2 . The word similarities are then weighted          Pointwise Mutual Information The pointwise mutual
with the corresponding word specificity, summed up, and              information using data collected by information retrieval
(PMI-IR) was suggested by (Turney 2001) as an unsuper-              LSA can be viewed as a way to overcome some of the
vised measure for the evaluation of the semantic similarity      drawbacks of the standard vector space model (sparseness
of words. It is based on word co-occurrence using counts         and high dimensionality). In fact, the LSA similarity is com-
collected over very large corpora (e.g. the Web). Given two      puted in a lower dimensional space, in which second-order
words w1 and w2 , their PMI-IR is measured as:                   relations among terms and texts are exploited. The similarity
                                                                 in the resulting vector space is then measured with the stan-
                                        p(w1 &w2 )
            PMI-IR(w1 , w2 ) = log2                        (2)   dard cosine similarity. Note also that LSA yields a vector
                                      p(w1 ) ∗ p(w2 )            space model that allows for a homogeneous representation
which indicates the degree of statistical dependence between     (and hence comparison) of words, word sets, and texts.
w1 and w2 , and can be used as a measure of the seman-              The application of the LSA word similarity measure to
tic similarity of w1 and w2 . From the four different types      text semantic similarity is done using Equation 1, which
of queries suggested by Turney (2001), we are using the          roughly amounts to the pseudo-document text representa-
N EAR query (co-occurrence within a ten-word window),            tion for LSA computation, as described by Berry (1992). In
which is a balance between accuracy (results obtained on         practice, each text segment is represented in the LSA space
synonymy tests) and efficiency (number of queries to be run       by summing up the normalized LSA vectors of all the con-
against a search engine). Specifically, the following query is    stituent words, using also a tf.idf weighting scheme.
used to collect counts from the AltaVista search engine.
                                                                 Knowledge-based Measures
                               hits(w1 N EAR w2 )                There are a number of measures that were developed to
          pN EAR (w1 &w2 )                                 (3)
                                    W ebSize                     quantify the degree to which two words are semantically re-
  With p(wi ) approximated as hits(w1 )/W ebSize, the fol-       lated using information drawn from semantic networks – see
lowing PMI-IR measure is obtained:                               e.g. (Budanitsky & Hirst 2001) for an overview. We present
                                                                 below several measures found to work well on the Word-
                    hits(w1 AN D w2 ) ∗ W ebSize                 Net hierarchy. All these measures assume as input a pair of
             log2                                          (4)
                         hits(w1 ) ∗ hits(w2 )                   concepts, and return a value indicating their semantic relat-
                                                                 edness. The six measures below were selected based on their
   In a set of experiments based on TOEFL synonymy tests         observed performance in other language processing applica-
(Turney 2001), the PMI-IR measure using the N EAR op-            tions, and for their relatively high computational efficiency.
erator accurately identified the correct answer (out of four         We conduct our evaluation using the following word sim-
synonym choices) in 72.5% of the cases, which exceeded           ilarity metrics: Leacock & Chodorow, Lesk, Wu & Palmer,
by a large margin the score obtained with latent semantic        Resnik, Lin, and Jiang & Conrath. Note that all these metrics
analysis (64.4%), as well as the average non-English college     are defined between concepts, rather than words, but they
applicant (64.5%). Since Turney (2001) performed evalu-          can be easily turned into a word-to-word similarity metric
ations of synonym candidates for one word at a time, the         by selecting for any given pair of words those two meanings
W ebSize value was irrelevant in the ranking. In our applica-    that lead to the highest concept-to-concept similarity1 . We
tion instead, it is not only the ranking of the synonym candi-   use the WordNet-based implementation of these metrics, as
dates that matters (for the selection of maxSim in Equation      available in the WordNet::Similarity package (Patwardhan,
1), but also the true value of PMI-IR, which is needed for       Banerjee, & Pedersen 2003). We provide below a short de-
the overall calculation of the text-to-text similarity metric.   scription for each of these six metrics.
We approximate the value of W ebSize to 7x1011 , which is
the value used by Chklovski (2004) in co-occurrence exper-       The Leacock & Chodorow (Leacock & Chodorow 1998)
iments involving Web counts.                                     similarity is determined as:
                                                                                                       length
Latent Semantic Analysis Another corpus-based mea-                                   Simlch = − log                             (5)
                                                                                                        2∗D
sure of semantic similarity is the latent semantic analysis
(LSA) proposed by Landauer (1998). In LSA, term co-              where length is the length of the shortest path between two
occurrences in a corpus are captured by means of a dimen-        concepts using node-counting, and D is the maximum depth
sionality reduction operated by a singular value decomposi-      of the taxonomy.
tion (SVD) on the term-by-document matrix T representing         The Lesk similarity of two concepts is defined as a function
the corpus. For the experiments reported here, we run the        of the overlap between the corresponding definitions, as pro-
SVD operation on the British National Corpus.                    vided by a dictionary. It is based on an algorithm proposed
   SVD is a well-known operation in linear algebra, which        by Lesk (1986) as a solution for word sense disambiguation.
can be applied to any rectangular matrix in order to find cor-    The application of the Lesk similarity measure is not limited
relations among its rows and columns. In our case, SVD           to semantic networks, and it can be used in conjunction with
decomposes the term-by-document matrix T into three ma-          any dictionary that provides word definitions.
trices T = UΣk VT where Σk is the diagonal k × k matrix
                                                                 The Wu and Palmer (Wu & Palmer 1994) similarity met-
containing the k singular values of T, σ1 ≥ σ2 ≥ . . . ≥ σk ,
                                                                 ric measures the depth of two given concepts in the Word-
and U and V are column-orthogonal matrices. When the
three matrices are multiplied together the original term-by-        1
                                                                     This is similar to the methodology used by (McCarthy et al.
document matrix is re-composed. Typically we can choose          2004) to find similarities between words and senses starting with a
k      k obtaining the approximation T UΣk VT .                  concept-to-concept similarity measure.
Net taxonomy, and the depth of the least common subsumer                             Text 1       Text 2       maxSim     idf
(LCS), and combines these figures into a similarity score:                            defendant    defendant     1.00     3.93
                                                                                     lawyer       attorney      0.89     2.64
                             2 ∗ depth(LCS)                                          walked       walked        1.00     1.58
       Simwup =                                                    (6)               court        courthouse    0.60     1.06
                    depth(concept1 ) + depth(concept2 )
                                                                                     victims      courthouse    0.40     2.11
                                                                                     supporters   crowd         0.40     2.15
The measure introduced by Resnik (Resnik 1995) returns                               turned       turned        1.00     0.66
the information content (IC) of the LCS of two concepts:                             backs        backs         1.00     2.41
                      Simres = IC(LCS)                             (7)
                                                                          Table 1: Word similarity scores and word specificity (idf)
where IC is defined as:
                      IC(c) = − log P (c)                          (8)
and P (c) is the probability of encountering an instance of              will result in a score of 0.46, thereby failing to find the para-
concept c in a large corpus.                                             phrase relation.
                                                                            Although there are a few words that occur in both text seg-
The next measure we use in our experiments is the metric in-             ments (e.g. defendant, or turn), there are also words that are
troduced by Lin (Lin 1998), which builds on Resnik’s mea-                not identical, but closely related, e.g. lawyer found similar
sure of similarity, and adds a normalization factor consisting           to attorney, or supporters which is related to crowd. Unlike
of the information content of the two input concepts:                    traditional similarity measures based on lexical matching,
                             2 ∗ IC(LCS)                                 our metric takes into account the semantic similarity of these
          Simlin =                                                 (9)   words, resulting in a more precise measure of text similarity.
                      IC(concept1 ) + IC(concept2 )

Finally, the last similarity metric considered is Jiang &                                Evaluation and Results
Conrath (Jiang & Conrath 1997):
                                                                         To test the effectiveness of the text semantic similarity mea-
                                  1                                      sure, we use it to automatically identify if two text segments
 Simjnc =
             IC(concept1 ) + IC(concept2 ) − 2 ∗ IC(LCS)                 are paraphrases of each other. We use the Microsoft para-
                                                      (10)               phrase corpus (Dolan, Quirk, & Brockett 2004), consisting
Note that all the word similarity measures are normalized so             of 4,076 training and 1,725 test pairs, and determine the
that they fall within a 0–1 range. The normalization is done             number of correctly identified paraphrase pairs in the cor-
by dividing the similarity score provided by a given measure             pus using the text semantic similarity measure as the only
with the maximum possible score for that measure.                        indicator of paraphrasing. The paraphrase pairs in this cor-
                                                                         pus were automatically collected from thousands of news
              A Walk-Through Example                                     sources on the Web over a period of 18 months, and were
                                                                         subsequently labeled by two human annotators who deter-
The application of the text similarity measure is illustrated            mined if the two sentences in a pair were semantically equiv-
with an example. Given two text segments, as shown be-                   alent or not. The agreement between the human judges who
low, we want to determine a score that reflects their semantic            labeled the candidate paraphrase pairs in this data set was
similarity. For illustration purposes, we restrict our attention         measured at approximately 83%, which can be considered
to one corpus-based measure – the PMI-IR metric imple-                   as an upperbound for an automatic paraphrase recognition
mented using the AltaVista N EAR operator.                               task performed on this data set.
   Text Segment 1: When the defendant and his lawyer walked                 For each candidate paraphrase pair in the test set, we first
   into the court, some of the victim supporters turned their            evaluate the text semantic similarity metric using Equation
   backs on him.                                                         1, and then label the candidate pair as a paraphrase if the
   Text Segment 2: When the defendant walked into the court-             similarity score exceeds a threshold of 0.5. Note that this
   house with his attorney, the crowd turned their backs on him.         is an unsupervised experimental setting, and therefore the
   Starting with each of the two text segments, and for each             training data is not used in the experiments.
open-class word, we determine the most similar word in the
other text segment, according to the PMI-IR similarity mea-              Baselines
sure. As mentioned earlier, a semantic similarity is sought              For comparison, we also compute two baselines: (1) A ran-
only between words with the same part-of-speech. Table 1                 dom baseline created by randomly choosing a true (para-
shows the word similarity scores and the word specificity                 phrase) or false (not paraphrase) value for each text pair; and
(idf) starting with the first text segment.                               (2) A vector-based similarity baseline, using a cosine simi-
   Next, using Equation 1, we combine the word similari-                 larity measure as traditionally used in information retrieval,
ties and their corresponding specificity, and determine the               with tf.idf weighting.
semantic similarity of the two texts as 0.80. This similarity
score correctly identifies the paraphrase relation between the
two text segments (using the same threshold of 0.50 as used
                                                                         Results
throughout all the experiments reported in this paper). In-              We evaluate the results in terms of accuracy, representing the
stead, a cosine similarity score based on the same idf weights           number of correctly identified true or false classifications in
Metric          Acc. Prec. Rec.         F              a similar behavior on the same subset of the test data (pre-
               Semantic similarity (corpus-based)                sumably an “easy” subset that can be solved using measures
          PMI-IR          69.9     70.2 95.2 81.0                of semantic similarity), or if the different measures cover in
          LSA             68.4     69.7 95.2 80.5
                                                                 fact different subsets of the data, we calculated the Pearson
             Semantic similarity (knowledge-based)
          J&C             69.3     72.2 87.1 79.0
                                                                 correlation factor among all the similarity measures. As seen
          L&C             69.5     72.4 87.0 79.0                in Table 3, there is in fact a high correlation among several
          Lesk            69.3     72.4 86.6 78.9                of the knowledge-based measures, indicating an overlap in
          Lin             69.3     71.6 88.7 79.2                their behavior. Although some of these metrics are diver-
          W&P             69.0     70.2 92.1 80.0                gent in what they measure (e.g. Lin versus Lesk), it seems
          Resnik          69.0     69.0 96.4 80.4                that the fact they are applied in a context lessens the differ-
          Combined        70.3     69.6 97.7 81.3                ences observed when applied at word level. Interestingly,
                           Baselines                             the Resnik measure has a low correlation with the other
          Vector-based 65.4        71.6 79.5 75.3                knowledge-based measures, and a somehow higher correla-
          Random          51.3     68.3 50.0 57.8                tion with the corpus-based metrics, which is probably due to
                                                                 the data-driven information content used in the Resnik mea-
   Table 2: Text similarity for paraphrase identification         sure (although Lin and Jiang & Conrath also use the infor-
                                                                 mation content, they have an additional normalization factor
                                                                 that makes them behave differently). Perhaps not surprising,
the test data set. We also measure precision, recall and F-      the corpus-based measures are only weakly correlated with
measure, calculated with respect to the true values in the       the knowledge-based measures and among them, with LSA
test data. Table 2 shows the results obtained. Among all         having the smallest correlation with the other metrics.
the individual measures of similarity, the PMI-IR measure           An interesting example is represented by the following
was found to perform the best, although the difference with      two text segments, where only the Resnik measure and the
respect to the other measures is small.                          two corpus-based measures manage to identify the para-
   In addition to the individual measures of similarity, we      phrase, because of a higher similarity found between sys-
also evaluate a metric that combines several similarity mea-     tems and PC, and between technology and processor.
sures into a single figure, using a simple average. We in-          Text Segment 1: Gateway will release new Profile 4 systems
clude all similarity measures, for an overall final accuracy        with the new Intel technology on Wednesday.
of 70.3%, and an F-measure of 81.3%.                               Text Segment 2: Gateway ’s all-in-one PC , the Profile 4 ,
   The improvement of the semantic similarity metrics over         also now features the new Intel processor.
the vector-based cosine similarity was found to be statisti-
cally significant in all the experiments, using a paired t-test      There are also cases where almost all the semantic simi-
(p < 0.001).                                                     larity measures fail, and instead the simpler cosine similar-
                                                                 ity has a better performance. This is mostly the case for the
            Discussion and Conclusions                           negative (not paraphrase) examples in the test data, where
                                                                 the semantic similarities identified between words increase
As it turns out, incorporating semantic information into         the overall text similarity above the threshold of 0.5. For in-
measures of text similarity increases the likelihood of recog-   stance, the following text segments were falsely marked as
nition significantly over the random baseline and over the        paraphrases by all but the cosine similarity and the Resnik
vector-based cosine similarity baseline, as measured in          measure:
a paraphrase recognition task. The best performance is
                                                                   Text Segment 1: The man wasn’t on the ice, but trapped in
achieved using a method that combines several similarity           the rapids, swaying in an eddy about 250 feet from the shore.
metrics into one, for an overall accuracy of 70.3%, repre-         Text Segment 2: The man was trapped about 250 feet from
senting a significant 13.8% error rate reduction with respect       the shore, right at the edge of the falls.
to the vector-based cosine similarity baseline. Moreover, if
we were to take into account the upperbound of 83% estab-           The small variations between the accuracies obtained with
lished by the inter-annotator agreement achieved on this data    the corpus-based and knowledge-based measures also sug-
set (Dolan, Quirk, & Brockett 2004), the error rate reduction    gest that both data-driven and knowledge-rich methods have
over the baseline appears even more significant.                  their own merits, leading to a similar performance. Corpus-
   In addition to performance, we also tried to gain insights    based methods have the advantage that no hand-made re-
into the applicability of the semantic similarity measures, by   sources are needed and, apart form the choice of an ap-
finding their coverage on this data set. On average, among        propriate and large corpus, they raise no problems related
approximately 18,000 word similarities identified in this cor-    to the completeness of the resources. On the other hand,
pus, about 14,500 are due to lexical matches, and 3,500 are      knowledge-based methods can encode fine-grained informa-
due to semantic similarities, which indicates that about 20%     tion. This difference can be observed in terms of precision
of the relations found between text segments are based on        and recall. In fact, while precision is generally higher with
semantics, in addition to lexical identity.                      knowledge-based measures, corpus-based measures give in
   Despite the differences among the various word-to-word        general better performance in recall.
similarity measures (corpus-based vs. knowledge-based,              Although our method relies on a bag-of-words approach,
definitional vs. link-based), the results are surprisingly sim-   as it turns out the use of measures of semantic similarity
ilar. To determine if the similar overall results are due to     improves significantly over the traditional lexical matching
Vect   PMI-IR    LSA     J&C    L&C    Lesk   Lin    W&P      Resnik
                               Vect      1.00    0.84     0.44    0.61   0.63   0.60   0.61   0.50      0.65
                               PMI-IR            1.00     0.58    0.67   0.68   0.65   0.67   0.58      0.64
                               LSA                        1.00    0.42   0.44   0.42   0.43   0.34      0.41
                               J&C                                1.00   0.98   0.97   0.99   0.91      0.45
                               L&C                                       1.00   0.98   0.98   0.87      0.46
                               Lesk                                             1.00   0.96   0.86      0.43
                               Lin                                                     1.00   0.88      0.44
                               W&P                                                            1.00      0.34
                               Resnik                                                                   1.00

                                        Table 3: Pearson correlation among similarity measures


metrics. We are nonetheless aware that a bag-of-words ap-                  Lin, C., and Hovy, E. 2003. Automatic evaluation of summaries
proach ignores many of the important relationships in sen-                 using n-gram co-occurrence statistics. In Proceedings of Human
tence structure, such as dependencies between words, or                    Language Technology Conference .
roles played by the various arguments in the sentence. Fu-                 Lin, D., and Pantel, P. 2001. Discovery of inference rules for
ture work will consider the investigation of more sophisti-                question answering. Natural Language Engineering 7(3).
cated representations of sentence structure, such as first or-              Lin, D. 1998. An information-theoretic definition of similarity.
der predicate logic or semantic parse trees, which should al-              In Proceedings of the International Conf. on Machine Learning.
low for the implementation of more effective measures of                   McCarthy, D.; Koeling, R.; Weeds, J.; and Carroll, J. 2004. Find-
text semantic similarity.                                                  ing predominant senses in untagged text. In Proceedings of the
                                                                           Annual Meeting of the Association for Computational Linguistics.
                         References                                        Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W. 2002. Bleu: a
                                                                           method for automatic evaluation of machine translation. In Pro-
 Barnard, C., and Callison-Burch, C. 2005. Paraphrasing with               ceedings of the 40th Annual Meeting of the Association for Com-
 bilingual parallel corpora. In Proceedings of the 43rd Annual             putational Linguistics.
 Meeting of the Association for Computational Linguistics.                 Papineni, K. 2001. Why inverse document frequency? In Pro-
 Barzilay, R., and Elhadad, N. 2003. Sentence alignment for                ceedings of the North American Chapter of the Association for
 monolingual comparable corpora. In Proceedings of the Confer-             Compuatational Linguistics, 25–32.
 ence on Empirical Methods in Natural Language Processing..                Patwardhan, S.; Banerjee, S.; and Pedersen, T. 2003. Using mea-
 Berry, M. 1992. Large-scale sparse singular value computations.           sures of semantic relatedness for word sense disambiguation. In
 International Journal of Supercomputer Applications 6(1).                 Proceedings of the Fourth International Conference on Intelligent
                                                                           Text Processing and Computational Linguistics.
 Budanitsky, A., and Hirst, G. 2001. Semantic distance in Word-
                                                                           Resnik, P. 1995. Using information content to evaluate semantic
 Net: An experimental, application-oriented evaluation of five
                                                                           similarity. In Proceedings of the 14th International Joint Confer-
 measures. In Proceedings of the NAACL Workshop on WordNet
                                                                           ence on Artificial Intelligence.
 and Other Lexical Resources.
                                                                           Rocchio, J. 1971. Relevance feedback in information retrieval.
 Chklovski, T., and Pantel, P. 2004. Verbocean: Mining the Web             Prentice Hall, Ing. Englewood Cliffs, New Jersey.
 for fine-grained semantic verb relations. In Proceedings of Con-
 ference on Empirical Methods in Natural Language Processing..             Salton, G., and Buckley, C. 1997. Term weighting approaches
                                                                           in automatic text retrieval. In Readings in Information Retrieval.
 Dagan, I.; Glickman, O.; and Magnini, B. 2005. The PASCAL                 San Francisco, CA: Morgan Kaufmann Publishers.
 recognising textual entailment challenge. In Proceedings of the           Salton, G., and Lesk, M. 1971. Computer evaluation of indexing
 PASCAL Workshop.                                                          and text processing. Prentice Hall, Ing. Englewood Cliffs, New
 Dolan, W.; Quirk, C.; and Brockett, C. 2004. Unsupervised                 Jersey. 143–180.
 construction of large paraphrase corpora: Exploiting massively            Salton, G.; Singhal, A.; Mitra, M.; and Buckley, C. 1997. Auto-
 parallel news sources. In Proceedings of the 20th International           matic text structuring and summarization. Information Processing
 Conference on Computational Linguistics.                                  and Management 2(32).
 Jiang, J., and Conrath, D. 1997. Semantic similarity based on cor-        Schutze, H. 1998. Automatic word sense discrimination. Com-
 pus statistics and lexical taxonomy. In Proceedings of the Inter-         putational Linguistics 24(1):97–124.
 national Conference on Research in Computational Linguistics.             Sparck-Jones, K. 1972. A statistical interpretation of term speci-
 Landauer, T. K.; Foltz, P.; and Laham, D. 1998. Introduction to           ficity and its application in retrieval. Journal of Documentation
 latent semantic analysis. Discourse Processes 25.                         28(1):11–21.
 Lapata, M., and Barzilay, R. 2005. Automatic evaluation of text           Turney, P. 2001. Mining the web for synonyms: PMI-IR versus
 coherence: Models and representations. In Proceedings of the              LSA on TOEFL. In Proceedings of the Twelfth European Confer-
 19th International Joint Conference on Artificial Intelligence.            ence on Machine Learning (ECML-2001).
 Leacock, C., and Chodorow, M. 1998. Combining local context               Voorhees, E. 1993. Using WordNet to disambiguate word senses
 and WordNet sense similarity for word sense identification. In             for text retrieval. In Proceedings of the 16th annual international
 WordNet, An Electronic Lexical Database. The MIT Press.                   ACM SIGIR conference.
                                                                           Wu, Z., and Palmer, M. 1994. Verb semantics and lexical selec-
 Lesk, M. 1986. Automatic sense disambiguation using machine
                                                                           tion. In Proceedings of the Annual Meeting of the Association for
 readable dictionaries: How to tell a pine cone from an ice cream
                                                                           Computational Linguistics.
 cone. In Proceedings of the SIGDOC Conference 1986.

More Related Content

What's hot

Application of rhetorical
Application of rhetoricalApplication of rhetorical
Application of rhetoricalcsandit
 
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...Andre Freitas
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...Editor IJARCET
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
Dil.7.1.13.doc
Dil.7.1.13.docDil.7.1.13.doc
Dil.7.1.13.docbutest
 
ASSESSING SIMILARITY BETWEEN ONTOLOGIES: THE CASE OF THE CONCEPTUAL SIMILARITY
ASSESSING SIMILARITY BETWEEN ONTOLOGIES: THE CASE OF THE CONCEPTUAL SIMILARITYASSESSING SIMILARITY BETWEEN ONTOLOGIES: THE CASE OF THE CONCEPTUAL SIMILARITY
ASSESSING SIMILARITY BETWEEN ONTOLOGIES: THE CASE OF THE CONCEPTUAL SIMILARITYIJwest
 
Sentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic RelationsSentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic RelationsEditor IJCATR
 
Utterance Topic Model for Generating Coherent Summaries
Utterance Topic Model for Generating Coherent SummariesUtterance Topic Model for Generating Coherent Summaries
Utterance Topic Model for Generating Coherent SummariesContent Savvy
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)Sihan Chen
 
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...University of Bari (Italy)
 

What's hot (18)

P13 corley
P13 corleyP13 corley
P13 corley
 
Application of rhetorical
Application of rhetoricalApplication of rhetorical
Application of rhetorical
 
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
 
Marcu 2000 presentation
Marcu 2000 presentationMarcu 2000 presentation
Marcu 2000 presentation
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
 
Exempler approach
Exempler approachExempler approach
Exempler approach
 
Dil.7.1.13.doc
Dil.7.1.13.docDil.7.1.13.doc
Dil.7.1.13.doc
 
LDG-basic-slides
LDG-basic-slidesLDG-basic-slides
LDG-basic-slides
 
ASSESSING SIMILARITY BETWEEN ONTOLOGIES: THE CASE OF THE CONCEPTUAL SIMILARITY
ASSESSING SIMILARITY BETWEEN ONTOLOGIES: THE CASE OF THE CONCEPTUAL SIMILARITYASSESSING SIMILARITY BETWEEN ONTOLOGIES: THE CASE OF THE CONCEPTUAL SIMILARITY
ASSESSING SIMILARITY BETWEEN ONTOLOGIES: THE CASE OF THE CONCEPTUAL SIMILARITY
 
Ihi2012 semantic-similarity-tutorial-part1
Ihi2012 semantic-similarity-tutorial-part1Ihi2012 semantic-similarity-tutorial-part1
Ihi2012 semantic-similarity-tutorial-part1
 
Sentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic RelationsSentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic Relations
 
L0261075078
L0261075078L0261075078
L0261075078
 
Utterance Topic Model for Generating Coherent Summaries
Utterance Topic Model for Generating Coherent SummariesUtterance Topic Model for Generating Coherent Summaries
Utterance Topic Model for Generating Coherent Summaries
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
 

Viewers also liked

WISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked DataWISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked DataAndre Freitas
 
API Usage Pattern Extraction using Semantic Similarity
API Usage Pattern Extraction using Semantic SimilarityAPI Usage Pattern Extraction using Semantic Similarity
API Usage Pattern Extraction using Semantic SimilarityMasud Rahman
 
Online examination system
Online examination systemOnline examination system
Online examination systemAj Maurya
 
The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...Brian Solis
 
Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)maditabalnco
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsBarry Feldman
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome EconomyHelge Tennø
 

Viewers also liked (7)

WISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked DataWISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked Data
 
API Usage Pattern Extraction using Semantic Similarity
API Usage Pattern Extraction using Semantic SimilarityAPI Usage Pattern Extraction using Semantic Similarity
API Usage Pattern Extraction using Semantic Similarity
 
Online examination system
Online examination systemOnline examination system
Online examination system
 
The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...
 
Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post Formats
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome Economy
 

Similar to Corpus-based and Knowledge-based Measures of Text Semantic Similarity

SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similaritypathsproject
 
Chat bot using text similarity approach
Chat bot using text similarity approachChat bot using text similarity approach
Chat bot using text similarity approachdinesh_joshy
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...dannyijwest
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextFulvio Rotella
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextUniversity of Bari (Italy)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Automatic Essay Grading With Probabilistic Latent Semantic Analysis
Automatic Essay Grading With Probabilistic Latent Semantic AnalysisAutomatic Essay Grading With Probabilistic Latent Semantic Analysis
Automatic Essay Grading With Probabilistic Latent Semantic AnalysisRichard Hogue
 
The CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognatesThe CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognatesMark Planigale
 
Barzilay & Lapata 2008 presentation
Barzilay & Lapata 2008 presentationBarzilay & Lapata 2008 presentation
Barzilay & Lapata 2008 presentationRichard Littauer
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnRwanEnan
 
Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianIDES Editor
 
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...acijjournal
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
 
IRJET-Semantic Similarity Between Sentences
IRJET-Semantic Similarity Between SentencesIRJET-Semantic Similarity Between Sentences
IRJET-Semantic Similarity Between SentencesIRJET Journal
 
Semantic Similarity Between Sentences
Semantic Similarity Between SentencesSemantic Similarity Between Sentences
Semantic Similarity Between SentencesIRJET Journal
 

Similar to Corpus-based and Knowledge-based Measures of Text Semantic Similarity (20)

SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
 
Chat bot using text similarity approach
Chat bot using text similarity approachChat bot using text similarity approach
Chat bot using text similarity approach
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...
 
Ceis 3
Ceis 3Ceis 3
Ceis 3
 
Gwc 2016
Gwc 2016Gwc 2016
Gwc 2016
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Automatic Essay Grading With Probabilistic Latent Semantic Analysis
Automatic Essay Grading With Probabilistic Latent Semantic AnalysisAutomatic Essay Grading With Probabilistic Latent Semantic Analysis
Automatic Essay Grading With Probabilistic Latent Semantic Analysis
 
The CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognatesThe CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognates
 
Barzilay & Lapata 2008 presentation
Barzilay & Lapata 2008 presentationBarzilay & Lapata 2008 presentation
Barzilay & Lapata 2008 presentation
 
L1803058388
L1803058388L1803058388
L1803058388
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of Persian
 
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
 
IRJET-Semantic Similarity Between Sentences
IRJET-Semantic Similarity Between SentencesIRJET-Semantic Similarity Between Sentences
IRJET-Semantic Similarity Between Sentences
 
Semantic Similarity Between Sentences
Semantic Similarity Between SentencesSemantic Similarity Between Sentences
Semantic Similarity Between Sentences
 

Recently uploaded

Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxPoojaSen20
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 

Recently uploaded (20)

YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 

Corpus-based and Knowledge-based Measures of Text Semantic Similarity

  • 1. Corpus-based and Knowledge-based Measures of Text Semantic Similarity Rada Mihalcea and Courtney Corley Carlo Strapparava Department of Computer Science Istituto per la Ricerca Scientifica e Tecnologica University of North Texas ITC – irst {rada,corley}@cs.unt.edu strappa@itc.it Abstract successful to a certain degree, these lexical similarity meth- ods cannot always identify the semantic similarity of texts. This paper presents a method for measuring the se- For instance, there is an obvious similarity between the text mantic similarity of texts, using corpus-based and segments I own a dog and I have an animal, but most of knowledge-based measures of similarity. Previous work on this problem has focused mainly on either large doc- the current text similarity metrics will fail in identifying any uments (e.g. text classification, information retrieval) kind of connection between these texts. or individual words (e.g. synonymy tests). Given that There is a large number of word-to-word semantic simi- a large fraction of the information available today, on larity measures, using approaches that are either knowledge- the Web and elsewhere, consists of short text snip- based (Wu & Palmer 1994; Leacock & Chodorow 1998) pets (e.g. abstracts of scientific documents, imagine or corpus-based (Turney 2001). Such measures have been captions, product descriptions), in this paper we fo- successfully applied to language processing tasks such as cus on measuring the semantic similarity of short texts. malapropism detection (Budanitsky & Hirst 2001), word Through experiments performed on a paraphrase data sense disambiguation (Patwardhan, Banerjee, & Pedersen set, we show that the semantic similarity method out- performs methods based on simple lexical matching, re- 2003), and synonym identification (Turney 2001). For text- sulting in up to 13% error rate reduction with respect to based semantic similarity, perhaps the most widely used ap- the traditional vector-based similarity metric. proaches are the approximations obtained through query ex- pansion, as performed in information retrieval (Voorhees 1993), or the latent semantic analysis method (Landauer, Introduction Foltz, & Laham 1998) that measures the similarity of texts Measures of text similarity have been used for a long time by exploiting second-order word relations automatically ac- in applications in natural language processing and related quired from large text collections. areas. One of the earliest applications of text similarity is A related line of work consists of methods for paraphrase perhaps the vectorial model in information retrieval, where recognition, which typically seek to align sentences in com- the document most relevant to an input query is determined parable corpora (Barzilay & Elhadad 2003; Dolan, Quirk, by ranking documents in a collection in reversed order of & Brockett 2004), or paraphrase generation using distribu- their similarity to the given query (Salton & Lesk 1971). tional similarity applied on paths of dependency trees (Lin & Text similarity has also been used for relevance feedback Pantel 2001) or using bilingual parallel corpora (Barnard & and text classification (Rocchio 1971), word sense disam- Callison-Burch 2005). These methods target the identifica- biguation (Lesk 1986; Schutze 1998), and more recently for tion of paraphrases in large documents, or the generation of extractive summarization (Salton et al. 1997), and methods paraphrases starting with an input text, without necessarily for automatic evaluation of machine translation (Papineni et providing a measure of their similarity. The recently intro- al. 2002) or text summarization (Lin & Hovy 2003). Mea- duced textual entailment task (Dagan, Glickman, & Magnini sures of text similarity were also found useful for the evalu- 2005) is also related to some extent, however textual entail- ation of text coherence (Lapata & Barzilay 2005). ment targets the identification of a directional inferential re- With few exceptions, the typical approach to finding the lation between texts, which is different than textual similar- similarity between two text segments is to use a simple lex- ity, and hence entailment systems are not overviewed here. ical matching method, and produce a similarity score based In this paper, we suggest a method for measuring the on the number of lexical units that occur in both input seg- semantic similarity of texts by exploiting the information ments. Improvements to this simple method have consid- that can be drawn from the similarity of the component ered stemming, stop-word removal, part-of-speech tagging, words. Specifically, we describe two corpus-based and six longest subsequence matching, as well as various weighting knowledge-based measures of word semantic similarity, and and normalization factors (Salton & Buckley 1997). While show how they can be used to derive a text-to-text similarity metric. We show that this measure of text semantic sim- Copyright c 2006, American Association for Artificial Intelli- ilarity outperforms the simpler vector-based similarity ap- gence (www.aaai.org). All rights reserved. proach, as evaluated on a paraphrase recognition task.
  • 2. Text Semantic Similarity normalized with the length of each text segment. Finally the Measures of semantic similarity have been traditionally de- resulting similarity scores are combined using a simple av- fined between words or concepts, and much less between erage. Note that only open-class words and cardinals can text segments consisting of two or more words. The em- participate in this semantic matching process. As done in phasis on word-to-word similarity metrics is probably due previous work on text similarity using vector-based models, to the availability of resources that specifically encode re- all function words are discarded. lations between words or concepts (e.g. WordNet), and the The similarity between the input text segments T1 and T2 various testbeds that allow for their evaluation (e.g. TOEFL is therefore determined using the following scoring function: or SAT analogy/synonymy tests). Moreover, the derivation of a text-to-text measure of similarity starting with a word- (maxSim(w,T2 )∗idf (w)) based semantic similarity metric may not be straightforward, 1 w∈{T1 } and consequently most of the work in this area has consid- sim(T1 , T2 ) = 2( idf (w) + ered mainly applications of the traditional vectorial model, w∈{T1 } occasionally extended to n-gram language models. (maxSim(w,T1 )∗idf (w)) Given two input text segments, we want to automatically w∈{T2 } ) (1) derive a score that indicates their similarity at semantic level, idf (w) thus going beyond the simple lexical matching methods tra- w∈{T2 } ditionally used for this task. Although we acknowledge the This similarity score has a value between 0 and 1, with a fact that a comprehensive metric of text semantic similarity score of 1 indicating identical text segments, and a score of should also take into account the structure of the text, we 0 indicating no semantic overlap between the two segments. take a first rough cut at this problem and attempt to model Note that the maximum similarity is sought only within the semantic similarity of texts as a function of the semantic classes of words with the same part-of-speech. The rea- similarity of the component words. We do this by combining son behind this decision is that most of the word-to-word metrics of word-to-word similarity and word specificity into knowledge-based measures cannot be applied across parts- a formula that is a potentially good indicator of the semantic of-speech, and consequently, for the purpose of consistency, similarity of the two input texts. we imposed the “same word-class” restriction to all the The following section provides details on eight different word-to-word similarity measures. This means that, for in- corpus-based and knowledge-based measures of word se- stance, the most similar word to the noun flower within the mantic similarity. In addition to the similarity of words, text “There are many green plants next to the house” will we also take into account the specificity of words, so that be sought among the nouns plant and house, and will ig- we can give a higher weight to a semantic matching identi- nore the words with a different part-of-speech (be, green, fied between two specific words (e.g. collie and sheepdog), next). Moreover, for those parts-of-speech for which a word- and give less importance to the similarity measured between to-word semantic similarity cannot be measured (e.g. some generic concepts (e.g. get and become). While the speci- knowledge-based measures are not defined among adjec- ficity of words is already measured to some extent by their tives or adverbs), we use instead a lexical match measure, depth in the semantic hierarchy, we are reinforcing this fac- which assigns a maxSim of 1 for identical occurrences of a tor with a corpus-based measure of word specificity, based word in the two text segments. on distributional information learned from large corpora. The specificity of a word is determined using the in- verse document frequency (idf) introduced by Sparck-Jones Semantic Similarity of Words (1972), defined as the total number of documents in the cor- There is a relatively large number of word-to-word similar- pus divided by the total number of documents including that ity metrics that were previously proposed in the literature, word. The idf measure was selected based on previous work ranging from distance-oriented measures computed on se- that theoretically proved the effectiveness of this weight- mantic networks, to metrics based on models of distribu- ing approach (Papineni 2001). In the experiments reported tional similarity learned from large text collections. From here, document frequency counts are derived starting with these, we chose to focus our attention on two corpus-based the British National Corpus – a 100 million words corpus of metrics and six knowledge-based different metrics, selected modern English including both spoken and written genres. mainly for their observed performance in other natural lan- Given a metric for word-to-word similarity and a measure guage processing applications. of word specificity, we define the semantic similarity of two text segments T1 and T2 using a metric that combines the Corpus-based Measures semantic similarities of each text segment in turn with re- Corpus-based measures of word semantic similarity try to spect to the other text segment. First, for each word w in the identify the degree of similarity between words using infor- segment T1 we try to identify the word in the segment T2 mation exclusively derived from large corpora. In the exper- that has the highest semantic similarity (maxSim(w, T2 )), iments reported here, we considered two metrics, namely: according to one of the word-to-word similarity measures (1) pointwise mutual information (Turney 2001), and (2) la- described in the following section. Next, the same process tent semantic analysis (Landauer, Foltz, & Laham 1998). is applied to determine the most similar word in T1 starting with words in T2 . The word similarities are then weighted Pointwise Mutual Information The pointwise mutual with the corresponding word specificity, summed up, and information using data collected by information retrieval
  • 3. (PMI-IR) was suggested by (Turney 2001) as an unsuper- LSA can be viewed as a way to overcome some of the vised measure for the evaluation of the semantic similarity drawbacks of the standard vector space model (sparseness of words. It is based on word co-occurrence using counts and high dimensionality). In fact, the LSA similarity is com- collected over very large corpora (e.g. the Web). Given two puted in a lower dimensional space, in which second-order words w1 and w2 , their PMI-IR is measured as: relations among terms and texts are exploited. The similarity in the resulting vector space is then measured with the stan- p(w1 &w2 ) PMI-IR(w1 , w2 ) = log2 (2) dard cosine similarity. Note also that LSA yields a vector p(w1 ) ∗ p(w2 ) space model that allows for a homogeneous representation which indicates the degree of statistical dependence between (and hence comparison) of words, word sets, and texts. w1 and w2 , and can be used as a measure of the seman- The application of the LSA word similarity measure to tic similarity of w1 and w2 . From the four different types text semantic similarity is done using Equation 1, which of queries suggested by Turney (2001), we are using the roughly amounts to the pseudo-document text representa- N EAR query (co-occurrence within a ten-word window), tion for LSA computation, as described by Berry (1992). In which is a balance between accuracy (results obtained on practice, each text segment is represented in the LSA space synonymy tests) and efficiency (number of queries to be run by summing up the normalized LSA vectors of all the con- against a search engine). Specifically, the following query is stituent words, using also a tf.idf weighting scheme. used to collect counts from the AltaVista search engine. Knowledge-based Measures hits(w1 N EAR w2 ) There are a number of measures that were developed to pN EAR (w1 &w2 ) (3) W ebSize quantify the degree to which two words are semantically re- With p(wi ) approximated as hits(w1 )/W ebSize, the fol- lated using information drawn from semantic networks – see lowing PMI-IR measure is obtained: e.g. (Budanitsky & Hirst 2001) for an overview. We present below several measures found to work well on the Word- hits(w1 AN D w2 ) ∗ W ebSize Net hierarchy. All these measures assume as input a pair of log2 (4) hits(w1 ) ∗ hits(w2 ) concepts, and return a value indicating their semantic relat- edness. The six measures below were selected based on their In a set of experiments based on TOEFL synonymy tests observed performance in other language processing applica- (Turney 2001), the PMI-IR measure using the N EAR op- tions, and for their relatively high computational efficiency. erator accurately identified the correct answer (out of four We conduct our evaluation using the following word sim- synonym choices) in 72.5% of the cases, which exceeded ilarity metrics: Leacock & Chodorow, Lesk, Wu & Palmer, by a large margin the score obtained with latent semantic Resnik, Lin, and Jiang & Conrath. Note that all these metrics analysis (64.4%), as well as the average non-English college are defined between concepts, rather than words, but they applicant (64.5%). Since Turney (2001) performed evalu- can be easily turned into a word-to-word similarity metric ations of synonym candidates for one word at a time, the by selecting for any given pair of words those two meanings W ebSize value was irrelevant in the ranking. In our applica- that lead to the highest concept-to-concept similarity1 . We tion instead, it is not only the ranking of the synonym candi- use the WordNet-based implementation of these metrics, as dates that matters (for the selection of maxSim in Equation available in the WordNet::Similarity package (Patwardhan, 1), but also the true value of PMI-IR, which is needed for Banerjee, & Pedersen 2003). We provide below a short de- the overall calculation of the text-to-text similarity metric. scription for each of these six metrics. We approximate the value of W ebSize to 7x1011 , which is the value used by Chklovski (2004) in co-occurrence exper- The Leacock & Chodorow (Leacock & Chodorow 1998) iments involving Web counts. similarity is determined as: length Latent Semantic Analysis Another corpus-based mea- Simlch = − log (5) 2∗D sure of semantic similarity is the latent semantic analysis (LSA) proposed by Landauer (1998). In LSA, term co- where length is the length of the shortest path between two occurrences in a corpus are captured by means of a dimen- concepts using node-counting, and D is the maximum depth sionality reduction operated by a singular value decomposi- of the taxonomy. tion (SVD) on the term-by-document matrix T representing The Lesk similarity of two concepts is defined as a function the corpus. For the experiments reported here, we run the of the overlap between the corresponding definitions, as pro- SVD operation on the British National Corpus. vided by a dictionary. It is based on an algorithm proposed SVD is a well-known operation in linear algebra, which by Lesk (1986) as a solution for word sense disambiguation. can be applied to any rectangular matrix in order to find cor- The application of the Lesk similarity measure is not limited relations among its rows and columns. In our case, SVD to semantic networks, and it can be used in conjunction with decomposes the term-by-document matrix T into three ma- any dictionary that provides word definitions. trices T = UΣk VT where Σk is the diagonal k × k matrix The Wu and Palmer (Wu & Palmer 1994) similarity met- containing the k singular values of T, σ1 ≥ σ2 ≥ . . . ≥ σk , ric measures the depth of two given concepts in the Word- and U and V are column-orthogonal matrices. When the three matrices are multiplied together the original term-by- 1 This is similar to the methodology used by (McCarthy et al. document matrix is re-composed. Typically we can choose 2004) to find similarities between words and senses starting with a k k obtaining the approximation T UΣk VT . concept-to-concept similarity measure.
  • 4. Net taxonomy, and the depth of the least common subsumer Text 1 Text 2 maxSim idf (LCS), and combines these figures into a similarity score: defendant defendant 1.00 3.93 lawyer attorney 0.89 2.64 2 ∗ depth(LCS) walked walked 1.00 1.58 Simwup = (6) court courthouse 0.60 1.06 depth(concept1 ) + depth(concept2 ) victims courthouse 0.40 2.11 supporters crowd 0.40 2.15 The measure introduced by Resnik (Resnik 1995) returns turned turned 1.00 0.66 the information content (IC) of the LCS of two concepts: backs backs 1.00 2.41 Simres = IC(LCS) (7) Table 1: Word similarity scores and word specificity (idf) where IC is defined as: IC(c) = − log P (c) (8) and P (c) is the probability of encountering an instance of will result in a score of 0.46, thereby failing to find the para- concept c in a large corpus. phrase relation. Although there are a few words that occur in both text seg- The next measure we use in our experiments is the metric in- ments (e.g. defendant, or turn), there are also words that are troduced by Lin (Lin 1998), which builds on Resnik’s mea- not identical, but closely related, e.g. lawyer found similar sure of similarity, and adds a normalization factor consisting to attorney, or supporters which is related to crowd. Unlike of the information content of the two input concepts: traditional similarity measures based on lexical matching, 2 ∗ IC(LCS) our metric takes into account the semantic similarity of these Simlin = (9) words, resulting in a more precise measure of text similarity. IC(concept1 ) + IC(concept2 ) Finally, the last similarity metric considered is Jiang & Evaluation and Results Conrath (Jiang & Conrath 1997): To test the effectiveness of the text semantic similarity mea- 1 sure, we use it to automatically identify if two text segments Simjnc = IC(concept1 ) + IC(concept2 ) − 2 ∗ IC(LCS) are paraphrases of each other. We use the Microsoft para- (10) phrase corpus (Dolan, Quirk, & Brockett 2004), consisting Note that all the word similarity measures are normalized so of 4,076 training and 1,725 test pairs, and determine the that they fall within a 0–1 range. The normalization is done number of correctly identified paraphrase pairs in the cor- by dividing the similarity score provided by a given measure pus using the text semantic similarity measure as the only with the maximum possible score for that measure. indicator of paraphrasing. The paraphrase pairs in this cor- pus were automatically collected from thousands of news A Walk-Through Example sources on the Web over a period of 18 months, and were subsequently labeled by two human annotators who deter- The application of the text similarity measure is illustrated mined if the two sentences in a pair were semantically equiv- with an example. Given two text segments, as shown be- alent or not. The agreement between the human judges who low, we want to determine a score that reflects their semantic labeled the candidate paraphrase pairs in this data set was similarity. For illustration purposes, we restrict our attention measured at approximately 83%, which can be considered to one corpus-based measure – the PMI-IR metric imple- as an upperbound for an automatic paraphrase recognition mented using the AltaVista N EAR operator. task performed on this data set. Text Segment 1: When the defendant and his lawyer walked For each candidate paraphrase pair in the test set, we first into the court, some of the victim supporters turned their evaluate the text semantic similarity metric using Equation backs on him. 1, and then label the candidate pair as a paraphrase if the Text Segment 2: When the defendant walked into the court- similarity score exceeds a threshold of 0.5. Note that this house with his attorney, the crowd turned their backs on him. is an unsupervised experimental setting, and therefore the Starting with each of the two text segments, and for each training data is not used in the experiments. open-class word, we determine the most similar word in the other text segment, according to the PMI-IR similarity mea- Baselines sure. As mentioned earlier, a semantic similarity is sought For comparison, we also compute two baselines: (1) A ran- only between words with the same part-of-speech. Table 1 dom baseline created by randomly choosing a true (para- shows the word similarity scores and the word specificity phrase) or false (not paraphrase) value for each text pair; and (idf) starting with the first text segment. (2) A vector-based similarity baseline, using a cosine simi- Next, using Equation 1, we combine the word similari- larity measure as traditionally used in information retrieval, ties and their corresponding specificity, and determine the with tf.idf weighting. semantic similarity of the two texts as 0.80. This similarity score correctly identifies the paraphrase relation between the two text segments (using the same threshold of 0.50 as used Results throughout all the experiments reported in this paper). In- We evaluate the results in terms of accuracy, representing the stead, a cosine similarity score based on the same idf weights number of correctly identified true or false classifications in
  • 5. Metric Acc. Prec. Rec. F a similar behavior on the same subset of the test data (pre- Semantic similarity (corpus-based) sumably an “easy” subset that can be solved using measures PMI-IR 69.9 70.2 95.2 81.0 of semantic similarity), or if the different measures cover in LSA 68.4 69.7 95.2 80.5 fact different subsets of the data, we calculated the Pearson Semantic similarity (knowledge-based) J&C 69.3 72.2 87.1 79.0 correlation factor among all the similarity measures. As seen L&C 69.5 72.4 87.0 79.0 in Table 3, there is in fact a high correlation among several Lesk 69.3 72.4 86.6 78.9 of the knowledge-based measures, indicating an overlap in Lin 69.3 71.6 88.7 79.2 their behavior. Although some of these metrics are diver- W&P 69.0 70.2 92.1 80.0 gent in what they measure (e.g. Lin versus Lesk), it seems Resnik 69.0 69.0 96.4 80.4 that the fact they are applied in a context lessens the differ- Combined 70.3 69.6 97.7 81.3 ences observed when applied at word level. Interestingly, Baselines the Resnik measure has a low correlation with the other Vector-based 65.4 71.6 79.5 75.3 knowledge-based measures, and a somehow higher correla- Random 51.3 68.3 50.0 57.8 tion with the corpus-based metrics, which is probably due to the data-driven information content used in the Resnik mea- Table 2: Text similarity for paraphrase identification sure (although Lin and Jiang & Conrath also use the infor- mation content, they have an additional normalization factor that makes them behave differently). Perhaps not surprising, the test data set. We also measure precision, recall and F- the corpus-based measures are only weakly correlated with measure, calculated with respect to the true values in the the knowledge-based measures and among them, with LSA test data. Table 2 shows the results obtained. Among all having the smallest correlation with the other metrics. the individual measures of similarity, the PMI-IR measure An interesting example is represented by the following was found to perform the best, although the difference with two text segments, where only the Resnik measure and the respect to the other measures is small. two corpus-based measures manage to identify the para- In addition to the individual measures of similarity, we phrase, because of a higher similarity found between sys- also evaluate a metric that combines several similarity mea- tems and PC, and between technology and processor. sures into a single figure, using a simple average. We in- Text Segment 1: Gateway will release new Profile 4 systems clude all similarity measures, for an overall final accuracy with the new Intel technology on Wednesday. of 70.3%, and an F-measure of 81.3%. Text Segment 2: Gateway ’s all-in-one PC , the Profile 4 , The improvement of the semantic similarity metrics over also now features the new Intel processor. the vector-based cosine similarity was found to be statisti- cally significant in all the experiments, using a paired t-test There are also cases where almost all the semantic simi- (p < 0.001). larity measures fail, and instead the simpler cosine similar- ity has a better performance. This is mostly the case for the Discussion and Conclusions negative (not paraphrase) examples in the test data, where the semantic similarities identified between words increase As it turns out, incorporating semantic information into the overall text similarity above the threshold of 0.5. For in- measures of text similarity increases the likelihood of recog- stance, the following text segments were falsely marked as nition significantly over the random baseline and over the paraphrases by all but the cosine similarity and the Resnik vector-based cosine similarity baseline, as measured in measure: a paraphrase recognition task. The best performance is Text Segment 1: The man wasn’t on the ice, but trapped in achieved using a method that combines several similarity the rapids, swaying in an eddy about 250 feet from the shore. metrics into one, for an overall accuracy of 70.3%, repre- Text Segment 2: The man was trapped about 250 feet from senting a significant 13.8% error rate reduction with respect the shore, right at the edge of the falls. to the vector-based cosine similarity baseline. Moreover, if we were to take into account the upperbound of 83% estab- The small variations between the accuracies obtained with lished by the inter-annotator agreement achieved on this data the corpus-based and knowledge-based measures also sug- set (Dolan, Quirk, & Brockett 2004), the error rate reduction gest that both data-driven and knowledge-rich methods have over the baseline appears even more significant. their own merits, leading to a similar performance. Corpus- In addition to performance, we also tried to gain insights based methods have the advantage that no hand-made re- into the applicability of the semantic similarity measures, by sources are needed and, apart form the choice of an ap- finding their coverage on this data set. On average, among propriate and large corpus, they raise no problems related approximately 18,000 word similarities identified in this cor- to the completeness of the resources. On the other hand, pus, about 14,500 are due to lexical matches, and 3,500 are knowledge-based methods can encode fine-grained informa- due to semantic similarities, which indicates that about 20% tion. This difference can be observed in terms of precision of the relations found between text segments are based on and recall. In fact, while precision is generally higher with semantics, in addition to lexical identity. knowledge-based measures, corpus-based measures give in Despite the differences among the various word-to-word general better performance in recall. similarity measures (corpus-based vs. knowledge-based, Although our method relies on a bag-of-words approach, definitional vs. link-based), the results are surprisingly sim- as it turns out the use of measures of semantic similarity ilar. To determine if the similar overall results are due to improves significantly over the traditional lexical matching
  • 6. Vect PMI-IR LSA J&C L&C Lesk Lin W&P Resnik Vect 1.00 0.84 0.44 0.61 0.63 0.60 0.61 0.50 0.65 PMI-IR 1.00 0.58 0.67 0.68 0.65 0.67 0.58 0.64 LSA 1.00 0.42 0.44 0.42 0.43 0.34 0.41 J&C 1.00 0.98 0.97 0.99 0.91 0.45 L&C 1.00 0.98 0.98 0.87 0.46 Lesk 1.00 0.96 0.86 0.43 Lin 1.00 0.88 0.44 W&P 1.00 0.34 Resnik 1.00 Table 3: Pearson correlation among similarity measures metrics. We are nonetheless aware that a bag-of-words ap- Lin, C., and Hovy, E. 2003. Automatic evaluation of summaries proach ignores many of the important relationships in sen- using n-gram co-occurrence statistics. In Proceedings of Human tence structure, such as dependencies between words, or Language Technology Conference . roles played by the various arguments in the sentence. Fu- Lin, D., and Pantel, P. 2001. Discovery of inference rules for ture work will consider the investigation of more sophisti- question answering. Natural Language Engineering 7(3). cated representations of sentence structure, such as first or- Lin, D. 1998. An information-theoretic definition of similarity. der predicate logic or semantic parse trees, which should al- In Proceedings of the International Conf. on Machine Learning. low for the implementation of more effective measures of McCarthy, D.; Koeling, R.; Weeds, J.; and Carroll, J. 2004. Find- text semantic similarity. ing predominant senses in untagged text. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. References Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W. 2002. Bleu: a method for automatic evaluation of machine translation. In Pro- Barnard, C., and Callison-Burch, C. 2005. Paraphrasing with ceedings of the 40th Annual Meeting of the Association for Com- bilingual parallel corpora. In Proceedings of the 43rd Annual putational Linguistics. Meeting of the Association for Computational Linguistics. Papineni, K. 2001. Why inverse document frequency? In Pro- Barzilay, R., and Elhadad, N. 2003. Sentence alignment for ceedings of the North American Chapter of the Association for monolingual comparable corpora. In Proceedings of the Confer- Compuatational Linguistics, 25–32. ence on Empirical Methods in Natural Language Processing.. Patwardhan, S.; Banerjee, S.; and Pedersen, T. 2003. Using mea- Berry, M. 1992. Large-scale sparse singular value computations. sures of semantic relatedness for word sense disambiguation. In International Journal of Supercomputer Applications 6(1). Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics. Budanitsky, A., and Hirst, G. 2001. Semantic distance in Word- Resnik, P. 1995. Using information content to evaluate semantic Net: An experimental, application-oriented evaluation of five similarity. In Proceedings of the 14th International Joint Confer- measures. In Proceedings of the NAACL Workshop on WordNet ence on Artificial Intelligence. and Other Lexical Resources. Rocchio, J. 1971. Relevance feedback in information retrieval. Chklovski, T., and Pantel, P. 2004. Verbocean: Mining the Web Prentice Hall, Ing. Englewood Cliffs, New Jersey. for fine-grained semantic verb relations. In Proceedings of Con- ference on Empirical Methods in Natural Language Processing.. Salton, G., and Buckley, C. 1997. Term weighting approaches in automatic text retrieval. In Readings in Information Retrieval. Dagan, I.; Glickman, O.; and Magnini, B. 2005. The PASCAL San Francisco, CA: Morgan Kaufmann Publishers. recognising textual entailment challenge. In Proceedings of the Salton, G., and Lesk, M. 1971. Computer evaluation of indexing PASCAL Workshop. and text processing. Prentice Hall, Ing. Englewood Cliffs, New Dolan, W.; Quirk, C.; and Brockett, C. 2004. Unsupervised Jersey. 143–180. construction of large paraphrase corpora: Exploiting massively Salton, G.; Singhal, A.; Mitra, M.; and Buckley, C. 1997. Auto- parallel news sources. In Proceedings of the 20th International matic text structuring and summarization. Information Processing Conference on Computational Linguistics. and Management 2(32). Jiang, J., and Conrath, D. 1997. Semantic similarity based on cor- Schutze, H. 1998. Automatic word sense discrimination. Com- pus statistics and lexical taxonomy. In Proceedings of the Inter- putational Linguistics 24(1):97–124. national Conference on Research in Computational Linguistics. Sparck-Jones, K. 1972. A statistical interpretation of term speci- Landauer, T. K.; Foltz, P.; and Laham, D. 1998. Introduction to ficity and its application in retrieval. Journal of Documentation latent semantic analysis. Discourse Processes 25. 28(1):11–21. Lapata, M., and Barzilay, R. 2005. Automatic evaluation of text Turney, P. 2001. Mining the web for synonyms: PMI-IR versus coherence: Models and representations. In Proceedings of the LSA on TOEFL. In Proceedings of the Twelfth European Confer- 19th International Joint Conference on Artificial Intelligence. ence on Machine Learning (ECML-2001). Leacock, C., and Chodorow, M. 1998. Combining local context Voorhees, E. 1993. Using WordNet to disambiguate word senses and WordNet sense similarity for word sense identification. In for text retrieval. In Proceedings of the 16th annual international WordNet, An Electronic Lexical Database. The MIT Press. ACM SIGIR conference. Wu, Z., and Palmer, M. 1994. Verb semantics and lexical selec- Lesk, M. 1986. Automatic sense disambiguation using machine tion. In Proceedings of the Annual Meeting of the Association for readable dictionaries: How to tell a pine cone from an ice cream Computational Linguistics. cone. In Proceedings of the SIGDOC Conference 1986.