SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Extending ranking with
           interword spacing analysis


Maria Carmela Daniele, Claudio Carpineto and
             Andrea Bernardini
Overview

I.    Word weighting based on interword spacing: σp
II.  Extension of quantistic weight through corpora analysis: σ*
III.  σ* application to ranking
IV.  Experiments
V.    Selective application of quantistic and frequentistic metrics based on:

      a)    Document’s length

      b)    Query hardness
Words	
  weighting	
  based	
  on	
  spacing	
  between	
  
term	
  occurrences: σp

•  Research branch evolved in the last decade.

•  Follow studies on energy level of statistical system formed by
  irregular quantum, created by Ortuño et al (2002)

•  Keyword extraction based on distances between term’s
  occurrences in a document, regardless of terms frequency
  analysis of the document.

•  Let’s see in more detail…
Reference Scenario

  Similar to quantistic system, terms in a
   document are subject to an attraction/
   repulsion phenomena, that is stronger
   between relevant terms compared to
   common words.

  Reference Document: Charles Darwin’s
   “The Origin of Species”

  In practice:

       Relevant words tend to cluster in
        documents ( ie: “INSTINCT”)

       Common words like “THE” are
        distributed uniformly
Definition of σp

•     Weighting method definition based on probability distributions of distances

•     A more efficient method characterized by Standard Deviation:

      A    great   scientist   must     be     a   good   teacher   and   a        good   researcher
       1    2         3            4     5     6    7          8     9    10        11       12

 •    For term “a” we get: X={1,6,10}, D = {0,5,4,2} (di = xi+1- xi), and:

                                                  1 n                          2
                                             s=           ((
                                                    ∑ x i +1 −x i − µ
                                                n −1 i=0
                                                                     ) )
 •    Normalizing with respect to the mean value:
                               €
Extension of quantistic weighting through
corpora analysis: σ*

•     We propose to modify the original metric with a factor σf based on the
      variance of term frequencies (Salton 1975). The factor σf is analogous to σp
      and it has a twofold goal:
     1.  Penalize rare words, because they can be often seen as ‘noise’ in real
          collection of documents, while they tend to be overestimated using σp ;
     2.  Reward words that make it possible to better discriminate a document
          from the rest of the collection. This feature is lacking in quantistic
          weighting



                                                               n
                                                            1                       2

        with
                                                 s f (w) =
                                                           ND i=1 i
                                                                   (
                                                              ⋅ ∑ f (w) − µ f   )

                                       €
Comparison between quantistic and
frequentistic metrics
•     Using Tf-Idf (with and without stop words) for the metric on the frequencies
•     Using σp e σ* for the quantistic weighting
•     Reference Document: “The Bible” of The King James
•     To calculate Idf e σf that require the collection, we use WT10g Trec collection
     Rank           Tf-Idf               Tf-Idf*               σp                 σ*

      1     unto                lord               jesus               jesus
      2     shall               god                christ              saul
      3     lord                absalom            paul                absalom
      4     thou                son                peter               jephthah
      5     thy                 king               disciples           jubile
      6     thee                behold             faith               ascendeteh
      7     him                 man                john                abimelech
      8     god                 judah              david               elias
      9     his                 land               saul                joab
      10    hath                men                gospel              haman
Application of σ* to ranking (1)

•  Using σ* metric it’s possible to rank a collection of documents against
   a query q




•  Based on the complementary features of quantistic and frequentistic
   weighting metrics, we would like to combine these two metrics.
Application of σ* to ranking (2)	
  
•    The combined metric is obtained through:
         Linear Combination of Okapi’s BM25 and σ* metrics


•    Prerequisite for the linear combination is that the the scores will be in similar
     range

•    Application of normalization of scores through:




•  The scores are combined by:
Experiments (1)


       Collection:
        Web Track: about 1.690.000 documents
        Robust Track: more than 500.000 documents

      •  Evaluation measure: MAP (mean average precision)

      •  Lucene with BM25 extension created by Perez-Iglesias
Experiments (2)
•    The quantistic metric alone does not work well:

     Collezione               Topics                BM25              σ*             BM25+σ*

     WT10g                   501-550                0.143            0.057            0.153
     Robust               301-450,601-700           0.195            0.089            0.203
•    Experiments on combined quantistic method enhance in a significant way performance of
     classical methods of IR

•    We let the α parameter vary in the range [0,1]: the two extreme points coincides, respectively,
     with BM25 and σ∗ techniques.

•    Results suggest us that the method is sufficiently robust, because we found a range of values in
     which the performance of the combined method was good.
     α             1        0.9     0.8      0.7      0.6     0.5     0.4     0.3      0.2      0.1     0

     MAP          .1436    .1469   .1537    .1535    .1501   .1379   .1222   .096     .0819    .0679   .0547
     MAP          .1954    .2033   .2031    .1983    .1673   .1549   .1428   .1203    .1075    .9674   .0898
Query by query analysis	
  
                                                                                                                      BM25

                                                                                                                      σ*

                                                                                                                      BM25+σ*
       1,0

       0,9

       0,8

       0,7

       0,6

       0,5
 AvP




       0,4

       0,3

       0,2

       0,1

       0,0
             1   3   5   7   9   11   13   15   17   19   21   23   25   27   29   31   33   35   37   39   41   43   45   47   49

                                                                N° Query
Selective application of quantistic and
frequentistic techniques


   1.  Relying on predictors of the query difficulty for
     choosing which metric to use (rationale: the
     quantistic method should be better on difficult
     queries)

   2.  Relying on document’s length for choosing which
     metric to use (rationale: the quantistic method
     should be better for long documents)
Query hardness (1)

•    We used two well-know query predictor:



     •    Simplified Clarity Score




     •    σ1
Query hardness (2)
                                                                                                              Bm25

            WT10g                                                                                             SS*                         • WT10g with σ1
                                                                                                              Lineare(Bm25)
      0,7
                                                                                                              Lineare(SS*)                predictor
      0,6
      0,5
                                                                                                                                          • Robust with SCS
      0,4
                                                                                                                                          predictor
MAP




      0,3
      0,2
      0,1
      0,0
            0   1   2   3   4   5    6      7    8   9    10     11       12       13       14       15   16     17   18   19   20                        BM25
                                                         sigma                                                                                            SS*
                                                                                                                                                          BM25
                                                Robust                                                                                                    SS*
• Predictor obtained                  0,9
                                      0,8
values on x-axis                      0,7
                                      0,6
                                      0,5
• MAP value on y-
                                    MAP




                                      0,4
axes (both BM25                       0,3
                                      0,2
and σ∗)                               0,1
                                      0,0
                                            0    1   2      3         4        5        6        7        8     9
                                                                                                               SCS
                                                                                                                      10   11   12   13    14   15   16    17    18   19
Document Lenght (1)


•  Why using document length? Because the quantistic method works
  better with long texts

                                 BM25             σ*

     Relevant Retrieved           1544           3729
     Relevant NOT Retrieved       4239           2115
Document Lenght (2)
                                         • Collection: WT10g

                                             • σ*

                                             • BM25




• X-Axis: document’s length expressed
in number of words

• Y-Axis: Cumulative percentage of
relevant documents (retrieved in Blue,
not retrieved in Red)
Conclusions on using a selective application of
frequentistic and quantistic weighting



•  Query hardness did not work.



•  Using document length was more promising
Conclusions and future works
•    Definition of an extended quantistic weighting method through corpora
     analysis.

•    Integration of quantistic and frequentistic ranking methods

•    A linear combination showed a significant enhance of performance compared
     to the classical frequentistic method

•    Selective application: query hardness not useful, document length useful

•    This method could be applied on other Information Retrieval Task, i.e.:
        •    Document Summarization: for create a short version of a text
        •    Query Expansion: expand the query phrase (ie : using synonymous)
        •    Search Result Clustering: group results in clusters
Conclusions



   Thanks for listening!
       questions?

Weitere ähnliche Inhalte

Ähnlich wie Maria daniele

SCALE RATIO ICP FOR 3D POINT CLOUDS WITH DIFFERENT SCALES
SCALE RATIO ICP FOR 3D POINT CLOUDS WITH DIFFERENT SCALES SCALE RATIO ICP FOR 3D POINT CLOUDS WITH DIFFERENT SCALES
SCALE RATIO ICP FOR 3D POINT CLOUDS WITH DIFFERENT SCALES Toru Tamaki
 
A Quantitative analysis and performance study for similarity search methods i...
A Quantitative analysis and performance study for similarity search methods i...A Quantitative analysis and performance study for similarity search methods i...
A Quantitative analysis and performance study for similarity search methods i...Jungyeol
 
Neighbourhood Preserving Quantisation for LSH SIGIR Poster
Neighbourhood Preserving Quantisation for LSH SIGIR PosterNeighbourhood Preserving Quantisation for LSH SIGIR Poster
Neighbourhood Preserving Quantisation for LSH SIGIR PosterSean Moran
 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?Dhafer Malouche
 
Yet another statistical analysis of the data of the ‘loophole free’ experime...
Yet another statistical analysis of the data of the  ‘loophole free’ experime...Yet another statistical analysis of the data of the  ‘loophole free’ experime...
Yet another statistical analysis of the data of the ‘loophole free’ experime...Richard Gill
 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Sean Moran
 
Group-wise analysis on myelination profiles of cerebral cortex using the seco...
Group-wise analysis on myelination profiles of cerebral cortex using the seco...Group-wise analysis on myelination profiles of cerebral cortex using the seco...
Group-wise analysis on myelination profiles of cerebral cortex using the seco...Seung-Goo Kim
 
Quantum Statistical Geometry #2
Quantum Statistical Geometry #2Quantum Statistical Geometry #2
Quantum Statistical Geometry #2Dann Passoja
 
Introduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesIntroduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesAdi Handarbeni
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic NotationsRishabh Soni
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15Shani729
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptxAbdusSadik
 
ZunqiuPresentationOct05
ZunqiuPresentationOct05ZunqiuPresentationOct05
ZunqiuPresentationOct05Chen Zunqiu
 
Dissertation Defense
Dissertation DefenseDissertation Defense
Dissertation Defensejunkermeier
 
Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...Kei Nakagawa
 
Nested Analysis of variance by study&control site
Nested Analysis of variance by study&control siteNested Analysis of variance by study&control site
Nested Analysis of variance by study&control siteJayKeluskar1
 
Digital Image Correlation
Digital Image Correlation Digital Image Correlation
Digital Image Correlation Reza Aghl
 

Ähnlich wie Maria daniele (20)

SCALE RATIO ICP FOR 3D POINT CLOUDS WITH DIFFERENT SCALES
SCALE RATIO ICP FOR 3D POINT CLOUDS WITH DIFFERENT SCALES SCALE RATIO ICP FOR 3D POINT CLOUDS WITH DIFFERENT SCALES
SCALE RATIO ICP FOR 3D POINT CLOUDS WITH DIFFERENT SCALES
 
A Quantitative analysis and performance study for similarity search methods i...
A Quantitative analysis and performance study for similarity search methods i...A Quantitative analysis and performance study for similarity search methods i...
A Quantitative analysis and performance study for similarity search methods i...
 
Neighbourhood Preserving Quantisation for LSH SIGIR Poster
Neighbourhood Preserving Quantisation for LSH SIGIR PosterNeighbourhood Preserving Quantisation for LSH SIGIR Poster
Neighbourhood Preserving Quantisation for LSH SIGIR Poster
 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?
 
Yet another statistical analysis of the data of the ‘loophole free’ experime...
Yet another statistical analysis of the data of the  ‘loophole free’ experime...Yet another statistical analysis of the data of the  ‘loophole free’ experime...
Yet another statistical analysis of the data of the ‘loophole free’ experime...
 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
 
Group-wise analysis on myelination profiles of cerebral cortex using the seco...
Group-wise analysis on myelination profiles of cerebral cortex using the seco...Group-wise analysis on myelination profiles of cerebral cortex using the seco...
Group-wise analysis on myelination profiles of cerebral cortex using the seco...
 
Paper
PaperPaper
Paper
 
Quantum Statistical Geometry #2
Quantum Statistical Geometry #2Quantum Statistical Geometry #2
Quantum Statistical Geometry #2
 
Cs1311lecture23wdl
Cs1311lecture23wdlCs1311lecture23wdl
Cs1311lecture23wdl
 
Introduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesIntroduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resources
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic Notations
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
 
ZunqiuPresentationOct05
ZunqiuPresentationOct05ZunqiuPresentationOct05
ZunqiuPresentationOct05
 
Dissertation Defense
Dissertation DefenseDissertation Defense
Dissertation Defense
 
Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...
 
Nested Analysis of variance by study&control site
Nested Analysis of variance by study&control siteNested Analysis of variance by study&control site
Nested Analysis of variance by study&control site
 
NestedANOVA.ppt
NestedANOVA.pptNestedANOVA.ppt
NestedANOVA.ppt
 
Digital Image Correlation
Digital Image Correlation Digital Image Correlation
Digital Image Correlation
 

Kürzlich hochgeladen

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneUiPathCommunity
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 

Kürzlich hochgeladen (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyone
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 

Maria daniele

  • 1. Extending ranking with interword spacing analysis Maria Carmela Daniele, Claudio Carpineto and Andrea Bernardini
  • 2. Overview I.  Word weighting based on interword spacing: σp II.  Extension of quantistic weight through corpora analysis: σ* III.  σ* application to ranking IV.  Experiments V.  Selective application of quantistic and frequentistic metrics based on: a)  Document’s length b)  Query hardness
  • 3. Words  weighting  based  on  spacing  between   term  occurrences: σp •  Research branch evolved in the last decade. •  Follow studies on energy level of statistical system formed by irregular quantum, created by Ortuño et al (2002) •  Keyword extraction based on distances between term’s occurrences in a document, regardless of terms frequency analysis of the document. •  Let’s see in more detail…
  • 4. Reference Scenario   Similar to quantistic system, terms in a document are subject to an attraction/ repulsion phenomena, that is stronger between relevant terms compared to common words.   Reference Document: Charles Darwin’s “The Origin of Species”   In practice:   Relevant words tend to cluster in documents ( ie: “INSTINCT”)   Common words like “THE” are distributed uniformly
  • 5. Definition of σp •  Weighting method definition based on probability distributions of distances •  A more efficient method characterized by Standard Deviation: A great scientist must be a good teacher and a good researcher 1 2 3 4 5 6 7 8 9 10 11 12 •  For term “a” we get: X={1,6,10}, D = {0,5,4,2} (di = xi+1- xi), and: 1 n 2 s= (( ∑ x i +1 −x i − µ n −1 i=0 ) ) •  Normalizing with respect to the mean value: €
  • 6. Extension of quantistic weighting through corpora analysis: σ* •  We propose to modify the original metric with a factor σf based on the variance of term frequencies (Salton 1975). The factor σf is analogous to σp and it has a twofold goal: 1.  Penalize rare words, because they can be often seen as ‘noise’ in real collection of documents, while they tend to be overestimated using σp ; 2.  Reward words that make it possible to better discriminate a document from the rest of the collection. This feature is lacking in quantistic weighting n 1 2 with s f (w) = ND i=1 i ( ⋅ ∑ f (w) − µ f ) €
  • 7. Comparison between quantistic and frequentistic metrics •  Using Tf-Idf (with and without stop words) for the metric on the frequencies •  Using σp e σ* for the quantistic weighting •  Reference Document: “The Bible” of The King James •  To calculate Idf e σf that require the collection, we use WT10g Trec collection Rank Tf-Idf Tf-Idf* σp σ* 1 unto lord jesus jesus 2 shall god christ saul 3 lord absalom paul absalom 4 thou son peter jephthah 5 thy king disciples jubile 6 thee behold faith ascendeteh 7 him man john abimelech 8 god judah david elias 9 his land saul joab 10 hath men gospel haman
  • 8. Application of σ* to ranking (1) •  Using σ* metric it’s possible to rank a collection of documents against a query q •  Based on the complementary features of quantistic and frequentistic weighting metrics, we would like to combine these two metrics.
  • 9. Application of σ* to ranking (2)   •  The combined metric is obtained through:   Linear Combination of Okapi’s BM25 and σ* metrics •  Prerequisite for the linear combination is that the the scores will be in similar range •  Application of normalization of scores through: •  The scores are combined by:
  • 10. Experiments (1) Collection:   Web Track: about 1.690.000 documents   Robust Track: more than 500.000 documents •  Evaluation measure: MAP (mean average precision) •  Lucene with BM25 extension created by Perez-Iglesias
  • 11. Experiments (2) •  The quantistic metric alone does not work well: Collezione Topics BM25 σ* BM25+σ* WT10g 501-550 0.143 0.057 0.153 Robust 301-450,601-700 0.195 0.089 0.203 •  Experiments on combined quantistic method enhance in a significant way performance of classical methods of IR •  We let the α parameter vary in the range [0,1]: the two extreme points coincides, respectively, with BM25 and σ∗ techniques. •  Results suggest us that the method is sufficiently robust, because we found a range of values in which the performance of the combined method was good. α 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 MAP .1436 .1469 .1537 .1535 .1501 .1379 .1222 .096 .0819 .0679 .0547 MAP .1954 .2033 .2031 .1983 .1673 .1549 .1428 .1203 .1075 .9674 .0898
  • 12. Query by query analysis   BM25 σ* BM25+σ* 1,0 0,9 0,8 0,7 0,6 0,5 AvP 0,4 0,3 0,2 0,1 0,0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 N° Query
  • 13. Selective application of quantistic and frequentistic techniques 1.  Relying on predictors of the query difficulty for choosing which metric to use (rationale: the quantistic method should be better on difficult queries) 2.  Relying on document’s length for choosing which metric to use (rationale: the quantistic method should be better for long documents)
  • 14. Query hardness (1) •  We used two well-know query predictor: •  Simplified Clarity Score •  σ1
  • 15. Query hardness (2) Bm25 WT10g SS* • WT10g with σ1 Lineare(Bm25) 0,7 Lineare(SS*) predictor 0,6 0,5 • Robust with SCS 0,4 predictor MAP 0,3 0,2 0,1 0,0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 BM25 sigma SS* BM25 Robust SS* • Predictor obtained 0,9 0,8 values on x-axis 0,7 0,6 0,5 • MAP value on y- MAP 0,4 axes (both BM25 0,3 0,2 and σ∗) 0,1 0,0 0 1 2 3 4 5 6 7 8 9 SCS 10 11 12 13 14 15 16 17 18 19
  • 16. Document Lenght (1) •  Why using document length? Because the quantistic method works better with long texts BM25 σ* Relevant Retrieved 1544 3729 Relevant NOT Retrieved 4239 2115
  • 17. Document Lenght (2) • Collection: WT10g • σ* • BM25 • X-Axis: document’s length expressed in number of words • Y-Axis: Cumulative percentage of relevant documents (retrieved in Blue, not retrieved in Red)
  • 18. Conclusions on using a selective application of frequentistic and quantistic weighting •  Query hardness did not work. •  Using document length was more promising
  • 19. Conclusions and future works •  Definition of an extended quantistic weighting method through corpora analysis. •  Integration of quantistic and frequentistic ranking methods •  A linear combination showed a significant enhance of performance compared to the classical frequentistic method •  Selective application: query hardness not useful, document length useful •  This method could be applied on other Information Retrieval Task, i.e.: •  Document Summarization: for create a short version of a text •  Query Expansion: expand the query phrase (ie : using synonymous) •  Search Result Clustering: group results in clusters
  • 20. Conclusions Thanks for listening! questions?