SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
BM25 Scoring for Lucene:
From Academia to Industry

             Yuval Feinstein
             Answers Corporation




              Apache Lucene EuroCon 2010 Meetup
              Prague, May 2010
Overview

       Answers.com
       A Relevance problem
       BM25F - a possible solution
       Joaquin’s Implementation
       Productization
       Future directions




2
Answers.com

       Mission - Provide best answers about anything.
       A popular web site (according to comScore,
        March 2010):
          #33 worldwide, with 75.8 million unique users
          #18 in US, with 51.2 million unique users
       WikiAnswers – community Q&A site (UGC)
       ReferenceAnswers – editorial content
       Atlas – internal search engine
       Implicit search example: find similar
3
        questions
Similar Questions




4
Case 31136




5
Enter BM25F

   Query Q = (t1, t2, …, tm)
   Document D
   Term frequency tfi
    similarity   Q , D    w i tf i 
                            tQ  D

   How much should tfi influence similarity?
   Determine similarity by choosing weights
   BM25F: saturation, soft length normalization, idf
    weights and field weights.
Saturation

                            Frequency Saturation


                    1
                  0.9
                  0.8
                  0.7
                  0.6
 Saturated
                  0.5
Weight, tf/(2+tf)
                  0.4
                  0.3
                  0.2
                  0.1
                    0
                        0   5       10        15      20   25   30
                                      Term Frequency tf




 Replace tf by tf/(k1+tf)
Soft Length Normalization

                         length normalization

             2
           1.8
           1.6
           1.4
           1.2
normalized
             1
 frequency
           0.8
           0.6
           0.4
           0.2
             0
                 0   5          10          15          20     25   30
                                      document length




                                                 tf
                             tf ' 
Replace tf by                                         dl 
                                        1  b   b      
                                                     avdl 
Inverse Document Frequency (IDF)

                                       IDF weighting

                   2.5

                    2

                   1.5
 IDF weight (wi)
                    1

                   0.5

                    0
                         0        20        40         60      80    100   120
                                           num docs with term (ni)



                 N  n i  0 .5
          log
   IDF
 wi
                   n i  0 .5
Field Weights




     Every field has a different b (length verbosity parameter) and a different v
     (field value parameer)
10
The BM25F Formula

                                         S
                                ~                  tf si
                                        v
 Field weighting
                               tf i           s
                                        s 1       Bs

                                                       sl s 
 Field length normalization   B s   1  b s   b s       
                                                      avsl 

                                                        ~
                                                       tf i
                                               
                                   BM 25 F                     IDF
  Saturation and IDF          w   i                     ~ w   i
                                                   k1  f i
Joaquin’s Implementation

        Joaquín Pérez Iglesias of UNED, Madrid, Spain
         implemented a BM25F library for Lucene,
         with the class BM25BooleanQuery
        Algorithm:
          Collect documents with query terms
          Score individual terms using BM25F
          Combine scores using addition to get Boolean query
           score




12
BM25F Usefulness for Our Case

        Short texts
        Term repetitions hurt relevance for short texts
        Want to combine different fields (in the future,
         different information sources)

        Initial Experiments showed nice relevance, but….




13
Feeling Safe to make Changes

        How can we be sure not to break anything?



        Added Unit Tests
        (This is almost a Lucene standard, but not in
         Academia…)




14
Production Challenges –
     Performance

     Can this library handle 10M queries daily?
     Initial Runtimes:


                     Average   Median
                     Runtime   Runtime
                     mSec      mSec

        Standard     161       119
        Lucene
        Scoring
        BM25F        273       209
        Difference   68%       75%

15
Improving Performance

     Addressed using:
      Benchmarking

      Profiling

      Refactoring, to give


                     Average   Median
                     Runtime   Runtime
                     mSec      mSec
        Standard     93        65
        Lucene
        Scoring
        BM25F        92        70
16      Difference   -1%       8%
Production Challenges –
Robustness

   Lots of users  strange inputs e.g.
////////////////////////////////////////
;-)
fdsfdsdfsdffssssssfsfsfs

   Addressed using more careful tokenization
Production Challenges –
Integration and Interoperability

   Needs data not currently in Lucene index:
     Average Field Lengths
     Document-level IDF
   We calculated the first externally and
    approximated the second using longest field IDF

   Library does not play nicely with others – not
    recursive
   BM25 Library supports BooleanQuery, not
    phrases, prefix, etc.
Remember case 31136?



Well, She’s mostly pleased…

   BM25 runs in our production environment
   Supporting 10s of millions of queries daily
Future Work

        LUCENE-2091 – Our suggested contrib patch
        LUCENE-2392 – Current work on making Lucene
         scoring more flexible, to incorporate BM25 as well
         as other models
        We want to incorporate BM25 scoring into Solr
        Could this be faster as well?




20
References

   Integrating the Probabilistic Model BM25/BM25F
    into Lucene – Joaquin Perez Iglesias
   The Probabilistic Relevance Framework: BM25
    and Beyond – Stephen Robertson and Hugo
    Zaragoza
   Working Effectively with Legacy Code – Michael
    Feathers

Weitere ähnliche Inhalte

Was ist angesagt? (20)

Waveform Coding
Waveform CodingWaveform Coding
Waveform Coding
 
Spectrum-Compliant Accelerograms through Harmonic Wavelet Transform
Spectrum-Compliant Accelerograms through Harmonic Wavelet TransformSpectrum-Compliant Accelerograms through Harmonic Wavelet Transform
Spectrum-Compliant Accelerograms through Harmonic Wavelet Transform
 
VSB
VSBVSB
VSB
 
Software-defined white-space cognitive systems: implementation of the spectru...
Software-defined white-space cognitive systems: implementation of the spectru...Software-defined white-space cognitive systems: implementation of the spectru...
Software-defined white-space cognitive systems: implementation of the spectru...
 
Icici bme 2011
Icici bme 2011Icici bme 2011
Icici bme 2011
 
Fourier transform
Fourier transformFourier transform
Fourier transform
 
Ofdm
OfdmOfdm
Ofdm
 
I phone 10
I phone 10I phone 10
I phone 10
 
Ch6 1 v1
Ch6 1 v1Ch6 1 v1
Ch6 1 v1
 
Introduction to OFDM
Introduction to OFDMIntroduction to OFDM
Introduction to OFDM
 
Assignment 1
Assignment 1Assignment 1
Assignment 1
 
Mimo
MimoMimo
Mimo
 
Nyquist criterion for zero ISI
Nyquist criterion for zero ISINyquist criterion for zero ISI
Nyquist criterion for zero ISI
 
Faster rcnn
Faster rcnnFaster rcnn
Faster rcnn
 
Tham khao ofdm tutorial
Tham khao ofdm tutorialTham khao ofdm tutorial
Tham khao ofdm tutorial
 
Data and signals
Data and signalsData and signals
Data and signals
 
Receiver deghosting method to mitigate F-­K transform artifacts: A non-­windo...
Receiver deghosting method to mitigate F-­K transform artifacts: A non-­windo...Receiver deghosting method to mitigate F-­K transform artifacts: A non-­windo...
Receiver deghosting method to mitigate F-­K transform artifacts: A non-­windo...
 
Physical Layer Numericals - Data Communication & Networking
Physical Layer  Numericals - Data Communication & NetworkingPhysical Layer  Numericals - Data Communication & Networking
Physical Layer Numericals - Data Communication & Networking
 
Adm
AdmAdm
Adm
 
2008 anna university
2008 anna university2008 anna university
2008 anna university
 

Andere mochten auch

Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scaleKen Krugler
 
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitVery Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitKavita Ganesan
 
LOW COST HOUSING
LOW COST HOUSINGLOW COST HOUSING
LOW COST HOUSINGSUJEESH A S
 
What is tackled in the Java EE Security API (Java EE 8)
What is tackled in the Java EE Security API (Java EE 8)What is tackled in the Java EE Security API (Java EE 8)
What is tackled in the Java EE Security API (Java EE 8)Rudy De Busscher
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkVolker Hirsch
 

Andere mochten auch (7)

Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
 
Future Urban Transport: When Less is More
Future Urban Transport: When Less is MoreFuture Urban Transport: When Less is More
Future Urban Transport: When Less is More
 
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitVery Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
 
LOW COST HOUSING
LOW COST HOUSINGLOW COST HOUSING
LOW COST HOUSING
 
Skybus
SkybusSkybus
Skybus
 
What is tackled in the Java EE Security API (Java EE 8)
What is tackled in the Java EE Security API (Java EE 8)What is tackled in the Java EE Security API (Java EE 8)
What is tackled in the Java EE Security API (Java EE 8)
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of Work
 

Ähnlich wie BM25 Scoring for Lucene: From Academia to Industry

Analysis of vibration signals to identify cracks in a gear unit
Analysis of vibration signals to identify cracks in a gear unitAnalysis of vibration signals to identify cracks in a gear unit
Analysis of vibration signals to identify cracks in a gear unitsushanthsjce
 
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum Sensing
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum SensingAnalysis Of Ofdm Parameters Using Cyclostationary Spectrum Sensing
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum SensingOmer Ali
 
Pcb carolina scg_2010
Pcb carolina scg_2010Pcb carolina scg_2010
Pcb carolina scg_2010tcoyle72
 
Pcb carolina scg_2010
Pcb carolina scg_2010Pcb carolina scg_2010
Pcb carolina scg_2010tcoyle72
 
OFDM Orthogonal Frequency Division Multiplexing
OFDM Orthogonal Frequency Division MultiplexingOFDM Orthogonal Frequency Division Multiplexing
OFDM Orthogonal Frequency Division MultiplexingAbdullaziz Tagawy
 
MEF Service Level Aggrement
MEF Service Level AggrementMEF Service Level Aggrement
MEF Service Level Aggrementshivlu
 
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐCHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐlykhnh386525
 
4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-note4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-notePei-Che Chang
 
Resilience at exascale
Resilience at exascaleResilience at exascale
Resilience at exascaleMarc Snir
 
V5 protocol English
V5 protocol EnglishV5 protocol English
V5 protocol Englishfigtree614
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...NUGU developers
 
F01 beam forming_srs
F01 beam forming_srsF01 beam forming_srs
F01 beam forming_srsLuciano Motta
 
Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...
 Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R... Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...
Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...inventy
 

Ähnlich wie BM25 Scoring for Lucene: From Academia to Industry (17)

Analysis of vibration signals to identify cracks in a gear unit
Analysis of vibration signals to identify cracks in a gear unitAnalysis of vibration signals to identify cracks in a gear unit
Analysis of vibration signals to identify cracks in a gear unit
 
D0432427
D0432427D0432427
D0432427
 
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum Sensing
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum SensingAnalysis Of Ofdm Parameters Using Cyclostationary Spectrum Sensing
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum Sensing
 
Pcb carolina scg_2010
Pcb carolina scg_2010Pcb carolina scg_2010
Pcb carolina scg_2010
 
Pcb carolina scg_2010
Pcb carolina scg_2010Pcb carolina scg_2010
Pcb carolina scg_2010
 
OFDM Orthogonal Frequency Division Multiplexing
OFDM Orthogonal Frequency Division MultiplexingOFDM Orthogonal Frequency Division Multiplexing
OFDM Orthogonal Frequency Division Multiplexing
 
MEF Service Level Aggrement
MEF Service Level AggrementMEF Service Level Aggrement
MEF Service Level Aggrement
 
ofdm
ofdmofdm
ofdm
 
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐCHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
 
4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-note4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-note
 
Resilience at exascale
Resilience at exascaleResilience at exascale
Resilience at exascale
 
V5 protocol English
V5 protocol EnglishV5 protocol English
V5 protocol English
 
Lec11 rate distortion optimization
Lec11 rate distortion optimizationLec11 rate distortion optimization
Lec11 rate distortion optimization
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
 
F01 beam forming_srs
F01 beam forming_srsF01 beam forming_srs
F01 beam forming_srs
 
Filter dengan-op-amp
Filter dengan-op-ampFilter dengan-op-amp
Filter dengan-op-amp
 
Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...
 Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R... Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...
Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...
 

Kürzlich hochgeladen

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 

Kürzlich hochgeladen (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

BM25 Scoring for Lucene: From Academia to Industry

  • 1. BM25 Scoring for Lucene: From Academia to Industry Yuval Feinstein Answers Corporation Apache Lucene EuroCon 2010 Meetup Prague, May 2010
  • 2. Overview  Answers.com  A Relevance problem  BM25F - a possible solution  Joaquin’s Implementation  Productization  Future directions 2
  • 3. Answers.com  Mission - Provide best answers about anything.  A popular web site (according to comScore, March 2010):  #33 worldwide, with 75.8 million unique users  #18 in US, with 51.2 million unique users  WikiAnswers – community Q&A site (UGC)  ReferenceAnswers – editorial content  Atlas – internal search engine  Implicit search example: find similar 3 questions
  • 6. Enter BM25F  Query Q = (t1, t2, …, tm)  Document D  Term frequency tfi similarity Q , D    w i tf i  tQ  D  How much should tfi influence similarity?  Determine similarity by choosing weights  BM25F: saturation, soft length normalization, idf weights and field weights.
  • 7. Saturation Frequency Saturation 1 0.9 0.8 0.7 0.6 Saturated 0.5 Weight, tf/(2+tf) 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 Term Frequency tf Replace tf by tf/(k1+tf)
  • 8. Soft Length Normalization length normalization 2 1.8 1.6 1.4 1.2 normalized 1 frequency 0.8 0.6 0.4 0.2 0 0 5 10 15 20 25 30 document length tf tf '  Replace tf by  dl   1  b   b   avdl 
  • 9. Inverse Document Frequency (IDF) IDF weighting 2.5 2 1.5 IDF weight (wi) 1 0.5 0 0 20 40 60 80 100 120 num docs with term (ni) N  n i  0 .5  log IDF wi n i  0 .5
  • 10. Field Weights Every field has a different b (length verbosity parameter) and a different v (field value parameer) 10
  • 11. The BM25F Formula S ~ tf si v Field weighting tf i  s s 1 Bs  sl s  Field length normalization B s   1  b s   b s   avsl  ~ tf i  BM 25 F IDF Saturation and IDF w i ~ w i k1  f i
  • 12. Joaquin’s Implementation  Joaquín Pérez Iglesias of UNED, Madrid, Spain implemented a BM25F library for Lucene, with the class BM25BooleanQuery  Algorithm:  Collect documents with query terms  Score individual terms using BM25F  Combine scores using addition to get Boolean query score 12
  • 13. BM25F Usefulness for Our Case  Short texts  Term repetitions hurt relevance for short texts  Want to combine different fields (in the future, different information sources)  Initial Experiments showed nice relevance, but…. 13
  • 14. Feeling Safe to make Changes  How can we be sure not to break anything?  Added Unit Tests  (This is almost a Lucene standard, but not in Academia…) 14
  • 15. Production Challenges – Performance Can this library handle 10M queries daily? Initial Runtimes: Average Median Runtime Runtime mSec mSec Standard 161 119 Lucene Scoring BM25F 273 209 Difference 68% 75% 15
  • 16. Improving Performance Addressed using:  Benchmarking  Profiling  Refactoring, to give Average Median Runtime Runtime mSec mSec Standard 93 65 Lucene Scoring BM25F 92 70 16 Difference -1% 8%
  • 17. Production Challenges – Robustness  Lots of users  strange inputs e.g. //////////////////////////////////////// ;-) fdsfdsdfsdffssssssfsfsfs  Addressed using more careful tokenization
  • 18. Production Challenges – Integration and Interoperability  Needs data not currently in Lucene index:  Average Field Lengths  Document-level IDF  We calculated the first externally and approximated the second using longest field IDF  Library does not play nicely with others – not recursive  BM25 Library supports BooleanQuery, not phrases, prefix, etc.
  • 19. Remember case 31136? Well, She’s mostly pleased…  BM25 runs in our production environment  Supporting 10s of millions of queries daily
  • 20. Future Work  LUCENE-2091 – Our suggested contrib patch  LUCENE-2392 – Current work on making Lucene scoring more flexible, to incorporate BM25 as well as other models  We want to incorporate BM25 scoring into Solr  Could this be faster as well? 20
  • 21. References  Integrating the Probabilistic Model BM25/BM25F into Lucene – Joaquin Perez Iglesias  The Probabilistic Relevance Framework: BM25 and Beyond – Stephen Robertson and Hugo Zaragoza  Working Effectively with Legacy Code – Michael Feathers