SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
Tweaking the Base Score:
Lucene/Solr Similarities Explained
Demo: github.com/sematext/activate/tree/master/2019
More info: sematext.com/blog/search-relevance-solr-elasticsearch-similarity
Radu
Gheorghe
Rafał
Kuć
www.sematext.com
Agenda
BM25 - Best Match: the default
DFR - Divergence From Randomness framework
DFI - Divergence From Independence
IB - Information-Based models
LM - Language Models
Custom similarity
Putting it all together
TF*IDF
You know, for historical reasons
BM25 - the TF part
freq / (freq + k1 * (1 - b + b * dl / avgdl))
Best for Most 😁
BM25 tunables
freq / (freq + k1 * (1 - b + b * dl / avgdl))
k1 - raise or lower ceiling
BM25 tunables
freq / (freq + k1 * (1 - b + b * dl / avgdl))
doc length normalization
BM25 demo
yes, that’s how we look
when we give demos
BM25
Good default. You can
tune the weight of freq
and docLength.
Divergence From Randomness
Basic Model
G, I(n), I(ne), I(F)
After Effect
L, B
Normalization
H1, H2, H3, Z, none
tf * c * avgFieldLength / docFieldLength
Divergence From Randomness - H1
Divergence From Randomness - H1
No normalization, and H1 with c == 1, 3, 5, 7
tf * log2
(1 + c * (avgFieldLength / docFieldLength))
Divergence From Randomness - H2
Divergence From Randomness - H2
No normalization, and H2 with c == 1, 3, 5, 7
tf * (avgFieldLength / docFieldLength)Z
Divergence From Randomness - Z
Divergence From Randomness - Z
No normalization, and Z with z == 0.1, 0.2, 0.3, 0.4
(tf * mu * ((totalTermFreq + 1) / (#fieldTokens + 1)))
(docFieldLength + mu) * mu
Divergence From Randomness - H3
Divergence From Randomness - H3
No normalization, and H3 with mu == 1, 3, 5, 7
DFR demo
Only one, I promise
DFR
Framework. Tunable:
choose algorithm and
tune parameters for
both IDF* and
docLength.
* generic name for importance
of this term
Divergence From Independence
expected frequency
Divergence From Independence
docLength*totalTermFrequency/numberOfFieldTokens
expected frequency
DFI: Standardized
(actual - expected)/sqrt(expected)
DFI demo
Oh, but don’t remove
stopwords*!
1) arbitrarily chops field length
2) stopwords aren’t always
stopwords ;)
DFI
Simple. Parameterless.
Flexible: works well
with various datasets.
Information Based
how much information we get from this term?
Information Based
Distribution
Log-Logistic, Smoothed Power-Law
Lambda
DF, TTF
Normalization
H1, H2, H3, Z, none
Information Based - Log-Logistic
log( tfn / (lambda + 1) )
Information Based - Log-Logistic
lambda: 0.1 (red), 0.3 (black), 0.8 (blue)
Information Based - Retrieval Function
the average of the document information brought
by each query term
Information Based - Retrieval Function - DF
number of matching documents
(docFrequency + 1) / (numberOfDocuments + 1)
Information Based - Retrieval Function - TTF
total number of term occurrences
(totalTermFrequency + 1) / (numberOfDocuments + 1)
IB demo
IB
Framework. like DFR.
Even has the same
normalization options.
But newer and, in the
paper, better.
Language Models
probability of a term being our term
Language Models
totalTermFreq/totalFieldTokens
probability of a term being our term
Language Models: Jelinek-Mercer
log(
(1-λ)*
tf
)
docLength
λ * probability
LM demo
feat. Jelinek-Mercer
LM
Two probabilistic
models. Similar
approach to DFI, but
tunable.
Custom Similarity
compute a similarity score using custom code
Custom Similarity - Activate Similarity Factory
public class ActivateSimilarityFactory extends SimilarityFactory {
private volatile Similarity similarity;
public void init(SolrParams params) {
super.init(params);
}
public Similarity getSimilarity() {
if (similarity == null) {
similarity = new ActivateSimilarity();
}
return similarity;
}
}
Custom Similarity - Activate Similarity Factory
public class ActivateSimilarityFactory extends SimilarityFactory {
private volatile Similarity similarity;
public void init(SolrParams params) {
super.init(params);
}
public Similarity getSimilarity() {
if (similarity == null) {
similarity = new ActivateSimilarity();
}
return similarity;
}
}
Custom Similarity - Activate Similarity Factory
public class ActivateSimilarityFactory extends SimilarityFactory {
private volatile Similarity similarity;
public void init(SolrParams params) {
super.init(params);
}
public Similarity getSimilarity() {
if (similarity == null) {
similarity = new ActivateSimilarity();
}
return similarity;
}
}
Custom Similarity - Similarity
public class ActivateSimilarity extends Similarity {
public ActivateSimilarity() {}
public long computeNorm(FieldInvertState state) { return 1; }
public Similarity.SimScorer scorer(float boost,
CollectionStatistics collectionStats, TermStatistics... termStats) {
return new ActivateSimScorer();
}
}
Custom Similarity - Similarity
public class ActivateSimilarity extends Similarity {
public ActivateSimilarity() {}
public long computeNorm(FieldInvertState state) { return 1; }
public Similarity.SimScorer scorer(float boost,
CollectionStatistics collectionStats, TermStatistics... termStats) {
return new ActivateSimScorer();
}
}
Custom Similarity - Similarity
public class ActivateSimilarity extends Similarity {
public ActivateSimilarity() {}
public long computeNorm(FieldInvertState state) { return 1; }
public Similarity.SimScorer scorer(float boost,
CollectionStatistics collectionStats, TermStatistics... termStats) {
return new ActivateSimScorer();
}
}
Custom Similarity - SimScorer
public class ActivateSimScorer extends Similarity.SimScorer {
public float score(float freq, long norm) {
return freq;
}
}
Custom Similarity - SimScorer
public class ActivateSimScorer extends Similarity.SimScorer {
public float score(float freq, long norm) {
return freq;
}
}
Custom
Similarity
demo
Custom
When you need
something special, like
disregarding term
frequency.
Multiple
similarities
demo
THANK YOU

Weitere ähnliche Inhalte

Was ist angesagt?

仕事で使うF#
仕事で使うF#仕事で使うF#
仕事で使うF#
bleis tift
 

Was ist angesagt? (19)

Faster Python, FOSDEM
Faster Python, FOSDEMFaster Python, FOSDEM
Faster Python, FOSDEM
 
仕事で使うF#
仕事で使うF#仕事で使うF#
仕事で使うF#
 
String c
String cString c
String c
 
Strings
StringsStrings
Strings
 
C Programming Homework Help
C Programming Homework HelpC Programming Homework Help
C Programming Homework Help
 
05 object behavior
05 object behavior05 object behavior
05 object behavior
 
Computer Science Engineering : Data structure & algorithm, THE GATE ACADEMY
Computer Science Engineering : Data structure & algorithm, THE GATE ACADEMYComputer Science Engineering : Data structure & algorithm, THE GATE ACADEMY
Computer Science Engineering : Data structure & algorithm, THE GATE ACADEMY
 
Demystifying the Go Scheduler
Demystifying the Go SchedulerDemystifying the Go Scheduler
Demystifying the Go Scheduler
 
Privacy-Preserving Search for Chemical Compound Databases
Privacy-Preserving Search for Chemical Compound DatabasesPrivacy-Preserving Search for Chemical Compound Databases
Privacy-Preserving Search for Chemical Compound Databases
 
Introduction to Recursion (Python)
Introduction to Recursion (Python)Introduction to Recursion (Python)
Introduction to Recursion (Python)
 
SAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionSAE: Structured Aspect Extraction
SAE: Structured Aspect Extraction
 
Nagios Conference 2013 - BOF Nagios Plugins New Threshold Specification Syntax
Nagios Conference 2013 - BOF Nagios Plugins New Threshold Specification SyntaxNagios Conference 2013 - BOF Nagios Plugins New Threshold Specification Syntax
Nagios Conference 2013 - BOF Nagios Plugins New Threshold Specification Syntax
 
Strings
StringsStrings
Strings
 
String in c programming
String in c programmingString in c programming
String in c programming
 
Introduction to go
Introduction to goIntroduction to go
Introduction to go
 
FFT
FFTFFT
FFT
 
String.ppt
String.pptString.ppt
String.ppt
 
pointer, virtual function and polymorphism
pointer, virtual function and polymorphismpointer, virtual function and polymorphism
pointer, virtual function and polymorphism
 
Pointers, virtual function and polymorphism
Pointers, virtual function and polymorphismPointers, virtual function and polymorphism
Pointers, virtual function and polymorphism
 

Ähnlich wie Activate 2019: Tweaking the Base Score: Lucene/Solr Similarities Explained

Grape generative fuzzing
Grape generative fuzzingGrape generative fuzzing
Grape generative fuzzing
FFRI, Inc.
 
Introducing PHP Latest Updates
Introducing PHP Latest UpdatesIntroducing PHP Latest Updates
Introducing PHP Latest Updates
Iftekhar Eather
 
Refactoring In Tdd The Missing Part
Refactoring In Tdd The Missing PartRefactoring In Tdd The Missing Part
Refactoring In Tdd The Missing Part
Gabriele Lana
 

Ähnlich wie Activate 2019: Tweaking the Base Score: Lucene/Solr Similarities Explained (20)

Core java
Core javaCore java
Core java
 
C++ concept of Polymorphism
C++ concept of  PolymorphismC++ concept of  Polymorphism
C++ concept of Polymorphism
 
Terraform Abstractions for Safety and Power
Terraform Abstractions for Safety and PowerTerraform Abstractions for Safety and Power
Terraform Abstractions for Safety and Power
 
The GO Language : From Beginners to Gophers
The GO Language : From Beginners to GophersThe GO Language : From Beginners to Gophers
The GO Language : From Beginners to Gophers
 
Andy On Closures
Andy On ClosuresAndy On Closures
Andy On Closures
 
Addressing Scenario
Addressing ScenarioAddressing Scenario
Addressing Scenario
 
Terraform training 🎒 - Basic
Terraform training 🎒 - BasicTerraform training 🎒 - Basic
Terraform training 🎒 - Basic
 
Doing It Wrong with Puppet -
Doing It Wrong with Puppet - Doing It Wrong with Puppet -
Doing It Wrong with Puppet -
 
Grape generative fuzzing
Grape generative fuzzingGrape generative fuzzing
Grape generative fuzzing
 
Introducing PHP Latest Updates
Introducing PHP Latest UpdatesIntroducing PHP Latest Updates
Introducing PHP Latest Updates
 
Design patterns
Design patternsDesign patterns
Design patterns
 
Network automation with Ansible and Python
Network automation with Ansible and PythonNetwork automation with Ansible and Python
Network automation with Ansible and Python
 
How to test infrastructure code: automated testing for Terraform, Kubernetes,...
How to test infrastructure code: automated testing for Terraform, Kubernetes,...How to test infrastructure code: automated testing for Terraform, Kubernetes,...
How to test infrastructure code: automated testing for Terraform, Kubernetes,...
 
Groovy Ecosystem - JFokus 2011 - Guillaume Laforge
Groovy Ecosystem - JFokus 2011 - Guillaume LaforgeGroovy Ecosystem - JFokus 2011 - Guillaume Laforge
Groovy Ecosystem - JFokus 2011 - Guillaume Laforge
 
Terraform modules restructured
Terraform modules restructuredTerraform modules restructured
Terraform modules restructured
 
Terraform Modules Restructured
Terraform Modules RestructuredTerraform Modules Restructured
Terraform Modules Restructured
 
Refactoring In Tdd The Missing Part
Refactoring In Tdd The Missing PartRefactoring In Tdd The Missing Part
Refactoring In Tdd The Missing Part
 
Spock: A Highly Logical Way To Test
Spock: A Highly Logical Way To TestSpock: A Highly Logical Way To Test
Spock: A Highly Logical Way To Test
 
From Java to Parellel Clojure - Clojure South 2019
From Java to Parellel Clojure - Clojure South 2019From Java to Parellel Clojure - Clojure South 2019
From Java to Parellel Clojure - Clojure South 2019
 
Kicking off with Zend Expressive and Doctrine ORM (PHP South Africa 2018)
Kicking off with Zend Expressive and Doctrine ORM (PHP South Africa 2018)Kicking off with Zend Expressive and Doctrine ORM (PHP South Africa 2018)
Kicking off with Zend Expressive and Doctrine ORM (PHP South Africa 2018)
 

Kürzlich hochgeladen

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
mbmh111980
 

Kürzlich hochgeladen (20)

OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and Prevention
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
architecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfarchitecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdf
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
 
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdfThe Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
 

Activate 2019: Tweaking the Base Score: Lucene/Solr Similarities Explained