Using SweetSpotSimilarity for Solr Fulltext Indexing

•

1 gefällt mir•968 views

The document discusses using the SweetSpotSimilarity algorithm for Solr fulltext indexing. SweetSpotSimilarity scores documents differently based on document length, giving higher scores to documents with lengths within a specified "sweet spot" range, with scores decreasing more steeply the further outside the range. The document provides an example configuration for Solr that sets the minimum and maximum lengths of the sweet spot range to 1000 and 20,000 tokens respectively, with a steepness of 0.5.

Software Technologie Business

Using SweetSpotSimilarity for
Solr Fulltext Indexing
(A Public Service Message)
Jay Luker
SAO/NASA Astrophysics Data System
http://adsabs.harvard.edu/

$From http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/search/Similarity.html Score for a particular result Buncha stuff you probably ought to read up on. "encapsulates a few (indexing time) boost and length factors" {$

norm(t,d)
Includes...
● Document boost - e.g. <doc boost="2.5">
● Field boost - e.g. <field boost="3.0">
and what we're concerned with...
● lengthNorm(field) - computed at index time based
on the number of tokens in the field of the input
document.
These factors, multiplied together, make up the norm(t,
d) for a given document

lengthNorm(String fieldName, int numTokens)
"Matches in longer fields are less precise, so implementations of
this method usually return smaller values when numTokens is
large, and larger values when numTokens is small."
Translation:
SHORTER DOCUMENTS SCORE HIGHER
from the javadoc:

changes this ...
to this ...
lengthNorm(L) =
1
sqrt(L)
SweetSpotSimilarity
lucene/contrib/misc/...
lengthNorm(L) =
1
sqrt(steepness*(|L-min|+|L-max|-(max-min))+1)

min/max = your "sweet spot" range. Lengths within
this range compute to a constant, i.e., 1.
steepness = controls the curve up to and down from
the sweet spot "plateau".

(termcounts for all ADS's searchable fulltext since 01/2000)

<similarity class="org.ads.solr.SweetSpotSimilarityFactory">
<str name="min">1000</str>
<str name="max">20000</str>
<str name="steepness">0.5</str>
</similarity>
In schema.xml

public class SweetSpotSimilarityFactory extends SimilarityFactory {
public static final Logger log =
LoggerFactory.getLogger(SolrResourceLoader.class);
@Override
public Similarity getSimilarity() {
SweetSpotSimilarity sim = new SweetSpotSimilarity();
int max = this.params.getInt("max");
int min = this.params.getInt("min");
float steepness = this.params.getFloat("steepness");
log.info("max: " + max);
log.info("min: " + min);
log.info("steepness: " + steepness);
// yuck! hardcoded field settings for now
sim.setLengthNormFactors("body", min, max, steepness, true);
return sim;
}
}

Thanks!
Further reading:
"Lucene and Juru at TREC 2007: 1-Million Queries Track"
http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
Also, check out our Blacklight beta search!
http://labs.adsabs.harvard.edu/fulltext

Empfohlen

Document Classification and ClusteringAnkur Shrivastava

Scoring, term weighting and the vector spaceUjjawal

Document clustering for forensic analysissrinivasa teja

Vchunk join an efficient algorithm for edit similarity joinsVijay Koushik

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris

Grid based method & model based clustering methodrajshreemuthiah

Final proj 2 (1)Praveen Kumar

Document clustering and classification Mahmoud Alfarra

Empfohlen

Document Classification and ClusteringAnkur Shrivastava

Scoring, term weighting and the vector spaceUjjawal

Document clustering for forensic analysissrinivasa teja

Vchunk join an efficient algorithm for edit similarity joinsVijay Koushik

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris

Grid based method & model based clustering methodrajshreemuthiah

Final proj 2 (1)Praveen Kumar

Document clustering and classification Mahmoud Alfarra

Similarity Measurement Preliminary Resultsxiaojuzheng

Textmining Retrieval And Clusteringguest0edcaf

Query Optimizationrohitsalunke

OsMahesh Balisetty

ML2014_Poster_ TextClusteringDemoGeorge Simov

LoryfelNunezInsightLory Nunez

LoryfelNunezLory Nunez

3.2 partitioning methodsKrish_ver2

RDBMSsowfi

Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelWaqas Tariq

Journal paper 1Editor IJCATR

AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...IJCSEA Journal

Modelling Accessibility Performance in LTE networks, An Analytics Methodologyalien_gmx

Poster-SetCoverAlgorithmDivya Jain

Entropy scaling search methodGokulakannan Selvam

Query treesShefa Idrees

GhostJhih-Ming Chen

Mining high speed data streams: Hoeffding and VFDTDavide Gallitelli

Aggarwal DraftDeanna Kosaraju

Language Technology Enhanced Learningtelss09

Overview of query evaluationavniS

Advanced full text searching techniques using LuceneAsad Abbas

Weitere ähnliche Inhalte

Was ist angesagt?

Similarity Measurement Preliminary Resultsxiaojuzheng

Textmining Retrieval And Clusteringguest0edcaf

Query Optimizationrohitsalunke

OsMahesh Balisetty

ML2014_Poster_ TextClusteringDemoGeorge Simov

LoryfelNunezInsightLory Nunez

LoryfelNunezLory Nunez

3.2 partitioning methodsKrish_ver2

RDBMSsowfi

Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelWaqas Tariq

Journal paper 1Editor IJCATR

AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...IJCSEA Journal

Modelling Accessibility Performance in LTE networks, An Analytics Methodologyalien_gmx

Poster-SetCoverAlgorithmDivya Jain

Entropy scaling search methodGokulakannan Selvam

Query treesShefa Idrees

GhostJhih-Ming Chen

Mining high speed data streams: Hoeffding and VFDTDavide Gallitelli

Was ist angesagt? (18)

Similarity Measurement Preliminary Results

Textmining Retrieval And Clustering

Query Optimization

ML2014_Poster_ TextClusteringDemo

LoryfelNunezInsight

LoryfelNunez

3.2 partitioning methods

RDBMS

Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model

Journal paper 1

AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...

Modelling Accessibility Performance in LTE networks, An Analytics Methodology

Poster-SetCoverAlgorithm

Entropy scaling search method

Query trees

Ghost

Mining high speed data streams: Hoeffding and VFDT

Ähnlich wie Using SweetSpotSimilarity for Solr Fulltext Indexing

Aggarwal DraftDeanna Kosaraju

Language Technology Enhanced Learningtelss09

Overview of query evaluationavniS

Advanced full text searching techniques using LuceneAsad Abbas

IR with luceneStelios Gorilas

Query optimization to improve performance of the code executionAlexander Decker

11.query optimization to improve performance of the code executionAlexander Decker

Real-Time Spark: From Interactive Queries to StreamingDatabricks

Declarative Multilingual Information Extraction with SystemTdiannepatricia

Implementation of query optimization for reducing run timeAlexander Decker

A look ahead at spark 2.0 Databricks

엘라스틱서치 적합성 이해하기 20160630Yong Joon Moon

Query optimization for_sensor_networksHarshavardhan Achrekar

04 pig data operationsSubhas Kumar Ghosh

An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort

Stress test data pipelineMarina Grechuhin

The life of a query (oracle edition)maclean liu

Data structure and algorithmTrupti Agrawal

Hadoop map reduce conceptsSubhas Kumar Ghosh

Extractive Document Summarization - An Unsupervised ApproachFindwise

Ähnlich wie Using SweetSpotSimilarity for Solr Fulltext Indexing (20)

Aggarwal Draft

Language Technology Enhanced Learning

Overview of query evaluation

Advanced full text searching techniques using Lucene

IR with lucene

Query optimization to improve performance of the code execution

11.query optimization to improve performance of the code execution

Real-Time Spark: From Interactive Queries to Streaming

Declarative Multilingual Information Extraction with SystemT

Implementation of query optimization for reducing run time

A look ahead at spark 2.0

엘라스틱서치 적합성 이해하기 20160630

Query optimization for_sensor_networks

04 pig data operations

An Overview of Spanner: Google's Globally Distributed Database

Stress test data pipeline

The life of a query (oracle edition)

Data structure and algorithm

Hadoop map reduce concepts

Extractive Document Summarization - An Unsupervised Approach

Mehr von Jay Luker

CoinageJay Luker

Learning Engineering Initiatives at Harvard DCEJay Luker

N Characters in Search of an Author: Improving Author Name Indexing & Searchi...Jay Luker

Letting In the Light: Using Solr as an External Search ComponentJay Luker

LexFarm Busa Farm Site PlanJay Luker

LexFarm PresentationJay Luker

LexFarm ProposalJay Luker

Mehr von Jay Luker (7)

Coinage

Learning Engineering Initiatives at Harvard DCE

N Characters in Search of an Author: Improving Author Name Indexing & Searchi...

Letting In the Light: Using Solr as an External Search Component

LexFarm Busa Farm Site Plan

LexFarm Presentation

LexFarm Proposal

Kürzlich hochgeladen

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

5 Signs You Need a Fashion PLM Software.pdfWave PLM

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

TECUNIQUE: Success Stories: IT Service providermohitmore19

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Right Money Management App For Your Financial GoalsJhone kinadey

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

Kürzlich hochgeladen (20)

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live

HR Software Buyers Guide in 2024 - HRSoftware.com

5 Signs You Need a Fashion PLM Software.pdf

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

TECUNIQUE: Success Stories: IT Service provider

Optimizing AI for immediate response in Smart CCTV

Unlocking the Future of AI Agents with Large Language Models

Microsoft AI Transformation Partner Playbook.pdf

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Right Money Management App For Your Financial Goals

Diamond Application Development Crafting Solutions with Precision

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

Using SweetSpotSimilarity for Solr Fulltext Indexing

1. Using SweetSpotSimilarity for Solr Fulltext Indexing (A Public Service Message) Jay Luker SAO/NASA Astrophysics Data System http://adsabs.harvard.edu/

2. From http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/search/Similarity.html Score for a particular result Buncha stuff you probably ought to read up on. "encapsulates a few (indexing time) boost and length factors" {

3. norm(t,d) Includes... ● Document boost - e.g. <doc boost="2.5"> ● Field boost - e.g. <field boost="3.0"> and what we're concerned with... ● lengthNorm(field) - computed at index time based on the number of tokens in the field of the input document. These factors, multiplied together, make up the norm(t, d) for a given document

4. lengthNorm(String fieldName, int numTokens) "Matches in longer fields are less precise, so implementations of this method usually return smaller values when numTokens is large, and larger values when numTokens is small." Translation: SHORTER DOCUMENTS SCORE HIGHER from the javadoc:

5. changes this ... to this ... lengthNorm(L) = 1 sqrt(L) SweetSpotSimilarity lucene/contrib/misc/... lengthNorm(L) = 1 sqrt(steepness*(|L-min|+|L-max|-(max-min))+1)

6. min/max = your "sweet spot" range. Lengths within this range compute to a constant, i.e., 1. steepness = controls the curve up to and down from the sweet spot "plateau".

7. (termcounts for all ADS's searchable fulltext since 01/2000)

8. <similarity class="org.ads.solr.SweetSpotSimilarityFactory"> <str name="min">1000</str> <str name="max">20000</str> <str name="steepness">0.5</str> </similarity> In schema.xml

9. public class SweetSpotSimilarityFactory extends SimilarityFactory { public static final Logger log = LoggerFactory.getLogger(SolrResourceLoader.class); @Override public Similarity getSimilarity() { SweetSpotSimilarity sim = new SweetSpotSimilarity(); int max = this.params.getInt("max"); int min = this.params.getInt("min"); float steepness = this.params.getFloat("steepness"); log.info("max: " + max); log.info("min: " + min); log.info("steepness: " + steepness); // yuck! hardcoded field settings for now sim.setLengthNormFactors("body", min, max, steepness, true); return sim; } }

10.

11.

12. Thanks! Further reading: "Lucene and Juru at TREC 2007: 1-Million Queries Track" http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf Also, check out our Blacklight beta search! http://labs.adsabs.harvard.edu/fulltext