SlideShare a Scribd company logo
1 of 46
Hacking Lucene for
Custom Search Results
Doug Turnbull
OpenSource Connections
OpenSource Connections
Hello
Me
@softwaredoug
dturnbull@o19s.com
Us
OpenSource Connections
@o19s
http://o19s.com
- Trusted Advisors in Search, Discovery &
Analytics
OpenSource Connections
Related Resources
• Code for this talk can be found here:
o https://github.com/o19s/lucene-query-example
• Blog Articles:
o Build Your Own Custom Lucene Query and Scorer:
http://www.opensourceconnections.com/2014/01
/20/build-your-own-custom-lucene-query-and-
scorer/
o Using Custom Score Query for Custom Solr/Lucene
Scoring:
http://www.opensourceconnections.com/2014/03
/12/using-customscorequery-for-custom-
solrlucene-scoring/
OpenSource Connections
Tough Search Problems
• We have demanding users!
OpenSource Connections
Switch these two!
Tough Search Problems
• Demanding users!
OpenSource Connections
WRONG!
Make search do
what is in my head!
Tough Search Problems
• Our Eternal Problem:
o Customers don’t care about the technology
field of Information Retrieval: they just want
results
o BUT we are constrained by the tech!
OpenSource Connections
This is how
a search
engine
works!
In one ear Out the other
/dev/null
Satisfying User Expectations
• Easy: The Search Relevancy Game:
o Solr query operations (boosts, etc)
o Analysis of query/index to enhance matching
• Medium: Forget this, lets write some Java
o Solr query parsers. Reuse existing Lucene Queries to get
closer to user needs
OpenSource Connections
That Still Didn’t Work
• Look at him, he’s angrier than ever!
• For the toughest problems, we’ve made
search complex and brittle
• WHACK-A-MOLE:
o Fix one problem, cause another
o We give up,
OpenSource Connections
Next Level
• Hard: Custom Lucene Scoring – implement a query
and scorer to explicitly control matching and
scoring
OpenSource Connections
This is the Nuclear Option!
Shameless Plug
• How do we know if we’re making progress?
OpenSource Connections
• Quepid! – our search test driven workbench
o http://quepid.com
Lucene Lets Review
• At some point we wrote a Lucene index to a
directory
• Boilerplate (open up the index):
OpenSource Connections
Directory d = new RAMDirectory();
IndexReader ir = DirectoryReader.open(d);
IndexSearcher is = new IndexSearcher(ir);
Boilerplate setup of:
• Directory Lucene’s handle to
the FS
• IndexReader – Access to
Lucene’s data structures
• IndexSearcher – use index
searcher to perform search
Lucene Lets Review
• Queries:
• Queries That Combine Queries
OpenSource Connections
Make a Query and Search!
• TermQuery: basic term
search for a field
Term termToFind = new Term("tag", "space");
TermQuery spaceQ = new TermQuery(termToFind);
termToFind = new Term("tag", "star-trek");
TermQuery starTrekQ = new TermQuery(termToFind);
BooleanQuery bq = new BooleanQuery();
BooleanClause bClause = new BooleanClause(spaceQ, Occur.MUST);
BooleanClause bClause2 = new BooleanClause(starTrekQ, Occur.SHOULD);
bq.add(bClause);
bq.add(bClause2);
Lucene Lets Review
• Query responsible for specifying search behavior
• Both:
o Matching – what documents to include in the results
o Scoring – how relevant is a result to the query by assigning
a score
OpenSource Connections
Lucene Queries, 30,000 ft view
OpenSource Connections
LuceneQuery
IndexReader
Find
next
Match
IndexSearcher
Aka, “not really accurate, but what to tell your boss to not confuse them”
Next Match Plz
Here ya go
Score That Plz
Calc.
score
Score of last doc
First Stop CustomScoreQuery
• Wrap a query but override its score
OpenSource Connections
CustomScoreQuery
LuceneQuery
Find
next
Match
Calc.
score
CustomScoreProvider
Rescore doc
New Score
Next Match Plz
Here ya go
Score That Plz
Score of last doc
Result:
- Matching Behavior unchanged
- Scoring completely overriden
A chance to reorder results of a Lucene
Query by tweaking scoring
How to use?
• Use a normal Lucene query for matching
Term t = new Term("tag", "star-trek");
TermQuery tq = new TermQuery(t);
• Create & Use a CustomQueryScorer for scoring that
wraps the Lucene query
CountingQuery ct = new CountingQuery(tq);
OpenSource Connections
Implementation
• Extend CustomScoreQuery, provide a
CustomScoreProvider
OpenSource Connections
protected CustomScoreProvider getCustomScoreProvider(
AtomicReaderContext context) throws IOException {
return new CountingQueryScoreProvider("tag", context);
}
(boilerplate omitted)
Implementation
• CustomScoreProvider rescores each doc with
IndexReader & docId
OpenSource Connections
// Give all docs a score of 1.0
public float customScore(int doc, float subQueryScore, float
valSrcScores[]) throws IOException {
return (float)(1.0f); // New Score
}
Implementation
• Example: Sort by number of terms in a field
OpenSource Connections
// Rescores by counting the number of terms in the field
public float customScore(int doc, float subQueryScore, float
valSrcScores[]) throws IOException {
IndexReader r = context.reader();
Terms tv = r.getTermVector(doc, _field);
TermsEnum termsEnum = null;
termsEnum = tv.iterator(termsEnum);
int numTerms = 0;
while((termsEnum.next()) != null) {
numTerms++;
}
return (float)(numTerms); // New Score
}
CustomScoreQuery, Takeaway
• SIMPLE!
o Relatively few gotchas or bells & whistles (we will see lots of
gotchas)
• Limited
o No tight control on what matches
• If this satisfies your requirements: You should get off
the train here
OpenSource Connections
Lucene Circle Back
• I care about overriding scoring
o CustomScoreQuery
• I need to control custom scoring and matching
o Custom Lucene Queries!
OpenSource Connections
Example – Backwards Query
• Search for terms backwards!
o Instead of banana, lets create a query that finds ananab
matches and scores the document (5.0)
o But lets also match forward terms (banana), but with a
lower score (1.0)
• Disclaimer: its probably possible to do this with
easier means!
https://github.com/o19s/lucene-query-example/
OpenSource Connections
Lucene Queries, 30,000 ft view
OpenSource Connections
LuceneQuery
IndexReader
Find
next
Match
IndexSearcher
Aka, “not really accurate, but what to tell your boss to not confuse them”
Next Match Plz
Here ya go
Score That Plz
Calc.
score
Score of last doc
Anatomy of Lucene Query
OpenSource Connections
LuceneQuery
Weight
Scorer
A Tale Of Three Classes:
• Queries Create Weights:
• Query-level stats for this
search
• Think “IDF” when you
hear weights
• Weights Create Scorers:
• Heavy Lifting, reports
matches and returns a
score
Weight & Scorer are inner classes of Query
Next Match Plz
Here ya go
Score That Plz
Score of last doc
Find
next
Match
Calc.
score
Backwards Query Outline
OpenSource Connections
class BacwkardsQuery {
class BackwardsScorer {
// matching & scoring functionality
}
class BackwardsWeight {
// query normalization and other “global” stats
public Scorer scorer(AtomicReaderContext context, …)
}
public Weight createWeight(IndexSearcher)
}
How are these used?
OpenSource Connections
Query q = new BackwardsQuery();
idxSearcher.search(q);
This Setup Happens:
When you do:
Weight w = q.createWeight(idxSearcher);
normalize(w);
foreach IndexReader idxReader:
Scorer s = w.scorer(idxReader);
Important to know how Lucene is calling your code
Weight
OpenSource Connections
Weight w = q.createWeight(idxSearcher);
normalize(w);
What should we do with our weight?
IndexSearcher Level Stats
- Notice we pass the IndexSearcher when we create the weight
- Weight tracks IndexSearcher level statistics used for scoring
Query Normalization
- Weight also participates in query normalization
Remember – its your Weight! Weight can be a no-op and just create searchers
Weight & Query Normalization
OpenSource Connections
Query Normalization – an optional little ritual to take your Weight
instance through:
float v = weight.getValueForNormalization();
float norm = getSimilarity().queryNorm(v);
weight.normalize(norm, 1.0f);
What I think my weight is
Normalize that weight
against global statistics
Pass back the normalized stats
Weight & Query Normalization
• For TermQuery:
o The result of all this ceremony is the IDF (inverse document
frequency of the term).
• This code is fairly abstract
o All three steps are pluggable, and can be totally ignored
OpenSource Connections
float v = weight.getValueForNormalization();
float norm = getSimilarity().queryNorm(v);
weight.normalize(norm, 1.0f);
BackwardsWeight
• Custom Weight that completely ignores query
normalization:
OpenSource Connections
@Override
public float getValueForNormalization() throws IOException {
return 0.0f;
}
@Override
public void normalize(float norm, float topLevelBoost) {
// no-op
}
Weights make Scorers!
• Scorers Have Two Jobs:
o Match! – iterator interface over matching results
o Score! – score the current match
OpenSource Connections
@Override
public Scorer scorer(AtomicReaderContext context, boolean
scoreDocsInOrder, boolean topScorer, Bits acceptDocs) throws
IOException {
return new BackwardsScorer(...);
}
Scorer as an iterator
Inherits the following from DocsEnum:
• nextDoc()
o Next match
• advance(int docId) –
o Seek to the specified docId
• docID()
o Id of the current document we’re on
Oh and, from Scorer
• score()
o Score the current docID()
OpenSource Connections
DocsEnum
Scorer
DocIdSetIterator
In other words…
• Remember THIS?
OpenSource Connections
LuceneQuery
IndexReader
Find
next
Match
IndexSearcher
Next Match Plz
Here ya go
Score That Plz
Calc.
score
Score of curr doc
LuceneQueryLuceneScorer
…Actually…
nextDoc()
score()
Scorer == Engine of the Query
What would nextDoc look like
• Remember search is an inverted index
o Much like a book index
o Fields -> Terms -> Documents!
OpenSource Connections
IndexReader == our handle to inverted index:
• Much like an index. Given term, return list
of doc ids
• TermsEnum:
• Enumeration of terms (actual logical
index of terms)
• DocsEnum
• Enum. of corresponding docIDs (like
list of pages next to term)
What would nextDoc look like?
OpenSource Connections
IndexReader
Find
next
Match
Calc.
score
LuceneScorerfinal TermsEnum termsEnum =
reader.terms(term.field()).iterator(null);
termsEnum.seekExact(term.bytes(), state);
• TermsEnum to lookup info for a Term:
DocsEnum docs = termsEnum.docs(acceptDocs, null);
• Each term has a DocsEnum that lists the
docs that contain this term:
What would nextDoc look like?
OpenSource Connections
IndexReader
Find
next
Match
Calc.
score
LuceneScorer
@Override
public int nextDoc() throws IOException {
return docs.nextDoc();
}
• Wrapping this enum, now I can return matches
for this term!
• You’ve just implemented TermQuery!
BackwardsScorer nextDoc
• Later, when creating a Scorer. Get a handle to
DocsEnum for our backwards term:
OpenSource Connections
public Scorer scorer(AtomicReaderContext context, boolean scoreDocsInOrder,
boolean topScorer, Bits acceptDocs) throws IOException {
Term bwdsTerm = BackwardsQuery.this.backwardsTerm;
TermsEnum bwdsTerms = context.reader().terms(bwdsTerm.field()).iterator(null);
bwdsTerms.seekExact(bwdsTerm.bytes());
DocsEnum bwdsDocs = bwdsTerms.docs(acceptDocs, null);
• Recall our Query has a Backwards Term (ananab):
public BackwardsQuery(String field, String term) {
backwardsTerm = new Term(field, new StringBuilder(term).reverse().toString());
...
}
Terrifying and verbose Lucene speak for:
1. Seek to term in field via TermsEnum
2. Give me a DocsEnum of matching docs
BackwardsScorer nextDoc
• Our scorer has bwdDocs and fwdDocs, our nextDoc
just walks both:
OpenSource Connections
@Override
public int nextDoc() throws IOException {
int currDocId = docID();
// increment one or both
if (currDocId == backwardsScorer.docID()) {
backwardsScorer.nextDoc();
}
if (currDocId == forwardsScorer.docID()) {
forwardsScorer.nextDoc();
}
return docID();
}
Scorer for scores!
• Score is easy! Implement score,
do whatever you want!
OpenSource Connections
IndexReader
Find
next
Match
Calc.
score
LuceneScorer@Override
public float score() throws
IOException {
return 1.0f;
}
We call docID() in nextDoc()
BackwardsScorer Score
• Recall, match a backwards term (ananab)score =
5.0, fwd term (banana) score = 1.0
• We hook into docID, update score based on
current posn
OpenSource Connections
@Override
public int docID() {
int backwordsDocId = backwardsScorer.docID();
int forwardsDocId = forwardsScorer.docID();
if (backwordsDocId <= forwardsDocId && backwordsDocId != NO_MORE_DOCS) {
currScore = BACKWARDS_SCORE;
return backwordsDocId;
} else if (forwardsDocId != NO_MORE_DOCS) {
currScore = FORWARDS_SCORE;
return forwardsDocId;
}
return NO_MORE_DOCS;
}
Currently positioned on a
bwds doc, set currScore to 5.0
Currently positioned on a fwd
doc, set currScore to 1.0
BackwardsScorer Score
• For completeness sake, here’s our
score:
OpenSource Connections
@Override
public float score() throws
IOException {
return currScore;
}
IndexReader
Find
next
Match
Calc.
score
LuceneScorer
So many gotchas!
• Ultimate POWER! But You will have weird bugs:
o Do all of your searches return the results of your first query?
• In Query Implement hashCode and equals
o Weird/Random Test Failures
• Test using LuceneTestCase to ferret out common Lucene bugs
o Randomized testing w/ different codecs etc
o IndexReader methods have a certain ritual and very specific
rules, (enums must be primed, etc)
OpenSource Connections
Extras
• Query rewrite method
o Optional, recognize you are a complex query, turn yourself
into a simpler one
• BooleanQuery with 1 clause -> return just one clause
• Weight has optional explain
o Useful for debugging in Solr
o Pretty straight-forward API
OpenSource Connections
Conclusions!
• These are nuclear options!
o You can achieve SO MUCH before
you get here (at much less
complexity)
o There’s certainly a way to do what
you’ve seen without this level of
control
• Fun way to learn about Lucene!
OpenSource Connections
QUESTIONS?
OpenSource Connections
Feel free to contact me
dturnbull@opensourceconnections.com
@softwaredoug
OpenSource Connections

More Related Content

What's hot

하이브 최적화 방안
하이브 최적화 방안하이브 최적화 방안
하이브 최적화 방안Teddy Choi
 
Millions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size MattersMillions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size MattersDataWorks Summit
 
Obscure Go Optimisations
Obscure Go OptimisationsObscure Go Optimisations
Obscure Go OptimisationsBryan Boreham
 
HBase Blockcache 101
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101Nick Dimiduk
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Cloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera, Inc.
 
Extending Druid Index File
Extending Druid Index FileExtending Druid Index File
Extending Druid Index FileNavis Ryu
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBDocker, Inc.
 
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話Yahoo!デベロッパーネットワーク
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariDataWorks Summit
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureDan McKinley
 
Analisadores de protocolo: comparação e uso
Analisadores de protocolo: comparação e usoAnalisadores de protocolo: comparação e uso
Analisadores de protocolo: comparação e usoJerônimo Medina Madruga
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance SmackdownDataWorks Summit
 
Apache Ambari: Past, Present, Future
Apache Ambari: Past, Present, FutureApache Ambari: Past, Present, Future
Apache Ambari: Past, Present, FutureHortonworks
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationDataWorks Summit
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 

What's hot (20)

An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
하이브 최적화 방안
하이브 최적화 방안하이브 최적화 방안
하이브 최적화 방안
 
Millions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size MattersMillions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size Matters
 
Obscure Go Optimisations
Obscure Go OptimisationsObscure Go Optimisations
Obscure Go Optimisations
 
HBase Blockcache 101
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Cloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera cluster
 
Extending Druid Index File
Extending Druid Index FileExtending Druid Index File
Extending Druid Index File
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FB
 
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with Ambari
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds Architecture
 
Analisadores de protocolo: comparação e uso
Analisadores de protocolo: comparação e usoAnalisadores de protocolo: comparação e uso
Analisadores de protocolo: comparação e uso
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
Apache Ambari: Past, Present, Future
Apache Ambari: Past, Present, FutureApache Ambari: Past, Present, Future
Apache Ambari: Past, Present, Future
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 

Similar to Hacking Lucene for Custom Search Results

Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationSease
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document ClassificationAlessandro Benedetti
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenchesIsmail Mayat
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginsearchbox-com
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Dev buchan leveraging
Dev buchan leveragingDev buchan leveraging
Dev buchan leveragingBill Buchan
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 

Similar to Hacking Lucene for Custom Search Results (20)

Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document Classification
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Android Database
Android DatabaseAndroid Database
Android Database
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Dev buchan leveraging
Dev buchan leveragingDev buchan leveraging
Dev buchan leveraging
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Test box bdd
Test box bddTest box bdd
Test box bdd
 
Linq to sql
Linq to sqlLinq to sql
Linq to sql
 

More from OpenSource Connections

How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessOpenSource Connections
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullOpenSource Connections
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonOpenSource Connections
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...OpenSource Connections
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajOpenSource Connections
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...OpenSource Connections
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlOpenSource Connections
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...OpenSource Connections
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...OpenSource Connections
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...OpenSource Connections
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...OpenSource Connections
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...OpenSource Connections
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...OpenSource Connections
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah ViaOpenSource Connections
 

More from OpenSource Connections (20)

Encores
EncoresEncores
Encores
 
Test driven relevancy
Test driven relevancyTest driven relevancy
Test driven relevancy
 
How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for Success
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
 
Payloads and OCR with Solr
Payloads and OCR with SolrPayloads and OCR with Solr
Payloads and OCR with Solr
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
 

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Hacking Lucene for Custom Search Results

  • 1. Hacking Lucene for Custom Search Results Doug Turnbull OpenSource Connections OpenSource Connections
  • 2. Hello Me @softwaredoug dturnbull@o19s.com Us OpenSource Connections @o19s http://o19s.com - Trusted Advisors in Search, Discovery & Analytics OpenSource Connections
  • 3. Related Resources • Code for this talk can be found here: o https://github.com/o19s/lucene-query-example • Blog Articles: o Build Your Own Custom Lucene Query and Scorer: http://www.opensourceconnections.com/2014/01 /20/build-your-own-custom-lucene-query-and- scorer/ o Using Custom Score Query for Custom Solr/Lucene Scoring: http://www.opensourceconnections.com/2014/03 /12/using-customscorequery-for-custom- solrlucene-scoring/ OpenSource Connections
  • 4. Tough Search Problems • We have demanding users! OpenSource Connections Switch these two!
  • 5. Tough Search Problems • Demanding users! OpenSource Connections WRONG! Make search do what is in my head!
  • 6. Tough Search Problems • Our Eternal Problem: o Customers don’t care about the technology field of Information Retrieval: they just want results o BUT we are constrained by the tech! OpenSource Connections This is how a search engine works! In one ear Out the other /dev/null
  • 7. Satisfying User Expectations • Easy: The Search Relevancy Game: o Solr query operations (boosts, etc) o Analysis of query/index to enhance matching • Medium: Forget this, lets write some Java o Solr query parsers. Reuse existing Lucene Queries to get closer to user needs OpenSource Connections
  • 8. That Still Didn’t Work • Look at him, he’s angrier than ever! • For the toughest problems, we’ve made search complex and brittle • WHACK-A-MOLE: o Fix one problem, cause another o We give up, OpenSource Connections
  • 9. Next Level • Hard: Custom Lucene Scoring – implement a query and scorer to explicitly control matching and scoring OpenSource Connections This is the Nuclear Option!
  • 10. Shameless Plug • How do we know if we’re making progress? OpenSource Connections • Quepid! – our search test driven workbench o http://quepid.com
  • 11. Lucene Lets Review • At some point we wrote a Lucene index to a directory • Boilerplate (open up the index): OpenSource Connections Directory d = new RAMDirectory(); IndexReader ir = DirectoryReader.open(d); IndexSearcher is = new IndexSearcher(ir); Boilerplate setup of: • Directory Lucene’s handle to the FS • IndexReader – Access to Lucene’s data structures • IndexSearcher – use index searcher to perform search
  • 12. Lucene Lets Review • Queries: • Queries That Combine Queries OpenSource Connections Make a Query and Search! • TermQuery: basic term search for a field Term termToFind = new Term("tag", "space"); TermQuery spaceQ = new TermQuery(termToFind); termToFind = new Term("tag", "star-trek"); TermQuery starTrekQ = new TermQuery(termToFind); BooleanQuery bq = new BooleanQuery(); BooleanClause bClause = new BooleanClause(spaceQ, Occur.MUST); BooleanClause bClause2 = new BooleanClause(starTrekQ, Occur.SHOULD); bq.add(bClause); bq.add(bClause2);
  • 13. Lucene Lets Review • Query responsible for specifying search behavior • Both: o Matching – what documents to include in the results o Scoring – how relevant is a result to the query by assigning a score OpenSource Connections
  • 14. Lucene Queries, 30,000 ft view OpenSource Connections LuceneQuery IndexReader Find next Match IndexSearcher Aka, “not really accurate, but what to tell your boss to not confuse them” Next Match Plz Here ya go Score That Plz Calc. score Score of last doc
  • 15. First Stop CustomScoreQuery • Wrap a query but override its score OpenSource Connections CustomScoreQuery LuceneQuery Find next Match Calc. score CustomScoreProvider Rescore doc New Score Next Match Plz Here ya go Score That Plz Score of last doc Result: - Matching Behavior unchanged - Scoring completely overriden A chance to reorder results of a Lucene Query by tweaking scoring
  • 16. How to use? • Use a normal Lucene query for matching Term t = new Term("tag", "star-trek"); TermQuery tq = new TermQuery(t); • Create & Use a CustomQueryScorer for scoring that wraps the Lucene query CountingQuery ct = new CountingQuery(tq); OpenSource Connections
  • 17. Implementation • Extend CustomScoreQuery, provide a CustomScoreProvider OpenSource Connections protected CustomScoreProvider getCustomScoreProvider( AtomicReaderContext context) throws IOException { return new CountingQueryScoreProvider("tag", context); } (boilerplate omitted)
  • 18. Implementation • CustomScoreProvider rescores each doc with IndexReader & docId OpenSource Connections // Give all docs a score of 1.0 public float customScore(int doc, float subQueryScore, float valSrcScores[]) throws IOException { return (float)(1.0f); // New Score }
  • 19. Implementation • Example: Sort by number of terms in a field OpenSource Connections // Rescores by counting the number of terms in the field public float customScore(int doc, float subQueryScore, float valSrcScores[]) throws IOException { IndexReader r = context.reader(); Terms tv = r.getTermVector(doc, _field); TermsEnum termsEnum = null; termsEnum = tv.iterator(termsEnum); int numTerms = 0; while((termsEnum.next()) != null) { numTerms++; } return (float)(numTerms); // New Score }
  • 20. CustomScoreQuery, Takeaway • SIMPLE! o Relatively few gotchas or bells & whistles (we will see lots of gotchas) • Limited o No tight control on what matches • If this satisfies your requirements: You should get off the train here OpenSource Connections
  • 21. Lucene Circle Back • I care about overriding scoring o CustomScoreQuery • I need to control custom scoring and matching o Custom Lucene Queries! OpenSource Connections
  • 22. Example – Backwards Query • Search for terms backwards! o Instead of banana, lets create a query that finds ananab matches and scores the document (5.0) o But lets also match forward terms (banana), but with a lower score (1.0) • Disclaimer: its probably possible to do this with easier means! https://github.com/o19s/lucene-query-example/ OpenSource Connections
  • 23. Lucene Queries, 30,000 ft view OpenSource Connections LuceneQuery IndexReader Find next Match IndexSearcher Aka, “not really accurate, but what to tell your boss to not confuse them” Next Match Plz Here ya go Score That Plz Calc. score Score of last doc
  • 24. Anatomy of Lucene Query OpenSource Connections LuceneQuery Weight Scorer A Tale Of Three Classes: • Queries Create Weights: • Query-level stats for this search • Think “IDF” when you hear weights • Weights Create Scorers: • Heavy Lifting, reports matches and returns a score Weight & Scorer are inner classes of Query Next Match Plz Here ya go Score That Plz Score of last doc Find next Match Calc. score
  • 25. Backwards Query Outline OpenSource Connections class BacwkardsQuery { class BackwardsScorer { // matching & scoring functionality } class BackwardsWeight { // query normalization and other “global” stats public Scorer scorer(AtomicReaderContext context, …) } public Weight createWeight(IndexSearcher) }
  • 26. How are these used? OpenSource Connections Query q = new BackwardsQuery(); idxSearcher.search(q); This Setup Happens: When you do: Weight w = q.createWeight(idxSearcher); normalize(w); foreach IndexReader idxReader: Scorer s = w.scorer(idxReader); Important to know how Lucene is calling your code
  • 27. Weight OpenSource Connections Weight w = q.createWeight(idxSearcher); normalize(w); What should we do with our weight? IndexSearcher Level Stats - Notice we pass the IndexSearcher when we create the weight - Weight tracks IndexSearcher level statistics used for scoring Query Normalization - Weight also participates in query normalization Remember – its your Weight! Weight can be a no-op and just create searchers
  • 28. Weight & Query Normalization OpenSource Connections Query Normalization – an optional little ritual to take your Weight instance through: float v = weight.getValueForNormalization(); float norm = getSimilarity().queryNorm(v); weight.normalize(norm, 1.0f); What I think my weight is Normalize that weight against global statistics Pass back the normalized stats
  • 29. Weight & Query Normalization • For TermQuery: o The result of all this ceremony is the IDF (inverse document frequency of the term). • This code is fairly abstract o All three steps are pluggable, and can be totally ignored OpenSource Connections float v = weight.getValueForNormalization(); float norm = getSimilarity().queryNorm(v); weight.normalize(norm, 1.0f);
  • 30. BackwardsWeight • Custom Weight that completely ignores query normalization: OpenSource Connections @Override public float getValueForNormalization() throws IOException { return 0.0f; } @Override public void normalize(float norm, float topLevelBoost) { // no-op }
  • 31. Weights make Scorers! • Scorers Have Two Jobs: o Match! – iterator interface over matching results o Score! – score the current match OpenSource Connections @Override public Scorer scorer(AtomicReaderContext context, boolean scoreDocsInOrder, boolean topScorer, Bits acceptDocs) throws IOException { return new BackwardsScorer(...); }
  • 32. Scorer as an iterator Inherits the following from DocsEnum: • nextDoc() o Next match • advance(int docId) – o Seek to the specified docId • docID() o Id of the current document we’re on Oh and, from Scorer • score() o Score the current docID() OpenSource Connections DocsEnum Scorer DocIdSetIterator
  • 33. In other words… • Remember THIS? OpenSource Connections LuceneQuery IndexReader Find next Match IndexSearcher Next Match Plz Here ya go Score That Plz Calc. score Score of curr doc LuceneQueryLuceneScorer …Actually… nextDoc() score() Scorer == Engine of the Query
  • 34. What would nextDoc look like • Remember search is an inverted index o Much like a book index o Fields -> Terms -> Documents! OpenSource Connections IndexReader == our handle to inverted index: • Much like an index. Given term, return list of doc ids • TermsEnum: • Enumeration of terms (actual logical index of terms) • DocsEnum • Enum. of corresponding docIDs (like list of pages next to term)
  • 35. What would nextDoc look like? OpenSource Connections IndexReader Find next Match Calc. score LuceneScorerfinal TermsEnum termsEnum = reader.terms(term.field()).iterator(null); termsEnum.seekExact(term.bytes(), state); • TermsEnum to lookup info for a Term: DocsEnum docs = termsEnum.docs(acceptDocs, null); • Each term has a DocsEnum that lists the docs that contain this term:
  • 36. What would nextDoc look like? OpenSource Connections IndexReader Find next Match Calc. score LuceneScorer @Override public int nextDoc() throws IOException { return docs.nextDoc(); } • Wrapping this enum, now I can return matches for this term! • You’ve just implemented TermQuery!
  • 37. BackwardsScorer nextDoc • Later, when creating a Scorer. Get a handle to DocsEnum for our backwards term: OpenSource Connections public Scorer scorer(AtomicReaderContext context, boolean scoreDocsInOrder, boolean topScorer, Bits acceptDocs) throws IOException { Term bwdsTerm = BackwardsQuery.this.backwardsTerm; TermsEnum bwdsTerms = context.reader().terms(bwdsTerm.field()).iterator(null); bwdsTerms.seekExact(bwdsTerm.bytes()); DocsEnum bwdsDocs = bwdsTerms.docs(acceptDocs, null); • Recall our Query has a Backwards Term (ananab): public BackwardsQuery(String field, String term) { backwardsTerm = new Term(field, new StringBuilder(term).reverse().toString()); ... } Terrifying and verbose Lucene speak for: 1. Seek to term in field via TermsEnum 2. Give me a DocsEnum of matching docs
  • 38. BackwardsScorer nextDoc • Our scorer has bwdDocs and fwdDocs, our nextDoc just walks both: OpenSource Connections @Override public int nextDoc() throws IOException { int currDocId = docID(); // increment one or both if (currDocId == backwardsScorer.docID()) { backwardsScorer.nextDoc(); } if (currDocId == forwardsScorer.docID()) { forwardsScorer.nextDoc(); } return docID(); }
  • 39. Scorer for scores! • Score is easy! Implement score, do whatever you want! OpenSource Connections IndexReader Find next Match Calc. score LuceneScorer@Override public float score() throws IOException { return 1.0f; }
  • 40. We call docID() in nextDoc() BackwardsScorer Score • Recall, match a backwards term (ananab)score = 5.0, fwd term (banana) score = 1.0 • We hook into docID, update score based on current posn OpenSource Connections @Override public int docID() { int backwordsDocId = backwardsScorer.docID(); int forwardsDocId = forwardsScorer.docID(); if (backwordsDocId <= forwardsDocId && backwordsDocId != NO_MORE_DOCS) { currScore = BACKWARDS_SCORE; return backwordsDocId; } else if (forwardsDocId != NO_MORE_DOCS) { currScore = FORWARDS_SCORE; return forwardsDocId; } return NO_MORE_DOCS; } Currently positioned on a bwds doc, set currScore to 5.0 Currently positioned on a fwd doc, set currScore to 1.0
  • 41. BackwardsScorer Score • For completeness sake, here’s our score: OpenSource Connections @Override public float score() throws IOException { return currScore; } IndexReader Find next Match Calc. score LuceneScorer
  • 42. So many gotchas! • Ultimate POWER! But You will have weird bugs: o Do all of your searches return the results of your first query? • In Query Implement hashCode and equals o Weird/Random Test Failures • Test using LuceneTestCase to ferret out common Lucene bugs o Randomized testing w/ different codecs etc o IndexReader methods have a certain ritual and very specific rules, (enums must be primed, etc) OpenSource Connections
  • 43. Extras • Query rewrite method o Optional, recognize you are a complex query, turn yourself into a simpler one • BooleanQuery with 1 clause -> return just one clause • Weight has optional explain o Useful for debugging in Solr o Pretty straight-forward API OpenSource Connections
  • 44. Conclusions! • These are nuclear options! o You can achieve SO MUCH before you get here (at much less complexity) o There’s certainly a way to do what you’ve seen without this level of control • Fun way to learn about Lucene! OpenSource Connections
  • 45. QUESTIONS? OpenSource Connections Feel free to contact me dturnbull@opensourceconnections.com @softwaredoug