SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Recent Additions to Lucene’s Arsenal
Shai Erera, Researcher, IBM
Adrien Grand, ElasticSearch
Who We Are
•

Shai Erera
–
–
–
–

•

Working at IBM – Information Retrieval Research
Lucene/Solr committer and PMC member
http://shaierera.blogspot.com
shaie@apache.org

Adrien Grand
–
–
–

@jpountz
Lucene/Solr committer and PMC member
Software engineer at Elasticsearch
The Replicator
Load Balancing

Load
Balancer
Failover
Index Backup
Replicator

Replication
Client

The Replicator

Backup

Replication
Client

Primary

Backup
http://shaierera.blogspot.com/2013/05/the-replicator.html
Replication Components
•

Replicator
–
–
–

•

Revision
–
–

•

Describes a list of files and metadata
Responsible to ensure the files are available as long as clients replicate it

ReplicationClient
–
–
–

•

Mediates between the client and server
Manages the published Revisions
Implementation for replication over HTTP

Performs the replication operation on the replica side
Copies delta files and invokes ReplicationHandler upon successful copy
Always replicates latest revision

ReplicationHandler
–

Acts on the copied files
Index Replication
•

IndexRevision
–
–

•

IndexReplicationHandler
–
–
–

•

Obtains a snapshot on the last commit through SnapshotDeletionPolicy
Released when revision is released by Replicator
Copies the files to the index directory and fsync them
Aborts (rollback) on any error
Upon successful completion, invokes a callback (e.g.
SearcherManager.maybeRefresh())

Similar extensions for faceted index replication
–
–

IndexAndTaxonomyRevision: obtains snapshots on both the search and taxonomy
indexes
IndexAndTaxonomyReplicationHandler: copies the files to the respective
directories, keeping both in sync
Sample Code
// Server-side: publish a new Revision
Replicator replicator = new LocalReplicator();
replicator.publish(new IndexRevision(indexWriter));
// Client-side: replicate a Revision
Replicator replicator; // either LocalReplicator or HttpReplicator
// refresh SearcherManager after index is updated
Callable<Boolean> callback = new Callable<Boolean>() {
public Boolean call() throws Exception {
// index was updated, refresh manager
searcherManager.maybeRefresh();
}
}
ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback);
SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir);
ReplicationClient client = new ReplicationClient(replicator, handler, factory);
client.updateNow(); // invoke client manually
// -- OR -client.startUpdateThread(30000); // check for updates every 30 seconds
Future Work
•

Resume
–
–

•

Parallel Replication
–

•

Session level: don’t copy files that were already successfully copied
File level: don’t copy file parts that were already successfully copied
Copy revision files in parallel

Other replication strategies
–

Peer-to-peer
Index Sorting
How to trade index speed for search speed
Anatomy of a Lucene index
Index = collection of immutable segments
Segments store documents sequentially on disk
Add data = create a new segment
Segments get eventually merged together
Order of segments / documents in segments doesn’t matter
– the following segments are equivalent

Id

1

3

10

4

7

20

42

11

9

8

15

18

30

31

99

5

12

Price

9

0

7

8

2

2

1

8

10

3

4

4

6

10

1

1

13

Id

12

9

31

1

4

11

10

30

15

18

8

7

20

42

99

5

3

Price

13

10

10

9

8

8

7

6

4

4

3

2

2

1

1

1

0
Anatomy of a Lucene index
ordinal of a doc in a segment = doc id
used in the inverted index to refer to docs

shoe

1, 3, 5, 8, 11, 13, 15

doc id

0

1

2

3

4

5

7

8

9

10 11 12 13 14 15 16

Id

1

3

10

4

7

20 42 11

9

8

15 18 30 31 99

5

12

Price

9

0

7

8

2

2

10

3

4

1

13

6
1

8

4

6

10

1
Top hits
Get top N=2 results:
– Create a priority queue of size N
– Accumulate matching docs

Id

1

3

10

4

7

20

42

11

9

8

15

18

30

31

99

5

12

Price

9

0

7

8

2

2

1

8

10

3

4

4

6

10

1

1

13

()

(3)

Create an empty
priority queue

(3,4)

(4,20)

(4,9)

Automatic overflow of the
priority queue to remove the
least one

(4,9)

(9,31) (9,31)

Top hits
Early termination
Let’s do the same on a sorted index

Id

12

9

31

1

4

11

10

30

15

18

8

7

20

42

99

5

3

Price

13

10

10

9

8

8

7

6

4

4

3

2

2

1

1

1

0

()

(9)

(9,31) (9,31)

Priority queue never
changes after this
document

(9,31)

(9,31)

(9,31) (9,31)
Early termination
Pros
– makes finding the top hits much faster
– file-system cache-friendly
Cons
– only works for static ranks
– not if the sort order depends on the query
– requires the index to be sorted
– doesn’t work for tasks that require visiting every doc:
– total number of matches
– faceting
Static ranks
Not uncommon!
Graph-based ranks
– Google’s PageRank
Facebook social search / Unicorn
– https://www.facebook.com/publications/219621248185635
Many more...

Doesn’t need to be the exact sort order
– heuristics when score is only a function of the static rank
Offline sorting
A live index can’t be kept sorted
– would require inserting docs between existing docs!
– segments are immutable
Offline sorting to the rescue:
– index as usual
– sort into a new index
– search!
Pros/cons
– super fast to search, the whole index is fully sorted
– but only works for static content
Offline Sorting
// open a reader on the unsorted index and create a sorted (but slow) view
DirectoryReader reader = DirectoryReader.open(in);
boolean ascending = false;
Sorter sorter = new NumericDocValuesSorter("price", ascending);
AtomicReader sortedReader = SortingAtomicReader.wrap(
SlowCompositeReaderWrapper.wrap(reader), sorter);
// copy the content of the sorted reader to the new dir
IndexWriter writer = new IndexWriter(out, iwConf);
writer.addIndexes(sortedReader);
writer.close();
reader.close();
Online sorting?
Sort segments independently
– wouldn’t require inserting data into existing segments
– collection could still be early-terminated on a per-segment basis
But segments are immutable
– must be sorted before starting writing them
Online sorting?
2 sources of segments
– flush
– merge
flushed segments can’t be sorted
– Lucene writes stored fields to disk on the fly
– could be buffered but this would require a lot of memory
merged segments can be sorted
– create a sorted view over the segments to merge
– pass this view to SegmentMerger instead of the original segments
not a bad trade-off
– flushed segments are usually small & fast to collect
Online sorting?

Merged segments can easily take 99+%
of the size of the index

Merged segments

Flushed segments
- NRT reopens
- RAM buffer size limit hit

http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
Online Sorting
IndexWriterConfig iwConf = new IndexWriterConfig(...);
// original MergePolicy finds the segments to merge
MergePolicy origMP = iwConf.getMergePolicy();
// SortingMergePolicy wraps the segments with a sorted view
boolean ascending = false;
Sorter sorter = new NumericDocValuesSorter("price", ascending);
MergePolicy sortingMP = new SortingMergePolicy(origMP, sorter);
// setup IndexWriter to use SortingMergePolicy
iwConf.setMergePolicy(sortingMP);
IndexWriter writer = new IndexWriter(dir, iwConf);
// index as usual
Early termination
Collect top N matches
Offline sorting
– index sorted globally
– early terminate after N matches have been collected
– no priority queue needed!
Online sorting
– no early termination on flushed segments
– early termination on merged segments
– if N matches have been collected
– or if current match is less than the top of the PQ
Early Termination
class MyCollector extends Collector {
@Override
public void setNextReader(AtomicReaderContext context) throws IOException {
readerIsSorted = SortingMergePolicy.isSorted(context.reader(), sorter);
collected = 0;
}
@Override
public void collect(int doc) throws IOException {
if (readerIsSorted &&
(++collected >= maxDocsToCollect || curVal <= pq.top()) {
// Special exception that tells IndexSearcher to terminate
// collection of the current segment
throw new CollectionTerminatedException();
} else {
// collect hit
}
}
}
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
Erik Hatcher
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
Erik Hatcher
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
Erik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 

Was ist angesagt? (20)

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Solr 4
Solr 4Solr 4
Solr 4
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis Tricks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 

Ähnlich wie Recent Additions to Lucene Arsenal

SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced Session
Splunk
 
Hands on training on DbFit Part-II
Hands on training on DbFit Part-IIHands on training on DbFit Part-II
Hands on training on DbFit Part-II
Babul Mirdha
 

Ähnlich wie Recent Additions to Lucene Arsenal (20)

Did you mean 'Galene'?
Did you mean 'Galene'?Did you mean 'Galene'?
Did you mean 'Galene'?
 
Cdcr apachecon-talk
Cdcr apachecon-talkCdcr apachecon-talk
Cdcr apachecon-talk
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced Session
 
Micro frontend: The microservices puzzle extended to frontend
Micro frontend: The microservices puzzle  extended to frontendMicro frontend: The microservices puzzle  extended to frontend
Micro frontend: The microservices puzzle extended to frontend
 
Hands on training on DbFit Part-II
Hands on training on DbFit Part-IIHands on training on DbFit Part-II
Hands on training on DbFit Part-II
 
High Performance Mysql
High Performance MysqlHigh Performance Mysql
High Performance Mysql
 
Andrzej bialecki lr-2013-dublin
Andrzej bialecki lr-2013-dublinAndrzej bialecki lr-2013-dublin
Andrzej bialecki lr-2013-dublin
 
Professionalizing the Front-end
Professionalizing the Front-endProfessionalizing the Front-end
Professionalizing the Front-end
 
Redshift Chartio Event Presentation
Redshift Chartio Event PresentationRedshift Chartio Event Presentation
Redshift Chartio Event Presentation
 
SVN Tool Information : Best Practices
SVN Tool Information  : Best PracticesSVN Tool Information  : Best Practices
SVN Tool Information : Best Practices
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
 
SFDC Deployments
SFDC DeploymentsSFDC Deployments
SFDC Deployments
 
ZooKeeper (and other things)
ZooKeeper (and other things)ZooKeeper (and other things)
ZooKeeper (and other things)
 
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle DatabaseOracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database
 
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database ...
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database ...Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database ...
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database ...
 
01 oracle architecture
01 oracle architecture01 oracle architecture
01 oracle architecture
 
System analysis
System analysisSystem analysis
System analysis
 
Breaking data
Breaking dataBreaking data
Breaking data
 
Sql optimize
Sql optimizeSql optimize
Sql optimize
 
Day 7 - Make it Fast
Day 7 - Make it FastDay 7 - Make it Fast
Day 7 - Make it Fast
 

Mehr von lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

Mehr von lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucene
 
10 keys to Solr's Future
10 keys to Solr's Future10 keys to Solr's Future
10 keys to Solr's Future
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Recent Additions to Lucene Arsenal

  • 1.
  • 2. Recent Additions to Lucene’s Arsenal Shai Erera, Researcher, IBM Adrien Grand, ElasticSearch
  • 3. Who We Are • Shai Erera – – – – • Working at IBM – Information Retrieval Research Lucene/Solr committer and PMC member http://shaierera.blogspot.com shaie@apache.org Adrien Grand – – – @jpountz Lucene/Solr committer and PMC member Software engineer at Elasticsearch
  • 9. Replication Components • Replicator – – – • Revision – – • Describes a list of files and metadata Responsible to ensure the files are available as long as clients replicate it ReplicationClient – – – • Mediates between the client and server Manages the published Revisions Implementation for replication over HTTP Performs the replication operation on the replica side Copies delta files and invokes ReplicationHandler upon successful copy Always replicates latest revision ReplicationHandler – Acts on the copied files
  • 10. Index Replication • IndexRevision – – • IndexReplicationHandler – – – • Obtains a snapshot on the last commit through SnapshotDeletionPolicy Released when revision is released by Replicator Copies the files to the index directory and fsync them Aborts (rollback) on any error Upon successful completion, invokes a callback (e.g. SearcherManager.maybeRefresh()) Similar extensions for faceted index replication – – IndexAndTaxonomyRevision: obtains snapshots on both the search and taxonomy indexes IndexAndTaxonomyReplicationHandler: copies the files to the respective directories, keeping both in sync
  • 11. Sample Code // Server-side: publish a new Revision Replicator replicator = new LocalReplicator(); replicator.publish(new IndexRevision(indexWriter)); // Client-side: replicate a Revision Replicator replicator; // either LocalReplicator or HttpReplicator // refresh SearcherManager after index is updated Callable<Boolean> callback = new Callable<Boolean>() { public Boolean call() throws Exception { // index was updated, refresh manager searcherManager.maybeRefresh(); } } ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback); SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir); ReplicationClient client = new ReplicationClient(replicator, handler, factory); client.updateNow(); // invoke client manually // -- OR -client.startUpdateThread(30000); // check for updates every 30 seconds
  • 12. Future Work • Resume – – • Parallel Replication – • Session level: don’t copy files that were already successfully copied File level: don’t copy file parts that were already successfully copied Copy revision files in parallel Other replication strategies – Peer-to-peer
  • 13. Index Sorting How to trade index speed for search speed
  • 14. Anatomy of a Lucene index Index = collection of immutable segments Segments store documents sequentially on disk Add data = create a new segment Segments get eventually merged together Order of segments / documents in segments doesn’t matter – the following segments are equivalent Id 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Price 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 Id 12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3 Price 13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0
  • 15. Anatomy of a Lucene index ordinal of a doc in a segment = doc id used in the inverted index to refer to docs shoe 1, 3, 5, 8, 11, 13, 15 doc id 0 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 Id 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Price 9 0 7 8 2 2 10 3 4 1 13 6 1 8 4 6 10 1
  • 16. Top hits Get top N=2 results: – Create a priority queue of size N – Accumulate matching docs Id 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Price 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 () (3) Create an empty priority queue (3,4) (4,20) (4,9) Automatic overflow of the priority queue to remove the least one (4,9) (9,31) (9,31) Top hits
  • 17. Early termination Let’s do the same on a sorted index Id 12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3 Price 13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0 () (9) (9,31) (9,31) Priority queue never changes after this document (9,31) (9,31) (9,31) (9,31)
  • 18. Early termination Pros – makes finding the top hits much faster – file-system cache-friendly Cons – only works for static ranks – not if the sort order depends on the query – requires the index to be sorted – doesn’t work for tasks that require visiting every doc: – total number of matches – faceting
  • 19. Static ranks Not uncommon! Graph-based ranks – Google’s PageRank Facebook social search / Unicorn – https://www.facebook.com/publications/219621248185635 Many more... Doesn’t need to be the exact sort order – heuristics when score is only a function of the static rank
  • 20. Offline sorting A live index can’t be kept sorted – would require inserting docs between existing docs! – segments are immutable Offline sorting to the rescue: – index as usual – sort into a new index – search! Pros/cons – super fast to search, the whole index is fully sorted – but only works for static content
  • 21. Offline Sorting // open a reader on the unsorted index and create a sorted (but slow) view DirectoryReader reader = DirectoryReader.open(in); boolean ascending = false; Sorter sorter = new NumericDocValuesSorter("price", ascending); AtomicReader sortedReader = SortingAtomicReader.wrap( SlowCompositeReaderWrapper.wrap(reader), sorter); // copy the content of the sorted reader to the new dir IndexWriter writer = new IndexWriter(out, iwConf); writer.addIndexes(sortedReader); writer.close(); reader.close();
  • 22. Online sorting? Sort segments independently – wouldn’t require inserting data into existing segments – collection could still be early-terminated on a per-segment basis But segments are immutable – must be sorted before starting writing them
  • 23. Online sorting? 2 sources of segments – flush – merge flushed segments can’t be sorted – Lucene writes stored fields to disk on the fly – could be buffered but this would require a lot of memory merged segments can be sorted – create a sorted view over the segments to merge – pass this view to SegmentMerger instead of the original segments not a bad trade-off – flushed segments are usually small & fast to collect
  • 24. Online sorting? Merged segments can easily take 99+% of the size of the index Merged segments Flushed segments - NRT reopens - RAM buffer size limit hit http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
  • 25. Online Sorting IndexWriterConfig iwConf = new IndexWriterConfig(...); // original MergePolicy finds the segments to merge MergePolicy origMP = iwConf.getMergePolicy(); // SortingMergePolicy wraps the segments with a sorted view boolean ascending = false; Sorter sorter = new NumericDocValuesSorter("price", ascending); MergePolicy sortingMP = new SortingMergePolicy(origMP, sorter); // setup IndexWriter to use SortingMergePolicy iwConf.setMergePolicy(sortingMP); IndexWriter writer = new IndexWriter(dir, iwConf); // index as usual
  • 26. Early termination Collect top N matches Offline sorting – index sorted globally – early terminate after N matches have been collected – no priority queue needed! Online sorting – no early termination on flushed segments – early termination on merged segments – if N matches have been collected – or if current match is less than the top of the PQ
  • 27. Early Termination class MyCollector extends Collector { @Override public void setNextReader(AtomicReaderContext context) throws IOException { readerIsSorted = SortingMergePolicy.isSorted(context.reader(), sorter); collected = 0; } @Override public void collect(int doc) throws IOException { if (readerIsSorted && (++collected >= maxDocsToCollect || curVal <= pq.top()) { // Special exception that tells IndexSearcher to terminate // collection of the current segment throw new CollectionTerminatedException(); } else { // collect hit } } }