SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Improved Search with Lucene 4.0
            Robert Muir, Lucid Imagination
    robert.muir@lucidimagination.com, 10/19/2011
Introductions
 What: Examine improvements in 4.0
 Who am I?
  • Lucene Committer & PMC Member
  • Lucid Imagination Employee
 Many big changes coming!
 Examine three general areas:
  • Indexing Improvements
  • Search Improvements
  • Performance Improvements
 Future improvements

                     2
Indexing Improvements


 Codecs
  • Flexible index format
 Index DocValues
  • Efficient per-document lookup values
 Miscellaneous Improvements
  • Binary terms
  • Additional index statistics



                       3
Codecs: Introduction


 Problem: “one size fits all” index format
 How to integrate future improvements?
   • Example: more efficient index compression
   • But need to support old format, too.
 How to optimize for your data?
   • Without digging into the guts of indexer!
 Idea: format should be pluggable



                          4
Codecs: Example use case
 Accelerate near-realtime reopen
 Speed up delete by term on ID field
   • “Primary key lookup”
 Idea: optimize ID field for this use case




                            5
Codecs: available formats
   Standard: Lucene 4.0 index format
   Pulsing: inline low frequency terms' postings
   Memory: loads field entirely into RAM
   Appending: supports append-only filesystem
   SimpleText: plaintext (not for production!!!!!)
   PreFlex: supports Lucene 3.x index format




                          6
Codecs: Configuration
Lucene: use CodecProvider class
Solr: specify in schema.xml




                          7
Codecs: Configuration




            8
Index DocValues: Introduction


 Problem: lookup a per-document value
  • RAM-resident for things like scoring factors (e.g. pagerank)
  • Disk-resident for things like document versioning
 Existing workarounds in Lucene have limitations
  • Scoring limited to one byte norm
  • FieldCache requires uninversion to build
  • Stored fields are slow, geared towards summary results
 Idea: add performant, flexible per-document lookups.




                                 9
Index DocValues: Limitations


 New feature, recently added to Lucene

✔ External scoring factors
✔ Memory-resident and disk-based lookup
✗ Docvalues sort-by-term
✗ Internal scoring factors (die norms, die)
✗ Integration with Solr



                         10
Index DocValues: Example




             11
Binary Terms


 In Lucene 4.0, index terms are byte[]
  • Do not need to be unicode text
 Example: Localized Sort and Range
  • Previous versions of Lucene: special encoder
  • Lucene 4.0: 50% space savings




                        12
Localized Sort/Range Example
Lucene: use CollationAnalyzer class
Solr: specify in schema.xml




                          13
Additional Index Statistics


 New statistics in Lucene 4.0:
  •   totalTermFreq(term): number of occurrences
  •   sumTotalTermFreq(field): number of tokens
  •   sumDocFreq(field): number of postings
  •   docCount(field): number of docs with value
 Available from Lucene APIs
 Available from function queries
 Supports additional scoring algorithms...



                            14
Search Improvements


 Improved Scoring API
  • Additional Scoring algorithms
 Spellchecking improvements
 Miscellaneous
  • Query parsing improvements
  • Deep paging support




                      15
Scoring: Introduction


 Problem: “baked-in” vector space model
 How to integrate additional algorithms?
  • Example: Language Models
  • Before 4.0: write custom Queries
  • Before 4.0: track certain statistics yourself
 How to customize for your data?
  • Without digging into the guts of postings lists!
 Idea: separate “matching” from “scoring”



                             16
Scoring: additional algorithms


      BM25
      Language Models
      Divergence from Randomness
      Information-based Models




                    17
Scoring: Configuration
Lucene: use Similarity class
Solr: specify in schema.xml




                          18
Spellchecking Improvements


 New DirectSpellChecker
  • No additional index needed
 Better Suggestions
  • Levenshtein vs. n-gram
 Exposes more configuration options
  • Tune to your collection




                       19
Spellchecking: Configuration
Lucene: use DirectSpellChecker class
Solr: specify in solrconfig.xml




                          20
Queryparsing Improvements

 Regular expression queries
  • /mycompany.(com|org|net)/
 Specify number of edits for fuzzy
  • foobar~2
 Wildcard escaping
  • crazy?*
 Range query syntax improvements
  • Mix inclusive/exclusive bounds
  • Support open-ended syntax in Lucene


                          21
Deep-paging support

 Problem: users who page deep into results :)
 Normal paging in lucene:
  • Page 1: search(“foo”, 20) ← look at 1-20
  • Page 2: search(“foo”, 40) ← look at 21-40
 Deep paging in lucene:
  • Page 1: searchAfter(null, “foo”, 20)
  • Page 2: searchAfter(lastResult, “foo”, 20)




                           22
Performance Improvements



 Concurrent Flushing
 Fast Fuzzy Query
 Improved RAM Efficiency




                   23
Concurrent Flushing




http://people.apache.org/~mikemccand/lucenebench/
Fast Fuzzy Query




In Lucene 3 this thing is < 1 QPS!
Improved RAM Efficiency
 Lucene 4.0 uses much less memory
  • Terms Index, Suggester, Synonyms : finite-state
  • Fieldcache: packed integers / utf-8
 Example with Wikipedia collection:
  “Memory footprint reduction from 389M to 90M
  after some off-the wall sorting and faceting”
Future Improvements
 Block Index Compression
  • PFOR-delta, Simple8b, …
 Positional iterators from Scorers
  • Offsets in postings lists (fast highlighting)
  • Proximity Scoring
 Structured/Section Scoring (e.g. BM25F)
 Improved flexibility through Codec
  • stored fields, term vectors, …
 Faster filtered search


                           27
Conclusion
 Lucene 4.0 will have many improvements,
  architectural, and API changes.
 A few of these were introduced here:
  • Ability to customize the index format
  • Additional scoring algorithms
  • Performance Improvements
 More are currently under development.
 For more information, look at CHANGES.txt in
  subversion, JIRA, ...


                          28
More Information

 Lucene Website
  • http://lucene.apache.org
 Users List
  • java-user@lucene.apache.org
 Lucene Revolution presentations
  •   http://www.lucidimagination.com/devzone/events/conferences/revolution/2010
  •   http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

 Contact Info
  •   robert.muir@lucidimagination.com
  •   http://www.lucidimagination.com/blog/author/robert-muir



                                         29

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingabial
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Adrien Grand
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsLucidworks
 
Search at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterSearch at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterLucidworks
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...Lucidworks
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 

Was ist angesagt? (20)

Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Search at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterSearch at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, Twitter
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 

Ähnlich wie Improved Search with Lucene 4.0 - Robert Muir

Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmaplucenerevolution
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road maplucenerevolution
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Pharo 10 and beyond
 Pharo 10 and beyond Pharo 10 and beyond
Pharo 10 and beyondESUG
 
Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull lucenerevolution
 
VA Smalltalk Update
VA Smalltalk UpdateVA Smalltalk Update
VA Smalltalk UpdateESUG
 
AD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension LibraryAD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension Librarypaidi_ed
 
Budapest Spring MUG 2016 - MongoDB User Group
Budapest Spring MUG 2016 - MongoDB User GroupBudapest Spring MUG 2016 - MongoDB User Group
Budapest Spring MUG 2016 - MongoDB User GroupMarc Schwering
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2GokulD
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Combining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept LocationCombining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept LocationSonia Haiduc
 
MongoDB 3.2 Feature Preview
MongoDB 3.2 Feature PreviewMongoDB 3.2 Feature Preview
MongoDB 3.2 Feature PreviewNorberto Leite
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersLucidworks
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks
 
Tagging search solution design Advanced edition
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced editionAlexander Tokarev
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks
 
SplunkLive London 2014 Developer Presentation
SplunkLive London 2014  Developer PresentationSplunkLive London 2014  Developer Presentation
SplunkLive London 2014 Developer PresentationDamien Dallimore
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2MongoDB
 
Cross Site Collection Navigation using SPFx, Powershell PnP & PnP-JS
Cross Site Collection Navigation using SPFx, Powershell PnP & PnP-JSCross Site Collection Navigation using SPFx, Powershell PnP & PnP-JS
Cross Site Collection Navigation using SPFx, Powershell PnP & PnP-JSThomas Daly
 

Ähnlich wie Improved Search with Lucene 4.0 - Robert Muir (20)

Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Pharo 10 and beyond
 Pharo 10 and beyond Pharo 10 and beyond
Pharo 10 and beyond
 
Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull
 
VA Smalltalk Update
VA Smalltalk UpdateVA Smalltalk Update
VA Smalltalk Update
 
AD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension LibraryAD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension Library
 
Budapest Spring MUG 2016 - MongoDB User Group
Budapest Spring MUG 2016 - MongoDB User GroupBudapest Spring MUG 2016 - MongoDB User Group
Budapest Spring MUG 2016 - MongoDB User Group
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Combining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept LocationCombining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept Location
 
MongoDB 3.2 Feature Preview
MongoDB 3.2 Feature PreviewMongoDB 3.2 Feature Preview
MongoDB 3.2 Feature Preview
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
 
Tagging search solution design Advanced edition
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced edition
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
 
SplunkLive London 2014 Developer Presentation
SplunkLive London 2014  Developer PresentationSplunkLive London 2014  Developer Presentation
SplunkLive London 2014 Developer Presentation
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2
 
Cross Site Collection Navigation using SPFx, Powershell PnP & PnP-JS
Cross Site Collection Navigation using SPFx, Powershell PnP & PnP-JSCross Site Collection Navigation using SPFx, Powershell PnP & PnP-JS
Cross Site Collection Navigation using SPFx, Powershell PnP & PnP-JS
 

Mehr von lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...lucenerevolution
 

Mehr von lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 

Kürzlich hochgeladen

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Kürzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Improved Search with Lucene 4.0 - Robert Muir

  • 1. Improved Search with Lucene 4.0 Robert Muir, Lucid Imagination robert.muir@lucidimagination.com, 10/19/2011
  • 2. Introductions  What: Examine improvements in 4.0  Who am I? • Lucene Committer & PMC Member • Lucid Imagination Employee  Many big changes coming!  Examine three general areas: • Indexing Improvements • Search Improvements • Performance Improvements  Future improvements 2
  • 3. Indexing Improvements  Codecs • Flexible index format  Index DocValues • Efficient per-document lookup values  Miscellaneous Improvements • Binary terms • Additional index statistics 3
  • 4. Codecs: Introduction  Problem: “one size fits all” index format  How to integrate future improvements? • Example: more efficient index compression • But need to support old format, too.  How to optimize for your data? • Without digging into the guts of indexer!  Idea: format should be pluggable 4
  • 5. Codecs: Example use case  Accelerate near-realtime reopen  Speed up delete by term on ID field • “Primary key lookup”  Idea: optimize ID field for this use case 5
  • 6. Codecs: available formats  Standard: Lucene 4.0 index format  Pulsing: inline low frequency terms' postings  Memory: loads field entirely into RAM  Appending: supports append-only filesystem  SimpleText: plaintext (not for production!!!!!)  PreFlex: supports Lucene 3.x index format 6
  • 7. Codecs: Configuration Lucene: use CodecProvider class Solr: specify in schema.xml 7
  • 9. Index DocValues: Introduction  Problem: lookup a per-document value • RAM-resident for things like scoring factors (e.g. pagerank) • Disk-resident for things like document versioning  Existing workarounds in Lucene have limitations • Scoring limited to one byte norm • FieldCache requires uninversion to build • Stored fields are slow, geared towards summary results  Idea: add performant, flexible per-document lookups. 9
  • 10. Index DocValues: Limitations  New feature, recently added to Lucene ✔ External scoring factors ✔ Memory-resident and disk-based lookup ✗ Docvalues sort-by-term ✗ Internal scoring factors (die norms, die) ✗ Integration with Solr 10
  • 12. Binary Terms  In Lucene 4.0, index terms are byte[] • Do not need to be unicode text  Example: Localized Sort and Range • Previous versions of Lucene: special encoder • Lucene 4.0: 50% space savings 12
  • 13. Localized Sort/Range Example Lucene: use CollationAnalyzer class Solr: specify in schema.xml 13
  • 14. Additional Index Statistics  New statistics in Lucene 4.0: • totalTermFreq(term): number of occurrences • sumTotalTermFreq(field): number of tokens • sumDocFreq(field): number of postings • docCount(field): number of docs with value  Available from Lucene APIs  Available from function queries  Supports additional scoring algorithms... 14
  • 15. Search Improvements  Improved Scoring API • Additional Scoring algorithms  Spellchecking improvements  Miscellaneous • Query parsing improvements • Deep paging support 15
  • 16. Scoring: Introduction  Problem: “baked-in” vector space model  How to integrate additional algorithms? • Example: Language Models • Before 4.0: write custom Queries • Before 4.0: track certain statistics yourself  How to customize for your data? • Without digging into the guts of postings lists!  Idea: separate “matching” from “scoring” 16
  • 17. Scoring: additional algorithms  BM25  Language Models  Divergence from Randomness  Information-based Models 17
  • 18. Scoring: Configuration Lucene: use Similarity class Solr: specify in schema.xml 18
  • 19. Spellchecking Improvements  New DirectSpellChecker • No additional index needed  Better Suggestions • Levenshtein vs. n-gram  Exposes more configuration options • Tune to your collection 19
  • 20. Spellchecking: Configuration Lucene: use DirectSpellChecker class Solr: specify in solrconfig.xml 20
  • 21. Queryparsing Improvements  Regular expression queries • /mycompany.(com|org|net)/  Specify number of edits for fuzzy • foobar~2  Wildcard escaping • crazy?*  Range query syntax improvements • Mix inclusive/exclusive bounds • Support open-ended syntax in Lucene 21
  • 22. Deep-paging support  Problem: users who page deep into results :)  Normal paging in lucene: • Page 1: search(“foo”, 20) ← look at 1-20 • Page 2: search(“foo”, 40) ← look at 21-40  Deep paging in lucene: • Page 1: searchAfter(null, “foo”, 20) • Page 2: searchAfter(lastResult, “foo”, 20) 22
  • 23. Performance Improvements  Concurrent Flushing  Fast Fuzzy Query  Improved RAM Efficiency 23
  • 25. Fast Fuzzy Query In Lucene 3 this thing is < 1 QPS!
  • 26. Improved RAM Efficiency  Lucene 4.0 uses much less memory • Terms Index, Suggester, Synonyms : finite-state • Fieldcache: packed integers / utf-8  Example with Wikipedia collection: “Memory footprint reduction from 389M to 90M after some off-the wall sorting and faceting”
  • 27. Future Improvements  Block Index Compression • PFOR-delta, Simple8b, …  Positional iterators from Scorers • Offsets in postings lists (fast highlighting) • Proximity Scoring  Structured/Section Scoring (e.g. BM25F)  Improved flexibility through Codec • stored fields, term vectors, …  Faster filtered search 27
  • 28. Conclusion  Lucene 4.0 will have many improvements, architectural, and API changes.  A few of these were introduced here: • Ability to customize the index format • Additional scoring algorithms • Performance Improvements  More are currently under development.  For more information, look at CHANGES.txt in subversion, JIRA, ... 28
  • 29. More Information  Lucene Website • http://lucene.apache.org  Users List • java-user@lucene.apache.org  Lucene Revolution presentations • http://www.lucidimagination.com/devzone/events/conferences/revolution/2010 • http://www.lucidimagination.com/devzone/events/conferences/revolution/2011  Contact Info • robert.muir@lucidimagination.com • http://www.lucidimagination.com/blog/author/robert-muir 29