SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Finite-State Queries in Lucene
             Robert Muir
          rmuir@apache.org
Agenda
• Introduction to Lucene
• Improving inexact matching:
  – Background
  – Regular Expression, Wildcard, Fuzzy Queries
• Additional use cases:
  – Language support: expansion versus
    stemming
  – Improved spellchecking
• Other ongoing developments in Lucene
Introduction to Lucene
• Open Source Search Engine Library
  – Not just Java, ported to other languages too.
  – Commercial support via several companies
• Just the library
  – Embed for your own uses, e.g. Eclipse
  – For a search server, see Solr
  – For web search + crawler, see Nutch
• Website: http://lucene.apache.org
Inverted Index
• Like a TreeMap<Terms, Documents>
• Before 2.9, queries only operate on one
  “subtree”.
  – For example, Regex and Fuzzy exhaustively
    evaluate all terms, unless you give them a
    “constant prefix”.
• TermQuery is just a special case, looks at
  one leaf node.
Lucene 2.9: Fast Numeric Ranges
• Indexes at different levels of precision.
• Enumerates multiple subtrees.
  – But typically this is a small number: e.g. 15
• Query APIs improved to support this.

                   4                       5                           6



            42           44            52                  63                    64



      421    423   445   446   448   521       522   632   633   634       641   642   644
Automaton Queries
• Only explore subtrees that can lead to an
  accept state of some finite state machine.
• AutomatonQuery traverses the term
  dictionary and the state machine in parallel
Another way to think of it
• Index as a state machine that recognizes Terms
  and transduces matching Documents.
• AutomatonQuery represents a user’s search
  need as a FSM.
• The intersection of the two emits search results.
Query API improvements
• Automata might need to do many seeks
  around the term dictionary.
  – Depends on what is in term dictionary
  – Depends on state machine structure
• MultiTermQuery API further improved
  – Easier and more efficient to skip around.
  – Explicitly supports seeking.
Regex, Wildcard, Fuzzy
• Without constant prefix, exhaustive
  – Regex: (http|ftp)://foo.com
  – Wildcard: ?oo?ar
  – Fuzzy: foobar~
• Re-implemented as automata queries
  – Just parsers that produce a DFA
  – Improved performance and scalability
  – (http|ftp)://foo.com examines 2 terms.
Additional/Advanced Use Cases
Stemming
• Stemmers work at index and query time
  – walked, walking -> walk
  – Can increase retrieval effectiveness
• Some problems
  – Mistakes: international -> intern
  – Must determine language of documents
  – Multilingual cases can get messy
  – Tuning is difficult: must re-index
  – Unfriendly: wildcards on stemmed terms…
Expansion instead
• Don’t remove data at index time
  – Expand the query instead.
  – Single field now works well for all queries:
    exact match, wildcard, expanded, etc.
• Simplifies search configuration
  – Tuning relevance is easier, no re-indexing.
  – No need to worry about language ID for docs.
  – Multilingual case is much simpler.
Automata expansion
• Natural fit for morphology
• Use set intersection operators
  – Minus to subtract exact match case
  – Union to search multiple languages
• Efficient operation
  – Doesn’t explode for languages with complex
    morphology
Experimental results
• 125k docs English test collection
   • Results are for TD queries
• Inverted the “S-Stemmer”
   • 6 declarative rewrite rules to regex
• Competitive with traditional stemming.
                  No      Porter    S-Stem    Automaton
               Stemming                        S-Stem
      MAP       0.4575    0.5069    0.5029     0.4979
      MRR       0.8070    0.7862    0.7587     0.8220
     # Terms   336,675    280,061   305,710    336,675
TODO:
• Support expansion models, too in Lucene.
• Language-specific resources
  – lucene-hunspell could provide these
• Language-independent tokenization
  – Unicode rules go a long way.
• Scoring that doesn’t need stopwords
  – For now, use CommonGrams!
Spellchecking




• Lucene spellchecker builds a separate
  index to find correction candidates
• Perhaps our fuzzy enumeration is now fast
  enough for small edit distances (e.g. 1,2)
  to just use the index directly.
• Could simplify configurations, especially
  distributed ones.
Ongoing developments in Lucene
Community
• Merging Lucene and Solr development
  – Still two separate released “products”!!!
  – Share mailing list and code repository
  – Solr dev code in sync with Lucene dev code
• Benefits to both Lucene and Solr users
  – Lucene features exposed to Solr faster
  – Solr features available to Lucene users
Indexing
• Flexible Indexing
  – Customize the format of the index
  – Decreased RAM usage
  – Faster IndexReader open and faster seeking
• Future
  – Serialize custom attributes to the index
  – More RAM savings
  – Improved index compression
  – Faster Near-Real-Time performance
Text Analysis
• Improved Unicode Support
  – Unicode 4/Supplementary support
  – ICU integration (to support Unicode 5.2)
• Improved Language Support
  – More languages
  – Faster indexing performance
  – Easier packaging and ease-of-use
• Improvements to Analysis API
Relevance and Scoring
• Improved relevance benchmarking
  – Open Relevance Project
  – Quality Benchmarking Package
• Future: More flexible scoring
  – How to support BM25, DFR, more vector
    space?
  – Some packages/patches available, but
     • Additional index statistics needed to be simple
Other
• Improvements to Spatial support
  – See Chris Male’s talk here:
    http://vimeo.com/10204365
  – Progress in both Lucene and Solr
• Steps towards a more modular
  architecture
  – Smaller Lucene core
  – Separate modules with more functionality
Questions?
Backup slides
Finite State Queries In Lucene

Weitere ähnliche Inhalte

Was ist angesagt?

Using ngx_lua / lua-nginx-module in pixiv
Using ngx_lua / lua-nginx-module in pixivUsing ngx_lua / lua-nginx-module in pixiv
Using ngx_lua / lua-nginx-module in pixivShunsuke Michii
 
CyberAgentのプライベートクラウド Cycloudの運用及びモニタリングについて #CODT2020 / Administration and M...
CyberAgentのプライベートクラウド Cycloudの運用及びモニタリングについて #CODT2020 / Administration and M...CyberAgentのプライベートクラウド Cycloudの運用及びモニタリングについて #CODT2020 / Administration and M...
CyberAgentのプライベートクラウド Cycloudの運用及びモニタリングについて #CODT2020 / Administration and M...whywaita
 
kube-system落としてみました
kube-system落としてみましたkube-system落としてみました
kube-system落としてみましたShuntaro Saiba
 
PostgreSQL 11 New Features With Examples (English)
PostgreSQL 11 New Features With Examples (English)PostgreSQL 11 New Features With Examples (English)
PostgreSQL 11 New Features With Examples (English)Noriyoshi Shinoda
 
Text tagging with finite state transducers
Text tagging with finite state transducersText tagging with finite state transducers
Text tagging with finite state transducerslucenerevolution
 
MySQL 5.7とレプリケーションにおける改良
MySQL 5.7とレプリケーションにおける改良MySQL 5.7とレプリケーションにおける改良
MySQL 5.7とレプリケーションにおける改良Shinya Sugiyama
 
Common Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache KafkaCommon Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache Kafkaconfluent
 
DevAx::connect はじめました
DevAx::connect はじめましたDevAx::connect はじめました
DevAx::connect はじめました政雄 金森
 
Infrastructure as Code with Terraform and Ansible
Infrastructure as Code with Terraform and AnsibleInfrastructure as Code with Terraform and Ansible
Infrastructure as Code with Terraform and AnsibleDevOps Meetup Bern
 
PostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldPostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldJignesh Shah
 
Docker volume基礎/Project Longhorn紹介
Docker volume基礎/Project Longhorn紹介Docker volume基礎/Project Longhorn紹介
Docker volume基礎/Project Longhorn紹介Masahito Zembutsu
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 
まずやっとくPostgreSQLチューニング
まずやっとくPostgreSQLチューニングまずやっとくPostgreSQLチューニング
まずやっとくPostgreSQLチューニングKosuke Kida
 
Apache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once SemanticsApache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once SemanticsYoshiyasu SAEKI
 
10分でわかる Cilium と XDP / BPF
10分でわかる Cilium と XDP / BPF10分でわかる Cilium と XDP / BPF
10分でわかる Cilium と XDP / BPFShuji Yamada
 
NginxとLuaを用いた動的なリバースプロキシでデプロイを 100 倍速くした
NginxとLuaを用いた動的なリバースプロキシでデプロイを 100 倍速くしたNginxとLuaを用いた動的なリバースプロキシでデプロイを 100 倍速くした
NginxとLuaを用いた動的なリバースプロキシでデプロイを 100 倍速くしたtoshi_pp
 

Was ist angesagt? (20)

Nginx lua
Nginx luaNginx lua
Nginx lua
 
Using ngx_lua / lua-nginx-module in pixiv
Using ngx_lua / lua-nginx-module in pixivUsing ngx_lua / lua-nginx-module in pixiv
Using ngx_lua / lua-nginx-module in pixiv
 
DataGuard体験記
DataGuard体験記DataGuard体験記
DataGuard体験記
 
CyberAgentのプライベートクラウド Cycloudの運用及びモニタリングについて #CODT2020 / Administration and M...
CyberAgentのプライベートクラウド Cycloudの運用及びモニタリングについて #CODT2020 / Administration and M...CyberAgentのプライベートクラウド Cycloudの運用及びモニタリングについて #CODT2020 / Administration and M...
CyberAgentのプライベートクラウド Cycloudの運用及びモニタリングについて #CODT2020 / Administration and M...
 
Redis at LINE
Redis at LINERedis at LINE
Redis at LINE
 
kube-system落としてみました
kube-system落としてみましたkube-system落としてみました
kube-system落としてみました
 
PostgreSQL 11 New Features With Examples (English)
PostgreSQL 11 New Features With Examples (English)PostgreSQL 11 New Features With Examples (English)
PostgreSQL 11 New Features With Examples (English)
 
Text tagging with finite state transducers
Text tagging with finite state transducersText tagging with finite state transducers
Text tagging with finite state transducers
 
MySQL 5.7とレプリケーションにおける改良
MySQL 5.7とレプリケーションにおける改良MySQL 5.7とレプリケーションにおける改良
MySQL 5.7とレプリケーションにおける改良
 
Common Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache KafkaCommon Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache Kafka
 
DevAx::connect はじめました
DevAx::connect はじめましたDevAx::connect はじめました
DevAx::connect はじめました
 
Optimizing MySQL queries
Optimizing MySQL queriesOptimizing MySQL queries
Optimizing MySQL queries
 
Infrastructure as Code with Terraform and Ansible
Infrastructure as Code with Terraform and AnsibleInfrastructure as Code with Terraform and Ansible
Infrastructure as Code with Terraform and Ansible
 
PostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldPostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized World
 
Docker volume基礎/Project Longhorn紹介
Docker volume基礎/Project Longhorn紹介Docker volume基礎/Project Longhorn紹介
Docker volume基礎/Project Longhorn紹介
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
まずやっとくPostgreSQLチューニング
まずやっとくPostgreSQLチューニングまずやっとくPostgreSQLチューニング
まずやっとくPostgreSQLチューニング
 
Apache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once SemanticsApache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once Semantics
 
10分でわかる Cilium と XDP / BPF
10分でわかる Cilium と XDP / BPF10分でわかる Cilium と XDP / BPF
10分でわかる Cilium と XDP / BPF
 
NginxとLuaを用いた動的なリバースプロキシでデプロイを 100 倍速くした
NginxとLuaを用いた動的なリバースプロキシでデプロイを 100 倍速くしたNginxとLuaを用いた動的なリバースプロキシでデプロイを 100 倍速くした
NginxとLuaを用いた動的なリバースプロキシでデプロイを 100 倍速くした
 

Andere mochten auch

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
 
Lucandra
LucandraLucandra
Lucandraotisg
 
Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)otisg
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1YI-CHING WU
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadooplucenerevolution
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...Lucidworks
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMLucidworks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartLucidworks
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisJosiane Gamgo
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 

Andere mochten auch (20)

Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Lucene
LuceneLucene
Lucene
 
Lucandra
LucandraLucandra
Lucandra
 
Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 

Ähnlich wie Finite State Queries In Lucene

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2GokulD
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.NetDean Thrasher
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and AbstractionsMetosin Oy
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmaplucenerevolution
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road maplucenerevolution
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudAnshum Gupta
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL DatabasesEmanuel Calvo
 
Real world RESTful service development problems and solutions
Real world RESTful service development problems and solutionsReal world RESTful service development problems and solutions
Real world RESTful service development problems and solutionsMasoud Kalali
 
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...TAUS - The Language Data Network
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteSearch Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteLucidworks
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and TricksErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 

Ähnlich wie Finite State Queries In Lucene (20)

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and Abstractions
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Lucene 101
Lucene 101Lucene 101
Lucene 101
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL Databases
 
Real world RESTful service development problems and solutions
Real world RESTful service development problems and solutionsReal world RESTful service development problems and solutions
Real world RESTful service development problems and solutions
 
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteSearch Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
 
Solr 4
Solr 4Solr 4
Solr 4
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and Tricks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Fun with flexible indexing
Fun with flexible indexingFun with flexible indexing
Fun with flexible indexing
 

Finite State Queries In Lucene

  • 1. Finite-State Queries in Lucene Robert Muir rmuir@apache.org
  • 2. Agenda • Introduction to Lucene • Improving inexact matching: – Background – Regular Expression, Wildcard, Fuzzy Queries • Additional use cases: – Language support: expansion versus stemming – Improved spellchecking • Other ongoing developments in Lucene
  • 3. Introduction to Lucene • Open Source Search Engine Library – Not just Java, ported to other languages too. – Commercial support via several companies • Just the library – Embed for your own uses, e.g. Eclipse – For a search server, see Solr – For web search + crawler, see Nutch • Website: http://lucene.apache.org
  • 4. Inverted Index • Like a TreeMap<Terms, Documents> • Before 2.9, queries only operate on one “subtree”. – For example, Regex and Fuzzy exhaustively evaluate all terms, unless you give them a “constant prefix”. • TermQuery is just a special case, looks at one leaf node.
  • 5. Lucene 2.9: Fast Numeric Ranges • Indexes at different levels of precision. • Enumerates multiple subtrees. – But typically this is a small number: e.g. 15 • Query APIs improved to support this. 4 5 6 42 44 52 63 64 421 423 445 446 448 521 522 632 633 634 641 642 644
  • 6. Automaton Queries • Only explore subtrees that can lead to an accept state of some finite state machine. • AutomatonQuery traverses the term dictionary and the state machine in parallel
  • 7. Another way to think of it • Index as a state machine that recognizes Terms and transduces matching Documents. • AutomatonQuery represents a user’s search need as a FSM. • The intersection of the two emits search results.
  • 8. Query API improvements • Automata might need to do many seeks around the term dictionary. – Depends on what is in term dictionary – Depends on state machine structure • MultiTermQuery API further improved – Easier and more efficient to skip around. – Explicitly supports seeking.
  • 9. Regex, Wildcard, Fuzzy • Without constant prefix, exhaustive – Regex: (http|ftp)://foo.com – Wildcard: ?oo?ar – Fuzzy: foobar~ • Re-implemented as automata queries – Just parsers that produce a DFA – Improved performance and scalability – (http|ftp)://foo.com examines 2 terms.
  • 11. Stemming • Stemmers work at index and query time – walked, walking -> walk – Can increase retrieval effectiveness • Some problems – Mistakes: international -> intern – Must determine language of documents – Multilingual cases can get messy – Tuning is difficult: must re-index – Unfriendly: wildcards on stemmed terms…
  • 12. Expansion instead • Don’t remove data at index time – Expand the query instead. – Single field now works well for all queries: exact match, wildcard, expanded, etc. • Simplifies search configuration – Tuning relevance is easier, no re-indexing. – No need to worry about language ID for docs. – Multilingual case is much simpler.
  • 13. Automata expansion • Natural fit for morphology • Use set intersection operators – Minus to subtract exact match case – Union to search multiple languages • Efficient operation – Doesn’t explode for languages with complex morphology
  • 14. Experimental results • 125k docs English test collection • Results are for TD queries • Inverted the “S-Stemmer” • 6 declarative rewrite rules to regex • Competitive with traditional stemming. No Porter S-Stem Automaton Stemming S-Stem MAP 0.4575 0.5069 0.5029 0.4979 MRR 0.8070 0.7862 0.7587 0.8220 # Terms 336,675 280,061 305,710 336,675
  • 15. TODO: • Support expansion models, too in Lucene. • Language-specific resources – lucene-hunspell could provide these • Language-independent tokenization – Unicode rules go a long way. • Scoring that doesn’t need stopwords – For now, use CommonGrams!
  • 16. Spellchecking • Lucene spellchecker builds a separate index to find correction candidates • Perhaps our fuzzy enumeration is now fast enough for small edit distances (e.g. 1,2) to just use the index directly. • Could simplify configurations, especially distributed ones.
  • 18. Community • Merging Lucene and Solr development – Still two separate released “products”!!! – Share mailing list and code repository – Solr dev code in sync with Lucene dev code • Benefits to both Lucene and Solr users – Lucene features exposed to Solr faster – Solr features available to Lucene users
  • 19. Indexing • Flexible Indexing – Customize the format of the index – Decreased RAM usage – Faster IndexReader open and faster seeking • Future – Serialize custom attributes to the index – More RAM savings – Improved index compression – Faster Near-Real-Time performance
  • 20. Text Analysis • Improved Unicode Support – Unicode 4/Supplementary support – ICU integration (to support Unicode 5.2) • Improved Language Support – More languages – Faster indexing performance – Easier packaging and ease-of-use • Improvements to Analysis API
  • 21. Relevance and Scoring • Improved relevance benchmarking – Open Relevance Project – Quality Benchmarking Package • Future: More flexible scoring – How to support BM25, DFR, more vector space? – Some packages/patches available, but • Additional index statistics needed to be simple
  • 22. Other • Improvements to Spatial support – See Chris Male’s talk here: http://vimeo.com/10204365 – Progress in both Lucene and Solr • Steps towards a more modular architecture – Smaller Lucene core – Separate modules with more functionality