SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Natural language search in Solr
             Tommaso Teofili, Sourcesense
   t.teofili@sourcesense.com, October 19th 2011
Agenda
 An approach to natural language search in
  Solr
 Main points
  •   Solr-UIMA integration module
  •   Custom Lucene analyzers for UIMA
  •   OSS NLP algorithms in Lucene/Solr
  •   Orchestrating blocks to build a sample
      system able to understand natural language
      queries
 Results
My Background
 Software engineer at Sourcesense
  • Enterprise search consultant
 Member of the Apache Software Foundation
  •   UIMA
  •   Clerezza
  •   Stanbol
  •   DirectMemory
  •   ...
Google in ‘99
Google today
Google today
The Challenge
 Improved recall/precision
  • ‘articles about science’ (concepts)
  • ‘movies by K. Spacey’ vs ‘movies with K. Spacey’
 Easier experience for non-expert users
  • ‘people working at Google’ - ‘cities near London’
 Horizontal domains (e.g. Google)
 Vertical domains
Hurdles
 understanding documents’ text/user queries
 extract domain-specific/wide entities and
  concepts
 index/search performance
Use Case
   search engine for an online movies magazine
   Solr based
   non technical users
   time / cost
    • Solr 3.x setup : 2 mins
    • NLS setup / tweak : 5 days
 expecting
    • improved recall / precision
    • more time (clicks) on site ($)
Online movies magazine
General approach
 Natural language processing
 Processing documents at indexing time
  • document text analysis
  • write enriched text in (dedicated) fields
  • add custom types / payloads to terms
 Processing queries at searching time
  •   query analysis
  •   higher boosts to entities/concepts
  •   in-sentence search
  •   ...
NLP
 AI discipline
   • Computers understanding and managing
     information written in human language
 analyze text at various levels
 incrementally enrich / give structure
 extract concepts and named entities
Technical detail
 NLP algorithms plugged via Apache UIMA
 Indexing time
  • UpdateProcessor plugin (solr/contrib/uima)
  • Custom tokenizers/filters
 Search time
  • Custom QParserPlugin
Why Apache UIMA?
   OASIS standard for UIM
   TLP since March 2010
   Deploy pipelines of Analysis Engines
   AEs wrap NLP algorithms
   Scaling capabilities
NLP and OSS
 Sentence Split
   • OpenNLP, UIMA Addons, StanfordNLP
 PoS tagging
   • OpenNLP, UIMA Addons, StanfordNLP
 Chunking/Parsing
   • OpenNLP, StanfordNLP
 NER
   • OpenNLP, UIMA Addons, Stanbol, StanfordNLP
 Clustering/Classifying
   • Mahout, OpenNLP, StanfordNLP
 ...
Solr NLS architecture
UIMA Update Processor
Lucene analysis & UIMA
 Type : denote lexical types for tokens
 Payload : a byte array stored at each term
  position
 tokenize / filter tokens covered by a certain
  annotation type
 store UIMA annotations’ features in types /
  payloads
UIMA type-aware tokenizer
Solr NLS QParser
 analyze user query
 extract (and query on) concepts / entities
 use types/PoS in the query for
  • boosting terms
  • synonim expansion
 search within sentences
 faceting / clustering using entities
 identify ‘place queries’ and expand Solr spatial
  queries (for filtering / boosting)
Scaling architecture
Performance
 basic (in memory)
  • slower with NRT indexing
  • search could be significantly impacted
 ReST (SimpleServer)
  • faster
  • need to explictly digest results
 UIMA-AS
  • fast also with NRT indexing
  • fast search
  • scales nicely with lots of data
DisMax vs NLS
Wrap up
   general purpose architecture
   generally improved recall / precision
   NLP algorithms accuracy make the difference
   lots of OSS alternatives
   performances can be kept good
Sources
 Resources
  • http://svn.apache.org/repos/asf/lucene/dev/trunk/
    solr/contrib/uima/
  • https://github.com/tteofili/le11-nls
 Links
  • http://wiki.apache.org/solr/SolrUIMA
  • http://googleblog.blogspot.com/2010/01/helping-
    computers-understand-language.html
Thanks
 http://www.sourcesense.com

 t.teofili@sourcesense.com

 @tteofili

Weitere ähnliche Inhalte

Mehr von lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

Mehr von lucenerevolution (20)

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucene
 
10 keys to Solr's Future
10 keys to Solr's Future10 keys to Solr's Future
10 keys to Solr's Future
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
The Typed Index
The Typed IndexThe Typed Index
The Typed Index
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

Natural Language Search in Solr - Tommaso Teofili

  • 1. Natural language search in Solr Tommaso Teofili, Sourcesense t.teofili@sourcesense.com, October 19th 2011
  • 2. Agenda  An approach to natural language search in Solr  Main points • Solr-UIMA integration module • Custom Lucene analyzers for UIMA • OSS NLP algorithms in Lucene/Solr • Orchestrating blocks to build a sample system able to understand natural language queries  Results
  • 3. My Background  Software engineer at Sourcesense • Enterprise search consultant  Member of the Apache Software Foundation • UIMA • Clerezza • Stanbol • DirectMemory • ...
  • 7. The Challenge  Improved recall/precision • ‘articles about science’ (concepts) • ‘movies by K. Spacey’ vs ‘movies with K. Spacey’  Easier experience for non-expert users • ‘people working at Google’ - ‘cities near London’  Horizontal domains (e.g. Google)  Vertical domains
  • 8. Hurdles  understanding documents’ text/user queries  extract domain-specific/wide entities and concepts  index/search performance
  • 9. Use Case  search engine for an online movies magazine  Solr based  non technical users  time / cost • Solr 3.x setup : 2 mins • NLS setup / tweak : 5 days  expecting • improved recall / precision • more time (clicks) on site ($)
  • 11. General approach  Natural language processing  Processing documents at indexing time • document text analysis • write enriched text in (dedicated) fields • add custom types / payloads to terms  Processing queries at searching time • query analysis • higher boosts to entities/concepts • in-sentence search • ...
  • 12. NLP  AI discipline • Computers understanding and managing information written in human language  analyze text at various levels  incrementally enrich / give structure  extract concepts and named entities
  • 13. Technical detail  NLP algorithms plugged via Apache UIMA  Indexing time • UpdateProcessor plugin (solr/contrib/uima) • Custom tokenizers/filters  Search time • Custom QParserPlugin
  • 14. Why Apache UIMA?  OASIS standard for UIM  TLP since March 2010  Deploy pipelines of Analysis Engines  AEs wrap NLP algorithms  Scaling capabilities
  • 15. NLP and OSS  Sentence Split • OpenNLP, UIMA Addons, StanfordNLP  PoS tagging • OpenNLP, UIMA Addons, StanfordNLP  Chunking/Parsing • OpenNLP, StanfordNLP  NER • OpenNLP, UIMA Addons, Stanbol, StanfordNLP  Clustering/Classifying • Mahout, OpenNLP, StanfordNLP  ...
  • 18. Lucene analysis & UIMA  Type : denote lexical types for tokens  Payload : a byte array stored at each term position  tokenize / filter tokens covered by a certain annotation type  store UIMA annotations’ features in types / payloads
  • 20. Solr NLS QParser  analyze user query  extract (and query on) concepts / entities  use types/PoS in the query for • boosting terms • synonim expansion  search within sentences  faceting / clustering using entities  identify ‘place queries’ and expand Solr spatial queries (for filtering / boosting)
  • 22. Performance  basic (in memory) • slower with NRT indexing • search could be significantly impacted  ReST (SimpleServer) • faster • need to explictly digest results  UIMA-AS • fast also with NRT indexing • fast search • scales nicely with lots of data
  • 24. Wrap up  general purpose architecture  generally improved recall / precision  NLP algorithms accuracy make the difference  lots of OSS alternatives  performances can be kept good
  • 25. Sources  Resources • http://svn.apache.org/repos/asf/lucene/dev/trunk/ solr/contrib/uima/ • https://github.com/tteofili/le11-nls  Links • http://wiki.apache.org/solr/SolrUIMA • http://googleblog.blogspot.com/2010/01/helping- computers-understand-language.html