Natural Language Search in Solr - Tommaso Teofili

•

2 gefällt mir•961 views

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011 This presentation aims to showcase how to build and implement a search engine which is able to understand a query written in a way much nearer to spoken language than to keyword-based search using Apache Lucene/Solr and Apache UIMA. A system which can recognize semantics in natural language can be very handy for non expert users, e-learning systems, customer care systems, etc. With such a system it's possible to submit queries such as "hotels near Rome" or "people working at Google" without having to manually transform a user entered natural language query to a Lucene/Solr query.

Technologie

Natural language search in Solr
Tommaso Teofili, Sourcesense
t.teofili@sourcesense.com, October 19th 2011

Agenda
 An approach to natural language search in
Solr
 Main points
• Solr-UIMA integration module
• Custom Lucene analyzers for UIMA
• OSS NLP algorithms in Lucene/Solr
• Orchestrating blocks to build a sample
system able to understand natural language
queries
 Results

My Background
 Software engineer at Sourcesense
• Enterprise search consultant
 Member of the Apache Software Foundation
• UIMA
• Clerezza
• Stanbol
• DirectMemory
• ...

The Challenge
 Improved recall/precision
• ‘articles about science’ (concepts)
• ‘movies by K. Spacey’ vs ‘movies with K. Spacey’
 Easier experience for non-expert users
• ‘people working at Google’ - ‘cities near London’
 Horizontal domains (e.g. Google)
 Vertical domains

Hurdles
 understanding documents’ text/user queries
 extract domain-specific/wide entities and
concepts
 index/search performance

Use Case
 search engine for an online movies magazine
 Solr based
 non technical users
 time / cost
• Solr 3.x setup : 2 mins
• NLS setup / tweak : 5 days
 expecting
• improved recall / precision
• more time (clicks) on site ($)

General approach
 Natural language processing
 Processing documents at indexing time
• document text analysis
• write enriched text in (dedicated) fields
• add custom types / payloads to terms
 Processing queries at searching time
• query analysis
• higher boosts to entities/concepts
• in-sentence search
• ...

NLP
 AI discipline
• Computers understanding and managing
information written in human language
 analyze text at various levels
 incrementally enrich / give structure
 extract concepts and named entities

Technical detail
 NLP algorithms plugged via Apache UIMA
 Indexing time
• UpdateProcessor plugin (solr/contrib/uima)
• Custom tokenizers/filters
 Search time
• Custom QParserPlugin

Why Apache UIMA?
 OASIS standard for UIM
 TLP since March 2010
 Deploy pipelines of Analysis Engines
 AEs wrap NLP algorithms
 Scaling capabilities

NLP and OSS
 Sentence Split
• OpenNLP, UIMA Addons, StanfordNLP
 PoS tagging
• OpenNLP, UIMA Addons, StanfordNLP
 Chunking/Parsing
• OpenNLP, StanfordNLP
 NER
• OpenNLP, UIMA Addons, Stanbol, StanfordNLP
 Clustering/Classifying
• Mahout, OpenNLP, StanfordNLP
 ...

Lucene analysis & UIMA
 Type : denote lexical types for tokens
 Payload : a byte array stored at each term
position
 tokenize / filter tokens covered by a certain
annotation type
 store UIMA annotations’ features in types /
payloads

Solr NLS QParser
 analyze user query
 extract (and query on) concepts / entities
 use types/PoS in the query for
• boosting terms
• synonim expansion
 search within sentences
 faceting / clustering using entities
 identify ‘place queries’ and expand Solr spatial
queries (for filtering / boosting)

Performance
 basic (in memory)
• slower with NRT indexing
• search could be significantly impacted
 ReST (SimpleServer)
• faster
• need to explictly digest results
 UIMA-AS
• fast also with NRT indexing
• fast search
• scales nicely with lots of data

Wrap up
 general purpose architecture
 generally improved recall / precision
 NLP algorithms accuracy make the difference
 lots of OSS alternatives
 performances can be kept good

Sources
 Resources
• http://svn.apache.org/repos/asf/lucene/dev/trunk/
solr/contrib/uima/
• https://github.com/tteofili/le11-nls
 Links
• http://wiki.apache.org/solr/SolrUIMA
• http://googleblog.blogspot.com/2010/01/helping-
computers-understand-language.html

Thanks
 http://www.sourcesense.com

 t.teofili@sourcesense.com

 @tteofili

Weitere ähnliche Inhalte

Mehr von lucenerevolution

In a recent project with the United States Patent and Trademark Office, Opensource Connections was asked to prototype the next generation of patent search - using Solr and Lucene. An important aspect of this project was the implementation of BRS, a specialized search syntax used by patent examiners during the examination process. In this fast paced session we will relate our experiences and describe how we used a combination of Parboiled (a Parser Expression Grammar [PEG] parser), Lucene Queries and SpanQueries, and an extension of Solr's QParserPlugin to build BRS search functionality in Solr. First we will characterize the patent search problem and then define the BRS syntax itself. We will then introduce the Parboiled parser and discuss various considerations that one must make when designing a syntax parser. Following this we will describe the methodology used to implement the search functionality in Lucene/Solr. Finally, we will include an overview our syntactic and semantic testing strategies. The audience will leave this session with an understanding of how Solr, Lucene, and Parboiled may be used to implement their own custom search parser.

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled

lucenerevolution

Many of us tend to hate or simply ignore logs, and rightfully so: they’re typically hard to find, difficult to handle, and are cryptic to the human eye. But can we make logs more valuable and more usable if we index them in Solr, so we can search and run real-time statistics on them? Indeed we can, and in this session you’ll learn how to make that happen. In the first part of the session we’ll explain why centralized logging is important, what valuable information one can extract from logs, and we’ll introduce the leading tools from the logging ecosystems everyone should be aware of - from syslog and log4j to LogStash and Flume. In the second part we’ll teach you how to use these tools in tandem with Solr. We’ll show how to use Solr in a SolrCloud setup to index large volumes of logs continuously and efficiently. Then, we'll look at how to scale the Solr cluster as your data volume grows. Finally, we'll see how you can parse your unstructured logs and convert them to nicely structured Solr documents suitable for analytical queries.

Using Solr to Search and Analyze Logs

lucenerevolution

Enhancing relevancy through personalization & semantic search

lucenerevolution

Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr's full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.

Real-time Inverted Search in the Cloud Using Lucene and Storm

lucenerevolution

Like many Web-Applications in the past, the Solr Admin UI up until 4.0 was entirely server based. It used separate code on the server to generate their Dashboards, Overviews and Statistics. All that code had to be maintained and still ... you weren't really able to use that kind of data for the things you needed it for. It was wrapped into HTML, most of the time difficult to extract and changed the structure from time to time w/o announcement. After a short look back, we're going to look into the current state of the Solr Admin UI - a client-side application, running completely in your browser. We'll see how it works, where it gets its data from and how you can get the very same data and wire that into your own custom applications, dashboards and/oder monitoring systems.

Solr's Admin UI - Where does the data come from?

lucenerevolution

Schemaless Solr and the Solr Schema REST API

lucenerevolution

Presented by Renaud Delbru, Co-Founder, SindiceTech In this presentation, we will discuss how Lucene and Solr can be used for very efficient search of tree-shaped schemaless document, e.g. JSON or XML, and can be then made to address both graph and relational data search. We will discuss the capabilities of SIREn, a Lucene/Solr plugin we have developed to deal with huge collections of tree-shaped schemaless documents, and how SIREn is built using Lucene extensibility capabilities (Analysis, Codec, Flexible Query Parser). We will compare it with Lucene's BlockJoin Query API in nested schemaless data intensive scenarios. We will then go through use cases that show how relational or graph data can be turned into JSON documents using Hadoop and Pig, and how this can be used in conjunction with SIREn to create relational faceting systems with unprecedented performance. Take-away lessons from this session will be awareness about using Lucene/Solr and Hadoop for relational and graph data search, as well as the awareness that it is now possible to have relational faceted browsers with sub-second response time on commodity hardware.

High Performance JSON Search and Relational Faceted Browsing with Lucene

lucenerevolution

In this session we will show how to build a text classifier using the Apache Lucene/Solr with libSVM libraries. We classify our corpus of job offers into a number of predefined categories. Each indexed document (a job offer) then belongs to zero, one or more categories. Known machine learning techniques for text classification include naïve bayes model, logistic regression, neural network, support vector machine (SVM), etc. We use Lucene/Solr to construct the features vector. Then we use the libsvm library known as the reference implementation of the SVM model to classify the document. We construct as many one-vs-all svm classifiers as there are classes in our setting, then using the Hadoop MapReduce Framework we reconcile the result of our classifiers. The end result is a scalable multi-class classifier. Finally we outline how the classifier is used to enrich basic solr keyword search.

Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

lucenerevolution

Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.

Faceted Search with Lucene

lucenerevolution

Presented by Shai Erera, Researcher, IBM Lucene's arsenal has recently expanded to include two new modules: Index Sorting and Replication. Index sorting lets you keep an index consistently sorted based on some criteria (e.g. modification date). This allows for efficient search early-termination as well as achieve better index compression. Index replication lets you replicate a search index to achieve high-availability, fault tolerance as well as take hot index backups. In this talk we will introduce these modules, discuss implementation and design details as well as best practices.

Recent Additions to Lucene Arsenal

lucenerevolution

As part of their work with large media monitoring companies, Flax has developed a technique for applying tens of thousands of stored Lucene queries to a document in under a second. We'll talk about how we built intelligent filters to reduce the number of actual queries applied and how we extended Lucene to extract the exact hit positions of matches, the challenges of implementation, and how it can be used, including applications that monitor hundreds of thousands of news stories every day.

Turning search upside down

lucenerevolution

Presented by Xavier Sanchez Loro, Ph.D, Trovit Search SL This session aims to explain the implementation and use case for spellchecking in Trovit search engine. Trovit is a classified ads search engine supporting several different sites, one for each on country and vertical. Our search engine supports multiple indexes in multiple languages, each with several millions of indexed ads. Those indexes are segmented in several different sites depending on the type of ads (homes, cars, rentals, products, jobs and deals). We have developed a multi-language spellchecking system using solr and lucene in order to help our users to better find the desired ads and avoid the dreaded 0 results as much as possible. As such our goal is not pure orthographic correction, but also suggestion of correct searches for a certain site.

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...

lucenerevolution

Shrinking the haystack wes caldwell - final

lucenerevolution

Presented by Mark Miller, Software Developer, Cloudera Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.

The First Class Integration of Solr with Hadoop

lucenerevolution

Presented by Rajini Maski, Senior Software Engineer, Happiest Minds Technologies An important problem with document-search in any content management system (CMS) is the handling of permission-based search requests for each user. In this session, we present an algorithm and framework that allows the Search Engine to plainly index both public and privileged documents without any early binding overhead—thus enforcing document-level security policies only at the time of search. With our late-binding approach for ACL (access control lists) and some custom components, we have achieved reduction in search-time overhead. We will also discuss the order of complexity and execution time for the search overhead.

A Novel methodology for handling Document Level Security in Search Based Appl...

lucenerevolution

Presented by Hien Luu, Technical Lead, LinkedIn Rajasekaran Rangaswamy, LinkedIn For internet companies, marketing campaigns play an important role in acquiring new customers, retaining and engaging existing customers, and promoting new products. The LinkedIn segmentation and targeting platform helps marketing teams to easily and quickly create member segments based on member attributes using nested predicate expressions ranging from simple to complex. Once segments are created, then those qualified members are targeted with marketing campaigns. Lucene is a key piece of technology in this platform. This session will cover how we leverage Hadoop to efficiently build Lucene indexes for a large and growing member attribute data set of 225 million members, and how Lucene is used to create segments based on complex nested predicate expressions. This presentation will also share some of the lessons we learned and challenges we encountered from using Lucene to search over large data sets.

How Lucene Powers the LinkedIn Segmentation and Targeting Platform

lucenerevolution

Presented by Stefan Pohl, Senior Research Engineer, HERE, a Nokia Business Besides the quality of results, the time that it takes from the submission of a query to the display of results is of utmost importance to user satisfaction. Within search engine implementations such as Apache Lucene, significant development efforts are hence directed towards reducing query latency. In this session, I will explain reasons for high query latencies and describe general approaches and recent developments within Lucene to counter them.To make the presented material relevant to a wider audience, I will focus on the actual query processing, as this is at the core of every query and search use-case.

Query Latency Optimization with Lucene

lucenerevolution

10 keys to Solr's Future

lucenerevolution

Presented by Julien Nioche, Director, DigitalPebble This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.

Large Scale Crawling with Apache Nutch and Friends

lucenerevolution

Presented by Christoph Goller, Chief Scientist, IntraFind Software AG If you want to search in a multilingual environment with high-quality language-specific word-normalization, if you want to handle mixed-language documents, if you want to add phonetic search for names if you need a semantic search which distinguishes between a search for the color "brown" and a person with the second name "brown", in all these cases you have to deal with different types of terms. I will show why it makes much more sense to attach types (prefixes) to Lucene terms instead of relying on different fields or even different indexes for different kinds of terms. Furthermore I will show how queries to such a typed index look and why e.g. SpanQueries are needed to correctly treat compound words and phrases or realize a reasonable phonetic search. The Analyzers and the QueryParser described are available as plugins for Lucene, Solr, and elasticsearch.

The Typed Index

lucenerevolution

Mehr von lucenerevolution (20)

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled

Using Solr to Search and Analyze Logs

Enhancing relevancy through personalization & semantic search

Real-time Inverted Search in the Cloud Using Lucene and Storm

Solr's Admin UI - Where does the data come from?

Schemaless Solr and the Solr Schema REST API

High Performance JSON Search and Relational Faceted Browsing with Lucene

Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Faceted Search with Lucene

Recent Additions to Lucene Arsenal

Turning search upside down

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...

Shrinking the haystack wes caldwell - final

The First Class Integration of Solr with Hadoop

A Novel methodology for handling Document Level Security in Search Based Appl...

How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Query Latency Optimization with Lucene

10 keys to Solr's Future

Large Scale Crawling with Apache Nutch and Friends

The Typed Index

Kürzlich hochgeladen

Exploring Multimodal Embeddings with Milvus

Zilliz

FWD Group - Insurer Innovation Award 2024

The Digital Insurer

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

MadyBayot

Accelerating FinTech Innovation: Unleashing API Economy and GenAI Vasa Krishnan, Chief Technology Officer - FinResults Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

apidays

Architecting Cloud Native Applications

WSO2

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Zilliz

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Zilliz

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Rustici Software

The microservices honeymoon is over. When starting a new project or revamping a legacy monolith, teams started looking for alternatives to microservices. The Modular Monolith, or 'Modulith', is an architecture that reaps the benefits of (vertical) functional decoupling without the high costs associated with separate deployments. This talk will delve into the advantages and challenges of this progressive architecture, beginning with exploring the concept of a 'module', its internal structure, public API, and inter-module communication patterns. Supported by spring-modulith, the talk provides practical guidance on addressing the main challenges of a Modultith Architecture: finding and guarding module boundaries, data decoupling, and integration module-testing. You should not miss this talk if you are a software architect or tech lead seeking practical, scalable solutions. About the author With two decades of experience, Victor is a Java Champion working as a trainer for top companies in Europe. Five thousands developers in 120 companies attended his workshops, so he gets to debate every week the challenges that various projects struggle with. In return, Victor summarizes key points from these workshops in conference talks and online meetups for the European Software Crafters, the world’s largest developer community around architecture, refactoring, and testing. Discover how Victor can help you on victorrentea.ro : company training catalog, consultancy and YouTube playlists.

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Victor Rentea

[BuildWithAI] Introduction to Gemini.pdf

Sandro Moreira

Kürzlich hochgeladen (20)

Exploring Multimodal Embeddings with Milvus

FWD Group - Insurer Innovation Award 2024

AWS Community Day CPH - Three problems of Terraform

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Apidays New York 2024 - The value of a flexible API Management solution for O...

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

presentation ICT roal in 21st century education

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Architecting Cloud Native Applications

Why Teams call analytics are critical to your entire business

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

MS Copilot expands with MS Graph connectors

Strategies for Landing an Oracle DBA Job as a Fresher

Corporate and higher education May webinar.pptx

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

[BuildWithAI] Introduction to Gemini.pdf

Natural Language Search in Solr - Tommaso Teofili

1. Natural language search in Solr Tommaso Teofili, Sourcesense t.teofili@sourcesense.com, October 19th 2011

2. Agenda  An approach to natural language search in Solr  Main points • Solr-UIMA integration module • Custom Lucene analyzers for UIMA • OSS NLP algorithms in Lucene/Solr • Orchestrating blocks to build a sample system able to understand natural language queries  Results

3. My Background  Software engineer at Sourcesense • Enterprise search consultant  Member of the Apache Software Foundation • UIMA • Clerezza • Stanbol • DirectMemory • ...

4. Google in ‘99

5. Google today

6. Google today

7. The Challenge  Improved recall/precision • ‘articles about science’ (concepts) • ‘movies by K. Spacey’ vs ‘movies with K. Spacey’  Easier experience for non-expert users • ‘people working at Google’ - ‘cities near London’  Horizontal domains (e.g. Google)  Vertical domains

8. Hurdles  understanding documents’ text/user queries  extract domain-specific/wide entities and concepts  index/search performance

9. Use Case  search engine for an online movies magazine  Solr based  non technical users  time / cost • Solr 3.x setup : 2 mins • NLS setup / tweak : 5 days  expecting • improved recall / precision • more time (clicks) on site ($)

10. Online movies magazine

11. General approach  Natural language processing  Processing documents at indexing time • document text analysis • write enriched text in (dedicated) fields • add custom types / payloads to terms  Processing queries at searching time • query analysis • higher boosts to entities/concepts • in-sentence search • ...

12. NLP  AI discipline • Computers understanding and managing information written in human language  analyze text at various levels  incrementally enrich / give structure  extract concepts and named entities

13. Technical detail  NLP algorithms plugged via Apache UIMA  Indexing time • UpdateProcessor plugin (solr/contrib/uima) • Custom tokenizers/filters  Search time • Custom QParserPlugin

14. Why Apache UIMA?  OASIS standard for UIM  TLP since March 2010  Deploy pipelines of Analysis Engines  AEs wrap NLP algorithms  Scaling capabilities

15. NLP and OSS  Sentence Split • OpenNLP, UIMA Addons, StanfordNLP  PoS tagging • OpenNLP, UIMA Addons, StanfordNLP  Chunking/Parsing • OpenNLP, StanfordNLP  NER • OpenNLP, UIMA Addons, Stanbol, StanfordNLP  Clustering/Classifying • Mahout, OpenNLP, StanfordNLP  ...

16. Solr NLS architecture

17. UIMA Update Processor

18. Lucene analysis & UIMA  Type : denote lexical types for tokens  Payload : a byte array stored at each term position  tokenize / filter tokens covered by a certain annotation type  store UIMA annotations’ features in types / payloads

19. UIMA type-aware tokenizer

20. Solr NLS QParser  analyze user query  extract (and query on) concepts / entities  use types/PoS in the query for • boosting terms • synonim expansion  search within sentences  faceting / clustering using entities  identify ‘place queries’ and expand Solr spatial queries (for filtering / boosting)

21. Scaling architecture

22. Performance  basic (in memory) • slower with NRT indexing • search could be significantly impacted  ReST (SimpleServer) • faster • need to explictly digest results  UIMA-AS • fast also with NRT indexing • fast search • scales nicely with lots of data

23. DisMax vs NLS

24. Wrap up  general purpose architecture  generally improved recall / precision  NLP algorithms accuracy make the difference  lots of OSS alternatives  performances can be kept good

25. Sources  Resources • http://svn.apache.org/repos/asf/lucene/dev/trunk/ solr/contrib/uima/ • https://github.com/tteofili/le11-nls  Links • http://wiki.apache.org/solr/SolrUIMA • http://googleblog.blogspot.com/2010/01/helping- computers-understand-language.html

26. Thanks  http://www.sourcesense.com  t.teofili@sourcesense.com  @tteofili

Natural Language Search in Solr - Tommaso Teofili

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von lucenerevolution

Mehr von lucenerevolution (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Natural Language Search in Solr - Tommaso Teofili