SlideShare a Scribd company logo
1 of 16
The Once and Future History of Enterprise Search and Open Source Marc Krellenstein, CTOmarc@lucidimagination.com
Evolving challenges in full text search  Finding something in a lot of content (recall, scalability) IBM/STAIRS vs. US gov/Basis, BRS, Dialog, Verity Lycos and Fast , Infoseek, Excite  AltaVista Centralized search  Distributed search (Fast, Google, Lucene/Solr) Finding just the good stuff (precision) SMART, Autonomy, Google, Lucene/Solr, authority scores, browsing/clustering/faceting,… Finding it fast (performance)  Fast, Google, Lucene/Solr Making it easy (simplicity) Google, Lucene/Solr* Deploying good search everywhere (all of the above plus price, flexibility) Lucene/Solr
Google	 Breakthrough in precision of Internet search Popularity algorithm hides the bad stuff Proved importance of understanding data & users Set expectations for accuracy of enterprise search Set a new standard for search performance Sub-second (or near) Proved value of good adaptive spell-checking Demonstrated the power of distributed search for scale Reinforced the importance of simplicity and a single search box Proved the value of search Search needs to be everywhere
But Google is not like most enterprise search applications Google Most data is bad, many good enough answers…task is to screen out the bad Many privacy issues among users No security issues Many naïve users with little patience…speed is important Enterprise search Most or all the data may be good, often only one answer to a search need Many security issues Few or no privacy issues between users Naïve and sophisticated users motivated by an organizational purpose The best enterprise search tools will fit enterprise needs
Best practice recall and precision	  Recall Percent of relevant documents (items) returned 50 good answers in system, 25 returned = 50% recall Precision Percent of documents returned that are relevant 100 returned, 25 are relevant = 25% precision Ideal is 100% recall and 100% precision: return all relevant documents and only those 100% recall is easy – return all documents…but precision so low they can’t be found…precision harder Need adequate recall & enough precision for the task That will vary by application (data & users)… 6
How to get good recall Collect, index and search all the data Check for missing or corrupt data Index everything – stop words not usually needed today Search everything…limit results by category AFTER the search (clustering/faceting) Normalize the data Convert to lower case, strip/handle special characters, stemming, … Use spell-checking, synonyms to match users’ vocabulary with content Adaptive spell-checking, application-specific synonyms Light (or real) natural language processing for abstract concepts ‘Recent documents on Asia’
How to get good precision Term frequency (TF) – more occurrences of query terms is better Inverse document frequency (IDF) – rarer query terms are more important Phrase boost – query terms near each other is better Field boost – where the query term is in doc matters (e.g., in ‘title’ better) Length normalization – avoid penalizing short docs Recency – all things being equal, recent is better Authority – items linked to, clicked on or bought by others may be better Implicit and explicit relevance feedback, more-like-this – expand query (queries usually underdetermined…intent??)  Clustering/faceting – when above fail or intent is not specific Lots of data…Watson, Google Translate 8
The emergence of open source Lucene/Solr Lucene Built in late 90’s by Doug Cutting…. Apache release 2001 State of the art Java library for indexing and ranking…many ports since Contributed to open source to keep it going and reusable Wide acceptance by 2005, mostly by technology organizations, products Solr Build in 2005 by Yonik Seeley to meet CNET needs for quicker-to-build applications and faceting…had to be open source…Apache release 2006 Lucene over HTTP, schema, cache management, replication,… and faceting Open source as a development model, not a religion 4,000+ sites – Apple, Cisco, EMC, HP, IBM, LinkedIn, MySpace, Netflix, Salesforce, Twitter, Gov, Wikipedia…
Current Lucene/Solr: strengths  Best practice segmented index (like Google, Fast) Scalability via SolrCloud distributed search  billions of documents Best practice, flexible ranking (term/field/doc boosts, function queries, custom scoring…) Best overall query performance and complete query capabilities (unlimited Boolean operations, wildcards, find-similar, synonyms, spell-check…) Multilingual, query filters, geo search, memory mapped indexes, near real-time search, advanced proximity operators… Rapid innovation Extensible architecture, complete control (open source) No license fees (open source) CORE TECHNOLOGY AS GOOD OR BETTER THAN ANY OTHER…AND OPEN SOURCE
Open source Lucene/Solr: weaknesses Those typical of open source No formal support Limited access to training, consulting Lack of stringent integrated QA Pace of development and open source environment too complex for some (e.g., what version should I download? What patches? GUI?  Others Lucene/Solr development has tended to focus on core capabilities, so missing certain features for enterprise search (e.g., connectors, security, alerts, advanced query operations)
Addressing open source Lucene/Solr weaknesses Lucene/Solr Community Apache Lucene/Solr community has a wealth of information on web sites, wikis and mailing lists Community members usually respond quickly to questions Consultants May be especially helpful for systems integration or addressing gaps Commercialization  Companies commercializing open source  provide commercial support, certified versions, training and consulting…may fill in gaps or address ease of use Examples: Red Hat, MySQL ,Lucid Imagination Internal resources – usually in combination with one or more of the above
Product strengths of top commercial competitors Well established players tend to be full-featured Some organizations have focused on a particular application or domain (e.g., ecommerce, publishing, legal, help desk) Some competitors have focused on appliance-like simplicity
Weaknesses of top commercial competitors Usually expensive, especially at scale Platform or portability limitations Limited transparency Limited flexibility, especially for other than intended application or domain Limited customization, especially for appliance-like products Sometimes limited scalability Technical debt and/or lack of rapid innovation Customers are dependent on the company’s continued business success
Current competitive landscape	 For last 5 years commercial companies have felt increasing competition from Lucene/Solr because of the combination of its capability and price Very hard to justify multi-million dollar deals given Lucene/Solr Lucene/Solr sometimes wins on performance alone Some competitors have responded with diversification Re-invent themselves as a business intelligence or other kind of company Produce search derivative applications Focus on specific domains Some have been acquired But the need for good, affordable, flexible search remains
The competitive future Basic search has become commoditized and widespread…but Top commercial companies usually often have one or more key weaknesses Existing search is often mediocre and too expensive or difficult to maintain, grow or customize/enhance Producing best practice search is still hard (and search remains a hard problem…intent, context, NLP…) Market strength and features of competitors will keep competitors going a while…but Very hard to justify high prices, especially for large applications Very hard to justify closed and proprietary technology  Lucene/Solr capabilities, performance, control, price and continued rapid innovation (and addressing weaknesses) will likely lead to its dominance
Resources Lucene in Action, Second Edition, by Michael McCandless, Erik Hatcher and Otis Gospodnetic. Manning, 2010. Solr 1.4 Enterprise Search Server, by David Smiley and Eric Pugh. Packt Publishing, 2009.   Solr reference guide: http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr/Reference-Guide 17

More Related Content

More from lucenerevolution

Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...lucenerevolution
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platformlucenerevolution
 

More from lucenerevolution (20)

Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 

Recently uploaded

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 

Recently uploaded (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

The Once and Future History of Enterprise Search and Open Source - By Marc Krellenstein

  • 1. The Once and Future History of Enterprise Search and Open Source Marc Krellenstein, CTOmarc@lucidimagination.com
  • 2. Evolving challenges in full text search Finding something in a lot of content (recall, scalability) IBM/STAIRS vs. US gov/Basis, BRS, Dialog, Verity Lycos and Fast , Infoseek, Excite  AltaVista Centralized search  Distributed search (Fast, Google, Lucene/Solr) Finding just the good stuff (precision) SMART, Autonomy, Google, Lucene/Solr, authority scores, browsing/clustering/faceting,… Finding it fast (performance) Fast, Google, Lucene/Solr Making it easy (simplicity) Google, Lucene/Solr* Deploying good search everywhere (all of the above plus price, flexibility) Lucene/Solr
  • 3. Google Breakthrough in precision of Internet search Popularity algorithm hides the bad stuff Proved importance of understanding data & users Set expectations for accuracy of enterprise search Set a new standard for search performance Sub-second (or near) Proved value of good adaptive spell-checking Demonstrated the power of distributed search for scale Reinforced the importance of simplicity and a single search box Proved the value of search Search needs to be everywhere
  • 4. But Google is not like most enterprise search applications Google Most data is bad, many good enough answers…task is to screen out the bad Many privacy issues among users No security issues Many naïve users with little patience…speed is important Enterprise search Most or all the data may be good, often only one answer to a search need Many security issues Few or no privacy issues between users Naïve and sophisticated users motivated by an organizational purpose The best enterprise search tools will fit enterprise needs
  • 5. Best practice recall and precision Recall Percent of relevant documents (items) returned 50 good answers in system, 25 returned = 50% recall Precision Percent of documents returned that are relevant 100 returned, 25 are relevant = 25% precision Ideal is 100% recall and 100% precision: return all relevant documents and only those 100% recall is easy – return all documents…but precision so low they can’t be found…precision harder Need adequate recall & enough precision for the task That will vary by application (data & users)… 6
  • 6. How to get good recall Collect, index and search all the data Check for missing or corrupt data Index everything – stop words not usually needed today Search everything…limit results by category AFTER the search (clustering/faceting) Normalize the data Convert to lower case, strip/handle special characters, stemming, … Use spell-checking, synonyms to match users’ vocabulary with content Adaptive spell-checking, application-specific synonyms Light (or real) natural language processing for abstract concepts ‘Recent documents on Asia’
  • 7. How to get good precision Term frequency (TF) – more occurrences of query terms is better Inverse document frequency (IDF) – rarer query terms are more important Phrase boost – query terms near each other is better Field boost – where the query term is in doc matters (e.g., in ‘title’ better) Length normalization – avoid penalizing short docs Recency – all things being equal, recent is better Authority – items linked to, clicked on or bought by others may be better Implicit and explicit relevance feedback, more-like-this – expand query (queries usually underdetermined…intent??) Clustering/faceting – when above fail or intent is not specific Lots of data…Watson, Google Translate 8
  • 8. The emergence of open source Lucene/Solr Lucene Built in late 90’s by Doug Cutting…. Apache release 2001 State of the art Java library for indexing and ranking…many ports since Contributed to open source to keep it going and reusable Wide acceptance by 2005, mostly by technology organizations, products Solr Build in 2005 by Yonik Seeley to meet CNET needs for quicker-to-build applications and faceting…had to be open source…Apache release 2006 Lucene over HTTP, schema, cache management, replication,… and faceting Open source as a development model, not a religion 4,000+ sites – Apple, Cisco, EMC, HP, IBM, LinkedIn, MySpace, Netflix, Salesforce, Twitter, Gov, Wikipedia…
  • 9. Current Lucene/Solr: strengths Best practice segmented index (like Google, Fast) Scalability via SolrCloud distributed search  billions of documents Best practice, flexible ranking (term/field/doc boosts, function queries, custom scoring…) Best overall query performance and complete query capabilities (unlimited Boolean operations, wildcards, find-similar, synonyms, spell-check…) Multilingual, query filters, geo search, memory mapped indexes, near real-time search, advanced proximity operators… Rapid innovation Extensible architecture, complete control (open source) No license fees (open source) CORE TECHNOLOGY AS GOOD OR BETTER THAN ANY OTHER…AND OPEN SOURCE
  • 10. Open source Lucene/Solr: weaknesses Those typical of open source No formal support Limited access to training, consulting Lack of stringent integrated QA Pace of development and open source environment too complex for some (e.g., what version should I download? What patches? GUI? Others Lucene/Solr development has tended to focus on core capabilities, so missing certain features for enterprise search (e.g., connectors, security, alerts, advanced query operations)
  • 11. Addressing open source Lucene/Solr weaknesses Lucene/Solr Community Apache Lucene/Solr community has a wealth of information on web sites, wikis and mailing lists Community members usually respond quickly to questions Consultants May be especially helpful for systems integration or addressing gaps Commercialization Companies commercializing open source provide commercial support, certified versions, training and consulting…may fill in gaps or address ease of use Examples: Red Hat, MySQL ,Lucid Imagination Internal resources – usually in combination with one or more of the above
  • 12. Product strengths of top commercial competitors Well established players tend to be full-featured Some organizations have focused on a particular application or domain (e.g., ecommerce, publishing, legal, help desk) Some competitors have focused on appliance-like simplicity
  • 13. Weaknesses of top commercial competitors Usually expensive, especially at scale Platform or portability limitations Limited transparency Limited flexibility, especially for other than intended application or domain Limited customization, especially for appliance-like products Sometimes limited scalability Technical debt and/or lack of rapid innovation Customers are dependent on the company’s continued business success
  • 14. Current competitive landscape For last 5 years commercial companies have felt increasing competition from Lucene/Solr because of the combination of its capability and price Very hard to justify multi-million dollar deals given Lucene/Solr Lucene/Solr sometimes wins on performance alone Some competitors have responded with diversification Re-invent themselves as a business intelligence or other kind of company Produce search derivative applications Focus on specific domains Some have been acquired But the need for good, affordable, flexible search remains
  • 15. The competitive future Basic search has become commoditized and widespread…but Top commercial companies usually often have one or more key weaknesses Existing search is often mediocre and too expensive or difficult to maintain, grow or customize/enhance Producing best practice search is still hard (and search remains a hard problem…intent, context, NLP…) Market strength and features of competitors will keep competitors going a while…but Very hard to justify high prices, especially for large applications Very hard to justify closed and proprietary technology Lucene/Solr capabilities, performance, control, price and continued rapid innovation (and addressing weaknesses) will likely lead to its dominance
  • 16. Resources Lucene in Action, Second Edition, by Michael McCandless, Erik Hatcher and Otis Gospodnetic. Manning, 2010. Solr 1.4 Enterprise Search Server, by David Smiley and Eric Pugh. Packt Publishing, 2009. Solr reference guide: http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr/Reference-Guide 17