SlideShare a Scribd company logo
1 of 16
Download to read offline
The Once and Future History of
Enterprise Search and Open Source

        Marc Krellenstein, CTO
       marc@lucidimagination.com
Evolving challenges in full text
             search
!  Finding something in a lot of content (recall, scalability)
     •  IBM/STAIRS vs. US gov/Basis, BRS, Dialog, Verity
     •  Lycos and Fast , Infoseek, Excite " AltaVista
     •  Centralized search " Distributed search (Fast, Google,
        Lucene/Solr)
!  Finding just the good stuff (precision)
     •  SMART, Autonomy, Google, Lucene/Solr, authority scores,
        browsing/clustering/faceting,!
!  Finding it fast (performance)
     •  Fast, Google, Lucene/Solr
!  Making it easy (simplicity)
     •  Google, Lucene/Solr*
!  Deploying good search everywhere (all of the above plus price,
   flexibility)
     •  Lucene/Solr
Google
!  Breakthrough in precision of Internet search
   •  Popularity algorithm hides the bad stuff
   •  Proved importance of understanding data & users
   •  Set expectations for accuracy of enterprise search
!  Set a new standard for search performance
   •  Sub-second (or near)
!  Proved value of good adaptive spell-checking
!  Demonstrated the power of distributed search for
   scale
!  Reinforced the importance of simplicity and a single
   search box
!  Proved the value of search
   •  Search needs to be everywhere
But Google is not like most
     enterprise search applications
!  Google
   •  Most data is bad, many good enough answers!task is to
      screen out the bad
   •  Many privacy issues among users
   •  No security issues
   •  Many naïve users with little patience!speed is important
!  Enterprise search
   •  Most or all the data may be good, often only one answer to
      a search need
   •  Many security issues
   •  Few or no privacy issues between users
   •  Naïve and sophisticated users motivated by an
      organizational purpose
!  The best enterprise search tools will fit enterprise needs
Best practice recall and precision

!  Recall
   •  Percent of relevant documents (items) returned
   •  50 good answers in system, 25 returned = 50% recall
!  Precision
   •  Percent of documents returned that are relevant
   •  100 returned, 25 are relevant = 25% precision
!  Ideal is 100% recall and 100% precision: return all
   relevant documents and only those
!  100% recall is easy – return all documents!but
   precision so low they can’t be found!precision harder
!  Need adequate recall & enough precision for the task
   •  That will vary by application (data & users)!


                                                      6
How to get good recall
!  Collect, index and search all the data
   •  Check for missing or corrupt data
   •  Index everything – stop words not usually needed today
   •  Search everything!limit results by category AFTER the
      search (clustering/faceting)
!  Normalize the data
   •  Convert to lower case, strip/handle special characters,
      stemming, !
!  Use spell-checking, synonyms to match users’
   vocabulary with content
   •  Adaptive spell-checking, application-specific synonyms
!  Light (or real) natural language processing for abstract
   concepts
   •  ‘Recent documents on Asia’
How to get good precision
!  Term frequency (TF) – more occurrences of query terms is
   better
!  Inverse document frequency (IDF) – rarer query terms are
   more important
!  Phrase boost – query terms near each other is better
!  Field boost – where the query term is in doc matters (e.g., in
   ‘title’ better)
!  Length normalization – avoid penalizing short docs
!  Recency – all things being equal, recent is better
!  Authority – items linked to, clicked on or bought by others
   may be better
!  Implicit and explicit relevance feedback, more-like-this –
   expand query (queries usually underdetermined!intent??)
!  Clustering/faceting – when above fail or intent is not specific
!  Lots of data!Watson, Google Translate


                                                          8
The emergence of open source
         Lucene/Solr
!  Lucene
   •  Built in late 90’s by Doug Cutting!. Apache release 2001
   •  State of the art Java library for indexing and ranking!
      many ports since
   •  Contributed to open source to keep it going and reusable
   •  Wide acceptance by 2005, mostly by technology
      organizations, products
!  Solr
   •  Build in 2005 by Yonik Seeley to meet CNET needs for
      quicker-to-build applications and faceting!had to be open
      source!Apache release 2006
   •  Lucene over HTTP, schema, cache management,
      replication,! and faceting
!  Open source as a development model, not a religion
!  4,000+ sites – Apple, Cisco, EMC, HP, IBM, LinkedIn,
   MySpace, Netflix, Salesforce, Twitter, Gov, Wikipedia!
Current Lucene/Solr: strengths
!  Best practice segmented index (like Google, Fast)
   •  Scalability via SolrCloud distributed search " billions of
      documents
!  Best practice, flexible ranking (term/field/doc boosts, function
   queries, custom scoring!)
!  Best overall query performance and complete query
   capabilities (unlimited Boolean operations, wildcards, find-
   similar, synonyms, spell-check!)
!  Multilingual, query filters, geo search, memory mapped
   indexes, near real-time search, advanced proximity
   operators!
!  Rapid innovation
!  Extensible architecture, complete control (open source)
!  No license fees (open source)
!  CORE TECHNOLOGY AS GOOD OR BETTER THAN
   ANY OTHER!AND OPEN SOURCE
Open source Lucene/Solr:
             weaknesses
!  Those typical of open source
  •    No formal support
  •    Limited access to training, consulting
  •    Lack of stringent integrated QA
  •    Pace of development and open source
       environment too complex for some (e.g., what
       version should I download? What patches? GUI?
!  Others
  •  Lucene/Solr development has tended to focus on
     core capabilities, so missing certain features for
     enterprise search (e.g., connectors, security,
     alerts, advanced query operations)
Addressing open source Lucene/
       Solr weaknesses
!  Lucene/Solr Community
   •  Apache Lucene/Solr community has a wealth of
      information on web sites, wikis and mailing lists
   •  Community members usually respond quickly to
      questions
!  Consultants
   •  May be especially helpful for systems integration or
      addressing gaps
!  Commercialization
   •  Companies commercializing open source provide
      commercial support, certified versions, training and
      consulting!may fill in gaps or address ease of use
   •  Examples: Red Hat, MySQL ,Lucid Imagination
!  Internal resources – usually in combination with
   one or more of the above
Product strengths of top commercial
            competitors

!  Well established players tend to be full-featured
!  Some organizations have focused on a
   particular application or domain (e.g.,
   ecommerce, publishing, legal, help desk)
!  Some competitors have focused on appliance-
   like simplicity
Weaknesses of top commercial
              competitors
!    Usually expensive, especially at scale
!    Platform or portability limitations
!    Limited transparency
!    Limited flexibility, especially for other than
     intended application or domain
!    Limited customization, especially for appliance-
     like products
!    Sometimes limited scalability
!    Technical debt and/or lack of rapid innovation
!    Customers are dependent on the company’s
     continued business success
Current competitive landscape

!  For last 5 years commercial companies have felt
   increasing competition from Lucene/Solr because of the
   combination of its capability and price
   •  Very hard to justify multi-million dollar deals given Lucene/
      Solr
   •  Lucene/Solr sometimes wins on performance alone
!  Some competitors have responded with diversification
   •  Re-invent themselves as a business intelligence or other
      kind of company
   •  Produce search derivative applications
   •  Focus on specific domains
!  Some have been acquired
!  But the need for good, affordable, flexible search
   remains
The competitive future
!  Basic search has become commoditized and
   widespread!but
   •  Top commercial companies usually often have one or
      more key weaknesses
   •  Existing search is often mediocre and too expensive or
      difficult to maintain, grow or customize/enhance
   •  Producing best practice search is still hard (and search
      remains a hard problem!intent, context, NLP!)
!  Market strength and features of competitors will keep
   competitors going a while!but
   •  Very hard to justify high prices, especially for large
      applications
   •  Very hard to justify closed and proprietary technology
!  Lucene/Solr capabilities, performance, control, price
   and continued rapid innovation (and addressing
   weaknesses) will likely lead to its dominance
Resources
!  Lucene in Action, Second Edition, by Michael
   McCandless, Erik Hatcher and Otis
   Gospodnetic. Manning, 2010.
!  Solr 1.4 Enterprise Search Server, by David
   Smiley and Eric Pugh. Packt Publishing, 2009.
!  Solr reference guide:
   http://www.lucidimagination.com/Downloads/
   LucidWorks-for-Solr/Reference-Guide




                                          17

More Related Content

Similar to Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise search_+_open_source

Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...Andrei Lopatenko
 
Navigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointNavigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointJoanne Klein
 
Opening the Black Box of Software Localization
Opening the Black Box of Software LocalizationOpening the Black Box of Software Localization
Opening the Black Box of Software LocalizationKenneth Farrall
 
Movin on Up - SPEngage Phoenix 2017
Movin on Up - SPEngage Phoenix 2017Movin on Up - SPEngage Phoenix 2017
Movin on Up - SPEngage Phoenix 2017Jim Adcock
 
What Lucene and Solr Open Source Search can do for Enterprise Search
What Lucene and Solr Open Source Search can do for Enterprise SearchWhat Lucene and Solr Open Source Search can do for Enterprise Search
What Lucene and Solr Open Source Search can do for Enterprise SearchLucidworks (Archived)
 
KAI, the Information Specialist
KAI, the Information SpecialistKAI, the Information Specialist
KAI, the Information Specialistaik762
 
Success with SharePoint
Success with SharePointSuccess with SharePoint
Success with SharePointStoverEffect
 
Enterprise search Information
Enterprise search Information Enterprise search Information
Enterprise search Information Netwoven Inc.
 
Talking to your organization about Elixir
Talking to your organization about ElixirTalking to your organization about Elixir
Talking to your organization about ElixirBrandon Richey
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...Dr. Haxel Consult
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search TrainingCloudera, Inc.
 
Webinar: Personalized Retail Search & Recommendations with Fusion
Webinar: Personalized Retail Search & Recommendations with FusionWebinar: Personalized Retail Search & Recommendations with Fusion
Webinar: Personalized Retail Search & Recommendations with FusionLucidworks
 
Technologies for startup
Technologies for startupTechnologies for startup
Technologies for startupDzung Nguyen
 
Moving to Solr/Lucene Open Source Search
Moving to Solr/Lucene Open Source SearchMoving to Solr/Lucene Open Source Search
Moving to Solr/Lucene Open Source SearchLucidworks (Archived)
 
Challenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genChallenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genrobin fay
 
Customer Insights Workshop - Consumer Text Analytics Conference
Customer Insights Workshop - Consumer Text Analytics ConferenceCustomer Insights Workshop - Consumer Text Analytics Conference
Customer Insights Workshop - Consumer Text Analytics ConferenceMekkin Bjarnadottir
 
Turbocharging FME: How to Improve the Performance of Your FME Workspaces
Turbocharging FME: How to Improve the Performance of Your FME WorkspacesTurbocharging FME: How to Improve the Performance of Your FME Workspaces
Turbocharging FME: How to Improve the Performance of Your FME WorkspacesSafe Software
 

Similar to Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise search_+_open_source (20)

Liwp consider opensource2010
Liwp consider opensource2010Liwp consider opensource2010
Liwp consider opensource2010
 
Solr 101
Solr 101Solr 101
Solr 101
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...
 
Apache Solr vs Oracle Endeca
Apache Solr vs Oracle EndecaApache Solr vs Oracle Endeca
Apache Solr vs Oracle Endeca
 
Navigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointNavigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePoint
 
Opening the Black Box of Software Localization
Opening the Black Box of Software LocalizationOpening the Black Box of Software Localization
Opening the Black Box of Software Localization
 
Movin on Up - SPEngage Phoenix 2017
Movin on Up - SPEngage Phoenix 2017Movin on Up - SPEngage Phoenix 2017
Movin on Up - SPEngage Phoenix 2017
 
What Lucene and Solr Open Source Search can do for Enterprise Search
What Lucene and Solr Open Source Search can do for Enterprise SearchWhat Lucene and Solr Open Source Search can do for Enterprise Search
What Lucene and Solr Open Source Search can do for Enterprise Search
 
KAI, the Information Specialist
KAI, the Information SpecialistKAI, the Information Specialist
KAI, the Information Specialist
 
Success with SharePoint
Success with SharePointSuccess with SharePoint
Success with SharePoint
 
Enterprise search Information
Enterprise search Information Enterprise search Information
Enterprise search Information
 
Talking to your organization about Elixir
Talking to your organization about ElixirTalking to your organization about Elixir
Talking to your organization about Elixir
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
 
Webinar: Personalized Retail Search & Recommendations with Fusion
Webinar: Personalized Retail Search & Recommendations with FusionWebinar: Personalized Retail Search & Recommendations with Fusion
Webinar: Personalized Retail Search & Recommendations with Fusion
 
Technologies for startup
Technologies for startupTechnologies for startup
Technologies for startup
 
Moving to Solr/Lucene Open Source Search
Moving to Solr/Lucene Open Source SearchMoving to Solr/Lucene Open Source Search
Moving to Solr/Lucene Open Source Search
 
Challenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genChallenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services gen
 
Customer Insights Workshop - Consumer Text Analytics Conference
Customer Insights Workshop - Consumer Text Analytics ConferenceCustomer Insights Workshop - Consumer Text Analytics Conference
Customer Insights Workshop - Consumer Text Analytics Conference
 
Turbocharging FME: How to Improve the Performance of Your FME Workspaces
Turbocharging FME: How to Improve the Performance of Your FME WorkspacesTurbocharging FME: How to Improve the Performance of Your FME Workspaces
Turbocharging FME: How to Improve the Performance of Your FME Workspaces
 

More from lucenerevolution

State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 

More from lucenerevolution (20)

State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 

Recently uploaded

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise search_+_open_source

  • 1. The Once and Future History of Enterprise Search and Open Source Marc Krellenstein, CTO marc@lucidimagination.com
  • 2. Evolving challenges in full text search !  Finding something in a lot of content (recall, scalability) •  IBM/STAIRS vs. US gov/Basis, BRS, Dialog, Verity •  Lycos and Fast , Infoseek, Excite " AltaVista •  Centralized search " Distributed search (Fast, Google, Lucene/Solr) !  Finding just the good stuff (precision) •  SMART, Autonomy, Google, Lucene/Solr, authority scores, browsing/clustering/faceting,! !  Finding it fast (performance) •  Fast, Google, Lucene/Solr !  Making it easy (simplicity) •  Google, Lucene/Solr* !  Deploying good search everywhere (all of the above plus price, flexibility) •  Lucene/Solr
  • 3. Google !  Breakthrough in precision of Internet search •  Popularity algorithm hides the bad stuff •  Proved importance of understanding data & users •  Set expectations for accuracy of enterprise search !  Set a new standard for search performance •  Sub-second (or near) !  Proved value of good adaptive spell-checking !  Demonstrated the power of distributed search for scale !  Reinforced the importance of simplicity and a single search box !  Proved the value of search •  Search needs to be everywhere
  • 4. But Google is not like most enterprise search applications !  Google •  Most data is bad, many good enough answers!task is to screen out the bad •  Many privacy issues among users •  No security issues •  Many naïve users with little patience!speed is important !  Enterprise search •  Most or all the data may be good, often only one answer to a search need •  Many security issues •  Few or no privacy issues between users •  Naïve and sophisticated users motivated by an organizational purpose !  The best enterprise search tools will fit enterprise needs
  • 5. Best practice recall and precision !  Recall •  Percent of relevant documents (items) returned •  50 good answers in system, 25 returned = 50% recall !  Precision •  Percent of documents returned that are relevant •  100 returned, 25 are relevant = 25% precision !  Ideal is 100% recall and 100% precision: return all relevant documents and only those !  100% recall is easy – return all documents!but precision so low they can’t be found!precision harder !  Need adequate recall & enough precision for the task •  That will vary by application (data & users)! 6
  • 6. How to get good recall !  Collect, index and search all the data •  Check for missing or corrupt data •  Index everything – stop words not usually needed today •  Search everything!limit results by category AFTER the search (clustering/faceting) !  Normalize the data •  Convert to lower case, strip/handle special characters, stemming, ! !  Use spell-checking, synonyms to match users’ vocabulary with content •  Adaptive spell-checking, application-specific synonyms !  Light (or real) natural language processing for abstract concepts •  ‘Recent documents on Asia’
  • 7. How to get good precision !  Term frequency (TF) – more occurrences of query terms is better !  Inverse document frequency (IDF) – rarer query terms are more important !  Phrase boost – query terms near each other is better !  Field boost – where the query term is in doc matters (e.g., in ‘title’ better) !  Length normalization – avoid penalizing short docs !  Recency – all things being equal, recent is better !  Authority – items linked to, clicked on or bought by others may be better !  Implicit and explicit relevance feedback, more-like-this – expand query (queries usually underdetermined!intent??) !  Clustering/faceting – when above fail or intent is not specific !  Lots of data!Watson, Google Translate 8
  • 8. The emergence of open source Lucene/Solr !  Lucene •  Built in late 90’s by Doug Cutting!. Apache release 2001 •  State of the art Java library for indexing and ranking! many ports since •  Contributed to open source to keep it going and reusable •  Wide acceptance by 2005, mostly by technology organizations, products !  Solr •  Build in 2005 by Yonik Seeley to meet CNET needs for quicker-to-build applications and faceting!had to be open source!Apache release 2006 •  Lucene over HTTP, schema, cache management, replication,! and faceting !  Open source as a development model, not a religion !  4,000+ sites – Apple, Cisco, EMC, HP, IBM, LinkedIn, MySpace, Netflix, Salesforce, Twitter, Gov, Wikipedia!
  • 9. Current Lucene/Solr: strengths !  Best practice segmented index (like Google, Fast) •  Scalability via SolrCloud distributed search " billions of documents !  Best practice, flexible ranking (term/field/doc boosts, function queries, custom scoring!) !  Best overall query performance and complete query capabilities (unlimited Boolean operations, wildcards, find- similar, synonyms, spell-check!) !  Multilingual, query filters, geo search, memory mapped indexes, near real-time search, advanced proximity operators! !  Rapid innovation !  Extensible architecture, complete control (open source) !  No license fees (open source) !  CORE TECHNOLOGY AS GOOD OR BETTER THAN ANY OTHER!AND OPEN SOURCE
  • 10. Open source Lucene/Solr: weaknesses !  Those typical of open source •  No formal support •  Limited access to training, consulting •  Lack of stringent integrated QA •  Pace of development and open source environment too complex for some (e.g., what version should I download? What patches? GUI? !  Others •  Lucene/Solr development has tended to focus on core capabilities, so missing certain features for enterprise search (e.g., connectors, security, alerts, advanced query operations)
  • 11. Addressing open source Lucene/ Solr weaknesses !  Lucene/Solr Community •  Apache Lucene/Solr community has a wealth of information on web sites, wikis and mailing lists •  Community members usually respond quickly to questions !  Consultants •  May be especially helpful for systems integration or addressing gaps !  Commercialization •  Companies commercializing open source provide commercial support, certified versions, training and consulting!may fill in gaps or address ease of use •  Examples: Red Hat, MySQL ,Lucid Imagination !  Internal resources – usually in combination with one or more of the above
  • 12. Product strengths of top commercial competitors !  Well established players tend to be full-featured !  Some organizations have focused on a particular application or domain (e.g., ecommerce, publishing, legal, help desk) !  Some competitors have focused on appliance- like simplicity
  • 13. Weaknesses of top commercial competitors !  Usually expensive, especially at scale !  Platform or portability limitations !  Limited transparency !  Limited flexibility, especially for other than intended application or domain !  Limited customization, especially for appliance- like products !  Sometimes limited scalability !  Technical debt and/or lack of rapid innovation !  Customers are dependent on the company’s continued business success
  • 14. Current competitive landscape !  For last 5 years commercial companies have felt increasing competition from Lucene/Solr because of the combination of its capability and price •  Very hard to justify multi-million dollar deals given Lucene/ Solr •  Lucene/Solr sometimes wins on performance alone !  Some competitors have responded with diversification •  Re-invent themselves as a business intelligence or other kind of company •  Produce search derivative applications •  Focus on specific domains !  Some have been acquired !  But the need for good, affordable, flexible search remains
  • 15. The competitive future !  Basic search has become commoditized and widespread!but •  Top commercial companies usually often have one or more key weaknesses •  Existing search is often mediocre and too expensive or difficult to maintain, grow or customize/enhance •  Producing best practice search is still hard (and search remains a hard problem!intent, context, NLP!) !  Market strength and features of competitors will keep competitors going a while!but •  Very hard to justify high prices, especially for large applications •  Very hard to justify closed and proprietary technology !  Lucene/Solr capabilities, performance, control, price and continued rapid innovation (and addressing weaknesses) will likely lead to its dominance
  • 16. Resources !  Lucene in Action, Second Edition, by Michael McCandless, Erik Hatcher and Otis Gospodnetic. Manning, 2010. !  Solr 1.4 Enterprise Search Server, by David Smiley and Eric Pugh. Packt Publishing, 2009. !  Solr reference guide: http://www.lucidimagination.com/Downloads/ LucidWorks-for-Solr/Reference-Guide 17