SlideShare a Scribd company logo
1 of 28
Download to read offline
Open Source Search

Andreas Pesenhofer

max.recall information systems GmbH
Künstlergasse 11/1 • A-1150 Wien • Austria
ICIC, October 2013

ICIC, October 2013
max.recall information systems
• max.recall is a software and consulting company enabling
enterprises to capitalize on the hidden value in the rapidly growing
amount of textual data
• Customized Solutions for
– Intelligent data analytics
– Vertical search
• Products and Services
– quantalyze: quantity analytics technology
– smart.coder: open-ended question coding tool for market researchers

• Founded 2010 and located in Vienna, Austria
• Operates worldwide with int’l customers from sectors such as IP,
market research, news and media, IT services

2
ICIC, October 2013
Recall and precision
•

Recall
–
–

•

Precision
–
–

•
•

•

Percent of relevant documents (items) returned
50 good answers in system, 25 returned = 50% recall
Percent of documents returned that are relevant
100 returned, 25 are relevant = 25% precision

Ideal is 100% recall and 100% precision: return all relevant documents and
only those
100% recall is easy – return all documents, but precision is low, relevant
documents can’t be found
Need adequate recall & enough precision for the task - that will vary by
application (data & users)

3
ICIC, October 2013
How to get good recall
• Collect, index and search all the data
– Check for missing or corrupt data

• Index everything
– Search everything … limit results by category AFTER the
search (clustering/faceting)

• Normalize the data
– Convert to lower case, strip/handle special characters,
stemming, ...

• Use spell-checking, synonyms to match users’
vocabulary with content
– Adaptive spell-checking, application-specific synonyms

• Light (or real) natural language processing for abstract
concepts

4
ICIC, October 2013
How to get good precision
• Term frequency (TF) – more occurrences of query terms is better
• Inverse document frequency (IDF) – rarer query terms are more
important
• Phrase boost – query terms near each other is better
• Field boost – where the query term is in doc matters (e.g., in 'title'
better)
• Length normalization – avoid penalizing short docs
• Recency – all things being equal, recent is better
• Authority – items linked to, clicked on or bought by others may be
better
• Implicit and explicit relevance feedback, more-like-this – expand
query
• Clustering/faceting – intent is not specific
• Lots of data

5
ICIC, October 2013
Every minute …

http://www.domo.com/

6
ICIC, October 2013
Groth of patent applications

7
ICIC, October 2013
Big Data Open Source Tools

8
ICIC, October 2013
Apache Lucene
Apache LuceneTM is a high-performance, full-featured text
search engine library written entirely in Java. It is a technology
suitable for nearly any application that requires full-text search,
especially cross-platform.
• Scalable, High-Performance Indexing
–
–
–
–

over 150GB/hour on modern hardware
small RAM requirements - only 1MB heap
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed

9
ICIC, October 2013
Apache Lucene (2)
• Powerful, Accurate and Efficient Search Algorithms
– ranked searching - best results returned first
– many powerful query types: phrase queries, wildcard queries, proximity
queries, range queries and more
– fielded searching (e.g. title, author, contents)
– sorting by any field
– multiple-index searching with merged results
– allows simultaneous update and searching
– flexible faceting, highlighting, joins and result grouping
– fast, memory-efficient and typo-tolerant suggesters
– pluggable ranking models, including the Vector Space Model and Okapi
BM25
– configurable storage engine (codecs)

• Cross-Platform Solution
10
ICIC, October 2013
Apache Lucene (3)
• Cross-Platform Solution
– Available as Open Source software under the Apache License - Lucene
in both commercial and Open Source programs
– 100%-pure Java
– Implementations in other programming languages available, the index is
compatible

• Apache Lucene 4.5.0 was released on October 5th, 2013.

11
ICIC, October 2013
Apache SOLR
• Apache SOLR is an open source enterprise search platform
from the Apache Lucene project.
• major features:
–
–
–
–
–
–
–

full-text search
hit highlighting
faceted search
dynamic clustering
database integration
handling of rich documents (e.g., Word, PDF)
providing distributed search and index replication, Solr is highly
scalable.

• Apache SOLR 4.5.0 was released on October 5th, 2013.

12
ICIC, October 2013
elasticsearch
• elasticsearch is a distributed, RESTful, open source search
server based on Apache Lucene. It is developed by Shay
Banon and is released under the terms of the Apache
License.
• major features:
–
–
–
–

fully supports the near real-time search of Apache Lucene
cluster setup needs no additional software
features of Lucene are made available through the JSON and Java API
JSON in / JSON out (and YAML)

• elasticsearch 0.90.5 was released on September 17th,
2013, based on Lucene 4.4.

13
ICIC, October 2013
All Time Top Committers

ICIC, October 2013
Active Contributors

ICIC, October 2013
Lines of Code

ICIC, October 2013
The Mailing Lists

ICIC, October 2013
Interest over time

ICIC, October 2013
Case study - quantalyze
patents.quantalyze.com






Runs in your browser
Filter and keyword search
Physical quantity search
Interval search
Print view

Visualization:
 Physical quantity distributions
 Cross-tabulations (e.g.
concepts vs. quantity type)
 Different chart types

19
ICIC, October 2013
Case study - StumbleUpon
Create a world-class
customer experience
• A “Stumble” provides real-time
recommendations to 30 million
customers per day
• Intelligent search is key to
providing fast and more
informed recommendations
• Update your searches
immediately with newly posted
content

Develop and scale easily
•

•

•

Build in intelligent search to
scale with millions of users and
interactions
Take advantage of powerful
and flexible APIs for easy data
integration
Use easy to use but powerful
solutions for your big data
search and analytics needs

20
ICIC, October 2013
Strengths of open source search
• Best practice segmented index (like Google, Fast)
• Scalability
• Best practice, flexible ranking (term/field/doc boosts, function
queries, custom scoring…)
• Best overall query performance and complete query
capabilities (unlimited Boolean operations, wildcards,
findsimilar, synonyms, spell-check…)
• Multilingual, query filters, geo search, memory mapped
indexes, near real-time search, advanced proximity
operators…
• Rapid innovation
• Extensible architecture, complete control (open source)
• No license fees (open source)

21
ICIC, October 2013
Weaknesses of open source search
• Those typical of open source
–
–
–
–

No formal support
Limited access to training, consulting
Lack of stringent integrated QA
Speed of development and open source environment
too complex for some (e.g., what version should I
download? What patches? GUI?)

• Others
– Lucene/Solr/Elasicsearch development has tended to
focus on core capabilities, so missing certain features
for enterprise search (e.g., connectors, security,
alerts, advanced query operations)

22
ICIC, October 2013
Addressing open source weaknesses
• Community
– Community has a wealth of information on web sites, wikis
and mailing lists
– Community members usually respond quickly to questions

• Consultants
– May be especially helpful for systems integration or
addressing gaps

• Commercialization
– Companies commercializing open source provide
commercial support, certified versions, training and
consulting

• Internal resources

23
ICIC, October 2013
Product strengths of commercial
competitors
• Well established players tend to be full-featured
• Some organizations have focused on a
particular application or domain (e.g.,
ecommerce, publishing, legal, help desk)
• Some competitors have focused on appliance

24
ICIC, October 2013
Weaknesses of top commercial
competitors
•
•
•
•
•
•
•
•

Usually expensive, especially at scale
Platform or portability limitations
Limited transparency
Limited flexibility, especially for other than
intended application or domain
Limited customization, especially for appliance-like
products
Sometimes limited scalability
Technical debt and/or lack of rapid innovation
Customers are dependent on the company’s
continued business success

25
ICIC, October 2013
Competitive landscape
• Last years commercial companies have felt
increasing competition from
Lucene/Solr/Elasticsearch because of the
combination of its capability and price
• Some competitors have responded with
diversification
• Some have been acquired
• Need for good, affordable, flexible search
remains

26
ICIC, October 2013
Questions

ICIC, October 2013
Credits

28
ICIC, October 2013

More Related Content

What's hot

II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
Dr. Haxel Consult
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities  ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
Dr. Haxel Consult
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 

What's hot (20)

II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
New Product Introductions - Minesoft
New Product Introductions - MinesoftNew Product Introductions - Minesoft
New Product Introductions - Minesoft
 
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
 
ICIC 2017: Product presentations FIZ Karlsruhe
ICIC 2017: Product presentations FIZ KarlsruheICIC 2017: Product presentations FIZ Karlsruhe
ICIC 2017: Product presentations FIZ Karlsruhe
 
ICIC 2017: New product presentation minesoft
ICIC 2017: New product presentation minesoftICIC 2017: New product presentation minesoft
ICIC 2017: New product presentation minesoft
 
ICIC 2017: New product presentations CAS
ICIC 2017: New product presentations CASICIC 2017: New product presentations CAS
ICIC 2017: New product presentations CAS
 
II-SDV 2016 Aalt van de Kuilen - The Art of Patent Landscaping
II-SDV 2016 Aalt van de Kuilen - The Art of Patent LandscapingII-SDV 2016 Aalt van de Kuilen - The Art of Patent Landscaping
II-SDV 2016 Aalt van de Kuilen - The Art of Patent Landscaping
 
II-SDV 2016 Raphael Ilmer, Quentin Ladetto - Optimization of Patent Landscape...
II-SDV 2016 Raphael Ilmer, Quentin Ladetto - Optimization of Patent Landscape...II-SDV 2016 Raphael Ilmer, Quentin Ladetto - Optimization of Patent Landscape...
II-SDV 2016 Raphael Ilmer, Quentin Ladetto - Optimization of Patent Landscape...
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
II-SDV 2016 Questel Intellixir
II-SDV 2016 Questel IntellixirII-SDV 2016 Questel Intellixir
II-SDV 2016 Questel Intellixir
 
New Product Introductions - FIZ Karlsruhe
New Product Introductions - FIZ KarlsruheNew Product Introductions - FIZ Karlsruhe
New Product Introductions - FIZ Karlsruhe
 
II-SDV 2016 Michael Iarrobino - Improving Text Mining Results with Access to ...
II-SDV 2016 Michael Iarrobino - Improving Text Mining Results with Access to ...II-SDV 2016 Michael Iarrobino - Improving Text Mining Results with Access to ...
II-SDV 2016 Michael Iarrobino - Improving Text Mining Results with Access to ...
 
II-SDV 2016 Linguamatics
II-SDV 2016 LinguamaticsII-SDV 2016 Linguamatics
II-SDV 2016 Linguamatics
 
II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution fr...
II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution fr...II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution fr...
II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution fr...
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
TIB's action for research data managament as a national library's strategy in...
TIB's action for research data managament as a national library's strategy in...TIB's action for research data managament as a national library's strategy in...
TIB's action for research data managament as a national library's strategy in...
 
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities  ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 

Viewers also liked

ICIC 2013 New Product Introductions Dolcera
ICIC 2013 New Product Introductions DolceraICIC 2013 New Product Introductions Dolcera
ICIC 2013 New Product Introductions Dolcera
Dr. Haxel Consult
 
ICIC 2013 New Product Introductions BizInt
ICIC 2013 New Product Introductions BizIntICIC 2013 New Product Introductions BizInt
ICIC 2013 New Product Introductions BizInt
Dr. Haxel Consult
 
ICIC 2014 Patent Landscape Analysis as a Tool for Public Policies Adjustment:...
ICIC 2014 Patent Landscape Analysis as a Tool for Public Policies Adjustment:...ICIC 2014 Patent Landscape Analysis as a Tool for Public Policies Adjustment:...
ICIC 2014 Patent Landscape Analysis as a Tool for Public Policies Adjustment:...
Dr. Haxel Consult
 
ICIC 2013 New Product Introductions Linguamatics
ICIC 2013 New Product Introductions LinguamaticsICIC 2013 New Product Introductions Linguamatics
ICIC 2013 New Product Introductions Linguamatics
Dr. Haxel Consult
 
ICIC 2013 New Product Introductions GenomeQuest
ICIC 2013 New Product Introductions GenomeQuestICIC 2013 New Product Introductions GenomeQuest
ICIC 2013 New Product Introductions GenomeQuest
Dr. Haxel Consult
 
ICIC 2013 New Product Introductions InfoChem
ICIC 2013 New Product Introductions InfoChemICIC 2013 New Product Introductions InfoChem
ICIC 2013 New Product Introductions InfoChem
Dr. Haxel Consult
 

Viewers also liked (20)

ICIC 2013 Conference Proceedings Krishna Molecular Connections
ICIC 2013 Conference Proceedings Krishna Molecular ConnectionsICIC 2013 Conference Proceedings Krishna Molecular Connections
ICIC 2013 Conference Proceedings Krishna Molecular Connections
 
ICIC 2013 New Product Introductions Dolcera
ICIC 2013 New Product Introductions DolceraICIC 2013 New Product Introductions Dolcera
ICIC 2013 New Product Introductions Dolcera
 
ICIC 2013 New Product Introductions BizInt
ICIC 2013 New Product Introductions BizIntICIC 2013 New Product Introductions BizInt
ICIC 2013 New Product Introductions BizInt
 
ICIC 2014 Tracking of the Mode of Action Landscape in Breast Cancer using Rep...
ICIC 2014 Tracking of the Mode of Action Landscape in Breast Cancer using Rep...ICIC 2014 Tracking of the Mode of Action Landscape in Breast Cancer using Rep...
ICIC 2014 Tracking of the Mode of Action Landscape in Breast Cancer using Rep...
 
New Product Introductions - GenomeQuest Life Sciences
New Product Introductions - GenomeQuest Life SciencesNew Product Introductions - GenomeQuest Life Sciences
New Product Introductions - GenomeQuest Life Sciences
 
ICIC 2014 Patent Landscape Analysis as a Tool for Public Policies Adjustment:...
ICIC 2014 Patent Landscape Analysis as a Tool for Public Policies Adjustment:...ICIC 2014 Patent Landscape Analysis as a Tool for Public Policies Adjustment:...
ICIC 2014 Patent Landscape Analysis as a Tool for Public Policies Adjustment:...
 
ICIC 2014 Panel: Mobile Apps for Patent Searchers
ICIC 2014 Panel: Mobile Apps for Patent SearchersICIC 2014 Panel: Mobile Apps for Patent Searchers
ICIC 2014 Panel: Mobile Apps for Patent Searchers
 
ICIC 2013 Conference Proceedings Sebastian Radestock
ICIC 2013 Conference Proceedings Sebastian RadestockICIC 2013 Conference Proceedings Sebastian Radestock
ICIC 2013 Conference Proceedings Sebastian Radestock
 
ICIC 2013 New Product Introductions Linguamatics
ICIC 2013 New Product Introductions LinguamaticsICIC 2013 New Product Introductions Linguamatics
ICIC 2013 New Product Introductions Linguamatics
 
ICIC 2014 New Product Introduction Questel
ICIC 2014 New Product Introduction QuestelICIC 2014 New Product Introduction Questel
ICIC 2014 New Product Introduction Questel
 
ICIC 2014 New Product Introduction CAS
ICIC 2014 New Product Introduction CASICIC 2014 New Product Introduction CAS
ICIC 2014 New Product Introduction CAS
 
ICIC 2013 New Product Introductions GenomeQuest
ICIC 2013 New Product Introductions GenomeQuestICIC 2013 New Product Introductions GenomeQuest
ICIC 2013 New Product Introductions GenomeQuest
 
ICIC 2014 The Intermediates are becoming extict - radical Change for Info Pr...
ICIC 2014 The Intermediates are becoming extict  - radical Change for Info Pr...ICIC 2014 The Intermediates are becoming extict  - radical Change for Info Pr...
ICIC 2014 The Intermediates are becoming extict - radical Change for Info Pr...
 
ICIC 2014 New Product Introduction Averbis
ICIC 2014 New Product Introduction AverbisICIC 2014 New Product Introduction Averbis
ICIC 2014 New Product Introduction Averbis
 
ICIC 2014 How the European Patent Office Uses Asian Documentation
ICIC 2014 How the European Patent Office Uses Asian Documentation ICIC 2014 How the European Patent Office Uses Asian Documentation
ICIC 2014 How the European Patent Office Uses Asian Documentation
 
Knowledge Manager - Fit for the Future
Knowledge Manager - Fit for the Future Knowledge Manager - Fit for the Future
Knowledge Manager - Fit for the Future
 
ICIC 2014 New Product Introduction Infotrieve
ICIC 2014 New Product Introduction InfotrieveICIC 2014 New Product Introduction Infotrieve
ICIC 2014 New Product Introduction Infotrieve
 
ICIC 2013 New Product Introductions InfoChem
ICIC 2013 New Product Introductions InfoChemICIC 2013 New Product Introductions InfoChem
ICIC 2013 New Product Introductions InfoChem
 
ICIC 2014 New Product Introduction Gridlogisc
ICIC 2014 New Product Introduction GridlogiscICIC 2014 New Product Introduction Gridlogisc
ICIC 2014 New Product Introduction Gridlogisc
 
ICIC 2014 New Product Introduction ProQuest
ICIC 2014 New Product Introduction ProQuestICIC 2014 New Product Introduction ProQuest
ICIC 2014 New Product Introduction ProQuest
 

Similar to ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall

Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
Lucidworks (Archived)
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
Petter Skodvin-Hvammen
 
OSSF 2018 - Stefan Just of Codescoop - OSCAR - a new approach to Software Com...
OSSF 2018 - Stefan Just of Codescoop - OSCAR - a new approach to Software Com...OSSF 2018 - Stefan Just of Codescoop - OSCAR - a new approach to Software Com...
OSSF 2018 - Stefan Just of Codescoop - OSCAR - a new approach to Software Com...
FINOS
 
An Introduction to ROS-Industrial
An Introduction to ROS-IndustrialAn Introduction to ROS-Industrial
An Introduction to ROS-Industrial
Clay Flannigan
 

Similar to ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall (20)

Apache Solr vs Oracle Endeca
Apache Solr vs Oracle EndecaApache Solr vs Oracle Endeca
Apache Solr vs Oracle Endeca
 
EnterpriseSearch
EnterpriseSearchEnterpriseSearch
EnterpriseSearch
 
The Enterprise Search Market in a Nutshell
The Enterprise Search Market in a NutshellThe Enterprise Search Market in a Nutshell
The Enterprise Search Market in a Nutshell
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
 
Scalable Search Analytics
Scalable Search AnalyticsScalable Search Analytics
Scalable Search Analytics
 
Nairobi OpenStack Meetup - July 2013
Nairobi OpenStack Meetup - July 2013Nairobi OpenStack Meetup - July 2013
Nairobi OpenStack Meetup - July 2013
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
OSSF 2018 - Stefan Just of Codescoop - OSCAR - a new approach to Software Com...
OSSF 2018 - Stefan Just of Codescoop - OSCAR - a new approach to Software Com...OSSF 2018 - Stefan Just of Codescoop - OSCAR - a new approach to Software Com...
OSSF 2018 - Stefan Just of Codescoop - OSCAR - a new approach to Software Com...
 
Application of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLibApplication of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLib
 
Big Data Technologies.pdf
Big Data Technologies.pdfBig Data Technologies.pdf
Big Data Technologies.pdf
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Rootconf 2017 - State of the Open Source monitoring landscape
Rootconf 2017 - State of the Open Source monitoring landscape Rootconf 2017 - State of the Open Source monitoring landscape
Rootconf 2017 - State of the Open Source monitoring landscape
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
Webinar: Site Search in an Hour with Fusion
Webinar: Site Search in an Hour with FusionWebinar: Site Search in an Hour with Fusion
Webinar: Site Search in an Hour with Fusion
 
An Introduction to ROS-Industrial
An Introduction to ROS-IndustrialAn Introduction to ROS-Industrial
An Introduction to ROS-Industrial
 
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
 
Current and emerging trends in library services
Current and emerging trends in library servicesCurrent and emerging trends in library services
Current and emerging trends in library services
 
II-SDV 2016 GRIDLOGICS
II-SDV 2016 GRIDLOGICSII-SDV 2016 GRIDLOGICS
II-SDV 2016 GRIDLOGICS
 

More from Dr. Haxel Consult

AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
Dr. Haxel Consult
 

More from Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 

ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall

  • 1. Open Source Search Andreas Pesenhofer max.recall information systems GmbH Künstlergasse 11/1 • A-1150 Wien • Austria ICIC, October 2013 ICIC, October 2013
  • 2. max.recall information systems • max.recall is a software and consulting company enabling enterprises to capitalize on the hidden value in the rapidly growing amount of textual data • Customized Solutions for – Intelligent data analytics – Vertical search • Products and Services – quantalyze: quantity analytics technology – smart.coder: open-ended question coding tool for market researchers • Founded 2010 and located in Vienna, Austria • Operates worldwide with int’l customers from sectors such as IP, market research, news and media, IT services 2 ICIC, October 2013
  • 3. Recall and precision • Recall – – • Precision – – • • • Percent of relevant documents (items) returned 50 good answers in system, 25 returned = 50% recall Percent of documents returned that are relevant 100 returned, 25 are relevant = 25% precision Ideal is 100% recall and 100% precision: return all relevant documents and only those 100% recall is easy – return all documents, but precision is low, relevant documents can’t be found Need adequate recall & enough precision for the task - that will vary by application (data & users) 3 ICIC, October 2013
  • 4. How to get good recall • Collect, index and search all the data – Check for missing or corrupt data • Index everything – Search everything … limit results by category AFTER the search (clustering/faceting) • Normalize the data – Convert to lower case, strip/handle special characters, stemming, ... • Use spell-checking, synonyms to match users’ vocabulary with content – Adaptive spell-checking, application-specific synonyms • Light (or real) natural language processing for abstract concepts 4 ICIC, October 2013
  • 5. How to get good precision • Term frequency (TF) – more occurrences of query terms is better • Inverse document frequency (IDF) – rarer query terms are more important • Phrase boost – query terms near each other is better • Field boost – where the query term is in doc matters (e.g., in 'title' better) • Length normalization – avoid penalizing short docs • Recency – all things being equal, recent is better • Authority – items linked to, clicked on or bought by others may be better • Implicit and explicit relevance feedback, more-like-this – expand query • Clustering/faceting – intent is not specific • Lots of data 5 ICIC, October 2013
  • 7. Groth of patent applications 7 ICIC, October 2013
  • 8. Big Data Open Source Tools 8 ICIC, October 2013
  • 9. Apache Lucene Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. • Scalable, High-Performance Indexing – – – – over 150GB/hour on modern hardware small RAM requirements - only 1MB heap incremental indexing as fast as batch indexing index size roughly 20-30% the size of text indexed 9 ICIC, October 2013
  • 10. Apache Lucene (2) • Powerful, Accurate and Efficient Search Algorithms – ranked searching - best results returned first – many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more – fielded searching (e.g. title, author, contents) – sorting by any field – multiple-index searching with merged results – allows simultaneous update and searching – flexible faceting, highlighting, joins and result grouping – fast, memory-efficient and typo-tolerant suggesters – pluggable ranking models, including the Vector Space Model and Okapi BM25 – configurable storage engine (codecs) • Cross-Platform Solution 10 ICIC, October 2013
  • 11. Apache Lucene (3) • Cross-Platform Solution – Available as Open Source software under the Apache License - Lucene in both commercial and Open Source programs – 100%-pure Java – Implementations in other programming languages available, the index is compatible • Apache Lucene 4.5.0 was released on October 5th, 2013. 11 ICIC, October 2013
  • 12. Apache SOLR • Apache SOLR is an open source enterprise search platform from the Apache Lucene project. • major features: – – – – – – – full-text search hit highlighting faceted search dynamic clustering database integration handling of rich documents (e.g., Word, PDF) providing distributed search and index replication, Solr is highly scalable. • Apache SOLR 4.5.0 was released on October 5th, 2013. 12 ICIC, October 2013
  • 13. elasticsearch • elasticsearch is a distributed, RESTful, open source search server based on Apache Lucene. It is developed by Shay Banon and is released under the terms of the Apache License. • major features: – – – – fully supports the near real-time search of Apache Lucene cluster setup needs no additional software features of Lucene are made available through the JSON and Java API JSON in / JSON out (and YAML) • elasticsearch 0.90.5 was released on September 17th, 2013, based on Lucene 4.4. 13 ICIC, October 2013
  • 14. All Time Top Committers ICIC, October 2013
  • 16. Lines of Code ICIC, October 2013
  • 17. The Mailing Lists ICIC, October 2013
  • 18. Interest over time ICIC, October 2013
  • 19. Case study - quantalyze patents.quantalyze.com      Runs in your browser Filter and keyword search Physical quantity search Interval search Print view Visualization:  Physical quantity distributions  Cross-tabulations (e.g. concepts vs. quantity type)  Different chart types 19 ICIC, October 2013
  • 20. Case study - StumbleUpon Create a world-class customer experience • A “Stumble” provides real-time recommendations to 30 million customers per day • Intelligent search is key to providing fast and more informed recommendations • Update your searches immediately with newly posted content Develop and scale easily • • • Build in intelligent search to scale with millions of users and interactions Take advantage of powerful and flexible APIs for easy data integration Use easy to use but powerful solutions for your big data search and analytics needs 20 ICIC, October 2013
  • 21. Strengths of open source search • Best practice segmented index (like Google, Fast) • Scalability • Best practice, flexible ranking (term/field/doc boosts, function queries, custom scoring…) • Best overall query performance and complete query capabilities (unlimited Boolean operations, wildcards, findsimilar, synonyms, spell-check…) • Multilingual, query filters, geo search, memory mapped indexes, near real-time search, advanced proximity operators… • Rapid innovation • Extensible architecture, complete control (open source) • No license fees (open source) 21 ICIC, October 2013
  • 22. Weaknesses of open source search • Those typical of open source – – – – No formal support Limited access to training, consulting Lack of stringent integrated QA Speed of development and open source environment too complex for some (e.g., what version should I download? What patches? GUI?) • Others – Lucene/Solr/Elasicsearch development has tended to focus on core capabilities, so missing certain features for enterprise search (e.g., connectors, security, alerts, advanced query operations) 22 ICIC, October 2013
  • 23. Addressing open source weaknesses • Community – Community has a wealth of information on web sites, wikis and mailing lists – Community members usually respond quickly to questions • Consultants – May be especially helpful for systems integration or addressing gaps • Commercialization – Companies commercializing open source provide commercial support, certified versions, training and consulting • Internal resources 23 ICIC, October 2013
  • 24. Product strengths of commercial competitors • Well established players tend to be full-featured • Some organizations have focused on a particular application or domain (e.g., ecommerce, publishing, legal, help desk) • Some competitors have focused on appliance 24 ICIC, October 2013
  • 25. Weaknesses of top commercial competitors • • • • • • • • Usually expensive, especially at scale Platform or portability limitations Limited transparency Limited flexibility, especially for other than intended application or domain Limited customization, especially for appliance-like products Sometimes limited scalability Technical debt and/or lack of rapid innovation Customers are dependent on the company’s continued business success 25 ICIC, October 2013
  • 26. Competitive landscape • Last years commercial companies have felt increasing competition from Lucene/Solr/Elasticsearch because of the combination of its capability and price • Some competitors have responded with diversification • Some have been acquired • Need for good, affordable, flexible search remains 26 ICIC, October 2013