SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
Full Text Search
David LeBer
Align Software Inc.
What is full text search?
How?

•   Wild card database queries

•   Database implementations

•   Third party search engines

•   Text indexing libraries
Wild Card Queries

SELECT FROM 'SOME_TABLE' WHERE 'SOME_COLUMN' LIKE '%Some String%'
Wild Card Queries



•   Easy
Wild Card Queries


•   Slow

•   Hard to optimize

•   Difficult to rank
Database Implementations


•   MySQL FULLTEXT index and MATCH queries

•   PostgreSQL tsvector & tsquery
Database Implementations



•   Fairly Easy
Database Implementations

•   Database specific SQL

•   May include additional limitations
    (i.e: MySQL - MyISAM tables only)

•   Functionality define by the DB engine
Third Party Search Engines



•   Google indexing / searching of your content
Third Party Search Engines


•   Easy

•   Matches user expectations
Third Party Search Engines


•   Content must be available for indexing

•   Loss of control

•   Enhances the Google hegemony
Text Indexing Library



•   Lucene
Text Indexing Library

•   Complete control

•   Database independent

•   Flexible search behaviour

•   Ranked results
Text Indexing Library


•   Adds complexity

•   Additional query language

•   Parallel index
Lucene Overview

•   Open Source - part of the Apache Project

•   Very flexible

•   Wickedly fast

•   Index based
Lucene : Installing


•   Add the Lucene jars to your classpath

•   Use ERIndexing
Lucene : Tasks


•   Indexing

•   Searching
Indexing
What is Indexing?
Indexing : Steps


•   Conversion (to plain text)

•   Analysis (clean and convert the text to tokens)

•   Index (save the tokens to the index)
Indexing : Parts


•   Index - either file or memory based

•   Document - represents a unique object added to the index

•   Field - identifies a chunk of data in the document
Indexing : Classes

•   IndexWriter

•   Directory

•   Analyzer

•   Document

•   Field
Creating an Index

URL indexDirectoryURL = ... // assume exists
File indexFile = new File(indexDirectoryURL.getPath());
FSDirectory indexDirectory = FSDirectory.open(indexFile);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
IndexWriter indexWriter = new IndexWriter(index, analyzer, true,
                                IndexWriter.MaxFieldLength.UNLIMITED);
Indexing : Field Parameters


•   Stored or not

•   Analyzed or not, with and without norms

•   Include position, offset, and term frequency
Indexing : Analyzers

•   SimpleAnalyzer

•   StopAnalyzer

•   StandardAnalyzer

•   ...
Adding a Document

String value = ... // assume exists
Document doc = new Document();
Field docField = new Field("title", value,
                            Field.Store.YES, Field.Index.ANALYZED);
doc.add(docField);
...
indexWriter.addDocument(doc);
Indexing : Fun with indexes



•   Multiple Access
Searching
What is Searching
Searching : Steps

•   Clean the user input

•   Create a Query

•   Query the Index

•   Return the results
Searching : Search Classes
•   IndexReader

•   IndexSearcher

•   Query

•   QueryParser

•   TopDocs/ScoreDocs

•   Document
Searching : QueryTypes
•   TermQuery

•   RangeQuery

•   PrefixQuery

•   BooleanQuery

•   PhraseQuery

•   WildCardQuery

•   FuzzyQuery
Searching : QueryParser
•   'webobjects' - contains an exact match - TermQuery

•   'webobjects apple', 'webobjects OR apple' - an OR Query

•   +webobjects +apple / webobjects AND apple - an AND Query

•   title:webobjects - Contains the term in title field

•   title:webobjects -subject:iTunes / title:webobjects AND NOT
    subject:iTunes

•   (webobjects OR apple) AND iTunes
Searching : QueryParser

•   title:"apple webobjects" - Phrase Query

•   title:"apple webobjects"~5 - slop of 5

•   webobj* - Prefix Query

•   webobjicts~ - Fuzzy Query

•   lastmodified:[1/1/10 TO 1/1/11] - Range Query
Performing a Search

Query q = ... // assume exists
IndexSearcher searcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
Using a QueryParser

QueryParser queryParser = new QueryParser(Version.LUCENE_2.9,
                                          "content", analyzer);
Query query = queryParser.parse(queryString);
Demo
Scoring
“The more times a query term appears in a
document relative to the number of times the term
 appears in all the documents in the collection, the
   more relevant that document is to the query”
Boost

•   While Indexing

    •   Document

    •   Field

•   While Searching

    •   Query
Luke
Demo
ERIndexing
ERIndexing : Strengths

•   Hides some of the complexity of integrating Lucene with WO

•   Offers lots of utility and helper methods

•   Speaks WebObjects collection classes

•   Simplifies index creation
ERIndexing : Weaknesses


•   Hides some of the complexity of integrating Lucene with WO

•   Not fully baked

•   Auto indexing may be dangerous
Demo
Beyond Lucene


•   Solr

•   Compass

•   ElasticSearch
Q&A
Lucene: http://lucene.apache.org
Luke: http://code.google.com/p/luke/
Solr: http://lucene.apache.org/solr/
Compass: http://www.compass-project.org/overview.html
ElasticSearch: http://www.elasticsearch.com/

Weitere ähnliche Inhalte

Was ist angesagt?

Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
Full text search
Full text searchFull text search
Full text search
deleteman
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
Jukka Zitting
 

Was ist angesagt? (19)

Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Apache Lucene Basics
Apache Lucene BasicsApache Lucene Basics
Apache Lucene Basics
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Azure search
Azure searchAzure search
Azure search
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
Full text search
Full text searchFull text search
Full text search
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?
 

Ähnlich wie Full Text Search with Lucene

Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
Ferret A Ruby Search Engine
Ferret A Ruby Search EngineFerret A Ruby Search Engine
Ferret A Ruby Search Engine
elliando dias
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
Jake Mannix
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 

Ähnlich wie Full Text Search with Lucene (20)

Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Ferret A Ruby Search Engine
Ferret A Ruby Search EngineFerret A Ruby Search Engine
Ferret A Ruby Search Engine
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Test driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDBTest driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDB
 
Tagging search solution design Advanced edition
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced edition
 
Using Sphinx for Search in PHP
Using Sphinx for Search in PHPUsing Sphinx for Search in PHP
Using Sphinx for Search in PHP
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 

Mehr von WO Community

In memory OLAP engine
In memory OLAP engineIn memory OLAP engine
In memory OLAP engine
WO Community
 
Using Nagios to monitor your WO systems
Using Nagios to monitor your WO systemsUsing Nagios to monitor your WO systems
Using Nagios to monitor your WO systems
WO Community
 
Build and deployment
Build and deploymentBuild and deployment
Build and deployment
WO Community
 
Chaining the Beast - Testing Wonder Applications in the Real World
Chaining the Beast - Testing Wonder Applications in the Real WorldChaining the Beast - Testing Wonder Applications in the Real World
Chaining the Beast - Testing Wonder Applications in the Real World
WO Community
 
D2W Stateful Controllers
D2W Stateful ControllersD2W Stateful Controllers
D2W Stateful Controllers
WO Community
 
Deploying WO on Windows
Deploying WO on WindowsDeploying WO on Windows
Deploying WO on Windows
WO Community
 
Unit Testing with WOUnit
Unit Testing with WOUnitUnit Testing with WOUnit
Unit Testing with WOUnit
WO Community
 
Apache Cayenne for WO Devs
Apache Cayenne for WO DevsApache Cayenne for WO Devs
Apache Cayenne for WO Devs
WO Community
 
Advanced Apache Cayenne
Advanced Apache CayenneAdvanced Apache Cayenne
Advanced Apache Cayenne
WO Community
 
Migrating existing Projects to Wonder
Migrating existing Projects to WonderMigrating existing Projects to Wonder
Migrating existing Projects to Wonder
WO Community
 
iOS for ERREST - alternative version
iOS for ERREST - alternative versioniOS for ERREST - alternative version
iOS for ERREST - alternative version
WO Community
 
"Framework Principal" pattern
"Framework Principal" pattern"Framework Principal" pattern
"Framework Principal" pattern
WO Community
 
Filtering data with D2W
Filtering data with D2W Filtering data with D2W
Filtering data with D2W
WO Community
 
Localizing your apps for multibyte languages
Localizing your apps for multibyte languagesLocalizing your apps for multibyte languages
Localizing your apps for multibyte languages
WO Community
 

Mehr von WO Community (20)

KAAccessControl
KAAccessControlKAAccessControl
KAAccessControl
 
In memory OLAP engine
In memory OLAP engineIn memory OLAP engine
In memory OLAP engine
 
Using Nagios to monitor your WO systems
Using Nagios to monitor your WO systemsUsing Nagios to monitor your WO systems
Using Nagios to monitor your WO systems
 
Build and deployment
Build and deploymentBuild and deployment
Build and deployment
 
High availability
High availabilityHigh availability
High availability
 
Reenabling SOAP using ERJaxWS
Reenabling SOAP using ERJaxWSReenabling SOAP using ERJaxWS
Reenabling SOAP using ERJaxWS
 
Chaining the Beast - Testing Wonder Applications in the Real World
Chaining the Beast - Testing Wonder Applications in the Real WorldChaining the Beast - Testing Wonder Applications in the Real World
Chaining the Beast - Testing Wonder Applications in the Real World
 
D2W Stateful Controllers
D2W Stateful ControllersD2W Stateful Controllers
D2W Stateful Controllers
 
Deploying WO on Windows
Deploying WO on WindowsDeploying WO on Windows
Deploying WO on Windows
 
Unit Testing with WOUnit
Unit Testing with WOUnitUnit Testing with WOUnit
Unit Testing with WOUnit
 
Life outside WO
Life outside WOLife outside WO
Life outside WO
 
Apache Cayenne for WO Devs
Apache Cayenne for WO DevsApache Cayenne for WO Devs
Apache Cayenne for WO Devs
 
Advanced Apache Cayenne
Advanced Apache CayenneAdvanced Apache Cayenne
Advanced Apache Cayenne
 
Migrating existing Projects to Wonder
Migrating existing Projects to WonderMigrating existing Projects to Wonder
Migrating existing Projects to Wonder
 
iOS for ERREST - alternative version
iOS for ERREST - alternative versioniOS for ERREST - alternative version
iOS for ERREST - alternative version
 
iOS for ERREST
iOS for ERRESTiOS for ERREST
iOS for ERREST
 
"Framework Principal" pattern
"Framework Principal" pattern"Framework Principal" pattern
"Framework Principal" pattern
 
Filtering data with D2W
Filtering data with D2W Filtering data with D2W
Filtering data with D2W
 
WOver
WOverWOver
WOver
 
Localizing your apps for multibyte languages
Localizing your apps for multibyte languagesLocalizing your apps for multibyte languages
Localizing your apps for multibyte languages
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

Full Text Search with Lucene

  • 1. Full Text Search David LeBer Align Software Inc.
  • 2. What is full text search?
  • 3.
  • 4. How? • Wild card database queries • Database implementations • Third party search engines • Text indexing libraries
  • 5. Wild Card Queries SELECT FROM 'SOME_TABLE' WHERE 'SOME_COLUMN' LIKE '%Some String%'
  • 7. Wild Card Queries • Slow • Hard to optimize • Difficult to rank
  • 8. Database Implementations • MySQL FULLTEXT index and MATCH queries • PostgreSQL tsvector & tsquery
  • 10. Database Implementations • Database specific SQL • May include additional limitations (i.e: MySQL - MyISAM tables only) • Functionality define by the DB engine
  • 11. Third Party Search Engines • Google indexing / searching of your content
  • 12. Third Party Search Engines • Easy • Matches user expectations
  • 13. Third Party Search Engines • Content must be available for indexing • Loss of control • Enhances the Google hegemony
  • 15. Text Indexing Library • Complete control • Database independent • Flexible search behaviour • Ranked results
  • 16. Text Indexing Library • Adds complexity • Additional query language • Parallel index
  • 17. Lucene Overview • Open Source - part of the Apache Project • Very flexible • Wickedly fast • Index based
  • 18. Lucene : Installing • Add the Lucene jars to your classpath • Use ERIndexing
  • 19. Lucene : Tasks • Indexing • Searching
  • 22. Indexing : Steps • Conversion (to plain text) • Analysis (clean and convert the text to tokens) • Index (save the tokens to the index)
  • 23. Indexing : Parts • Index - either file or memory based • Document - represents a unique object added to the index • Field - identifies a chunk of data in the document
  • 24. Indexing : Classes • IndexWriter • Directory • Analyzer • Document • Field
  • 25. Creating an Index URL indexDirectoryURL = ... // assume exists File indexFile = new File(indexDirectoryURL.getPath()); FSDirectory indexDirectory = FSDirectory.open(indexFile); StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); IndexWriter indexWriter = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
  • 26. Indexing : Field Parameters • Stored or not • Analyzed or not, with and without norms • Include position, offset, and term frequency
  • 27. Indexing : Analyzers • SimpleAnalyzer • StopAnalyzer • StandardAnalyzer • ...
  • 28. Adding a Document String value = ... // assume exists Document doc = new Document(); Field docField = new Field("title", value, Field.Store.YES, Field.Index.ANALYZED); doc.add(docField); ... indexWriter.addDocument(doc);
  • 29. Indexing : Fun with indexes • Multiple Access
  • 32. Searching : Steps • Clean the user input • Create a Query • Query the Index • Return the results
  • 33. Searching : Search Classes • IndexReader • IndexSearcher • Query • QueryParser • TopDocs/ScoreDocs • Document
  • 34. Searching : QueryTypes • TermQuery • RangeQuery • PrefixQuery • BooleanQuery • PhraseQuery • WildCardQuery • FuzzyQuery
  • 35. Searching : QueryParser • 'webobjects' - contains an exact match - TermQuery • 'webobjects apple', 'webobjects OR apple' - an OR Query • +webobjects +apple / webobjects AND apple - an AND Query • title:webobjects - Contains the term in title field • title:webobjects -subject:iTunes / title:webobjects AND NOT subject:iTunes • (webobjects OR apple) AND iTunes
  • 36. Searching : QueryParser • title:"apple webobjects" - Phrase Query • title:"apple webobjects"~5 - slop of 5 • webobj* - Prefix Query • webobjicts~ - Fuzzy Query • lastmodified:[1/1/10 TO 1/1/11] - Range Query
  • 37. Performing a Search Query q = ... // assume exists IndexSearcher searcher = new IndexSearcher(index, true); TopScoreDocCollector collector = TopScoreDocCollector.create(10, true); searcher.search(query, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs;
  • 38. Using a QueryParser QueryParser queryParser = new QueryParser(Version.LUCENE_2.9, "content", analyzer); Query query = queryParser.parse(queryString);
  • 39. Demo
  • 41. “The more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query”
  • 42. Boost • While Indexing • Document • Field • While Searching • Query
  • 43. Luke
  • 44. Demo
  • 46. ERIndexing : Strengths • Hides some of the complexity of integrating Lucene with WO • Offers lots of utility and helper methods • Speaks WebObjects collection classes • Simplifies index creation
  • 47. ERIndexing : Weaknesses • Hides some of the complexity of integrating Lucene with WO • Not fully baked • Auto indexing may be dangerous
  • 48. Demo
  • 49. Beyond Lucene • Solr • Compass • ElasticSearch
  • 50. Q&A Lucene: http://lucene.apache.org Luke: http://code.google.com/p/luke/ Solr: http://lucene.apache.org/solr/ Compass: http://www.compass-project.org/overview.html ElasticSearch: http://www.elasticsearch.com/