Apache Lucene is a high-performance, full-featured text search engine library written in Java. It provides indexing and searching capabilities over various document formats. The Lucene architecture involves indexing documents, building queries, searching the index, and returning results. Core classes for indexing include IndexWriter, Directory, Analyzer, Document, and Field. Core searching classes are IndexSearcher, Query, QueryParser, TopDocs, and ScoreDoc. A demo was presented to index and search documents using Lucene's core classes.
2. Agenda
What is Apache Lucene ?
Focus of Apache Lucene
Lucene Architecture
Core Indexing Classes
Core Searching Classes
Demo
Questions & Answers
3. What is Apache Lucene?
Apache Lucene is a high-performance, full- featured text search
engine library written entirely in Java.”
Also known as Information Retrieval Library.
Lucene is specifically an API, not an application.
Open Source
4. Focus
Indexing Documents
Searching Documents
Note :
You can use Lucene to provide consistent full-text indexing across
both database objects and documents in various formats (Microsoft
Office documents, PDF, HTML, text, emails and so on).
10. Analyzers
Tokenizes the input text
Common Analyzers
–
WhitespaceAnalyzer
Splits tokens on whitespace
–
SimpleAnalyzer
Splits tokens on non-letters, and then lowercases
–
StopAnalyzer
Same as SimpleAnalyzer, but also removes stop words
–
StandardAnalyzer
Most sophisticated analyzer that knows about certain token types,
lowercases, removes stop words, ...
13. Document & Fields
A Document is the atomic unit of indexing and
searching, It contains Fields
Fields have a name and a value
–
You have to translate raw content into Fields
–
Examples: Title, author, date, abstract, body, URL, keywords, ...
–
Different documents can have different fields
14. Field options
Field.Store
–
NO : Don’t store the field value in the index
–
YES : Store the field value in the index
Field.Index
–
ANALYZED : Tokenize with an Analyzer
–
NOT_ANALYZED : Do not tokenize
–
NO : Do not index this field
15. Searching an Index
IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser(Version, field_name
,analyzer);
Query query = parser.parse(WORD_SEARCHED);
TopDocs hits = searcher.search(query, noOfHits);
ScoreDoc[] document = hits.scoreDocs;
Document doc = searcher.doc(0); // look at first match
System.out.println(“name=" + doc.get(“name"));
searcher.close();
20. QueryParser syntax examples
Query expression
Document matches if…
java
Contains the term java in the default field
java junit
java OR junit
Contains the term java or junit or both in the default field
(the default operator can be changed to AND)
+java +junit
Contains both java and junit in the default field
java AND junit
title:ant
Contains the term ant in the title field
title:extreme –subject:sports
Contains extreme in the title and not sports in subject
(agile OR extreme) AND java
Boolean expression matches
title:”junit in action”
Phrase matches in title
title:”junit action”~5
Proximity matches (within 5) in title
java*
Wildcard matches
java~
Fuzzy matches
lastmodified:[1/1/09 TO
12/31/09]
Range matches
21. TopDocs
Class containing top N ranked searched documents/results
that match a given query.
ScoreDoc
Array of ScoreDoc containing documents/results
that match a given query.
22. Demo of simple indexing and searching
using Apache Lucene
You will require lucene-core-x.y.jar for this demo.