3. Basics
Search Engine
Open Source
Supports Full Text Search, Sorting, Filtering and many other search functionalities
The core to Lucene is-
Inverted Index
Relevance Score
Search Algorithms
Tokenization
4. Index
An index is collection of document.
These document may or may not have any schema.
Fields: Document consists of one or more fields. Each field can
be of different data type.
Each Field is represented as key value pair.
Terms: When a field is processed through analyzer, it produces
Terms.
A term is “the unit of search” in search engines.
5. Segment
Index is split into many smaller
sections, called Segments. Each
segment has its own index.
Lucene searches all the segments in
sequence.
Data (document) once written to
segment can never be modified.
However Lucene can merge multiple
segments to optimize the
performance.
6. Inverted Index
Inverted index is an index data structure.
In simple words it inverts the “document-centric” data
structure (document -> terms) to “term-centric” data
structure (term -> document).
7. Lucene: Insert (Indexing)
“Indexing” is process of Document insertion to Lucene.
Lucene writes data to “in-memory buffer”.
When the buffer size reaches certain size, it gets
flushed to a “segment”.
8. Lucene: Delete
Document is never deleted from segment but only
marked deleted in a file. So that it can not be
accessed during the search.
It can be considered as soft delete.
9. Lucene: Update
A document never really gets updated.
But the update is actually a two-step process:
“older version” is marked “deleted” in the “original
segment”.
“new version” is “added” to the “current segment”.
10. Lucene: Get or Search
Searching or retrieving results from Lucene is a multi
step process:
Query Parser : Creates a query.
Index Searcher : Searches the query
11. Near Real Time Search
Lucene provides “near real time search” but not the
real time search.
NRT search is due to the way documents get inserted.
Since any new document first gets added to in-memory
buffer. Then buffer is flushed to become a segment.
Till the document reaches the segment it is
“unsearchable”.
12. Document Scoring
The official doc says- “Lucene scoring uses a combination of
the Vector Space Model (VSM) of Information Retrieval and
the Boolean model to determine how relevant a given Document is to
a User's query.”
In simpler term it is called “Tf-Idf” (Term Frequency- Inverse Document
Frequency) i.e. more times a query term appears in a document
relative to the number of times the term appears in all the documents
in the collection, the more relevant that document is to the query.
Note: Scoring is a detailed topic, I would publish a detailed study of
it. For reference Similarity formula is described here.
13. Boosting Score
Lucene let’s apply boost at various level. These are
namely:
Document Level Boost (while Indexing)
Field Level Boost (while Indexing)
Query Level Boost (while Searching)
14. Query Boost
Query-time boosts allow one to specify which terms/clauses
are "more important”.
Query boost plays role during searching.
The higher the boost factor, the more relevant the term will
be, and therefore the higher the corresponding document
scores.
Eg: Boosting first name over last name to factor of 2:
(first_name : “Jack”)^ 2 (last_name : “Jack”)