3. How does any search engine
works?
Internet search engines are special sites on the web
that are designed to help people find information on
the world wide web.
Any search engine operates in the following order
Web crawling
Indexing
searching
4. • Search engine uses software called spiders (crawlers), which comb the
internet looking for documents and their web addresses.
5. • The documents and web addresses are collected and sent to the search
engine's indexing software.
6. • The indexing software extracts information from the
documents, storing it in a database.
7. • When you perform a search by entering keywords, the
database is searched for documents that match.
8. What is lucene?
Lucene is an open source, highly scalable information
retrieval (IR) library.
Information retrieval refers to the process of searching
for documents, information within documents or
metadata about documents.
10. ANALYSIS
Analysis is converting the text data into a fundamental
unit of searching, which is called as term.
During analysis, the text data goes through multiple
operations: extracting the words, removing common
words, ignoring punctuation, reducing words to root
form, changing words to lowercase, etc.
Analysis happens just before indexing and query
parsing.
Analysis converts text data into tokens, and these
tokens are added as terms in the Lucene index.
12. Lucene Analysers
Analyzer in Lucene is tokenizer + stemmer + stop-words filter.
For e.g. :- Analyze: XY&Z Corporation - xyz@example.com
1) Whitespace Analyzer: Splits tokens at whitespace
[XY&Z] [Corporation] [-] [xyz@example.com]
2) Simple Analyzer: Divides text at non-letter characters and puts text
in lowercase
[xy] [z] [corporation] [xyz] [example] [com]
3) Stop Analyzer: Removes stop words (not useful for searching) and
puts text in lowercase
[xy] [z] [corporation] [xyz] [example] [com]
4) Standard Analyzer: Tokenizes text based on a sophisticated
grammar that recognizes: e-mail addresses; acronyms; Chinese,
Japanese, and Korean characters; alphanumerics.Puts text in lowercase.
Removes stop words
[xy&z] [corporation] [xyz@example] [com]
13. 5) Metaphone Replacement Analyzer:
It literally replaces the incoming token with some
metacode.
Two phrases that sound similar yet are spelled completely
differently are tokenized and encoded the same.
For e.g. :"The quick brown fox jumped over the lazy dogs"
will be encoded as
" [0] [KK] [BRN] [FKS] [JMPT] [OFR] [0] [LS] [TKS]“
Now if user wants to look for :
"Tha quik brown phox jumpd ovvar tha lazi dogz"
there will be an exact match as it will be encoded into the
same code as above and exact match will be found.
14. INDEXING
A process of converting text data into a format that
facilitates rapid searching.
Simple analogy – a book
For indexing data, is should available in simple text
format.
16. Directory :
The Directory class represents the location of a Lucene index. It’s
an abstract class that allows its subclasses to store the index as
they see fit.
Index Writers :
A class that either creates or maintains an index. Its constructor
accepts a Boolean that determines whether a new index is
created or whether an existing index is opened.
It provides methods to add, delete, or update documents in the
index.
IndexWriter creates a lock file for the directory to prevent index
corruption by simultaneous index updates.
17. Fields :
The class that actually holds the textual content to be
indexed.
The Field class encapsulates a field name and its value.
Lucene provides options to specify if a field needs to
be indexed or analyzed and if its value needs to be
stored.
18. Document :
A Document represents a collection of fields. You can think
of it as a virtual document—a chunk of data, such as a web
page, an email message, or a text file—that you want to
make retrievable at a later time.
Analyzers :
They are responsible for preprocessing the text data
and converting it into tokens stored in the index.
19.
20. Lucene Indexes
Every Lucene index consists of one or more segments.
Each segment is a standalone index itself, holding a subset
of all indexed documents.
At search time, each segment is visited separately and the
results are combined together.
Each segment, in turn, consists of multiple files, of the
form _X.<ext.
There is one special file, often referred to as “the segments
file”, and named segments_<N> that references all live
segments.
The value <N>, called “the generation”, is an integer that
increases by one every time a change is committed to the
index.
22. Lucene index has many separate segments.
Lucene must search each segment separately and then
combine the results.
There is an performance issue.
Index needs to be optimized.
optimize()
optimize(int maxNumSegments),
optimize(boolean doWait)
optimize(int maxNumSegments, boolean doWait)
tradeoff of a large one-time cost, for faster searching
23. Fascinating Lucene :Inverted Index
Lucene stores the input in a data structure known as an inverted index.
• What makes this
structure inverted is
that it uses tokens
extracted from input
documents as lookup
keys instead of
treating documents as
the central entities.
24. Searching in Lucene
Searching is the process of looking for words in the
index and finding the documents that contain those
words.
25. Core Searching classes
Searcher :
Searcher is an abstract base class that has various
overloaded search methods.
The Search method returns an ordered collection of
documents ranked by computed scores.
Lucene calculates a score for each of the documents that
match a given query.
Term :
Term is the most fundamental unit for searching. It's
composed of two elements: the text of the word and the
name of the field in which the text occurs. Term objects are
also involved in indexing, but they are created by Lucene
internals.
26. Score Docs :
A simple pointer to a document contained in the
search results. This encapsulates the position of a
document in the index and the score computed by
Lucene.
Top Docs :
• Encapsulates the total number of search results and an
array of ScoreDoc.
27. Querying Lucene Indexes
Query is an abstract base class for queries.
They are used as strategy to look up into the address indexes and
return the matching documents.
Some of the queries are :
1)Term Query:
.The most elementary way to search an index is for a specific term.
A term is the smallest indexed piece, consisting of a field name
and a text-value pair.
28. 2) Wildcard Query: Wildcard queries let you query for terms with missing pieces
Two standard wildcard characters are used:
* for zero or more characters
For example, to search for test, tests or tester, you can use the search: test*
? for zero or one character
For example, to search for "text" or "test" you can use the search: te?t
3) Range Query: Range queries allow to match all the documents whose field
value(s) are b/w lower and upper bound specified by range query. They can be
inclusive or exclusive :
Inclusive range queries are denoted by square brackets([]).
Exclusive range queries are denoted by curly brackets({ }).
For e.g. : date:[20020101 TO 20030101]
This will find documents whose date fields have values between 20020101 and
20030101, inclusive.
29. 4)Fuzzy Query : Lucene supports fuzzy searches based on the lenevstein distance ,
or edit distance algorithm.
To do a fuzzy search use the tilde~, symbol at the end of a single word term.
FuzzyQuery matches terms "close" to a specified base term : you specify an
allowed maximum edit distance and any terms within that edit distance from the
base term and, then, the docs containing those terms) are matched.
For e.g. : To search for a term similar in spelling to "roam" use the fuzzy search.
5)Boolean Query: Boolean operators allow terms to be combined through logic
operators.
Lucene supports AND , OR and NOT as Boolean operators
30. 7) Boosting Query: Boosting allows you to control the relevance(which
terms/clauses are "more important") of a document by boosting its term .
The higher the boost factor, the more relevant the term will be, and therefore the
higher the corresponding document scores.
To boost a term use the caret, "^", symbol with a boost factor (a number) at the end
of the term you are searching.
For e.g. : If you are searching for : IIT(BHU) Varanasi and you want the term "
Varanasi" to be more relevant boost it using the ^ symbol along with the boost factor
next to the term.
Query Syntax : IIT (BHU) Varanasi^4
35. Applications of lucene
Searchable email
Online documentation search
Version control and content management
Content search
.. And the list goes on…….