SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Search Engines
How does any search engine
works?
 Internet search engines are special sites on the web
that are designed to help people find information on
the world wide web.
 Any search engine operates in the following order
 Web crawling
 Indexing
 searching
• Search engine uses software called spiders (crawlers), which comb the
internet looking for documents and their web addresses.
• The documents and web addresses are collected and sent to the search
engine's indexing software.
• The indexing software extracts information from the
documents, storing it in a database.
• When you perform a search by entering keywords, the
database is searched for documents that match.
What is lucene?
 Lucene is an open source, highly scalable information
retrieval (IR) library.
 Information retrieval refers to the process of searching
for documents, information within documents or
metadata about documents.
Overview
of How
lucene
works?
ANALYSIS
 Analysis is converting the text data into a fundamental
unit of searching, which is called as term.
 During analysis, the text data goes through multiple
operations: extracting the words, removing common
words, ignoring punctuation, reducing words to root
form, changing words to lowercase, etc.
 Analysis happens just before indexing and query
parsing.
 Analysis converts text data into tokens, and these
tokens are added as terms in the Lucene index.
HTML
Extract
text
Extract
text
Extract
text
Extract
text
PDF
MS Word XML
Analysis
Index
Performed
by lucene
Lucene Analysers
Analyzer in Lucene is tokenizer + stemmer + stop-words filter.
For e.g. :- Analyze: XY&Z Corporation - xyz@example.com
 1) Whitespace Analyzer: Splits tokens at whitespace
[XY&Z] [Corporation] [-] [xyz@example.com]
 2) Simple Analyzer: Divides text at non-letter characters and puts text
in lowercase
[xy] [z] [corporation] [xyz] [example] [com]
 3) Stop Analyzer: Removes stop words (not useful for searching) and
puts text in lowercase
[xy] [z] [corporation] [xyz] [example] [com]
 4) Standard Analyzer: Tokenizes text based on a sophisticated
grammar that recognizes: e-mail addresses; acronyms; Chinese,
Japanese, and Korean characters; alphanumerics.Puts text in lowercase.
Removes stop words
[xy&z] [corporation] [xyz@example] [com]
5) Metaphone Replacement Analyzer:
 It literally replaces the incoming token with some
metacode.
 Two phrases that sound similar yet are spelled completely
differently are tokenized and encoded the same.
For e.g. :"The quick brown fox jumped over the lazy dogs"
will be encoded as
" [0] [KK] [BRN] [FKS] [JMPT] [OFR] [0] [LS] [TKS]“
Now if user wants to look for :
"Tha quik brown phox jumpd ovvar tha lazi dogz"
there will be an exact match as it will be encoded into the
same code as above and exact match will be found.
INDEXING
 A process of converting text data into a format that
facilitates rapid searching.
 Simple analogy – a book
 For indexing data, is should available in simple text
format.
Core Indexing Classes
Document
Field 1
Field 2
Field 3
Field 4
Analyzer Index Writer
Directory
Directory :
 The Directory class represents the location of a Lucene index. It’s
an abstract class that allows its subclasses to store the index as
they see fit.
Index Writers :
 A class that either creates or maintains an index. Its constructor
accepts a Boolean that determines whether a new index is
created or whether an existing index is opened.
 It provides methods to add, delete, or update documents in the
index.
 IndexWriter creates a lock file for the directory to prevent index
corruption by simultaneous index updates.
Fields :
 The class that actually holds the textual content to be
indexed.
 The Field class encapsulates a field name and its value.
Lucene provides options to specify if a field needs to
be indexed or analyzed and if its value needs to be
stored.
Document :
 A Document represents a collection of fields. You can think
of it as a virtual document—a chunk of data, such as a web
page, an email message, or a text file—that you want to
make retrievable at a later time.
Analyzers :
 They are responsible for preprocessing the text data
and converting it into tokens stored in the index.
Lucene Indexes
 Every Lucene index consists of one or more segments.
 Each segment is a standalone index itself, holding a subset
of all indexed documents.
 At search time, each segment is visited separately and the
results are combined together.
 Each segment, in turn, consists of multiple files, of the
form _X.<ext.
 There is one special file, often referred to as “the segments
file”, and named segments_<N> that references all live
segments.
 The value <N>, called “the generation”, is an integer that
increases by one every time a change is committed to the
index.
Index Structure
_0.fnm
_0.fdt
_0.fdx
_0.frq
_0.tis
_0.tii
_0.prx
_0.nrm
_0_1.del
_1.fnm
_1.fdt
_1.fdx
[…]
segments_3
 Lucene index has many separate segments.
 Lucene must search each segment separately and then
combine the results.
 There is an performance issue.
 Index needs to be optimized.
 optimize()
 optimize(int maxNumSegments),
 optimize(boolean doWait)
 optimize(int maxNumSegments, boolean doWait)
 tradeoff of a large one-time cost, for faster searching
Fascinating Lucene :Inverted Index
Lucene stores the input in a data structure known as an inverted index.
• What makes this
structure inverted is
that it uses tokens
extracted from input
documents as lookup
keys instead of
treating documents as
the central entities.
Searching in Lucene
 Searching is the process of looking for words in the
index and finding the documents that contain those
words.
Core Searching classes
Searcher :
 Searcher is an abstract base class that has various
overloaded search methods.
 The Search method returns an ordered collection of
documents ranked by computed scores.
 Lucene calculates a score for each of the documents that
match a given query.
Term :
 Term is the most fundamental unit for searching. It's
composed of two elements: the text of the word and the
name of the field in which the text occurs. Term objects are
also involved in indexing, but they are created by Lucene
internals.
Score Docs :
 A simple pointer to a document contained in the
search results. This encapsulates the position of a
document in the index and the score computed by
Lucene.
Top Docs :
• Encapsulates the total number of search results and an
array of ScoreDoc.
Querying Lucene Indexes
 Query is an abstract base class for queries.
 They are used as strategy to look up into the address indexes and
return the matching documents.
Some of the queries are :
1)Term Query:
.The most elementary way to search an index is for a specific term.
 A term is the smallest indexed piece, consisting of a field name
and a text-value pair.
2) Wildcard Query: Wildcard queries let you query for terms with missing pieces
 Two standard wildcard characters are used:
 * for zero or more characters
For example, to search for test, tests or tester, you can use the search: test*
 ? for zero or one character
For example, to search for "text" or "test" you can use the search: te?t
3) Range Query: Range queries allow to match all the documents whose field
value(s) are b/w lower and upper bound specified by range query. They can be
inclusive or exclusive :
Inclusive range queries are denoted by square brackets([]).
Exclusive range queries are denoted by curly brackets({ }).
For e.g. : date:[20020101 TO 20030101]
This will find documents whose date fields have values between 20020101 and
20030101, inclusive.
4)Fuzzy Query : Lucene supports fuzzy searches based on the lenevstein distance ,
or edit distance algorithm.
To do a fuzzy search use the tilde~, symbol at the end of a single word term.
 FuzzyQuery matches terms "close" to a specified base term : you specify an
allowed maximum edit distance and any terms within that edit distance from the
base term and, then, the docs containing those terms) are matched.
 For e.g. : To search for a term similar in spelling to "roam" use the fuzzy search.
5)Boolean Query: Boolean operators allow terms to be combined through logic
operators.
Lucene supports AND , OR and NOT as Boolean operators
7) Boosting Query: Boosting allows you to control the relevance(which
terms/clauses are "more important") of a document by boosting its term .
The higher the boost factor, the more relevant the term will be, and therefore the
higher the corresponding document scores.
To boost a term use the caret, "^", symbol with a boost factor (a number) at the end
of the term you are searching.
 For e.g. : If you are searching for : IIT(BHU) Varanasi and you want the term "
Varanasi" to be more relevant boost it using the ^ symbol along with the boost factor
next to the term.
Query Syntax : IIT (BHU) Varanasi^4
Luke – Lucene Index Toolbox
Applications of lucene
 Searchable email
 Online documentation search
 Version control and content management
 Content search
 .. And the list goes on…….
THANK YOU

Weitere ähnliche Inhalte

Was ist angesagt?

Update on DSpace 7
Update on DSpace 7Update on DSpace 7
Update on DSpace 7Bram Luyten
 
Introduction to Knowledge Graphs: Data Summit 2020
Introduction to Knowledge Graphs: Data Summit 2020Introduction to Knowledge Graphs: Data Summit 2020
Introduction to Knowledge Graphs: Data Summit 2020Enterprise Knowledge
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchhypto
 
Elasticsearch From the Bottom Up
Elasticsearch From the Bottom UpElasticsearch From the Bottom Up
Elasticsearch From the Bottom Upfoundsearch
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebMarina Santini
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overviewABC Talks
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into ElasticsearchKnoldus Inc.
 
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataIntroduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataSören Auer
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Web Services
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseAlexandre Rafalovitch
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solrKnoldus Inc.
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Semantic Web - Ontologies
Semantic Web - OntologiesSemantic Web - Ontologies
Semantic Web - OntologiesSerge Linckels
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101Data Con LA
 

Was ist angesagt? (20)

Update on DSpace 7
Update on DSpace 7Update on DSpace 7
Update on DSpace 7
 
Introduction to Knowledge Graphs: Data Summit 2020
Introduction to Knowledge Graphs: Data Summit 2020Introduction to Knowledge Graphs: Data Summit 2020
Introduction to Knowledge Graphs: Data Summit 2020
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Inverted index
Inverted indexInverted index
Inverted index
 
Elasticsearch From the Bottom Up
Elasticsearch From the Bottom UpElasticsearch From the Bottom Up
Elasticsearch From the Bottom Up
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
Big Data Analytics Lab File
Big Data Analytics Lab FileBig Data Analytics Lab File
Big Data Analytics Lab File
 
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataIntroduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by Case
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solr
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Mongodb
MongodbMongodb
Mongodb
 
Semantic Web - Ontologies
Semantic Web - OntologiesSemantic Web - Ontologies
Semantic Web - Ontologies
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 

Andere mochten auch

Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
ログ解析入門withR InnovationEggNo3
ログ解析入門withR InnovationEggNo3ログ解析入門withR InnovationEggNo3
ログ解析入門withR InnovationEggNo3hiroki84
 
Log解析の超入門
Log解析の超入門Log解析の超入門
Log解析の超入門菊池 佑太
 
Elasticsearch入門 pyfes 201207
Elasticsearch入門 pyfes 201207Elasticsearch入門 pyfes 201207
Elasticsearch入門 pyfes 201207Jun Ohtani
 
검색엔진 오픈 소스 Lucene
검색엔진 오픈 소스 Lucene검색엔진 오픈 소스 Lucene
검색엔진 오픈 소스 LuceneEunGi Hong
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisJosiane Gamgo
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrChristos Manios
 
SolrとElasticsearchの比較
SolrとElasticsearchの比較SolrとElasticsearchの比較
SolrとElasticsearchの比較genta kaneyama
 
ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術Drecom Co., Ltd.
 
情報システム部がSplunk を使うとどうなるか?
情報システム部がSplunk を使うとどうなるか?情報システム部がSplunk を使うとどうなるか?
情報システム部がSplunk を使うとどうなるか?snicker_jp
 
サービス改善はログデータ分析から
サービス改善はログデータ分析からサービス改善はログデータ分析から
サービス改善はログデータ分析からKenta Suzuki
 
Elasticsearchと機械学習を実際に連携させる
Elasticsearchと機械学習を実際に連携させるElasticsearchと機械学習を実際に連携させる
Elasticsearchと機械学習を実際に連携させるnobu_k
 
ElasticSearch勉強会 第6回
ElasticSearch勉強会 第6回ElasticSearch勉強会 第6回
ElasticSearch勉強会 第6回Naoyuki Yamada
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Elasticsearchを使うときの注意点 公開用スライド
Elasticsearchを使うときの注意点 公開用スライドElasticsearchを使うときの注意点 公開用スライド
Elasticsearchを使うときの注意点 公開用スライド崇介 藤井
 
SolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみようSolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみようShinsuke Sugaya
 
冬のLock free祭り safe
冬のLock free祭り safe冬のLock free祭り safe
冬のLock free祭り safeKumazaki Hiroki
 

Andere mochten auch (19)

Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Elasticsearch勉強会
Elasticsearch勉強会Elasticsearch勉強会
Elasticsearch勉強会
 
ログ解析入門withR InnovationEggNo3
ログ解析入門withR InnovationEggNo3ログ解析入門withR InnovationEggNo3
ログ解析入門withR InnovationEggNo3
 
Log解析の超入門
Log解析の超入門Log解析の超入門
Log解析の超入門
 
Elasticsearch入門 pyfes 201207
Elasticsearch入門 pyfes 201207Elasticsearch入門 pyfes 201207
Elasticsearch入門 pyfes 201207
 
검색엔진 오픈 소스 Lucene
검색엔진 오픈 소스 Lucene검색엔진 오픈 소스 Lucene
검색엔진 오픈 소스 Lucene
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
SolrとElasticsearchの比較
SolrとElasticsearchの比較SolrとElasticsearchの比較
SolrとElasticsearchの比較
 
ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術
 
情報システム部がSplunk を使うとどうなるか?
情報システム部がSplunk を使うとどうなるか?情報システム部がSplunk を使うとどうなるか?
情報システム部がSplunk を使うとどうなるか?
 
サービス改善はログデータ分析から
サービス改善はログデータ分析からサービス改善はログデータ分析から
サービス改善はログデータ分析から
 
Elasticsearchと機械学習を実際に連携させる
Elasticsearchと機械学習を実際に連携させるElasticsearchと機械学習を実際に連携させる
Elasticsearchと機械学習を実際に連携させる
 
ElasticSearch勉強会 第6回
ElasticSearch勉強会 第6回ElasticSearch勉強会 第6回
ElasticSearch勉強会 第6回
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Elasticsearchを使うときの注意点 公開用スライド
Elasticsearchを使うときの注意点 公開用スライドElasticsearchを使うときの注意点 公開用スライド
Elasticsearchを使うときの注意点 公開用スライド
 
SolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみようSolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみよう
 
冬のLock free祭り safe
冬のLock free祭り safe冬のLock free祭り safe
冬のLock free祭り safe
 

Ähnlich wie Lucene

Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerIOSR Journals
 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. ElasticsearchSelecto
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiRobert Calcavecchia
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrievalmghgk
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slideSovan Misra
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearchJoey Wen
 
Index Structures.pptx
Index Structures.pptxIndex Structures.pptx
Index Structures.pptxMBablu1
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI
 
Development of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrievalDevelopment of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrievalAmjad Ali
 
Context Based Indexing in Search Engines Using Ontology: Review
Context Based Indexing in Search Engines Using Ontology: ReviewContext Based Indexing in Search Engines Using Ontology: Review
Context Based Indexing in Search Engines Using Ontology: Reviewiosrjce
 
Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索longkeyy
 
Context Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebContext Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebIOSR Journals
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdfHabtamu100
 

Ähnlich wie Lucene (20)

Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
Lucene
LuceneLucene
Lucene
 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. Elasticsearch
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
 
Ibm haifa.mq.final
Ibm haifa.mq.finalIbm haifa.mq.final
Ibm haifa.mq.final
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slide
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Index Structures.pptx
Index Structures.pptxIndex Structures.pptx
Index Structures.pptx
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
Development of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrievalDevelopment of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrieval
 
N017249497
N017249497N017249497
N017249497
 
Context Based Indexing in Search Engines Using Ontology: Review
Context Based Indexing in Search Engines Using Ontology: ReviewContext Based Indexing in Search Engines Using Ontology: Review
Context Based Indexing in Search Engines Using Ontology: Review
 
 
Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索
 
Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
 
Context Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebContext Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic Web
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 

Kürzlich hochgeladen

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxdhanalakshmis0310
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxAmita Gupta
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 

Kürzlich hochgeladen (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 

Lucene

  • 1.
  • 3. How does any search engine works?  Internet search engines are special sites on the web that are designed to help people find information on the world wide web.  Any search engine operates in the following order  Web crawling  Indexing  searching
  • 4. • Search engine uses software called spiders (crawlers), which comb the internet looking for documents and their web addresses.
  • 5. • The documents and web addresses are collected and sent to the search engine's indexing software.
  • 6. • The indexing software extracts information from the documents, storing it in a database.
  • 7. • When you perform a search by entering keywords, the database is searched for documents that match.
  • 8. What is lucene?  Lucene is an open source, highly scalable information retrieval (IR) library.  Information retrieval refers to the process of searching for documents, information within documents or metadata about documents.
  • 10. ANALYSIS  Analysis is converting the text data into a fundamental unit of searching, which is called as term.  During analysis, the text data goes through multiple operations: extracting the words, removing common words, ignoring punctuation, reducing words to root form, changing words to lowercase, etc.  Analysis happens just before indexing and query parsing.  Analysis converts text data into tokens, and these tokens are added as terms in the Lucene index.
  • 12. Lucene Analysers Analyzer in Lucene is tokenizer + stemmer + stop-words filter. For e.g. :- Analyze: XY&Z Corporation - xyz@example.com  1) Whitespace Analyzer: Splits tokens at whitespace [XY&Z] [Corporation] [-] [xyz@example.com]  2) Simple Analyzer: Divides text at non-letter characters and puts text in lowercase [xy] [z] [corporation] [xyz] [example] [com]  3) Stop Analyzer: Removes stop words (not useful for searching) and puts text in lowercase [xy] [z] [corporation] [xyz] [example] [com]  4) Standard Analyzer: Tokenizes text based on a sophisticated grammar that recognizes: e-mail addresses; acronyms; Chinese, Japanese, and Korean characters; alphanumerics.Puts text in lowercase. Removes stop words [xy&z] [corporation] [xyz@example] [com]
  • 13. 5) Metaphone Replacement Analyzer:  It literally replaces the incoming token with some metacode.  Two phrases that sound similar yet are spelled completely differently are tokenized and encoded the same. For e.g. :"The quick brown fox jumped over the lazy dogs" will be encoded as " [0] [KK] [BRN] [FKS] [JMPT] [OFR] [0] [LS] [TKS]“ Now if user wants to look for : "Tha quik brown phox jumpd ovvar tha lazi dogz" there will be an exact match as it will be encoded into the same code as above and exact match will be found.
  • 14. INDEXING  A process of converting text data into a format that facilitates rapid searching.  Simple analogy – a book  For indexing data, is should available in simple text format.
  • 15. Core Indexing Classes Document Field 1 Field 2 Field 3 Field 4 Analyzer Index Writer Directory
  • 16. Directory :  The Directory class represents the location of a Lucene index. It’s an abstract class that allows its subclasses to store the index as they see fit. Index Writers :  A class that either creates or maintains an index. Its constructor accepts a Boolean that determines whether a new index is created or whether an existing index is opened.  It provides methods to add, delete, or update documents in the index.  IndexWriter creates a lock file for the directory to prevent index corruption by simultaneous index updates.
  • 17. Fields :  The class that actually holds the textual content to be indexed.  The Field class encapsulates a field name and its value. Lucene provides options to specify if a field needs to be indexed or analyzed and if its value needs to be stored.
  • 18. Document :  A Document represents a collection of fields. You can think of it as a virtual document—a chunk of data, such as a web page, an email message, or a text file—that you want to make retrievable at a later time. Analyzers :  They are responsible for preprocessing the text data and converting it into tokens stored in the index.
  • 19.
  • 20. Lucene Indexes  Every Lucene index consists of one or more segments.  Each segment is a standalone index itself, holding a subset of all indexed documents.  At search time, each segment is visited separately and the results are combined together.  Each segment, in turn, consists of multiple files, of the form _X.<ext.  There is one special file, often referred to as “the segments file”, and named segments_<N> that references all live segments.  The value <N>, called “the generation”, is an integer that increases by one every time a change is committed to the index.
  • 22.  Lucene index has many separate segments.  Lucene must search each segment separately and then combine the results.  There is an performance issue.  Index needs to be optimized.  optimize()  optimize(int maxNumSegments),  optimize(boolean doWait)  optimize(int maxNumSegments, boolean doWait)  tradeoff of a large one-time cost, for faster searching
  • 23. Fascinating Lucene :Inverted Index Lucene stores the input in a data structure known as an inverted index. • What makes this structure inverted is that it uses tokens extracted from input documents as lookup keys instead of treating documents as the central entities.
  • 24. Searching in Lucene  Searching is the process of looking for words in the index and finding the documents that contain those words.
  • 25. Core Searching classes Searcher :  Searcher is an abstract base class that has various overloaded search methods.  The Search method returns an ordered collection of documents ranked by computed scores.  Lucene calculates a score for each of the documents that match a given query. Term :  Term is the most fundamental unit for searching. It's composed of two elements: the text of the word and the name of the field in which the text occurs. Term objects are also involved in indexing, but they are created by Lucene internals.
  • 26. Score Docs :  A simple pointer to a document contained in the search results. This encapsulates the position of a document in the index and the score computed by Lucene. Top Docs : • Encapsulates the total number of search results and an array of ScoreDoc.
  • 27. Querying Lucene Indexes  Query is an abstract base class for queries.  They are used as strategy to look up into the address indexes and return the matching documents. Some of the queries are : 1)Term Query: .The most elementary way to search an index is for a specific term.  A term is the smallest indexed piece, consisting of a field name and a text-value pair.
  • 28. 2) Wildcard Query: Wildcard queries let you query for terms with missing pieces  Two standard wildcard characters are used:  * for zero or more characters For example, to search for test, tests or tester, you can use the search: test*  ? for zero or one character For example, to search for "text" or "test" you can use the search: te?t 3) Range Query: Range queries allow to match all the documents whose field value(s) are b/w lower and upper bound specified by range query. They can be inclusive or exclusive : Inclusive range queries are denoted by square brackets([]). Exclusive range queries are denoted by curly brackets({ }). For e.g. : date:[20020101 TO 20030101] This will find documents whose date fields have values between 20020101 and 20030101, inclusive.
  • 29. 4)Fuzzy Query : Lucene supports fuzzy searches based on the lenevstein distance , or edit distance algorithm. To do a fuzzy search use the tilde~, symbol at the end of a single word term.  FuzzyQuery matches terms "close" to a specified base term : you specify an allowed maximum edit distance and any terms within that edit distance from the base term and, then, the docs containing those terms) are matched.  For e.g. : To search for a term similar in spelling to "roam" use the fuzzy search. 5)Boolean Query: Boolean operators allow terms to be combined through logic operators. Lucene supports AND , OR and NOT as Boolean operators
  • 30. 7) Boosting Query: Boosting allows you to control the relevance(which terms/clauses are "more important") of a document by boosting its term . The higher the boost factor, the more relevant the term will be, and therefore the higher the corresponding document scores. To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching.  For e.g. : If you are searching for : IIT(BHU) Varanasi and you want the term " Varanasi" to be more relevant boost it using the ^ symbol along with the boost factor next to the term. Query Syntax : IIT (BHU) Varanasi^4
  • 31.
  • 32. Luke – Lucene Index Toolbox
  • 33.
  • 34.
  • 35. Applications of lucene  Searchable email  Online documentation search  Version control and content management  Content search  .. And the list goes on…….