SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Ms. T.Primya
Assistant Professor
Department of Computer Science and Engineering
Dr. N. G. P. Institute of Technology
Coimbatore
Open Source Search Engine Framework
When deciding to install a search engine in a website, there exists the possibility to use a commercial
search engine or an open source one. For most of the websites, using a commercial search engine is
not a feasible alternative because of the fees that are required and because they focus on large scale
sites. On the other hand, open source search engines may give the same functionalities (some are
capable of managing large amount of data) as a commercial one, with the benefits of the open source
philosophy: no cost, software maintained actively, possibility to customize the code in order to satisfy
personal needs.
Nowadays, there are many open source alternatives that can be used, and each of them has different
characteristics that must be taken into consideration in order to determine which one to install in the
website. These search engines can be classified according to the programming language in which it is
implemented, how it stores the index (inverted file, database, other file structure), its searching
capabilities (Boolean operators, fuzzy search, use of stemming, etc), way of ranking, type of files
capable of indexing (HTML, PDF, plain text, etc), possibility of on-line indexing and/or making
incremental indexes.
Example:
There are several open source search engines available.
Nutch, Lucene, ASPSeek, BBDBot, Datapark, ebhath, Eureka, ht://Dig, Indri, ISearch, IXE,
ManagingGigabytes (MG), MG4J, mnoGoSearch, MPS Information Server, Namazu, Omega,
OmniFind IBM Yahoo! Ed., OpenFTS, PLWeb, SWISH-E, SWISH++, Terrier, WAIS/ freeWAIS,
WebGlimpse, XML Query Engine, XMLSearch, Zebra, and Zettair.
Nutch: (A Flexible and Scalable Open-Source Web Search Engine)
Nutch is an open-source Web search engine that can be used at global, local, and even personal scale.
Its initial design goal was to enable a transparent alternative for global Web search in the public
interest — one of its signature features is the ability to “explain” its result rankings. Recent work has
emphasized how it can also be used for intranets; by local communities with richer data models, such
as the Creative Commons metadata enabled search for licensed content; on a personal scale to index a
user's files, email, and web-surfing history
Architecture:
Nutch has a highly modular architecture that uses plug-in APIs for media-type parsing, HTML
analysis, data retrieval protocols, and queries. The core has four major components.
a) Searcher:
Given a query, it must quickly find a small relevant subset of a corpus of documents, and then present
them. Finding a large relevant subset is normally done with an inverted index of the corpus; ranking
within that set to produce the most relevant documents, which then must be summarized for display.
b) Indexer:
Creates the inverted index from which the searcher extracts results. It uses Lucene storing indexes. c)
Database:
Stores the document contents for indexing and later summarization by the searcher, along with
information such as the link structure of the document space and the each document was last fetched.
d) Fetcher:
Requests web pages, parses them, and extracts links from them. Nutch’s robot has been written
entirely from scratch.
Crawling:
An intranet or niche search engine might only take a single machine a few hours to crawl, while a
whole-web crawl might take many machines several weeks or longer. A single crawling cycle consists
of generating a fetch list from the webdb, fetching those pages, parsing those for links, then updating
the webdb. It also uses a uniform refresh policy; all pages are refetched at the same interval (30 days,
by default).
Indexing Text:
Lucene meets the scalability requirements for text indexing in Nutch. Nutch also takes advantage of
Lucene’s multi-field case-folding keyword and phrase search in URLs, anchor text, and document
text.
Indexing Hypertext:
Lucene provides an inverted-file full-text index, which suffices for indexing text but not the
additional tasks required by a web search engine. In addition to this, Nutch implements a link
database to provide efficient access to the Web's link graph, and a page database that stores crawled
pages for indexing, summarizing, and serving to users, as well as supporting other functions such as
crawling and link analysis.
Removing Duplicates
The nutch dedup command eliminates duplicate documents from a set of Lucene indices for Nutch
segments, so it inherently requires access to all the segments at once. It's a batch-mode process that
has to be run before running searches to prevent the search from returning duplicate documents.
Link Analysis:
Nutch includes a link analysis algorithm similar to PageRank. Distributed link analysis is a bulk
synchronous parallel process. At the beginning of each phase, the list of URLs whose scores must be
updated is divided up into many chunks; in the middle, many processes produce score-edit files by
finding all the links into pages in their particular chunk. At the end, an updating phase reads the score-
edit files one at a time, merging their results into new scores for the pages in the web database.
Searching:
Nutch's search user interface runs as a Java Server Page (JSP) that parses the user's textual query and
invokes the search method of a NutchBean. If Nutch is running on a single server, this translates the
user's query into a Lucene query and gets a list of hits from Lucene, which the JSP then renders into
HTML. If Nutch is instead distributed across several servers, the NutchBean's search method instead
remotely invokes the search methods of other NutchBeans on other machines, which can be
configured either to perform the search locally as described above or farm pieces of the work out to
yet other servers.
Summarizing:
Summaries on a results page are designed to avoid click throughs. By providing as much relevant
information as possible in a small amount of text, they help users improve precision. Nutch's
summarizer works by retokenizing a text string containing the entire original document, extracting a
minimal set of excerpts containing five words of context on each side of each hit in the document,
then deciding which excerpts to include in the final summary. It orders the excerpts with the best ones
first, preferring longer excerpts over shorter ones, and excerpts with more hits above excerpts with
fewer, and then it truncates the total summary to a maximum of twenty words.

Weitere ähnliche Inhalte

Was ist angesagt?

Semantic web
Semantic webSemantic web
Semantic webRehithaP
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Miningsathish sak
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
Collaborative filtering
Collaborative filteringCollaborative filtering
Collaborative filteringNeha Kulkarni
 
Publish subscribe model overview
Publish subscribe model overviewPublish subscribe model overview
Publish subscribe model overviewIshraq Al Fataftah
 
Web 3.0 The Semantic Web
Web 3.0 The Semantic WebWeb 3.0 The Semantic Web
Web 3.0 The Semantic WebHatem Mahmoud
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web ArchitectureChamnap Chhorn
 
Components of a search engine
Components of a search engineComponents of a search engine
Components of a search enginePrimya Tamil
 
Link analysis : Comparative study of HITS and Page Rank Algorithm
Link analysis : Comparative study of HITS and Page Rank AlgorithmLink analysis : Comparative study of HITS and Page Rank Algorithm
Link analysis : Comparative study of HITS and Page Rank AlgorithmKavita Kushwah
 

Was ist angesagt? (20)

Semantic web
Semantic webSemantic web
Semantic web
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Mining
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Collaborative filtering
Collaborative filteringCollaborative filtering
Collaborative filtering
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Invisible Web
Invisible WebInvisible Web
Invisible Web
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Publish subscribe model overview
Publish subscribe model overviewPublish subscribe model overview
Publish subscribe model overview
 
HTTP Basics
HTTP BasicsHTTP Basics
HTTP Basics
 
Search engine
Search engineSearch engine
Search engine
 
Web 3.0 The Semantic Web
Web 3.0 The Semantic WebWeb 3.0 The Semantic Web
Web 3.0 The Semantic Web
 
Webcrawler
Webcrawler Webcrawler
Webcrawler
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web Architecture
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
 
HTTP & WWW
HTTP & WWWHTTP & WWW
HTTP & WWW
 
Web search vs ir
Web search vs irWeb search vs ir
Web search vs ir
 
Components of a search engine
Components of a search engineComponents of a search engine
Components of a search engine
 
Link analysis : Comparative study of HITS and Page Rank Algorithm
Link analysis : Comparative study of HITS and Page Rank AlgorithmLink analysis : Comparative study of HITS and Page Rank Algorithm
Link analysis : Comparative study of HITS and Page Rank Algorithm
 
Web Content Mining
Web Content MiningWeb Content Mining
Web Content Mining
 

Ähnlich wie Open source search engine

Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerIOSR Journals
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrievaliosrjce
 
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...ijwscjournal
 
A customized web search engine [autosaved]
A customized web search engine [autosaved]A customized web search engine [autosaved]
A customized web search engine [autosaved]Mustafa Elkhiat
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Introduction to internet.
Introduction to internet.Introduction to internet.
Introduction to internet.Anish Thomas
 
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...IJCSIS Research Publications
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the webVan-Duyet Le
 
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...ijwscjournal
 
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...iosrjce
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...ijsrd.com
 
Google indexing
Google indexingGoogle indexing
Google indexingtahoor71
 
TEXT ANALYZER
TEXT ANALYZER TEXT ANALYZER
TEXT ANALYZER ijcseit
 
Evaluation of Web Scale Discovery Services
Evaluation of Web Scale Discovery ServicesEvaluation of Web Scale Discovery Services
Evaluation of Web Scale Discovery ServicesNikesh Narayanan
 

Ähnlich wie Open source search engine (20)

Faster and resourceful multi core web crawling
Faster and resourceful multi core web crawlingFaster and resourceful multi core web crawling
Faster and resourceful multi core web crawling
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
G017254554
G017254554G017254554
G017254554
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
 
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
 
A customized web search engine [autosaved]
A customized web search engine [autosaved]A customized web search engine [autosaved]
A customized web search engine [autosaved]
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Introduction to internet.
Introduction to internet.Introduction to internet.
Introduction to internet.
 
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...
 
How Google Works
How Google WorksHow Google Works
How Google Works
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web
 
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
 
R01765113122
R01765113122R01765113122
R01765113122
 
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
 
Presentation on SEO, .htaccess, Open-source, Ontology, Semantic web, etc.
Presentation on SEO, .htaccess, Open-source, Ontology, Semantic web, etc.Presentation on SEO, .htaccess, Open-source, Ontology, Semantic web, etc.
Presentation on SEO, .htaccess, Open-source, Ontology, Semantic web, etc.
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
 
Google indexing
Google indexingGoogle indexing
Google indexing
 
TEXT ANALYZER
TEXT ANALYZER TEXT ANALYZER
TEXT ANALYZER
 
Evaluation of Web Scale Discovery Services
Evaluation of Web Scale Discovery ServicesEvaluation of Web Scale Discovery Services
Evaluation of Web Scale Discovery Services
 
Web crawler
Web crawlerWeb crawler
Web crawler
 

Kürzlich hochgeladen

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdfssuserdda66b
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 

Kürzlich hochgeladen (20)

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 

Open source search engine

  • 1. Ms. T.Primya Assistant Professor Department of Computer Science and Engineering Dr. N. G. P. Institute of Technology Coimbatore Open Source Search Engine Framework When deciding to install a search engine in a website, there exists the possibility to use a commercial search engine or an open source one. For most of the websites, using a commercial search engine is not a feasible alternative because of the fees that are required and because they focus on large scale sites. On the other hand, open source search engines may give the same functionalities (some are capable of managing large amount of data) as a commercial one, with the benefits of the open source philosophy: no cost, software maintained actively, possibility to customize the code in order to satisfy personal needs. Nowadays, there are many open source alternatives that can be used, and each of them has different characteristics that must be taken into consideration in order to determine which one to install in the website. These search engines can be classified according to the programming language in which it is implemented, how it stores the index (inverted file, database, other file structure), its searching capabilities (Boolean operators, fuzzy search, use of stemming, etc), way of ranking, type of files capable of indexing (HTML, PDF, plain text, etc), possibility of on-line indexing and/or making incremental indexes. Example: There are several open source search engines available. Nutch, Lucene, ASPSeek, BBDBot, Datapark, ebhath, Eureka, ht://Dig, Indri, ISearch, IXE, ManagingGigabytes (MG), MG4J, mnoGoSearch, MPS Information Server, Namazu, Omega, OmniFind IBM Yahoo! Ed., OpenFTS, PLWeb, SWISH-E, SWISH++, Terrier, WAIS/ freeWAIS, WebGlimpse, XML Query Engine, XMLSearch, Zebra, and Zettair. Nutch: (A Flexible and Scalable Open-Source Web Search Engine) Nutch is an open-source Web search engine that can be used at global, local, and even personal scale. Its initial design goal was to enable a transparent alternative for global Web search in the public interest — one of its signature features is the ability to “explain” its result rankings. Recent work has emphasized how it can also be used for intranets; by local communities with richer data models, such
  • 2. as the Creative Commons metadata enabled search for licensed content; on a personal scale to index a user's files, email, and web-surfing history Architecture: Nutch has a highly modular architecture that uses plug-in APIs for media-type parsing, HTML analysis, data retrieval protocols, and queries. The core has four major components. a) Searcher: Given a query, it must quickly find a small relevant subset of a corpus of documents, and then present them. Finding a large relevant subset is normally done with an inverted index of the corpus; ranking within that set to produce the most relevant documents, which then must be summarized for display. b) Indexer: Creates the inverted index from which the searcher extracts results. It uses Lucene storing indexes. c) Database: Stores the document contents for indexing and later summarization by the searcher, along with information such as the link structure of the document space and the each document was last fetched. d) Fetcher: Requests web pages, parses them, and extracts links from them. Nutch’s robot has been written entirely from scratch. Crawling:
  • 3. An intranet or niche search engine might only take a single machine a few hours to crawl, while a whole-web crawl might take many machines several weeks or longer. A single crawling cycle consists of generating a fetch list from the webdb, fetching those pages, parsing those for links, then updating the webdb. It also uses a uniform refresh policy; all pages are refetched at the same interval (30 days, by default). Indexing Text: Lucene meets the scalability requirements for text indexing in Nutch. Nutch also takes advantage of Lucene’s multi-field case-folding keyword and phrase search in URLs, anchor text, and document text. Indexing Hypertext: Lucene provides an inverted-file full-text index, which suffices for indexing text but not the additional tasks required by a web search engine. In addition to this, Nutch implements a link database to provide efficient access to the Web's link graph, and a page database that stores crawled pages for indexing, summarizing, and serving to users, as well as supporting other functions such as crawling and link analysis. Removing Duplicates The nutch dedup command eliminates duplicate documents from a set of Lucene indices for Nutch segments, so it inherently requires access to all the segments at once. It's a batch-mode process that has to be run before running searches to prevent the search from returning duplicate documents. Link Analysis: Nutch includes a link analysis algorithm similar to PageRank. Distributed link analysis is a bulk synchronous parallel process. At the beginning of each phase, the list of URLs whose scores must be updated is divided up into many chunks; in the middle, many processes produce score-edit files by finding all the links into pages in their particular chunk. At the end, an updating phase reads the score- edit files one at a time, merging their results into new scores for the pages in the web database. Searching: Nutch's search user interface runs as a Java Server Page (JSP) that parses the user's textual query and invokes the search method of a NutchBean. If Nutch is running on a single server, this translates the user's query into a Lucene query and gets a list of hits from Lucene, which the JSP then renders into HTML. If Nutch is instead distributed across several servers, the NutchBean's search method instead remotely invokes the search methods of other NutchBeans on other machines, which can be
  • 4. configured either to perform the search locally as described above or farm pieces of the work out to yet other servers. Summarizing: Summaries on a results page are designed to avoid click throughs. By providing as much relevant information as possible in a small amount of text, they help users improve precision. Nutch's summarizer works by retokenizing a text string containing the entire original document, extracting a minimal set of excerpts containing five words of context on each side of each hit in the document, then deciding which excerpts to include in the final summary. It orders the excerpts with the best ones first, preferring longer excerpts over shorter ones, and excerpts with more hits above excerpts with fewer, and then it truncates the total summary to a maximum of twenty words.