SlideShare a Scribd company logo
1 of 25
Download to read offline
Supervised By
Dr. Mohamed A. El-Rashidy Eng. Ahmed Ghozia
Dept. of Computer Science & Engineering
Faculty of Electronic Engineering,
Menoufiya University.
 The main purpose of this project is to build our own
search engine that should suffice for our needs as a
nation
 In this project has been tried to add customized
features to search engine such as building and
developing a time-based search engine that is meant
to deal with local and international news
 Question : What is a Search Engine?
 How web search engine work?
 Web crawler , Indexing , Ranking
 Lucene , Nutch , Solr
 Who uses solr?
 Setup Nutch for web crawling
 Setup Solr for search
 Running Nutch in Eclipse for developing
 Experiments
 Answer: A software that
 builds an index on text
 answers queries using that index
 A search engine offers
Scalability
Relevance Ranking
Integrates different data sources (email,
web pages, files, database,...)‫‏‬
 A search engine operates, in the following order
1. Web crawling
2. Indexing
3. Ranking
 a program or automated script which browses the
World Wide Web
 used to create a copy of all the visited pages for later
processing by a search engine
 it starts with a list of URLs to visit, called the seeds
 URLs recursively visited according to a set of policies
 A selection policy
 A re-visit policy
 A politeness policy
 A parallelization policy
 Indexing process entails how data is collected, parsed,
and stored to facilitate fast and accurate search query
evaluation.
 The process involves the following steps
 Data collection
 Data traversal
 Indexing
 Indexing process:
 Convert document
 Extract text and meta data
 Normalize text(stop word,stim)
 Write (inverted) index
 Example:
 Document 1: “Apache Lucene at Jazoon“
 Document 2: “Jazoon conference“
 Index:
 apache -> 1
 conference -> 2
 Jazoon -> 1, 2
 lucene -> 1
 The web search engine responds to a query that a user
enters into a web search engine to satisfy his or her
information needs
 a high-performance, scalable information retrieval
(IR) library
 lets you add searching capabilities to your
applications.
 free, open source project implemented in Java
 With Lucene, you can index and search email
messages, mailing-list archives, instant messenger
chats, your wiki pages…the list goes on.
 Web Search Engine Software
 Open source web crawler
 Coded entirely in the Java programming language
 Advantages
 Scalability
 Crawler Politeness
 Crawler Management
 Quality
 Open source enterprise search platform based on
Apache Lucene project.
 Powerful full-text search, hit highlighting, faceted
search
 Database integration, and rich document (e.g.,
Word, PDF) handling
 Download a binary package (apache-nutch-bin.zip)
 cd apache-nutch-1.X/
 bin/nutch crawl urls -dir crawl -depth 3 -topN 5
 Now you should be able to see the following directories
created:
 crawl/crawldb
 crawl/linkdb
 crawl/segments
 If you have a Solr core already set up and wish to index
to it we should use
bin/nutch crawl urls -solr http://localhost:8983/solr/ -
depth 3 -topN 5
Now skip to here for how to set up your Solr instance
and index your crawl data.
 Download binary file (apache-Solr-bin.zip)
 cd ${APACHE_SOLR_HOME}/example
 java -jar start.jar
 After you started Solr admin console, you should be
able to access the following link:
http://localhost:8983/solr/admin/
 Integrate Solr with Nutch
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml
${APACHE_SOLR_HOME}/example/solr/conf/
 restart Solr with the command “java -jar start.jar”
under ${APACHE_SOLR_HOME}/example
 run the Solr Index command:
bin/nutch solrindex http://127.0.0.1:8983/solr/
crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
 Crawling the Egyptian Universities
 Crawling the Arabic news websites
 Crawling the Arabic news websites
Mustafa Mohammed Ahmed Elkhiat
Email:melkhiat@gmail.com
A customized web search engine [autosaved]

More Related Content

What's hot

Understanding & Using Search Engine Optimization
Understanding & Using Search Engine OptimizationUnderstanding & Using Search Engine Optimization
Understanding & Using Search Engine OptimizationifPeople
 
The SEO Guide for Beginners
The SEO Guide for BeginnersThe SEO Guide for Beginners
The SEO Guide for BeginnersHugo Clery
 
Entireweb review over 150 million searches per month with website submission ...
Entireweb review over 150 million searches per month with website submission ...Entireweb review over 150 million searches per month with website submission ...
Entireweb review over 150 million searches per month with website submission ...joelmaster
 
Pm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodologyPm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodologyprashant mishra
 
GOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAM
GOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAMGOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAM
GOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAMieijjournal
 
Se omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seoSe omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seoalexanderandreya
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine OptimizationArun Kumar
 
Enabling news companies as content curators
Enabling news companies as content curatorsEnabling news companies as content curators
Enabling news companies as content curatorsPARC, a Xerox company
 
Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...
Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...
Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...Additya N
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation Campaigntouchdown777a
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignTrafficInjectors
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation Campaignbelieve52
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation Campaignbobtravpa
 
SEO 101 Workshop 10/2
SEO 101 Workshop 10/2SEO 101 Workshop 10/2
SEO 101 Workshop 10/2451 Marketing
 

What's hot (16)

Understanding & Using Search Engine Optimization
Understanding & Using Search Engine OptimizationUnderstanding & Using Search Engine Optimization
Understanding & Using Search Engine Optimization
 
The SEO Guide for Beginners
The SEO Guide for BeginnersThe SEO Guide for Beginners
The SEO Guide for Beginners
 
Entireweb review over 150 million searches per month with website submission ...
Entireweb review over 150 million searches per month with website submission ...Entireweb review over 150 million searches per month with website submission ...
Entireweb review over 150 million searches per month with website submission ...
 
Pm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodologyPm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodology
 
SEO
SEOSEO
SEO
 
GOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAM
GOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAMGOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAM
GOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAM
 
Se omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seoSe omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seo
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimization
 
Enabling news companies as content curators
Enabling news companies as content curatorsEnabling news companies as content curators
Enabling news companies as content curators
 
Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...
Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...
Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...
 
Seo adwords
Seo adwordsSeo adwords
Seo adwords
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation Campaign
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation Campaign
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation Campaign
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation Campaign
 
SEO 101 Workshop 10/2
SEO 101 Workshop 10/2SEO 101 Workshop 10/2
SEO 101 Workshop 10/2
 

Viewers also liked

Viewers also liked (6)

Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011
 
Wahyu asih 9e (tipe atau model jaringan)
Wahyu asih 9e (tipe atau model jaringan)Wahyu asih 9e (tipe atau model jaringan)
Wahyu asih 9e (tipe atau model jaringan)
 
The passive voice
The passive voiceThe passive voice
The passive voice
 
Wahyu asih 9e power point(sejarah internet)
Wahyu asih 9e power point(sejarah internet)Wahyu asih 9e power point(sejarah internet)
Wahyu asih 9e power point(sejarah internet)
 
Radiasibendahitam -phpapp02
Radiasibendahitam -phpapp02Radiasibendahitam -phpapp02
Radiasibendahitam -phpapp02
 
อนุตตรีย์ วัชรภา
อนุตตรีย์  วัชรภาอนุตตรีย์  วัชรภา
อนุตตรีย์ วัชรภา
 

Similar to A customized web search engine [autosaved]

Open source search engine
Open source search engineOpen source search engine
Open source search enginePrimya Tamil
 
Working of web browser.pptx
Working of web browser.pptxWorking of web browser.pptx
Working of web browser.pptxssuseraf60311
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrievaliosrjce
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerIOSR Journals
 
WP SESSION 2 PPT.ppt
WP SESSION 2 PPT.pptWP SESSION 2 PPT.ppt
WP SESSION 2 PPT.pptGFGCKCSKOLAR
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET Journal
 
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...ijwscjournal
 
Unit 5 World_Wide_Web.pptx
Unit 5 World_Wide_Web.pptxUnit 5 World_Wide_Web.pptx
Unit 5 World_Wide_Web.pptxDhruvPatel189174
 
Website and it's importance
Website and it's importanceWebsite and it's importance
Website and it's importanceRobinSingh347
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningA Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningIJMTST Journal
 

Similar to A customized web search engine [autosaved] (20)

Open source search engine
Open source search engineOpen source search engine
Open source search engine
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Working of web browser.pptx
Working of web browser.pptxWorking of web browser.pptx
Working of web browser.pptx
 
G017254554
G017254554G017254554
G017254554
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
WP SESSION 2 PPT.ppt
WP SESSION 2 PPT.pptWP SESSION 2 PPT.ppt
WP SESSION 2 PPT.ppt
 
Faster and resourceful multi core web crawling
Faster and resourceful multi core web crawlingFaster and resourceful multi core web crawling
Faster and resourceful multi core web crawling
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
 
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
 
Unit 5 World_Wide_Web.pptx
Unit 5 World_Wide_Web.pptxUnit 5 World_Wide_Web.pptx
Unit 5 World_Wide_Web.pptx
 
Website and it's importance
Website and it's importanceWebsite and it's importance
Website and it's importance
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
How search engine work ppt
How search engine work pptHow search engine work ppt
How search engine work ppt
 
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningA Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
 

Recently uploaded

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

A customized web search engine [autosaved]

  • 1. Supervised By Dr. Mohamed A. El-Rashidy Eng. Ahmed Ghozia Dept. of Computer Science & Engineering Faculty of Electronic Engineering, Menoufiya University.
  • 2.  The main purpose of this project is to build our own search engine that should suffice for our needs as a nation  In this project has been tried to add customized features to search engine such as building and developing a time-based search engine that is meant to deal with local and international news
  • 3.  Question : What is a Search Engine?  How web search engine work?  Web crawler , Indexing , Ranking  Lucene , Nutch , Solr  Who uses solr?  Setup Nutch for web crawling  Setup Solr for search  Running Nutch in Eclipse for developing  Experiments
  • 4.  Answer: A software that  builds an index on text  answers queries using that index  A search engine offers Scalability Relevance Ranking Integrates different data sources (email, web pages, files, database,...)‫‏‬
  • 5.  A search engine operates, in the following order 1. Web crawling 2. Indexing 3. Ranking
  • 6.  a program or automated script which browses the World Wide Web  used to create a copy of all the visited pages for later processing by a search engine  it starts with a list of URLs to visit, called the seeds  URLs recursively visited according to a set of policies  A selection policy  A re-visit policy  A politeness policy  A parallelization policy
  • 7.  Indexing process entails how data is collected, parsed, and stored to facilitate fast and accurate search query evaluation.  The process involves the following steps  Data collection  Data traversal  Indexing
  • 8.  Indexing process:  Convert document  Extract text and meta data  Normalize text(stop word,stim)  Write (inverted) index  Example:  Document 1: “Apache Lucene at Jazoon“  Document 2: “Jazoon conference“  Index:  apache -> 1  conference -> 2  Jazoon -> 1, 2  lucene -> 1
  • 9.  The web search engine responds to a query that a user enters into a web search engine to satisfy his or her information needs
  • 10.
  • 11.  a high-performance, scalable information retrieval (IR) library  lets you add searching capabilities to your applications.  free, open source project implemented in Java  With Lucene, you can index and search email messages, mailing-list archives, instant messenger chats, your wiki pages…the list goes on.
  • 12.  Web Search Engine Software  Open source web crawler  Coded entirely in the Java programming language  Advantages  Scalability  Crawler Politeness  Crawler Management  Quality
  • 13.  Open source enterprise search platform based on Apache Lucene project.  Powerful full-text search, hit highlighting, faceted search  Database integration, and rich document (e.g., Word, PDF) handling
  • 14.
  • 15.  Download a binary package (apache-nutch-bin.zip)  cd apache-nutch-1.X/  bin/nutch crawl urls -dir crawl -depth 3 -topN 5  Now you should be able to see the following directories created:  crawl/crawldb  crawl/linkdb  crawl/segments
  • 16.  If you have a Solr core already set up and wish to index to it we should use bin/nutch crawl urls -solr http://localhost:8983/solr/ - depth 3 -topN 5 Now skip to here for how to set up your Solr instance and index your crawl data.
  • 17.  Download binary file (apache-Solr-bin.zip)  cd ${APACHE_SOLR_HOME}/example  java -jar start.jar  After you started Solr admin console, you should be able to access the following link: http://localhost:8983/solr/admin/  Integrate Solr with Nutch cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/
  • 18.  restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example  run the Solr Index command: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
  • 19.
  • 20.  Crawling the Egyptian Universities
  • 21.  Crawling the Arabic news websites
  • 22.  Crawling the Arabic news websites
  • 23.
  • 24. Mustafa Mohammed Ahmed Elkhiat Email:melkhiat@gmail.com