Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Search Engine & Web Crawling
Presented By:-
Vinay Arora
Assistant Professor
CSED, Thapar University
Patiala (Punjab)
Contents
• What is search engine
• Example and need of a search engine
• How search engine works?
• Web crawler
• Web craw...
What is a search engine
• A search engine is a searchable database which collects
information on web pages from the Intern...
Examples of search engine
Need of search engine
• Variety An Internet search can generate a variety of sources for
information. Results from online ...
Searching for the keyword “thapar
university” @ google
How search engine works?
A Search engine has three parts.
• Spider: Deploys a robot program
called a spider or robot desig...
How search engine works? (Conti…)
Web crawler
• A Web crawler is a computer program that browses the World
Wide Web in a methodical, automated manner.
• Oth...
Sequential crawler
• This is a sequential crawler
• Seeds can be any list of
starting URLs
• Order of page visits is
deter...
Architecture of a crawler
Architecture of a crawler (Conti…)
• URL Frontier: containing URLs yet to be fetches in the
current crawl. At first, a see...
Architecture of a crawler (Conti…)
• Content Seen?: test whether a web page with the same
content has already been seen at...
Webcrawling & factors affecting it
• Crawling (spidering): finding and downloading web pages
automatically.
• Factors incl...
robots.txt
• The robots exclusion standard, also known as the robots
exclusion protocol or robots.txt protocol, is a stand...
robots.txt (Conti…)
sitemap.xml
• The Sitemaps protocol allows a webmaster to inform search
engines about URLs on a website that are available...
sitemap.xml (Conti…)
Manual submission of websites into
database of specific search engine
amendment in <a> tag with <href>
option
• The <a> tag defines a hyperlink, which is used to link from
one page to another....
Areas related to web crawling -
Indexing
• Search engine indexing collects, parses, and stores data to
facilitate fast and...
Areas related to web crawling –
Indexing (Conti…)
• Search engine architectures vary in the way indexing is
performed and ...
Areas related to web crawling -
Searching algorithms
• String Matching Algorithms
• Brute Force Algorithm
• Rabin Karp Alg...
Areas related to web crawling - Data
mining and analysis
• Graph Mining
▫ Apriori-based Approach
▫ Pattern-Growth Approach...
Web crawler as Add On
• Downloading whole website (offline dump) - httrack
Httrack (Conti…)
Httrack (Conti…)
Httrack (Conti…)
Examples of Web crawler – Open source
crawler4j
Application of crawling concepts
SEO – Search Engine Optimization
Search engine and web crawler
Nächste SlideShare
Wird geladen in …5
×

Search engine and web crawler

1.903 Aufrufe

Veröffentlicht am

Search engine and web crawler

Veröffentlicht in: Bildung
  • Als Erste(r) kommentieren

Search engine and web crawler

  1. 1. Search Engine & Web Crawling Presented By:- Vinay Arora Assistant Professor CSED, Thapar University Patiala (Punjab)
  2. 2. Contents • What is search engine • Example and need of a search engine • How search engine works? • Web crawler • Web crawling ▫ Factor affecting web crawling robots.txt sitemap.xml manual submission of websites into database of specific search engine amendment in <a> tag with <href> option • Areas related to web crawling ▫ Indexing ▫ Searching algorithms ▫ Data mining and analysis • Web crawler as Add On ▫ Downloading whole website (offline dump) Demo Tool – httrack • Examples of Web crawler ▫ Open source
  3. 3. What is a search engine • A search engine is a searchable database which collects information on web pages from the Internet. • Indexes the information and then stores the result in a huge database where it can be quickly searched. • The search engine provides an interface to search the database. • When you enter a keyword into the search engine, the search engine will look through the billions of web pages to help you find the ones that you are looking for.
  4. 4. Examples of search engine
  5. 5. Need of search engine • Variety An Internet search can generate a variety of sources for information. Results from online encyclopedias, news stories, university studies, discussion boards, and even personal blogs can come up in a basic Internet search. This variety allows anyone searching for information to choose the types of sources they would like to use, or to use a variety of sources to gain a greater understanding of a subject. • Organization Internet search engines help to organize the Internet and individual websites. Search engines aid in organizing the vast amount of information that can sometimes be scattered in various places on the same web page into an organized list that can be used more easily. • Precision Search engines do have the ability to provide refined or more precise results. Being able to search more precisely allows you to cut down on the amount of information generated by your search.
  6. 6. Searching for the keyword “thapar university” @ google
  7. 7. How search engine works? A Search engine has three parts. • Spider: Deploys a robot program called a spider or robot designed to track down web pages. It follows the links these pages contain, and add information to search engines’ database. Example: Googlebot (Google’s robot program) • Index: Database containing a copy of each Web page gathered by the spider. • Search engine software : Technology that enables users to query the index and that returns results in a schematic order.
  8. 8. How search engine works? (Conti…)
  9. 9. Web crawler • A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. • Other names Crawler Spider Robot (or bot) Web agent Wanderer, worm • Examples: googlebot, msnbot, etc.
  10. 10. Sequential crawler • This is a sequential crawler • Seeds can be any list of starting URLs • Order of page visits is determined by frontier data structure • Stop criterion can be anything
  11. 11. Architecture of a crawler
  12. 12. Architecture of a crawler (Conti…) • URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a seed set is stored in URL Frontier, and a crawler begins by taking a URL from the seed set. • DNS: domain name service resolution. Look up IP address for domain names. • Fetch: generally use the http protocol to fetch the URL. • Parse: the page is parsed. Texts (images, videos, and etc.) and Links are extracted.
  13. 13. Architecture of a crawler (Conti…) • Content Seen?: test whether a web page with the same content has already been seen at another URL. Need to develop a way to measure the fingerprint of a web page. • URL Filter: ▫ Whether the extracted URL should be excluded from the frontier (robots.txt). ▫ URL should be normalized. • Duplicate URL Elimination: the URL is checked for duplicate elimination.
  14. 14. Webcrawling & factors affecting it • Crawling (spidering): finding and downloading web pages automatically. • Factors include the things that deviate or restrict the crawler to perform the crawling. ▫ robots.txt ▫ sitemap.xml ▫ manual submission of websites into database of specific search engine ▫ amendment in <a> tag with <href> option
  15. 15. robots.txt • The robots exclusion standard, also known as the robots exclusion protocol or robots.txt protocol, is a standard used by websites to communicate with web crawlers and other web robots. • The standard specifies the instruction format to be used to inform the robot about which areas of the website should not be processed or scanned. • Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code.
  16. 16. robots.txt (Conti…)
  17. 17. sitemap.xml • The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. • A Sitemap is an XML file that lists the URLs for a site. • It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. • This allows search engines to crawl the site more intelligently. Sitemaps are a URL inclusion protocol and complement robots.txt, a URL exclusion protocol.
  18. 18. sitemap.xml (Conti…)
  19. 19. Manual submission of websites into database of specific search engine
  20. 20. amendment in <a> tag with <href> option • The <a> tag defines a hyperlink, which is used to link from one page to another. • Visit W3Schools.com! <a href="http://www.w3schools.com">Visit W3Schools.com!</a> • <a rel="nofollow" href="http://www.w3schools.com">Visit W3Schools.com!</a>
  21. 21. Areas related to web crawling - Indexing • Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. • The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. • Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power.
  22. 22. Areas related to web crawling – Indexing (Conti…) • Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors. • Index data structures ▫ Suffix tree ▫ Inverted index ▫ Citation index ▫ Ngram index ▫ Document-term matrix
  23. 23. Areas related to web crawling - Searching algorithms • String Matching Algorithms • Brute Force Algorithm • Rabin Karp Algorithm • Knuth-Morris-Pratt Algorithm • Boyer Moore Algorithm
  24. 24. Areas related to web crawling - Data mining and analysis • Graph Mining ▫ Apriori-based Approach ▫ Pattern-Growth Approach ▫ Pattern growth-based frequent substructure mining
  25. 25. Web crawler as Add On • Downloading whole website (offline dump) - httrack
  26. 26. Httrack (Conti…)
  27. 27. Httrack (Conti…)
  28. 28. Httrack (Conti…)
  29. 29. Examples of Web crawler – Open source
  30. 30. crawler4j
  31. 31. Application of crawling concepts
  32. 32. SEO – Search Engine Optimization

×