Web Crawler

Agenda
• Forward of Web Crawler
– HTML Parser

• Practice
– Feed Crawler

• Prototype demo
• Conclusion

HTML Parser
• HTML found on Web is usually dirty,
ill-formed and unsuitable for further
processing.
• First clean up the mess and bring the
order to tags, attributes and
ordinary text.

Well-known Parser
• Access the information using
standard XML interfaces.
• HtmlCleaner
• HtmlParser
• Nekohtml

Parser inner structure
• HTML scanner

– Pre-processing action

• Tag balancer

– Reorders individual elements
– Produces well-formed XML

• Extraction
• Transformation

Extraction
• Text extraction
– for use as input for text search engine
databases for example

• Link extraction
– for crawling through web pages or harvesting
email addresses

• Screen scraping
– for programmatic data input from web pages

Extraction

• Resource extraction

– collecting images or sound

• A browser front end

– the preliminary stage of page display

• Link checking

– ensuring links are valid

• Site monitoring

– checking for page differences beyond
simplistic diffs

Transformation
• URL rewriting
– modifying some or all links on a page

• Site capture
– moving content from the web to local disk

• Censorship
– removing offending words and phrases from
pages

Transformation
• HTML cleanup
– correcting erroneous pages

• AD removal
– excising URLs referencing advertising

• Conversion to XML
– moving existing web pages to XML

Practice
• Feed Crawler
– HTML

• Bloglines, Feedage

– XML

• RssMountain

– JSON

• Google AJAX Feed API

• Prototype
– Demo

Conclusion
• Page search, image search, news
search, blog search, feed search ...
• Fault toleration of text processing
• Text mining in web
• Q&A

Web Crawler

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie Web Crawler

Ähnlich wie Web Crawler (20)

Mehr von Allan Huang

Mehr von Allan Huang (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Web Crawler