2. Agenda
• Forward of Web Crawler
– HTML Parser
• Practice
– Feed Crawler
• Prototype demo
• Conclusion
3. HTML Parser
• HTML found on Web is usually dirty,
ill-formed and unsuitable for further
processing.
• First clean up the mess and bring the
order to tags, attributes and
ordinary text.
4. Well-known Parser
• Access the information using
standard XML interfaces.
• HtmlCleaner
• HtmlParser
• Nekohtml
5. Parser inner structure
• HTML scanner
– Pre-processing action
• Tag balancer
– Reorders individual elements
– Produces well-formed XML
• Extraction
• Transformation
7. Extraction
• Text extraction
– for use as input for text search engine
databases for example
• Link extraction
– for crawling through web pages or harvesting
email addresses
• Screen scraping
– for programmatic data input from web pages
8. Extraction
• Resource extraction
– collecting images or sound
• A browser front end
– the preliminary stage of page display
• Link checking
– ensuring links are valid
• Site monitoring
– checking for page differences beyond
simplistic diffs
9. Transformation
• URL rewriting
– modifying some or all links on a page
• Site capture
– moving content from the web to local disk
• Censorship
– removing offending words and phrases from
pages
10. Transformation
• HTML cleanup
– correcting erroneous pages
• AD removal
– excising URLs referencing advertising
• Conversion to XML
– moving existing web pages to XML
11. Practice
• Feed Crawler
– HTML
• Bloglines, Feedage
– XML
• RssMountain
– JSON
• Google AJAX Feed API
• Prototype
– Demo
12. Conclusion
• Page search, image search, news
search, blog search, feed search ...
• Fault toleration of text processing
• Text mining in web
• Q&A