SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Web crawler
Allan@eSobi.com
Agenda
• Forward of Web Crawler
– HTML Parser

• Practice
– Feed Crawler

• Prototype demo
• Conclusion
HTML Parser
• HTML found on Web is usually dirty,
ill-formed and unsuitable for further
processing.
• First clean up the mess and bring the
order to tags, attributes and
ordinary text.
Well-known Parser
• Access the information using
standard XML interfaces.
• HtmlCleaner
• HtmlParser
• Nekohtml
Parser inner structure
• HTML scanner

– Pre-processing action

• Tag balancer

– Reorders individual elements
– Produces well-formed XML

• Extraction
• Transformation
Example
Extraction
• Text extraction
– for use as input for text search engine
databases for example

• Link extraction
– for crawling through web pages or harvesting
email addresses

• Screen scraping
– for programmatic data input from web pages
Extraction

• Resource extraction

– collecting images or sound

• A browser front end

– the preliminary stage of page display

• Link checking

– ensuring links are valid

• Site monitoring

– checking for page differences beyond
simplistic diffs
Transformation
• URL rewriting
– modifying some or all links on a page

• Site capture
– moving content from the web to local disk

• Censorship
– removing offending words and phrases from
pages
Transformation
• HTML cleanup
– correcting erroneous pages

• AD removal
– excising URLs referencing advertising

• Conversion to XML
– moving existing web pages to XML
Practice
• Feed Crawler
– HTML

• Bloglines, Feedage

– XML

• RssMountain

– JSON

• Google AJAX Feed API

• Prototype
– Demo
Conclusion
• Page search, image search, news
search, blog search, feed search ...
• Fault toleration of text processing
• Text mining in web
• Q&A

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (11)

Week10 Web Presentation
Week10 Web PresentationWeek10 Web Presentation
Week10 Web Presentation
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profit
 
Web crawler and applications
Web crawler and applicationsWeb crawler and applications
Web crawler and applications
 
search engines
search enginessearch engines
search engines
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 

Ähnlich wie Web Crawler

Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlervinay arora
 
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...Flink Forward
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
CNIT 129S: Ch 4: Mapping the Application
CNIT 129S: Ch 4: Mapping the ApplicationCNIT 129S: Ch 4: Mapping the Application
CNIT 129S: Ch 4: Mapping the ApplicationSam Bowne
 
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producingkurtgessler
 
CNIT 129S Ch 4: Mapping the Application
CNIT 129S Ch 4: Mapping the ApplicationCNIT 129S Ch 4: Mapping the Application
CNIT 129S Ch 4: Mapping the ApplicationSam Bowne
 
4 Web Crawler.pptx
4 Web Crawler.pptx4 Web Crawler.pptx
4 Web Crawler.pptxDEEPAK948083
 
Search Engine Optimization, SEO Audits, and Analytics
Search Engine Optimization, SEO Audits, and AnalyticsSearch Engine Optimization, SEO Audits, and Analytics
Search Engine Optimization, SEO Audits, and AnalyticsBill Hartzer
 
SEO for developers (session 1)
SEO for developers (session 1)SEO for developers (session 1)
SEO for developers (session 1)RankAbove
 
On page optimization
On page optimizationOn page optimization
On page optimizationshieepoo
 
Search engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATGSearch engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATGVignesh sitaraman
 
How To Construct A Search Engine Friendly Website
How To Construct A Search Engine Friendly WebsiteHow To Construct A Search Engine Friendly Website
How To Construct A Search Engine Friendly WebsiteMatt Siltala
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptxScrbifPt
 

Ähnlich wie Web Crawler (20)

Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
CNIT 129S: Ch 4: Mapping the Application
CNIT 129S: Ch 4: Mapping the ApplicationCNIT 129S: Ch 4: Mapping the Application
CNIT 129S: Ch 4: Mapping the Application
 
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producing
 
CNIT 129S Ch 4: Mapping the Application
CNIT 129S Ch 4: Mapping the ApplicationCNIT 129S Ch 4: Mapping the Application
CNIT 129S Ch 4: Mapping the Application
 
4 Web Crawler.pptx
4 Web Crawler.pptx4 Web Crawler.pptx
4 Web Crawler.pptx
 
Webcrawler
WebcrawlerWebcrawler
Webcrawler
 
Search Engine Optimization, SEO Audits, and Analytics
Search Engine Optimization, SEO Audits, and AnalyticsSearch Engine Optimization, SEO Audits, and Analytics
Search Engine Optimization, SEO Audits, and Analytics
 
Search
SearchSearch
Search
 
On-Page SEO
On-Page  SEO On-Page  SEO
On-Page SEO
 
Webcrawler
WebcrawlerWebcrawler
Webcrawler
 
Module-5 Ppt.pptx
Module-5 Ppt.pptxModule-5 Ppt.pptx
Module-5 Ppt.pptx
 
SEO for developers (session 1)
SEO for developers (session 1)SEO for developers (session 1)
SEO for developers (session 1)
 
On page optimization
On page optimizationOn page optimization
On page optimization
 
Search engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATGSearch engine optimization (seo) from Endeca & ATG
Search engine optimization (seo) from Endeca & ATG
 
How To Construct A Search Engine Friendly Website
How To Construct A Search Engine Friendly WebsiteHow To Construct A Search Engine Friendly Website
How To Construct A Search Engine Friendly Website
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 

Mehr von Allan Huang

Concurrency in Java
Concurrency in  JavaConcurrency in  Java
Concurrency in JavaAllan Huang
 
Build, logging, and unit test tools
Build, logging, and unit test toolsBuild, logging, and unit test tools
Build, logging, and unit test toolsAllan Huang
 
Java JSON Parser Comparison
Java JSON Parser ComparisonJava JSON Parser Comparison
Java JSON Parser ComparisonAllan Huang
 
Netty 4-based RPC System Development
Netty 4-based RPC System DevelopmentNetty 4-based RPC System Development
Netty 4-based RPC System DevelopmentAllan Huang
 
eSobi Website Multilayered Architecture
eSobi Website Multilayered ArchitectureeSobi Website Multilayered Architecture
eSobi Website Multilayered ArchitectureAllan Huang
 
Java New Evolution
Java New EvolutionJava New Evolution
Java New EvolutionAllan Huang
 
Tomcat New Evolution
Tomcat New EvolutionTomcat New Evolution
Tomcat New EvolutionAllan Huang
 
JQuery New Evolution
JQuery New EvolutionJQuery New Evolution
JQuery New EvolutionAllan Huang
 
Responsive Web Design
Responsive Web DesignResponsive Web Design
Responsive Web DesignAllan Huang
 
Boilerpipe Integration And Improvement
Boilerpipe Integration And ImprovementBoilerpipe Integration And Improvement
Boilerpipe Integration And ImprovementAllan Huang
 
Build Cross-Platform Mobile Application with PhoneGap
Build Cross-Platform Mobile Application with PhoneGapBuild Cross-Platform Mobile Application with PhoneGap
Build Cross-Platform Mobile Application with PhoneGapAllan Huang
 
HTML5 Multithreading
HTML5 MultithreadingHTML5 Multithreading
HTML5 MultithreadingAllan Huang
 
HTML5 Offline Web Application
HTML5 Offline Web ApplicationHTML5 Offline Web Application
HTML5 Offline Web ApplicationAllan Huang
 
HTML5 Data Storage
HTML5 Data StorageHTML5 Data Storage
HTML5 Data StorageAllan Huang
 
Java Script Patterns
Java Script PatternsJava Script Patterns
Java Script PatternsAllan Huang
 
Weighted feed recommand
Weighted feed recommandWeighted feed recommand
Weighted feed recommandAllan Huang
 
eSobi Site Initiation
eSobi Site InitiationeSobi Site Initiation
eSobi Site InitiationAllan Huang
 
Architecture of eSobi club based on J2EE
Architecture of eSobi club based on J2EEArchitecture of eSobi club based on J2EE
Architecture of eSobi club based on J2EEAllan Huang
 

Mehr von Allan Huang (20)

Concurrency in Java
Concurrency in  JavaConcurrency in  Java
Concurrency in Java
 
Build, logging, and unit test tools
Build, logging, and unit test toolsBuild, logging, and unit test tools
Build, logging, and unit test tools
 
Drools
DroolsDrools
Drools
 
Java JSON Parser Comparison
Java JSON Parser ComparisonJava JSON Parser Comparison
Java JSON Parser Comparison
 
Netty 4-based RPC System Development
Netty 4-based RPC System DevelopmentNetty 4-based RPC System Development
Netty 4-based RPC System Development
 
eSobi Website Multilayered Architecture
eSobi Website Multilayered ArchitectureeSobi Website Multilayered Architecture
eSobi Website Multilayered Architecture
 
Java New Evolution
Java New EvolutionJava New Evolution
Java New Evolution
 
Tomcat New Evolution
Tomcat New EvolutionTomcat New Evolution
Tomcat New Evolution
 
JQuery New Evolution
JQuery New EvolutionJQuery New Evolution
JQuery New Evolution
 
Responsive Web Design
Responsive Web DesignResponsive Web Design
Responsive Web Design
 
Boilerpipe Integration And Improvement
Boilerpipe Integration And ImprovementBoilerpipe Integration And Improvement
Boilerpipe Integration And Improvement
 
YQL Case Study
YQL Case StudyYQL Case Study
YQL Case Study
 
Build Cross-Platform Mobile Application with PhoneGap
Build Cross-Platform Mobile Application with PhoneGapBuild Cross-Platform Mobile Application with PhoneGap
Build Cross-Platform Mobile Application with PhoneGap
 
HTML5 Multithreading
HTML5 MultithreadingHTML5 Multithreading
HTML5 Multithreading
 
HTML5 Offline Web Application
HTML5 Offline Web ApplicationHTML5 Offline Web Application
HTML5 Offline Web Application
 
HTML5 Data Storage
HTML5 Data StorageHTML5 Data Storage
HTML5 Data Storage
 
Java Script Patterns
Java Script PatternsJava Script Patterns
Java Script Patterns
 
Weighted feed recommand
Weighted feed recommandWeighted feed recommand
Weighted feed recommand
 
eSobi Site Initiation
eSobi Site InitiationeSobi Site Initiation
eSobi Site Initiation
 
Architecture of eSobi club based on J2EE
Architecture of eSobi club based on J2EEArchitecture of eSobi club based on J2EE
Architecture of eSobi club based on J2EE
 

Kürzlich hochgeladen

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Kürzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Web Crawler

  • 2. Agenda • Forward of Web Crawler – HTML Parser • Practice – Feed Crawler • Prototype demo • Conclusion
  • 3. HTML Parser • HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. • First clean up the mess and bring the order to tags, attributes and ordinary text.
  • 4. Well-known Parser • Access the information using standard XML interfaces. • HtmlCleaner • HtmlParser • Nekohtml
  • 5. Parser inner structure • HTML scanner – Pre-processing action • Tag balancer – Reorders individual elements – Produces well-formed XML • Extraction • Transformation
  • 7. Extraction • Text extraction – for use as input for text search engine databases for example • Link extraction – for crawling through web pages or harvesting email addresses • Screen scraping – for programmatic data input from web pages
  • 8. Extraction • Resource extraction – collecting images or sound • A browser front end – the preliminary stage of page display • Link checking – ensuring links are valid • Site monitoring – checking for page differences beyond simplistic diffs
  • 9. Transformation • URL rewriting – modifying some or all links on a page • Site capture – moving content from the web to local disk • Censorship – removing offending words and phrases from pages
  • 10. Transformation • HTML cleanup – correcting erroneous pages • AD removal – excising URLs referencing advertising • Conversion to XML – moving existing web pages to XML
  • 11. Practice • Feed Crawler – HTML • Bloglines, Feedage – XML • RssMountain – JSON • Google AJAX Feed API • Prototype – Demo
  • 12. Conclusion • Page search, image search, news search, blog search, feed search ... • Fault toleration of text processing • Text mining in web • Q&A