The document discusses practical web scraping using the Web::Scraper module in Perl. It provides an example of scraping the current UTC time from a website using regular expressions, then refactors it to use Web::Scraper for a more robust and maintainable approach. Key advantages of Web::Scraper include using CSS selectors and XPath to be less fragile, and proper handling of HTML encoding.
9. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
10. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
30. <span class="message">I ♥ Vienna</span> > perl –MHTML::Entities –e '$c =~ m@<span class="message">(.*?)</span>@ and print decode_entities ($1)' I ♥ Vienna
31. <span class="message"> ウィーンが大好き! </span> > perl –MHTML::Entities –MEncode –e '$c =~ m@<span class="message">(.*?)</span>@ and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. ウィーンが大好き!
32.
33.
34.
35.
36. XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id="ctu"]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
37.
38.
39.
40. CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath "strong#ctu"; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
41. Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get("http://www.timeanddate.com/worldclock/"); if ($res->is_error) { die "HTTP GET error: ", $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath("strong#ctu"); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
42.
43. Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@<strong id="ctu">(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
44. Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get("http://www.timeanddate.com/worldclock/"); if ($res->is_error) { die "HTTP GET error: ", $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath("strong#ctu"); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
45.
46.
47.
48. Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get("http://www.timeanddate.com/worldclock/"); if ($res->is_error) { die "HTTP GET error: ", $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath("strong#ctu"); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;