Web::Scraper

Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers YAPC::Europe 2007 Vienna

abbreviation Acme::Module::Authors Acme::Sneeze Acme::Sneeze::JP Apache::ACEProxy Apache::AntiSpam Apache::Clickable Apache::CustomKeywords Apache::DefaultCharset Apache::GuessCharset Apache::JavaScript::DocumentWrite Apache::No404Proxy Apache::Profiler Apache::Session::CacheAny Apache::Session::Generate::ModUniqueId Apache::Session::Generate::ModUsertrack Apache::Session::PHP Apache::Session::Serialize::YAML Apache::Singleton Apache::StickyQuery Archive::Any::Create Attribute::Profiled Attribute::Protected Attribute::Unimplemented Bundle::Sledge capitalization Catalyst::Plugin::JSONRPC Catalyst::View::Jemplate Catalyst::View::JSON CGI::Untaint::email Class::DBI::AbstractSearch Class::DBI::Extension Class::DBI::Pager Class::DBI::Replication Class::DBI::SQLite Class::DBI::View Class::Trigger Convert::Base32 Convert::DUDE Convert::RACE Date::Japanese::Era Date::Range::Birth Device::KeyStroke::Mobile Dunce::time Email::Find Email::Valid::Loose Encode::JavaScript::UCS Encode::JP::Mobile Encode::Punycode File::Find::Rule::Digest Geo::Coder::Google HTML::Entities::ImodePictogram HTML::RelExtor HTML::ResolveLink HTML::XSSLint HTTP::MobileAgent HTTP::ProxyPAC HTTP::Server::Simple::Authen IDNA::Punycode Inline::Basic Inline::TT JSON::Syck Kwiki::Emoticon Kwiki::Export Kwiki::Footnote Kwiki::OpenSearch Kwiki::OpenSearch::Service Kwiki::TypeKey Kwiki::URLBL Log::Dispatch::Config Log::Dispatch::DBI Mac::Macbinary Mail::Address::MobileJp Mail::ListDetector::Detector::Fml MSIE::MenuExt Net::DAAP::Server::AAC Net::IDN::Nameprep Net::IPAddr::Find Net::YahooMessenger NetAddr::IP::Find PHP::Session plagger Plagger POE::Component::Client::AirTunes POE::Component::YahooMessenger Template::Plugin::Clickable Template::Plugin::Comma Template::Plugin::FillInForm Template::Plugin::HTML::Template Template::Plugin::JavaScript Template::Plugin::MobileAgent Template::Plugin::Shuffle Template::Provider::Encoding Term::Encoding Term::TtyRec Text::Emoticon Text::Emoticon::GoogleTalk Text::Emoticon::MSN Text::Emoticon::Yahoo Text::MessageFormat Time::Duration::ja Time::Duration::Parse Web::Scrape WebService::Bloglines WebService::ChangesXml WebService::Google::Suggest WWW::Baseball::NPB WWW::Blog::Metadata::MobileLinkDiscovery WWW::Blog::Metadata::OpenID WWW::Blog::Metadata::OpenSearch WWW::Cache::Google WWW::OpenSearch XML::Atom XML::Atom::Lifeblog XML::Atom::Stream XML::Liberal

[object Object],[object Object]

Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping

[object Object],[object Object],[object Object]

<td>Current UTC (or GMT/Zulu)-time used: Monday, August 27, 2007 at 12:49:46

<td>Current UTC (or GMT/Zulu)-time used: Monday, August 27, 2007 at 12:49:46 > perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@(.*?)@ and print $1' Monday, August 27, 2007 at 12:49:46

[object Object],[object Object],[object Object],[object Object]

I &hearts; Vienna > perl –e '$c =~ m@(.*?)@ and print $1' I &hearts; Vienna

I &hearts; Vienna > perl –MHTML::Entities –e '$c =~ m@(.*?)@ and print decode_entities ($1)' I ♥ Vienna

ウィーンが大好き！ > perl –MHTML::Entities –MEncode –e '$c =~ m@(.*?)@ and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. ウィーンが大好き！

XPath <td>Current UTC (or GMT/Zulu)-time used: Monday, August 27, 2007 at 12:49:46 use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id="ctu"]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46

CSS Selectors ,[object Object],[object Object],[object Object]

CSS Selectors <td>Current UTC (or GMT/Zulu)-time used: Monday, August 27, 2007 at 12:49:46 use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath "strong#ctu"; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46

Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get("http://www.timeanddate.com/worldclock/"); if ($res->is_error) { die "HTTP GET error: ", $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath("strong#ctu"); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;

Exmaple (before) <td>Current UTC (or GMT/Zulu)-time used: Monday, August 27, 2007 at 12:49:46 > perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@(.*?)@ and print $1' Monday, August 27, 2007 at 12:49:46

Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get("http://www.timeanddate.com/worldclock/"); if ($res->is_error) { die "HTTP GET error: ", $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath("strong#ctu"); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;

Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get("http://www.timeanddate.com/worldclock/"); if ($res->is_error) { die "HTTP GET error: ", $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath("strong#ctu"); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;

Example (after) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Basics ,[object Object],[object Object],[object Object],[object Object],[object Object]

process ,[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

<ul class="sites"> <li><a href="http://vienna.openguides.org/">OpenGuides</a></li> <li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li> </ul>

[object Object],[object Object],[object Object],<ul class="sites"> <li><a href=" http://vienna.openguides.org/ ">OpenGuides</a></li> <li><a href=" http://vienna.yapceurope.org/ ">YAPC::Europe</a></li> </ul>

[object Object],[object Object],[object Object],<ul class="sites"> <li><a href="http://vienna.openguides.org/"> OpenGuides </a></li> <li><a href="http://vienna.yapceurope.org/"> YAPC::Europe </a></li> </ul>

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class="sites"> <li><a href="http://vienna.openguides.org/">OpenGuides</a></li> <li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li> </ul>

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class="sites"> <li><a href="http://vienna.openguides.org/">OpenGuides</a></li> <li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li> </ul>

result ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],my $s = scraper { process …; process …; result 'foo', 'bar'; };

Thumbnail URLs on Flickr set ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

<a href="http://twitter.com/iamcal" class="url" rel="contact" title="Cal Henderson"> <img alt="Cal Henderson" class="photo fn" height="24" id="profile-image" src="http://assets0.twitter.com/…/mini/buddyicon.gif" width="24" /></a> …

Twitter Friends ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Twitter Friends (complex) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object],[object Object]

Web::Scraper

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Web::Scraper

Ähnlich wie Web::Scraper (20)

Mehr von Tatsuhiko Miyagawa

Mehr von Tatsuhiko Miyagawa (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Web::Scraper