SlideShare ist ein Scribd-Unternehmen logo
1 von 79
Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa   [email_address] Six Apart, Ltd. / Shibuya Perl Mongers YAPC::Europe 2007 Vienna
[object Object]
[object Object]
abbreviation Acme::Module::Authors Acme::Sneeze Acme::Sneeze::JP Apache::ACEProxy Apache::AntiSpam Apache::Clickable Apache::CustomKeywords Apache::DefaultCharset Apache::GuessCharset Apache::JavaScript::DocumentWrite Apache::No404Proxy Apache::Profiler Apache::Session::CacheAny Apache::Session::Generate::ModUniqueId Apache::Session::Generate::ModUsertrack Apache::Session::PHP Apache::Session::Serialize::YAML Apache::Singleton Apache::StickyQuery Archive::Any::Create Attribute::Profiled Attribute::Protected Attribute::Unimplemented Bundle::Sledge capitalization Catalyst::Plugin::JSONRPC Catalyst::View::Jemplate Catalyst::View::JSON CGI::Untaint::email Class::DBI::AbstractSearch Class::DBI::Extension Class::DBI::Pager Class::DBI::Replication Class::DBI::SQLite Class::DBI::View Class::Trigger Convert::Base32 Convert::DUDE Convert::RACE Date::Japanese::Era Date::Range::Birth Device::KeyStroke::Mobile Dunce::time Email::Find Email::Valid::Loose Encode::JavaScript::UCS Encode::JP::Mobile Encode::Punycode File::Find::Rule::Digest Geo::Coder::Google HTML::Entities::ImodePictogram HTML::RelExtor HTML::ResolveLink HTML::XSSLint HTTP::MobileAgent HTTP::ProxyPAC HTTP::Server::Simple::Authen IDNA::Punycode Inline::Basic Inline::TT JSON::Syck Kwiki::Emoticon Kwiki::Export Kwiki::Footnote Kwiki::OpenSearch Kwiki::OpenSearch::Service Kwiki::TypeKey Kwiki::URLBL Log::Dispatch::Config Log::Dispatch::DBI Mac::Macbinary Mail::Address::MobileJp Mail::ListDetector::Detector::Fml MSIE::MenuExt Net::DAAP::Server::AAC Net::IDN::Nameprep Net::IPAddr::Find Net::YahooMessenger NetAddr::IP::Find PHP::Session plagger Plagger POE::Component::Client::AirTunes POE::Component::YahooMessenger Template::Plugin::Clickable Template::Plugin::Comma Template::Plugin::FillInForm Template::Plugin::HTML::Template Template::Plugin::JavaScript Template::Plugin::MobileAgent Template::Plugin::Shuffle Template::Provider::Encoding Term::Encoding Term::TtyRec Text::Emoticon Text::Emoticon::GoogleTalk Text::Emoticon::MSN Text::Emoticon::Yahoo Text::MessageFormat Time::Duration::ja Time::Duration::Parse Web::Scrape WebService::Bloglines WebService::ChangesXml WebService::Google::Suggest WWW::Baseball::NPB WWW::Blog::Metadata::MobileLinkDiscovery WWW::Blog::Metadata::OpenID WWW::Blog::Metadata::OpenSearch WWW::Cache::Google WWW::OpenSearch XML::Atom XML::Atom::Lifeblog XML::Atom::Stream XML::Liberal
 
http://code.sixapart.com/
 
[object Object],[object Object]
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus,  screen scrapers  were reborn in the web era to  extract machine-friendly data from HTML  and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
[object Object],[object Object]
 
 
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
 
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br />
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
[object Object]
WWW::MySpace 0.70
WWW::Search::Ebay 2.231
WWW::Mixi 0.50
[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
<span class=&quot;message&quot;>I &hearts; Vienna</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Vienna
<span class=&quot;message&quot;>I &hearts; Vienna</span> > perl  –MHTML::Entities  –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print  decode_entities ($1)' I  ♥  Vienna
<span class=&quot;message&quot;> ウィーンが大好き! </span> > perl –MHTML::Entities  –MEncode  –e  '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. ウィーンが大好き!
[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
[object Object],[object Object],[object Object]
CSS Selectors ,[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath =  selector_to_xpath  &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
[object Object],[object Object],[object Object],[object Object]
Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
Example (after) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Basics ,[object Object],[object Object],[object Object],[object Object],[object Object]
process ,[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
result ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],my $s = scraper { process …; process …; result 'foo', 'bar'; };
[object Object]
 
Thumbnail URLs on Flickr set ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
<span class=&quot;vcard&quot;> <a href=&quot;http://twitter.com/iamcal&quot; class=&quot;url&quot; rel=&quot;contact&quot; title=&quot;Cal Henderson&quot;> <img alt=&quot;Cal Henderson&quot; class=&quot;photo fn&quot; height=&quot;24&quot;  id=&quot;profile-image&quot; src=&quot;http://assets0.twitter.com/…/mini/buddyicon.gif&quot; width=&quot;24&quot; /></a> </span> <span class=&quot;vcard&quot;> … </span>
Twitter Friends ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Twitter Friends (complex) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object]
[object Object],[object Object],[object Object]

Weitere ähnliche Inhalte

Was ist angesagt?

LCA2014 - Introduction to Go
LCA2014 - Introduction to GoLCA2014 - Introduction to Go
LCA2014 - Introduction to Godreamwidth
 
Ruby HTTP clients comparison
Ruby HTTP clients comparisonRuby HTTP clients comparison
Ruby HTTP clients comparisonHiroshi Nakamura
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchRafał Kuć
 
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018Thijs Feryn
 
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018Codemotion
 
Introduction to performance tuning perl web applications
Introduction to performance tuning perl web applicationsIntroduction to performance tuning perl web applications
Introduction to performance tuning perl web applicationsPerrin Harkins
 
A reviravolta do desenvolvimento web
A reviravolta do desenvolvimento webA reviravolta do desenvolvimento web
A reviravolta do desenvolvimento webWallace Reis
 
Socket programming with php
Socket programming with phpSocket programming with php
Socket programming with phpElizabeth Smith
 
Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018Thijs Feryn
 
Preparing your web services for Android and your Android app for web services...
Preparing your web services for Android and your Android app for web services...Preparing your web services for Android and your Android app for web services...
Preparing your web services for Android and your Android app for web services...Droidcon Eastern Europe
 
Android webservices
Android webservicesAndroid webservices
Android webservicesKrazy Koder
 
Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018Thijs Feryn
 
Perl: Hate it for the Right Reasons
Perl: Hate it for the Right ReasonsPerl: Hate it for the Right Reasons
Perl: Hate it for the Right ReasonsMatt Follett
 
Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.Workhorse Computing
 

Was ist angesagt? (20)

LCA2014 - Introduction to Go
LCA2014 - Introduction to GoLCA2014 - Introduction to Go
LCA2014 - Introduction to Go
 
Ruby HTTP clients comparison
Ruby HTTP clients comparisonRuby HTTP clients comparison
Ruby HTTP clients comparison
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and Elasticsearch
 
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
 
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
 
Introduction to performance tuning perl web applications
Introduction to performance tuning perl web applicationsIntroduction to performance tuning perl web applications
Introduction to performance tuning perl web applications
 
Triple Blitz Strike
Triple Blitz StrikeTriple Blitz Strike
Triple Blitz Strike
 
AJAX Transport Layer
AJAX Transport LayerAJAX Transport Layer
AJAX Transport Layer
 
A reviravolta do desenvolvimento web
A reviravolta do desenvolvimento webA reviravolta do desenvolvimento web
A reviravolta do desenvolvimento web
 
Socket programming with php
Socket programming with phpSocket programming with php
Socket programming with php
 
Lies, Damn Lies, and Benchmarks
Lies, Damn Lies, and BenchmarksLies, Damn Lies, and Benchmarks
Lies, Damn Lies, and Benchmarks
 
Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018
 
B03-GenomeContent-Intermine
B03-GenomeContent-IntermineB03-GenomeContent-Intermine
B03-GenomeContent-Intermine
 
Preparing your web services for Android and your Android app for web services...
Preparing your web services for Android and your Android app for web services...Preparing your web services for Android and your Android app for web services...
Preparing your web services for Android and your Android app for web services...
 
Android webservices
Android webservicesAndroid webservices
Android webservices
 
Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018
 
Perl: Hate it for the Right Reasons
Perl: Hate it for the Right ReasonsPerl: Hate it for the Right Reasons
Perl: Hate it for the Right Reasons
 
Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.
 
On Centralizing Logs
On Centralizing LogsOn Centralizing Logs
On Centralizing Logs
 
Analyse Yourself
Analyse YourselfAnalyse Yourself
Analyse Yourself
 

Andere mochten auch

Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingMichelle Minkoff
 
Web Scraping and Its Business Benefits
Web Scraping and Its Business BenefitsWeb Scraping and Its Business Benefits
Web Scraping and Its Business BenefitsEminenture
 
When RSS Fails: Web Scraping with HTTP
When RSS Fails: Web Scraping with HTTPWhen RSS Fails: Web Scraping with HTTP
When RSS Fails: Web Scraping with HTTPMatthew Turland
 
Java Web Scraping
Java Web ScrapingJava Web Scraping
Java Web ScrapingSumant Raja
 
Marina Grigorian - Portfolio
Marina Grigorian - PortfolioMarina Grigorian - Portfolio
Marina Grigorian - PortfolioMarina Grigorian
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
Pivotingskyscrapers
PivotingskyscrapersPivotingskyscrapers
Pivotingskyscrapersbengermo1950
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung
 
Bucket Wheel Excavator meets D8r dozer
Bucket Wheel Excavator meets D8r dozerBucket Wheel Excavator meets D8r dozer
Bucket Wheel Excavator meets D8r dozerArun Kumar
 
Birth of skyscrapers
Birth of skyscrapersBirth of skyscrapers
Birth of skyscrapersAbhiniti Garg
 
Scraper ripper-grader-dozer
Scraper ripper-grader-dozerScraper ripper-grader-dozer
Scraper ripper-grader-dozerSATYANARAYANA I
 

Andere mochten auch (20)

Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
 
Scraper
ScraperScraper
Scraper
 
Web Scraping and Its Business Benefits
Web Scraping and Its Business BenefitsWeb Scraping and Its Business Benefits
Web Scraping and Its Business Benefits
 
Relevance Assessment Tool
Relevance Assessment ToolRelevance Assessment Tool
Relevance Assessment Tool
 
When RSS Fails: Web Scraping with HTTP
When RSS Fails: Web Scraping with HTTPWhen RSS Fails: Web Scraping with HTTP
When RSS Fails: Web Scraping with HTTP
 
Whereismy Dozer
Whereismy DozerWhereismy Dozer
Whereismy Dozer
 
Java Web Scraping
Java Web ScrapingJava Web Scraping
Java Web Scraping
 
Marina Grigorian - Portfolio
Marina Grigorian - PortfolioMarina Grigorian - Portfolio
Marina Grigorian - Portfolio
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Pivotingskyscrapers
PivotingskyscrapersPivotingskyscrapers
Pivotingskyscrapers
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 
Scraper
ScraperScraper
Scraper
 
Bucket Wheel Excavator meets D8r dozer
Bucket Wheel Excavator meets D8r dozerBucket Wheel Excavator meets D8r dozer
Bucket Wheel Excavator meets D8r dozer
 
Skyscraper
SkyscraperSkyscraper
Skyscraper
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Skyscrapers
SkyscrapersSkyscrapers
Skyscrapers
 
Birth of skyscrapers
Birth of skyscrapersBirth of skyscrapers
Birth of skyscrapers
 
Scraper ripper-grader-dozer
Scraper ripper-grader-dozerScraper ripper-grader-dozer
Scraper ripper-grader-dozer
 
Using Rss
Using RssUsing Rss
Using Rss
 

Ähnlich wie Web::Scraper

非同期処理の通知処理 with Tatsumaki
非同期処理の通知処理 with Tatsumaki非同期処理の通知処理 with Tatsumaki
非同期処理の通知処理 with Tatsumakikeroyonn
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate WorksGoro Fuji
 
Teflon - Anti Stick for the browser attack surface
Teflon - Anti Stick for the browser attack surfaceTeflon - Anti Stick for the browser attack surface
Teflon - Anti Stick for the browser attack surfaceSaumil Shah
 
Introduction To Lamp
Introduction To LampIntroduction To Lamp
Introduction To LampAmzad Hossain
 
Implementing Comet using PHP
Implementing Comet using PHPImplementing Comet using PHP
Implementing Comet using PHPKing Foo
 
루비가 얼랭에 빠진 날
루비가 얼랭에 빠진 날루비가 얼랭에 빠진 날
루비가 얼랭에 빠진 날Sukjoon Kim
 
Jade & Javascript templating
Jade & Javascript templatingJade & Javascript templating
Jade & Javascript templatingwearefractal
 
Accelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
Accelerated Adoption: HTML5 and CSS3 for ASP.NET DevelopersAccelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
Accelerated Adoption: HTML5 and CSS3 for ASP.NET DevelopersTodd Anglin
 
Ultra fast web development with sinatra
Ultra fast web development with sinatraUltra fast web development with sinatra
Ultra fast web development with sinatraSérgio Santos
 
Ajax to the Moon
Ajax to the MoonAjax to the Moon
Ajax to the Moondavejohnson
 
PHP Presentation
PHP PresentationPHP Presentation
PHP PresentationNikhil Jain
 
Searching the Now
Searching the NowSearching the Now
Searching the Nowlucasjosh
 
Node js presentation
Node js presentationNode js presentation
Node js presentationmartincabrera
 
Even Faster Web Sites at jQuery Conference '09
Even Faster Web Sites at jQuery Conference '09Even Faster Web Sites at jQuery Conference '09
Even Faster Web Sites at jQuery Conference '09Steve Souders
 

Ähnlich wie Web::Scraper (20)

Web::Scraper for SF.pm LT
Web::Scraper for SF.pm LTWeb::Scraper for SF.pm LT
Web::Scraper for SF.pm LT
 
非同期処理の通知処理 with Tatsumaki
非同期処理の通知処理 with Tatsumaki非同期処理の通知処理 with Tatsumaki
非同期処理の通知処理 with Tatsumaki
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate Works
 
Teflon - Anti Stick for the browser attack surface
Teflon - Anti Stick for the browser attack surfaceTeflon - Anti Stick for the browser attack surface
Teflon - Anti Stick for the browser attack surface
 
WordPress APIs
WordPress APIsWordPress APIs
WordPress APIs
 
Introduction To Lamp
Introduction To LampIntroduction To Lamp
Introduction To Lamp
 
Implementing Comet using PHP
Implementing Comet using PHPImplementing Comet using PHP
Implementing Comet using PHP
 
루비가 얼랭에 빠진 날
루비가 얼랭에 빠진 날루비가 얼랭에 빠진 날
루비가 얼랭에 빠진 날
 
Jade & Javascript templating
Jade & Javascript templatingJade & Javascript templating
Jade & Javascript templating
 
Accelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
Accelerated Adoption: HTML5 and CSS3 for ASP.NET DevelopersAccelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
Accelerated Adoption: HTML5 and CSS3 for ASP.NET Developers
 
Ultra fast web development with sinatra
Ultra fast web development with sinatraUltra fast web development with sinatra
Ultra fast web development with sinatra
 
Ajax to the Moon
Ajax to the MoonAjax to the Moon
Ajax to the Moon
 
&lt;img src="xss.com">
&lt;img src="xss.com">&lt;img src="xss.com">
&lt;img src="xss.com">
 
Fav
FavFav
Fav
 
Ajax ons2
Ajax ons2Ajax ons2
Ajax ons2
 
PHP Presentation
PHP PresentationPHP Presentation
PHP Presentation
 
Searching the Now
Searching the NowSearching the Now
Searching the Now
 
Node js presentation
Node js presentationNode js presentation
Node js presentation
 
Even Faster Web Sites at jQuery Conference '09
Even Faster Web Sites at jQuery Conference '09Even Faster Web Sites at jQuery Conference '09
Even Faster Web Sites at jQuery Conference '09
 
Writing Pluggable Software
Writing Pluggable SoftwareWriting Pluggable Software
Writing Pluggable Software
 

Mehr von Tatsuhiko Miyagawa

Carton CPAN dependency manager
Carton CPAN dependency managerCarton CPAN dependency manager
Carton CPAN dependency managerTatsuhiko Miyagawa
 
Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011Tatsuhiko Miyagawa
 
Plack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversPlack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversTatsuhiko Miyagawa
 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryTatsuhiko Miyagawa
 
Asynchronous programming with AnyEvent
Asynchronous programming with AnyEventAsynchronous programming with AnyEvent
Asynchronous programming with AnyEventTatsuhiko Miyagawa
 
Building a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQueryBuilding a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQueryTatsuhiko Miyagawa
 
Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008Tatsuhiko Miyagawa
 
20 modules i haven't yet talked about
20 modules i haven't yet talked about20 modules i haven't yet talked about
20 modules i haven't yet talked aboutTatsuhiko Miyagawa
 

Mehr von Tatsuhiko Miyagawa (20)

Carton CPAN dependency manager
Carton CPAN dependency managerCarton CPAN dependency manager
Carton CPAN dependency manager
 
Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011
 
Plack at OSCON 2010
Plack at OSCON 2010Plack at OSCON 2010
Plack at OSCON 2010
 
cpanminus at YAPC::NA 2010
cpanminus at YAPC::NA 2010cpanminus at YAPC::NA 2010
cpanminus at YAPC::NA 2010
 
Plack at YAPC::NA 2010
Plack at YAPC::NA 2010Plack at YAPC::NA 2010
Plack at YAPC::NA 2010
 
PSGI/Plack OSDC.TW
PSGI/Plack OSDC.TWPSGI/Plack OSDC.TW
PSGI/Plack OSDC.TW
 
Plack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversPlack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and servers
 
Plack - LPW 2009
Plack - LPW 2009Plack - LPW 2009
Plack - LPW 2009
 
Tatsumaki
TatsumakiTatsumaki
Tatsumaki
 
Intro to PSGI and Plack
Intro to PSGI and PlackIntro to PSGI and Plack
Intro to PSGI and Plack
 
CPAN Realtime feed
CPAN Realtime feedCPAN Realtime feed
CPAN Realtime feed
 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
 
Asynchronous programming with AnyEvent
Asynchronous programming with AnyEventAsynchronous programming with AnyEvent
Asynchronous programming with AnyEvent
 
Building a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQueryBuilding a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQuery
 
Remedie OSDC.TW
Remedie OSDC.TWRemedie OSDC.TW
Remedie OSDC.TW
 
Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008
 
20 modules i haven't yet talked about
20 modules i haven't yet talked about20 modules i haven't yet talked about
20 modules i haven't yet talked about
 
XML::Liberal
XML::LiberalXML::Liberal
XML::Liberal
 
Test::Base
Test::BaseTest::Base
Test::Base
 
Hacking Vox and Plagger
Hacking Vox and PlaggerHacking Vox and Plagger
Hacking Vox and Plagger
 

Kürzlich hochgeladen

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Web::Scraper

  • 1. Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers YAPC::Europe 2007 Vienna
  • 2.
  • 3.
  • 4. abbreviation Acme::Module::Authors Acme::Sneeze Acme::Sneeze::JP Apache::ACEProxy Apache::AntiSpam Apache::Clickable Apache::CustomKeywords Apache::DefaultCharset Apache::GuessCharset Apache::JavaScript::DocumentWrite Apache::No404Proxy Apache::Profiler Apache::Session::CacheAny Apache::Session::Generate::ModUniqueId Apache::Session::Generate::ModUsertrack Apache::Session::PHP Apache::Session::Serialize::YAML Apache::Singleton Apache::StickyQuery Archive::Any::Create Attribute::Profiled Attribute::Protected Attribute::Unimplemented Bundle::Sledge capitalization Catalyst::Plugin::JSONRPC Catalyst::View::Jemplate Catalyst::View::JSON CGI::Untaint::email Class::DBI::AbstractSearch Class::DBI::Extension Class::DBI::Pager Class::DBI::Replication Class::DBI::SQLite Class::DBI::View Class::Trigger Convert::Base32 Convert::DUDE Convert::RACE Date::Japanese::Era Date::Range::Birth Device::KeyStroke::Mobile Dunce::time Email::Find Email::Valid::Loose Encode::JavaScript::UCS Encode::JP::Mobile Encode::Punycode File::Find::Rule::Digest Geo::Coder::Google HTML::Entities::ImodePictogram HTML::RelExtor HTML::ResolveLink HTML::XSSLint HTTP::MobileAgent HTTP::ProxyPAC HTTP::Server::Simple::Authen IDNA::Punycode Inline::Basic Inline::TT JSON::Syck Kwiki::Emoticon Kwiki::Export Kwiki::Footnote Kwiki::OpenSearch Kwiki::OpenSearch::Service Kwiki::TypeKey Kwiki::URLBL Log::Dispatch::Config Log::Dispatch::DBI Mac::Macbinary Mail::Address::MobileJp Mail::ListDetector::Detector::Fml MSIE::MenuExt Net::DAAP::Server::AAC Net::IDN::Nameprep Net::IPAddr::Find Net::YahooMessenger NetAddr::IP::Find PHP::Session plagger Plagger POE::Component::Client::AirTunes POE::Component::YahooMessenger Template::Plugin::Clickable Template::Plugin::Comma Template::Plugin::FillInForm Template::Plugin::HTML::Template Template::Plugin::JavaScript Template::Plugin::MobileAgent Template::Plugin::Shuffle Template::Provider::Encoding Term::Encoding Term::TtyRec Text::Emoticon Text::Emoticon::GoogleTalk Text::Emoticon::MSN Text::Emoticon::Yahoo Text::MessageFormat Time::Duration::ja Time::Duration::Parse Web::Scrape WebService::Bloglines WebService::ChangesXml WebService::Google::Suggest WWW::Baseball::NPB WWW::Blog::Metadata::MobileLinkDiscovery WWW::Blog::Metadata::OpenID WWW::Blog::Metadata::OpenSearch WWW::Cache::Google WWW::OpenSearch XML::Atom XML::Atom::Lifeblog XML::Atom::Stream XML::Liberal
  • 5.  
  • 7.  
  • 8.
  • 9. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 10. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 11.
  • 12.  
  • 13.  
  • 14.
  • 15.
  • 16.
  • 17.  
  • 18. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
  • 19. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 20.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29. <span class=&quot;message&quot;>I &hearts; Vienna</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Vienna
  • 30. <span class=&quot;message&quot;>I &hearts; Vienna</span> > perl –MHTML::Entities –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities ($1)' I ♥ Vienna
  • 31. <span class=&quot;message&quot;> ウィーンが大好き! </span> > perl –MHTML::Entities –MEncode –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. ウィーンが大好き!
  • 32.
  • 33.
  • 34.
  • 35.
  • 36. XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 37.
  • 38.
  • 39.
  • 40. CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 41. Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 42.
  • 43. Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 44. Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 45.
  • 46.
  • 47.
  • 48. Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55. <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.  
  • 64.
  • 65.  
  • 66. <span class=&quot;vcard&quot;> <a href=&quot;http://twitter.com/iamcal&quot; class=&quot;url&quot; rel=&quot;contact&quot; title=&quot;Cal Henderson&quot;> <img alt=&quot;Cal Henderson&quot; class=&quot;photo fn&quot; height=&quot;24&quot; id=&quot;profile-image&quot; src=&quot;http://assets0.twitter.com/…/mini/buddyicon.gif&quot; width=&quot;24&quot; /></a> </span> <span class=&quot;vcard&quot;> … </span>
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.