SlideShare ist ein Scribd-Unternehmen logo
1 von 81
Downloaden Sie, um offline zu lesen
Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa   [email_address] Six Apart, Ltd. / Shibuya Perl Mongers Shibuya.pm Tech Talks #8
[object Object],[object Object]
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
Web pages  are built using text-based mark-up languages ( HTML  and  XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus,  screen scrapers  were reborn in the web era to  extract machine-friendly data from HTML  and other markup.  http://en.wikipedia.org/wiki/Screen_scraping
[object Object],[object Object]
 
 
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
 
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br />
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
[object Object]
WWW::MySpace 0.70
WWW::Search::Ebay 2.231
WWW::Mixi 0.50
[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
<span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Shibuya
<span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl  –MHTML::Entities  –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print  decode_entities ($1)' I  ♥  Shibuya
<span class=&quot;message&quot;>Perl が大好き! </span> > perl –MHTML::Entities  –MEncode  –e  '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@  and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. Perl が大好き!
[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
[object Object],[object Object],[object Object]
CSS Selectors ,[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath =  selector_to_xpath  &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
[object Object],[object Object],[object Object],[object Object]
Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used:  <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong>  <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua  = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree  = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node  = $tree->findnodes($xpath)->shift; print $node->as_text;
Example (after) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Basics ,[object Object],[object Object],[object Object],[object Object],[object Object]
process ,[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot; http://vienna.openguides.org/ &quot;>OpenGuides</a></li> <li><a href=&quot; http://vienna.yapceurope.org/ &quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;> OpenGuides </a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;> YAPC::Europe </a></li> </ul>
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],<ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
result ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],my $s = scraper { process …; process …; result 'foo', 'bar'; };
[object Object]
[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object],<span class=&quot;entry-date&quot;>October 1st, 2007 17:13:31 +0900</span> process &quot;.entry-date&quot;, date => 'TEXT :rfc822 ';
[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object],[object Object]
[object Object]
[object Object],[object Object],[object Object]

Weitere ähnliche Inhalte

Was ist angesagt?

Ultra fast web development with sinatra
Ultra fast web development with sinatraUltra fast web development with sinatra
Ultra fast web development with sinatraSérgio Santos
 
Javascript - The Stack and Beyond
Javascript - The Stack and BeyondJavascript - The Stack and Beyond
Javascript - The Stack and BeyondAll Things Open
 
Stop Worrying & Love the SQL - A Case Study
Stop Worrying & Love the SQL - A Case StudyStop Worrying & Love the SQL - A Case Study
Stop Worrying & Love the SQL - A Case StudyAll Things Open
 
Browser Extensions for Web Hackers
Browser Extensions for Web HackersBrowser Extensions for Web Hackers
Browser Extensions for Web HackersMark Wubben
 
今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?Sho Yoshida
 
Rotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App EngineRotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App Enginegeehwan
 
Eve - REST API for Humans™
Eve - REST API for Humans™Eve - REST API for Humans™
Eve - REST API for Humans™Nicola Iarocci
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0IBM Cloud Data Services
 
HPPG - high performance photo gallery
HPPG - high performance photo galleryHPPG - high performance photo gallery
HPPG - high performance photo galleryRemigijus Kiminas
 
Android webservices
Android webservicesAndroid webservices
Android webservicesKrazy Koder
 
Debugging and Testing ES Systems
Debugging and Testing ES SystemsDebugging and Testing ES Systems
Debugging and Testing ES SystemsChris Birchall
 
Non-Relational Databases
Non-Relational DatabasesNon-Relational Databases
Non-Relational Databaseskchodorow
 
yusukebe in Yokohama.pm 090909
yusukebe in Yokohama.pm 090909yusukebe in Yokohama.pm 090909
yusukebe in Yokohama.pm 090909Yusuke Wada
 
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud
 
Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...Odoo
 

Was ist angesagt? (20)

Ultra fast web development with sinatra
Ultra fast web development with sinatraUltra fast web development with sinatra
Ultra fast web development with sinatra
 
CouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: MangoCouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: Mango
 
Javascript - The Stack and Beyond
Javascript - The Stack and BeyondJavascript - The Stack and Beyond
Javascript - The Stack and Beyond
 
Stop Worrying & Love the SQL - A Case Study
Stop Worrying & Love the SQL - A Case StudyStop Worrying & Love the SQL - A Case Study
Stop Worrying & Love the SQL - A Case Study
 
Browser Extensions for Web Hackers
Browser Extensions for Web HackersBrowser Extensions for Web Hackers
Browser Extensions for Web Hackers
 
今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?
 
Rotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App EngineRotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App Engine
 
Eve - REST API for Humans™
Eve - REST API for Humans™Eve - REST API for Humans™
Eve - REST API for Humans™
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
 
CouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce ViewsCouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce Views
 
HPPG - high performance photo gallery
HPPG - high performance photo galleryHPPG - high performance photo gallery
HPPG - high performance photo gallery
 
Android webservices
Android webservicesAndroid webservices
Android webservices
 
Debugging and Testing ES Systems
Debugging and Testing ES SystemsDebugging and Testing ES Systems
Debugging and Testing ES Systems
 
Nodejs meetup-12-2-2015
Nodejs meetup-12-2-2015Nodejs meetup-12-2-2015
Nodejs meetup-12-2-2015
 
Non-Relational Databases
Non-Relational DatabasesNon-Relational Databases
Non-Relational Databases
 
Analyse Yourself
Analyse YourselfAnalyse Yourself
Analyse Yourself
 
CouchDB Day NYC 2017: JSON Documents
CouchDB Day NYC 2017: JSON DocumentsCouchDB Day NYC 2017: JSON Documents
CouchDB Day NYC 2017: JSON Documents
 
yusukebe in Yokohama.pm 090909
yusukebe in Yokohama.pm 090909yusukebe in Yokohama.pm 090909
yusukebe in Yokohama.pm 090909
 
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
 
Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...
 

Ähnlich wie Web Scraper Shibuya.pm tech talk #8

Ähnlich wie Web Scraper Shibuya.pm tech talk #8 (20)

Web::Scraper
Web::ScraperWeb::Scraper
Web::Scraper
 
Introduction To Lamp
Introduction To LampIntroduction To Lamp
Introduction To Lamp
 
A brief history of the web
A brief history of the webA brief history of the web
A brief history of the web
 
PHP Presentation
PHP PresentationPHP Presentation
PHP Presentation
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate Works
 
Introducing Modern Perl
Introducing Modern PerlIntroducing Modern Perl
Introducing Modern Perl
 
Php 3 1
Php 3 1Php 3 1
Php 3 1
 
Html5 Overview
Html5 OverviewHtml5 Overview
Html5 Overview
 
Forum Presentation
Forum PresentationForum Presentation
Forum Presentation
 
XML processing with perl
XML processing with perlXML processing with perl
XML processing with perl
 
Ubi comp27nov04
Ubi comp27nov04Ubi comp27nov04
Ubi comp27nov04
 
introduction to web technology
introduction to web technologyintroduction to web technology
introduction to web technology
 
10 Things You're Not Doing [IBM Lotus Notes Domino Application Development]
10 Things You're Not Doing [IBM Lotus Notes Domino Application Development]10 Things You're Not Doing [IBM Lotus Notes Domino Application Development]
10 Things You're Not Doing [IBM Lotus Notes Domino Application Development]
 
PHP Presentation
PHP PresentationPHP Presentation
PHP Presentation
 
RESTful design
RESTful designRESTful design
RESTful design
 
WordPress APIs
WordPress APIsWordPress APIs
WordPress APIs
 
Mojolicious on Steroids
Mojolicious on SteroidsMojolicious on Steroids
Mojolicious on Steroids
 
Lecture1 B Frames&Forms
Lecture1 B  Frames&FormsLecture1 B  Frames&Forms
Lecture1 B Frames&Forms
 
SlideShare Instant
SlideShare InstantSlideShare Instant
SlideShare Instant
 
SlideShare Instant
SlideShare InstantSlideShare Instant
SlideShare Instant
 

Mehr von Tatsuhiko Miyagawa

Carton CPAN dependency manager
Carton CPAN dependency managerCarton CPAN dependency manager
Carton CPAN dependency managerTatsuhiko Miyagawa
 
Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011Tatsuhiko Miyagawa
 
Plack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversPlack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversTatsuhiko Miyagawa
 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryTatsuhiko Miyagawa
 
Asynchronous programming with AnyEvent
Asynchronous programming with AnyEventAsynchronous programming with AnyEvent
Asynchronous programming with AnyEventTatsuhiko Miyagawa
 
Building a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQueryBuilding a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQueryTatsuhiko Miyagawa
 
Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008Tatsuhiko Miyagawa
 
20 modules i haven't yet talked about
20 modules i haven't yet talked about20 modules i haven't yet talked about
20 modules i haven't yet talked aboutTatsuhiko Miyagawa
 

Mehr von Tatsuhiko Miyagawa (20)

Carton CPAN dependency manager
Carton CPAN dependency managerCarton CPAN dependency manager
Carton CPAN dependency manager
 
Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011Deploying Plack Web Applications: OSCON 2011
Deploying Plack Web Applications: OSCON 2011
 
Plack at OSCON 2010
Plack at OSCON 2010Plack at OSCON 2010
Plack at OSCON 2010
 
cpanminus at YAPC::NA 2010
cpanminus at YAPC::NA 2010cpanminus at YAPC::NA 2010
cpanminus at YAPC::NA 2010
 
Plack at YAPC::NA 2010
Plack at YAPC::NA 2010Plack at YAPC::NA 2010
Plack at YAPC::NA 2010
 
PSGI/Plack OSDC.TW
PSGI/Plack OSDC.TWPSGI/Plack OSDC.TW
PSGI/Plack OSDC.TW
 
Plack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversPlack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and servers
 
Plack - LPW 2009
Plack - LPW 2009Plack - LPW 2009
Plack - LPW 2009
 
Tatsumaki
TatsumakiTatsumaki
Tatsumaki
 
Intro to PSGI and Plack
Intro to PSGI and PlackIntro to PSGI and Plack
Intro to PSGI and Plack
 
CPAN Realtime feed
CPAN Realtime feedCPAN Realtime feed
CPAN Realtime feed
 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
 
Asynchronous programming with AnyEvent
Asynchronous programming with AnyEventAsynchronous programming with AnyEvent
Asynchronous programming with AnyEvent
 
Building a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQueryBuilding a desktop app with HTTP::Engine, SQLite and jQuery
Building a desktop app with HTTP::Engine, SQLite and jQuery
 
Remedie OSDC.TW
Remedie OSDC.TWRemedie OSDC.TW
Remedie OSDC.TW
 
Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008Why Open Matters It Pro Challenge 2008
Why Open Matters It Pro Challenge 2008
 
20 modules i haven't yet talked about
20 modules i haven't yet talked about20 modules i haven't yet talked about
20 modules i haven't yet talked about
 
XML::Liberal
XML::LiberalXML::Liberal
XML::Liberal
 
Test::Base
Test::BaseTest::Base
Test::Base
 
Hacking Vox and Plagger
Hacking Vox and PlaggerHacking Vox and Plagger
Hacking Vox and Plagger
 

Kürzlich hochgeladen

Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 

Kürzlich hochgeladen (20)

Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 

Web Scraper Shibuya.pm tech talk #8

  • 1. Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email_address] Six Apart, Ltd. / Shibuya Perl Mongers Shibuya.pm Tech Talks #8
  • 2.
  • 3. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 4. Web pages are built using text-based mark-up languages ( HTML and XHTML ), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human consumption, and frequently mix content with presentation. Thus, screen scrapers were reborn in the web era to extract machine-friendly data from HTML and other markup. http://en.wikipedia.org/wiki/Screen_scraping
  • 5.
  • 6.  
  • 7.  
  • 8.
  • 9.
  • 10.
  • 11.  
  • 12. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br />
  • 13. <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 14.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. <span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print $1' I &hearts; Shibuya
  • 24. <span class=&quot;message&quot;>I &hearts; Shibuya</span> > perl –MHTML::Entities –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities ($1)' I ♥ Shibuya
  • 25. <span class=&quot;message&quot;>Perl が大好き! </span> > perl –MHTML::Entities –MEncode –e '$c =~ m@<span class=&quot;message&quot;>(.*?)</span>@ and print decode_entities( decode_utf8 ($1))' Wide character in print at –e line 1. Perl が大好き!
  • 26.
  • 27.
  • 28.
  • 29.
  • 30. XPath <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); print $tree->findnodes ('//strong[@id=&quot;ctu&quot;]') ->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 31.
  • 32.
  • 33.
  • 34. CSS Selectors <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath &quot;strong#ctu&quot;; print $tree->findnodes($xpath)->shift->as_text; # Monday, August 27, 2007 at 12:49:46
  • 35. Complete Script #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 36.
  • 37. Exmaple (before) <td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id=&quot;ctu&quot;>Monday, August 27, 2007 at 12:49:46</strong> <br /> > perl -MLWP::Simple -le '$c = get(&quot;http://timeanddate.com/worldclock/&quot;); $c =~ m@<strong id=&quot;ctu&quot;>(.*?)</strong>@ and print $1' Monday, August 27, 2007 at 12:49:46
  • 38. Example (after) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 39.
  • 40.
  • 41.
  • 42. Example (before) #!/usr/bin/perl use strict; use warnings; use Encode; use LWP::UserAgent; use HTTP::Response::Encoding; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath qw(selector_to_xpath); my $ua = LWP::UserAgent->new; my $res = $ua->get(&quot;http://www.timeanddate.com/worldclock/&quot;); if ($res->is_error) { die &quot;HTTP GET error: &quot;, $res->status_line; } my $content = decode $res->encoding, $res->content; my $tree = HTML::TreeBuilder::XPath->new_from_content($content); my $xpath = selector_to_xpath(&quot;strong#ctu&quot;); my $node = $tree->findnodes($xpath)->shift; print $node->as_text;
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49. <ul class=&quot;sites&quot;> <li><a href=&quot;http://vienna.openguides.org/&quot;>OpenGuides</a></li> <li><a href=&quot;http://vienna.yapceurope.org/&quot;>YAPC::Europe</a></li> </ul>
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.