SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Web Scraping with  Matthew Turland Acadiana Open Source Group April 30, 2009
What Is It?
Normal Web Browsing
Difference #1: Immediate Audience
Difference #2: Consumption Method
Why Is It Useful?
Data Without Web Services
Integration Testing
Crawlers
With plain text, we give ourselves the  ability to manipulate knowledge, both  manually and programmatically, using  virtually every tool at our disposal. 3.14 The Power of Plain Text, The Pragmatic Programmer
Disadvantages
Potential Lack of Stability
Reverse Engineering Required
More Requests
No Nice Neat Data Package
Step #1: Retrieval
Speaking the Language
The Web We Weave GET / HTTP/1.1 User-Agent: ... HTTP/1.1 200 OK Content-Type: ...
GET   /index.php?foo=bar   HTTP/1.1 <a href= &quot;/index.php?foo=bar&quot; > Index </a> <form method= &quot;post&quot;  action= &quot;/index.php&quot; > <input name= &quot;foo&quot;  value= &quot;bar&quot;  /> </form> POST  /index.php   HTTP/1.1 foo = bar Browsing -> Requests
HTTP/1.1 200 OK Content-Type : image/gif Content-Length:  8558 Responses -> Rendered Elements <img src= &quot;/intl/en_ALL/images/logo.gif&quot;  /> GET   /intl/en_ALL/images/logo.gif   HTTP/1.1 Host:  google.com
Not As Easy As It Looks
Redirections
Referer [sic]
Cookies
User Agent Sniffing
robots.txt
Caching
HTTP Authentication
PHP: Glue for the Web
HTTP Client Libraries PEAR::HTTP_Client pecl_http Zend_Http_Client Streams ,  cURL
Simple Streams Example $uri  =  'http://www.example.com/some/resource' ; $get  = file_get_contents( $uri ); $context  = stream_context_create( array ( 'http'  =>  array ( 'method'  =>  'POST' , 'header'  =>  'Content-Type: '  . 'application/x-www-form-urlencoded' , 'content'  => http_build_query( array ( 'var1'  =>  'value1' , 'var2'  =>  'value2' )) ) ) ); $post  = file_get_contents( $uri , false,  $context );
pecl_http Example $http  = new HttpRequest( $uri ); $http ->enableCookies(); $http ->setMethod(HTTP_METH_POST); $http ->addPostFields( array ( 'var1'  =>  'value1' )); $http ->setOptions( 'useragent'  =>  'PHP '  .  phpversion (), 'referer'  =>  'http://example.com/some/referer' )); $response  =  $http -> send (); $headers  =  $response ->getHeaders(); $body  =  $response ->getBody();
pecl_http Request Pooling $pool  = new HttpRequestPool; foreach  ( $urls  as  $url ) { $request  = new HttpRequest( $url , HTTP_METH_GET); $pool ->attach( $request ); } $pool -> send (); foreach  ( $pool  as  $request ) { echo   $request ->getUrl(), PHP_EOL; echo   $request ->getResponseBody(), PHP_EOL; }
HTTP Resources ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Step #2:Analysis
Tidy Extension $config  =  array ( 'output-xhtml'  => true); $tidy  = tidy_parse_string( $markupString ,  $config ); $tidy  = tidy_parse_file( $markupFilePath ,  $config ); $output  = tidy_get_output( $tidy );
DOM Extension $doc  = new DOMDocument; $doc ->loadHTML( $htmlString ); $doc ->loadHTMLFile( $htmlFilePath ); $listItems  =  $doc ->getElementsByTagName( 'li' ); $xpath  = new DOMXPath( $doc ); $listItems  =  $xpath ->query( '//ul/li' ); foreach  ( $listItems  as  $listItem ) { echo   $listItem ->nodeValue, PHP_EOL; }
SimpleXML Extension $sxe  = new SimpleXMLElement( $markupString ); $sxe  = new SimpleXMLElement( $filePath , null, true); echo   $sxe ->body->ul->li[0], PHP_EOL; $children  =  $sxe ->body->ul->li; $children  =  $sxe ->body->ul->children(); foreach  ( $children  as  $li ) { echo   $li , PHP_EOL; } echo   $sxe ->body->ul[ 'id' ]; $attributes  =  $sxe ->body->ul->attributes(); foreach  ( $attributes  as  $name  =>  $value ) { echo   $name ,  '=' ,  $value , PHP_EOL; }
XMLReader Extension $doc  = XMLReader::xml( $xmlString ); $doc  = XMLReader::open( $filePath ); while  ( $doc -> read ()) { if  ( $doc ->nodeType == XMLReader::ELEMENT) { var_dump ( $doc ->localName); var_dump ( $doc ->hasValue); var_dump ( $doc ->value); var_dump ( $doc ->hasAttributes); var_dump ( $doc ->getAttribute( 'id' )); } }
CSS Selector Libraries ,[object Object],[object Object],[object Object],$doc1  = phpQuery::newDocumentFile( $markupFilePath ); $doc2  = phpQuery::newDocument( $markupString ); $listItems  = pq( 'ul > li' );  // uses $doc2 $listItems  = pq( 'ul > li' ,  $doc1 );
PCRE Extension
Best Practices
Approximate Human Behavior
Minimize Requests
Batch Jobs, Non-Peak Hours
Account for Unavailability
Aim for Parallelism
Validate Data
Test, Test, Test!
Questions

Weitere ähnliche Inhalte

Was ist angesagt?

YAPC::Asia 2010 Twitter解析サービス
YAPC::Asia 2010 Twitter解析サービスYAPC::Asia 2010 Twitter解析サービス
YAPC::Asia 2010 Twitter解析サービス
Yusuke Wada
 
Twib in Yokoahma.pm 2010/3/5
Twib in Yokoahma.pm 2010/3/5Twib in Yokoahma.pm 2010/3/5
Twib in Yokoahma.pm 2010/3/5
Yusuke Wada
 
Modware next generation with pub module
Modware next generation with pub moduleModware next generation with pub module
Modware next generation with pub module
cybersiddhu
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
elliando dias
 
Programming For Designers V3
Programming For Designers V3Programming For Designers V3
Programming For Designers V3
sqoo
 

Was ist angesagt? (20)

Tax management-system
Tax management-systemTax management-system
Tax management-system
 
YAPC::Asia 2010 Twitter解析サービス
YAPC::Asia 2010 Twitter解析サービスYAPC::Asia 2010 Twitter解析サービス
YAPC::Asia 2010 Twitter解析サービス
 
Introduction to the Pods JSON API
Introduction to the Pods JSON APIIntroduction to the Pods JSON API
Introduction to the Pods JSON API
 
Twib in Yokoahma.pm 2010/3/5
Twib in Yokoahma.pm 2010/3/5Twib in Yokoahma.pm 2010/3/5
Twib in Yokoahma.pm 2010/3/5
 
Pemrograman Web 8 - MySQL
Pemrograman Web 8 - MySQLPemrograman Web 8 - MySQL
Pemrograman Web 8 - MySQL
 
Wsomdp
WsomdpWsomdp
Wsomdp
 
Php Rss
Php RssPhp Rss
Php Rss
 
Modware next generation with pub module
Modware next generation with pub moduleModware next generation with pub module
Modware next generation with pub module
 
Not Really PHP by the book
Not Really PHP by the bookNot Really PHP by the book
Not Really PHP by the book
 
TDC2015 Porto Alegre - Automate everything with Phing !
TDC2015 Porto Alegre - Automate everything with Phing !TDC2015 Porto Alegre - Automate everything with Phing !
TDC2015 Porto Alegre - Automate everything with Phing !
 
jQuery - Doing it right
jQuery - Doing it rightjQuery - Doing it right
jQuery - Doing it right
 
Add loop shortcode
Add loop shortcodeAdd loop shortcode
Add loop shortcode
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
 
So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019
 
Laravel the right way
Laravel   the right wayLaravel   the right way
Laravel the right way
 
Prepared Statement 올바르게 사용하기
Prepared Statement 올바르게 사용하기Prepared Statement 올바르게 사용하기
Prepared Statement 올바르게 사용하기
 
Current state-of-php
Current state-of-phpCurrent state-of-php
Current state-of-php
 
YAP / Open Mail Overview
YAP / Open Mail OverviewYAP / Open Mail Overview
YAP / Open Mail Overview
 
Programming For Designers V3
Programming For Designers V3Programming For Designers V3
Programming For Designers V3
 
Let's write secure Drupal code! - DrupalCamp Oslo, 2018
Let's write secure Drupal code! - DrupalCamp Oslo, 2018Let's write secure Drupal code! - DrupalCamp Oslo, 2018
Let's write secure Drupal code! - DrupalCamp Oslo, 2018
 

Ähnlich wie Web Scraping with PHP

Php Basic Security
Php Basic SecurityPhp Basic Security
Php Basic Security
mussawir20
 
Cakefest 2010: API Development
Cakefest 2010: API DevelopmentCakefest 2010: API Development
Cakefest 2010: API Development
Andrew Curioso
 
Intro to #memtech PHP 2011-12-05
Intro to #memtech PHP   2011-12-05Intro to #memtech PHP   2011-12-05
Intro to #memtech PHP 2011-12-05
Jeremy Kendall
 
Testing persistence in PHP with DbUnit
Testing persistence in PHP with DbUnitTesting persistence in PHP with DbUnit
Testing persistence in PHP with DbUnit
Peter Wilcsinszky
 
Php Security3895
Php Security3895Php Security3895
Php Security3895
Aung Khant
 

Ähnlich wie Web Scraping with PHP (20)

Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
PHP
PHP PHP
PHP
 
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
 
The Basics Of Page Creation
The Basics Of Page CreationThe Basics Of Page Creation
The Basics Of Page Creation
 
Php Basic Security
Php Basic SecurityPhp Basic Security
Php Basic Security
 
Introduction To Lamp
Introduction To LampIntroduction To Lamp
Introduction To Lamp
 
JQuery Basics
JQuery BasicsJQuery Basics
JQuery Basics
 
Php
PhpPhp
Php
 
Php 3 1
Php 3 1Php 3 1
Php 3 1
 
Cakefest 2010: API Development
Cakefest 2010: API DevelopmentCakefest 2010: API Development
Cakefest 2010: API Development
 
SlideShare Instant
SlideShare InstantSlideShare Instant
SlideShare Instant
 
SlideShare Instant
SlideShare InstantSlideShare Instant
SlideShare Instant
 
Php security3895
Php security3895Php security3895
Php security3895
 
PHP Security
PHP SecurityPHP Security
PHP Security
 
Mojolicious on Steroids
Mojolicious on SteroidsMojolicious on Steroids
Mojolicious on Steroids
 
Intro to #memtech PHP 2011-12-05
Intro to #memtech PHP   2011-12-05Intro to #memtech PHP   2011-12-05
Intro to #memtech PHP 2011-12-05
 
Testing persistence in PHP with DbUnit
Testing persistence in PHP with DbUnitTesting persistence in PHP with DbUnit
Testing persistence in PHP with DbUnit
 
HTML::FormHandler
HTML::FormHandlerHTML::FormHandler
HTML::FormHandler
 
Further Php
Further PhpFurther Php
Further Php
 
Php Security3895
Php Security3895Php Security3895
Php Security3895
 

Mehr von Matthew Turland

Open Source Networking with Vyatta
Open Source Networking with VyattaOpen Source Networking with Vyatta
Open Source Networking with Vyatta
Matthew Turland
 
Open Source Content Management Systems
Open Source Content Management SystemsOpen Source Content Management Systems
Open Source Content Management Systems
Matthew Turland
 
PHP Basics for Designers
PHP Basics for DesignersPHP Basics for Designers
PHP Basics for Designers
Matthew Turland
 
Creating Web Services with Zend Framework - Matthew Turland
Creating Web Services with Zend Framework - Matthew TurlandCreating Web Services with Zend Framework - Matthew Turland
Creating Web Services with Zend Framework - Matthew Turland
Matthew Turland
 
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake DevilleThe OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
Matthew Turland
 
Utilizing the Xen Hypervisor in business practice - Bryan Fusilier
Utilizing the Xen Hypervisor in business practice - Bryan FusilierUtilizing the Xen Hypervisor in business practice - Bryan Fusilier
Utilizing the Xen Hypervisor in business practice - Bryan Fusilier
Matthew Turland
 
The Ruby Programming Language - Ryan Farnell
The Ruby Programming Language - Ryan FarnellThe Ruby Programming Language - Ryan Farnell
The Ruby Programming Language - Ryan Farnell
Matthew Turland
 

Mehr von Matthew Turland (14)

New SPL Features in PHP 5.3
New SPL Features in PHP 5.3New SPL Features in PHP 5.3
New SPL Features in PHP 5.3
 
New SPL Features in PHP 5.3 (TEK-X)
New SPL Features in PHP 5.3 (TEK-X)New SPL Features in PHP 5.3 (TEK-X)
New SPL Features in PHP 5.3 (TEK-X)
 
Sinatra
SinatraSinatra
Sinatra
 
Open Source Networking with Vyatta
Open Source Networking with VyattaOpen Source Networking with Vyatta
Open Source Networking with Vyatta
 
When RSS Fails: Web Scraping with HTTP
When RSS Fails: Web Scraping with HTTPWhen RSS Fails: Web Scraping with HTTP
When RSS Fails: Web Scraping with HTTP
 
Open Source Content Management Systems
Open Source Content Management SystemsOpen Source Content Management Systems
Open Source Content Management Systems
 
PHP Basics for Designers
PHP Basics for DesignersPHP Basics for Designers
PHP Basics for Designers
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
Creating Web Services with Zend Framework - Matthew Turland
Creating Web Services with Zend Framework - Matthew TurlandCreating Web Services with Zend Framework - Matthew Turland
Creating Web Services with Zend Framework - Matthew Turland
 
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake DevilleThe OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
 
Utilizing the Xen Hypervisor in business practice - Bryan Fusilier
Utilizing the Xen Hypervisor in business practice - Bryan FusilierUtilizing the Xen Hypervisor in business practice - Bryan Fusilier
Utilizing the Xen Hypervisor in business practice - Bryan Fusilier
 
The Ruby Programming Language - Ryan Farnell
The Ruby Programming Language - Ryan FarnellThe Ruby Programming Language - Ryan Farnell
The Ruby Programming Language - Ryan Farnell
 
PDQ Programming Languages plus an overview of Alice - Frank Ducrest
PDQ Programming Languages plus an overview of Alice - Frank DucrestPDQ Programming Languages plus an overview of Alice - Frank Ducrest
PDQ Programming Languages plus an overview of Alice - Frank Ducrest
 
Getting Involved in Open Source - Matthew Turland
Getting Involved in Open Source - Matthew TurlandGetting Involved in Open Source - Matthew Turland
Getting Involved in Open Source - Matthew Turland
 

Kürzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Web Scraping with PHP