SlideShare ist ein Scribd-Unternehmen logo
1 von 18
XQuery: Querying the World (formerly known as Web Scraping) Dennis Knochenwefel <dennis.knochenwefel@28msec.com>
Evolution Web Scraping
PHP (2007) $url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD"; $raw = file_get_contents($url); $newlines = array("","","","2020","","0B"); $content = str_replace($newlines, "", html_entity_decode($raw)); $start = strpos($content,'<table cellpadding="2" class="standard_table"'); $end = strpos($content,'</table>',$start) + 8; $table = substr($content,$start,$end-$start); preg_match_all("|<tr(.*)</tr>|U",$table,$rows); foreach ($rows[0] as $row){     if ((strpos($row,'<th')===false)){         preg_match_all("|<td(.*)</td>|U",$row,$cells);         $number = strip_tags($cells[0][0]);         $name = strip_tags($cells[0][1]);         $position = strip_tags($cells[0][2]);         echo "{$position} - {$name} - Number {$number} <br>";     } } $url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD"; $raw = file_get_contents($url); $newlines = array("","","","2020","","0B"); $content = str_replace($newlines, "", html_entity_decode($raw)); $start = strpos($content,'<table cellpadding="2" class="standard_table"'); $end = strpos($content,'</table>',$start) + 8; $table = substr($content,$start,$end-$start); preg_match_all("|<tr(.*)</tr>|U",$table,$rows); foreach ($rows[0] as $row){     if ((strpos($row,'<th')===false)){         preg_match_all("|<td(.*)</td>|U",$row,$cells);         $number = strip_tags($cells[0][0]);         $name = strip_tags($cells[0][1]);         $position = strip_tags($cells[0][2]);         echo "{$position} - {$name} - Number {$number} <br>";     } } source: http://www.bradino.com/php/screen-scraping/
PHP (June 2011) $url="http://www.rtu.ac.in/results/reformat.php"; $post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit"; $ch=curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_POST,1); curl_setopt($ch,CURLOPT_POSTFIELDS,$post); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); $content=curl_exec($ch); curl_close($ch); $totalPath="html/body/table[4]/tbody/tr[3]/td[4]"; $page=new DOMDocument(); $xpath=new DOMXPath($page); $page->loadHTML($content); $page->saveHTML();  // this shows the page contents $total=$xpath->query($totalPath); echo $total->length;    //shows 0 echo $total->item(0)->nodeValue;   //shows nothing $url="http://www.rtu.ac.in/results/reformat.php"; $post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit"; $ch=curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_POST,1); curl_setopt($ch,CURLOPT_POSTFIELDS,$post); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); $content=curl_exec($ch); curl_close($ch); $totalPath="html/body/table[4]/tbody/tr[3]/td[4]"; $page=new DOMDocument(); $xpath=new DOMXPath($page); $page->loadHTML($content); $page->saveHTML();  // this shows the page contents $total=$xpath->query($totalPath); echo $total->length;    //shows 0 echo $total->item(0)->nodeValue;   //shows nothing ! ! source: http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
XQuery
Real World Example
awesome site awesome data no API
Deal with sessions
Need to emulate setting options
Different Notions Publisher <=> Consumer
JSON ? XML ? CSV ! HTML ! XLS ! Zip ! App Website
Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website
Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website Customize with URL Params HTML Forms
Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website Customize with URL Params HTML Forms
CSV ! HTML ! XLS ! Zip ! HTML ! Session! Session! App Website XQuery ! HTML Forms HTML Forms
Summary
Session handling Forms ! ! XQuery Web Data Processing A browser can do it?                  XQuery can do it!
Result: http://www.unemployment.by/country

Weitere ähnliche Inhalte

Was ist angesagt?

C A S Sample Php
C A S Sample PhpC A S Sample Php
C A S Sample Php
JH Lee
 

Was ist angesagt? (20)

Perl6 operators and metaoperators
Perl6   operators and metaoperatorsPerl6   operators and metaoperators
Perl6 operators and metaoperators
 
C A S Sample Php
C A S Sample PhpC A S Sample Php
C A S Sample Php
 
Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...
Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...
Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...
 
R57.Php
R57.PhpR57.Php
R57.Php
 
PhoneGap: Local Storage
PhoneGap: Local StoragePhoneGap: Local Storage
PhoneGap: Local Storage
 
Session8
Session8Session8
Session8
 
Nop2
Nop2Nop2
Nop2
 
WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011
WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011
WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011
 
PHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くためにPHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くために
 
Perl Bag of Tricks - Baltimore Perl mongers
Perl Bag of Tricks  -  Baltimore Perl mongersPerl Bag of Tricks  -  Baltimore Perl mongers
Perl Bag of Tricks - Baltimore Perl mongers
 
MySQL Create Table
MySQL Create TableMySQL Create Table
MySQL Create Table
 
Php
PhpPhp
Php
 
The Magic Of Tie
The Magic Of TieThe Magic Of Tie
The Magic Of Tie
 
So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019
 
IsTrue(true)?
IsTrue(true)?IsTrue(true)?
IsTrue(true)?
 
Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find Fraudsters
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 Version
 
How to stand on the shoulders of giants
How to stand on the shoulders of giantsHow to stand on the shoulders of giants
How to stand on the shoulders of giants
 
PHP Tips & Tricks
PHP Tips & TricksPHP Tips & Tricks
PHP Tips & Tricks
 
Coding website
Coding websiteCoding website
Coding website
 

Ähnlich wie London XQuery Meetup: Querying the World (Web Scraping)

R57shell
R57shellR57shell
R57shell
ady36
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
elliando dias
 
Javascript Continues Integration in Jenkins with AngularJS
Javascript Continues Integration in Jenkins with AngularJSJavascript Continues Integration in Jenkins with AngularJS
Javascript Continues Integration in Jenkins with AngularJS
Ladislav Prskavec
 

Ähnlich wie London XQuery Meetup: Querying the World (Web Scraping) (20)

Drupal Development (Part 2)
Drupal Development (Part 2)Drupal Development (Part 2)
Drupal Development (Part 2)
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
Ae internals
Ae internalsAe internals
Ae internals
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
 
R57shell
R57shellR57shell
R57shell
 
Blog Hacks 2011
Blog Hacks 2011Blog Hacks 2011
Blog Hacks 2011
 
Database api
Database apiDatabase api
Database api
 
PHP POWERPOINT SLIDES
PHP POWERPOINT SLIDESPHP POWERPOINT SLIDES
PHP POWERPOINT SLIDES
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
 
Daily notes
Daily notesDaily notes
Daily notes
 
8時間耐久CakePHP2 勉強会
8時間耐久CakePHP2 勉強会8時間耐久CakePHP2 勉強会
8時間耐久CakePHP2 勉強会
 
Dropping ACID with MongoDB
Dropping ACID with MongoDBDropping ACID with MongoDB
Dropping ACID with MongoDB
 
Php My Sql
Php My SqlPhp My Sql
Php My Sql
 
Secure Coding With Wordpress (BarCamp Orlando 2009)
Secure Coding With Wordpress (BarCamp Orlando 2009)Secure Coding With Wordpress (BarCamp Orlando 2009)
Secure Coding With Wordpress (BarCamp Orlando 2009)
 
Javascript Continues Integration in Jenkins with AngularJS
Javascript Continues Integration in Jenkins with AngularJSJavascript Continues Integration in Jenkins with AngularJS
Javascript Continues Integration in Jenkins with AngularJS
 
The History of PHPersistence
The History of PHPersistenceThe History of PHPersistence
The History of PHPersistence
 
[PL] Jak nie zostać "programistą" PHP?
[PL] Jak nie zostać "programistą" PHP?[PL] Jak nie zostać "programistą" PHP?
[PL] Jak nie zostać "programistą" PHP?
 
2013 - Benjamin Eberlei - Doctrine 2
2013 - Benjamin Eberlei - Doctrine 22013 - Benjamin Eberlei - Doctrine 2
2013 - Benjamin Eberlei - Doctrine 2
 
Views notwithstanding
Views notwithstandingViews notwithstanding
Views notwithstanding
 

Kürzlich hochgeladen

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
UK Journal
 

Kürzlich hochgeladen (20)

WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 

London XQuery Meetup: Querying the World (Web Scraping)

  • 1. XQuery: Querying the World (formerly known as Web Scraping) Dennis Knochenwefel <dennis.knochenwefel@28msec.com>
  • 3. PHP (2007) $url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD"; $raw = file_get_contents($url); $newlines = array("","","","2020","","0B"); $content = str_replace($newlines, "", html_entity_decode($raw)); $start = strpos($content,'<table cellpadding="2" class="standard_table"'); $end = strpos($content,'</table>',$start) + 8; $table = substr($content,$start,$end-$start); preg_match_all("|<tr(.*)</tr>|U",$table,$rows); foreach ($rows[0] as $row){ if ((strpos($row,'<th')===false)){ preg_match_all("|<td(.*)</td>|U",$row,$cells); $number = strip_tags($cells[0][0]); $name = strip_tags($cells[0][1]); $position = strip_tags($cells[0][2]); echo "{$position} - {$name} - Number {$number} <br>"; } } $url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD"; $raw = file_get_contents($url); $newlines = array("","","","2020","","0B"); $content = str_replace($newlines, "", html_entity_decode($raw)); $start = strpos($content,'<table cellpadding="2" class="standard_table"'); $end = strpos($content,'</table>',$start) + 8; $table = substr($content,$start,$end-$start); preg_match_all("|<tr(.*)</tr>|U",$table,$rows); foreach ($rows[0] as $row){ if ((strpos($row,'<th')===false)){ preg_match_all("|<td(.*)</td>|U",$row,$cells); $number = strip_tags($cells[0][0]); $name = strip_tags($cells[0][1]); $position = strip_tags($cells[0][2]); echo "{$position} - {$name} - Number {$number} <br>"; } } source: http://www.bradino.com/php/screen-scraping/
  • 4. PHP (June 2011) $url="http://www.rtu.ac.in/results/reformat.php"; $post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit"; $ch=curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_POST,1); curl_setopt($ch,CURLOPT_POSTFIELDS,$post); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); $content=curl_exec($ch); curl_close($ch); $totalPath="html/body/table[4]/tbody/tr[3]/td[4]"; $page=new DOMDocument(); $xpath=new DOMXPath($page); $page->loadHTML($content); $page->saveHTML();  // this shows the page contents $total=$xpath->query($totalPath); echo $total->length;    //shows 0 echo $total->item(0)->nodeValue;   //shows nothing $url="http://www.rtu.ac.in/results/reformat.php"; $post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit"; $ch=curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_POST,1); curl_setopt($ch,CURLOPT_POSTFIELDS,$post); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); $content=curl_exec($ch); curl_close($ch); $totalPath="html/body/table[4]/tbody/tr[3]/td[4]"; $page=new DOMDocument(); $xpath=new DOMXPath($page); $page->loadHTML($content); $page->saveHTML();  // this shows the page contents $total=$xpath->query($totalPath); echo $total->length;    //shows 0 echo $total->item(0)->nodeValue;   //shows nothing ! ! source: http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
  • 7. awesome site awesome data no API
  • 9. Need to emulate setting options
  • 11. JSON ? XML ? CSV ! HTML ! XLS ! Zip ! App Website
  • 12. Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website
  • 13. Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website Customize with URL Params HTML Forms
  • 14. Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website Customize with URL Params HTML Forms
  • 15. CSV ! HTML ! XLS ! Zip ! HTML ! Session! Session! App Website XQuery ! HTML Forms HTML Forms
  • 17. Session handling Forms ! ! XQuery Web Data Processing A browser can do it? XQuery can do it!

Hinweis der Redaktion

  1. http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
  2. http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
  3. http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page