SlideShare ist ein Scribd-Unternehmen logo
1 von 18
XQuery: Querying the World (formerly known as Web Scraping) Dennis Knochenwefel <dennis.knochenwefel@28msec.com>
Evolution Web Scraping
PHP (2007) $url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD"; $raw = file_get_contents($url); $newlines = array("","","","2020","","0B"); $content = str_replace($newlines, "", html_entity_decode($raw)); $start = strpos($content,'<table cellpadding="2" class="standard_table"'); $end = strpos($content,'</table>',$start) + 8; $table = substr($content,$start,$end-$start); preg_match_all("|<tr(.*)</tr>|U",$table,$rows); foreach ($rows[0] as $row){     if ((strpos($row,'<th')===false)){         preg_match_all("|<td(.*)</td>|U",$row,$cells);         $number = strip_tags($cells[0][0]);         $name = strip_tags($cells[0][1]);         $position = strip_tags($cells[0][2]);         echo "{$position} - {$name} - Number {$number} <br>";     } } $url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD"; $raw = file_get_contents($url); $newlines = array("","","","2020","","0B"); $content = str_replace($newlines, "", html_entity_decode($raw)); $start = strpos($content,'<table cellpadding="2" class="standard_table"'); $end = strpos($content,'</table>',$start) + 8; $table = substr($content,$start,$end-$start); preg_match_all("|<tr(.*)</tr>|U",$table,$rows); foreach ($rows[0] as $row){     if ((strpos($row,'<th')===false)){         preg_match_all("|<td(.*)</td>|U",$row,$cells);         $number = strip_tags($cells[0][0]);         $name = strip_tags($cells[0][1]);         $position = strip_tags($cells[0][2]);         echo "{$position} - {$name} - Number {$number} <br>";     } } source: http://www.bradino.com/php/screen-scraping/
PHP (June 2011) $url="http://www.rtu.ac.in/results/reformat.php"; $post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit"; $ch=curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_POST,1); curl_setopt($ch,CURLOPT_POSTFIELDS,$post); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); $content=curl_exec($ch); curl_close($ch); $totalPath="html/body/table[4]/tbody/tr[3]/td[4]"; $page=new DOMDocument(); $xpath=new DOMXPath($page); $page->loadHTML($content); $page->saveHTML();  // this shows the page contents $total=$xpath->query($totalPath); echo $total->length;    //shows 0 echo $total->item(0)->nodeValue;   //shows nothing $url="http://www.rtu.ac.in/results/reformat.php"; $post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit"; $ch=curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_POST,1); curl_setopt($ch,CURLOPT_POSTFIELDS,$post); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); $content=curl_exec($ch); curl_close($ch); $totalPath="html/body/table[4]/tbody/tr[3]/td[4]"; $page=new DOMDocument(); $xpath=new DOMXPath($page); $page->loadHTML($content); $page->saveHTML();  // this shows the page contents $total=$xpath->query($totalPath); echo $total->length;    //shows 0 echo $total->item(0)->nodeValue;   //shows nothing ! ! source: http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
XQuery
Real World Example
awesome site awesome data no API
Deal with sessions
Need to emulate setting options
Different Notions Publisher <=> Consumer
JSON ? XML ? CSV ! HTML ! XLS ! Zip ! App Website
Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website
Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website Customize with URL Params HTML Forms
Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website Customize with URL Params HTML Forms
CSV ! HTML ! XLS ! Zip ! HTML ! Session! Session! App Website XQuery ! HTML Forms HTML Forms
Summary
Session handling Forms ! ! XQuery Web Data Processing A browser can do it?                  XQuery can do it!
Result: http://www.unemployment.by/country

Weitere ähnliche Inhalte

Was ist angesagt?

Perl6 operators and metaoperators
Perl6   operators and metaoperatorsPerl6   operators and metaoperators
Perl6 operators and metaoperatorsSimon Proctor
 
C A S Sample Php
C A S Sample PhpC A S Sample Php
C A S Sample PhpJH Lee
 
Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...
Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...
Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...Mail.ru Group
 
WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011
WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011
WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011John Ford
 
PHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くためにPHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くためにYuya Takeyama
 
Perl Bag of Tricks - Baltimore Perl mongers
Perl Bag of Tricks  -  Baltimore Perl mongersPerl Bag of Tricks  -  Baltimore Perl mongers
Perl Bag of Tricks - Baltimore Perl mongersbrian d foy
 
MySQL Create Table
MySQL Create TableMySQL Create Table
MySQL Create TableHoyoung Jung
 
The Magic Of Tie
The Magic Of TieThe Magic Of Tie
The Magic Of Tiebrian d foy
 
So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019RonRohlfs1
 
Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersIan Barber
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionIan Barber
 
How to stand on the shoulders of giants
How to stand on the shoulders of giantsHow to stand on the shoulders of giants
How to stand on the shoulders of giantsIan Barber
 

Was ist angesagt? (20)

Perl6 operators and metaoperators
Perl6   operators and metaoperatorsPerl6   operators and metaoperators
Perl6 operators and metaoperators
 
C A S Sample Php
C A S Sample PhpC A S Sample Php
C A S Sample Php
 
Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...
Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...
Security Meetup 22 октября. «Реверс-инжиниринг в Enterprise». Алексей Секрето...
 
R57.Php
R57.PhpR57.Php
R57.Php
 
PhoneGap: Local Storage
PhoneGap: Local StoragePhoneGap: Local Storage
PhoneGap: Local Storage
 
Session8
Session8Session8
Session8
 
Nop2
Nop2Nop2
Nop2
 
WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011
WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011
WordPress Security: Be a Superhero - WordCamp Raleigh - May 2011
 
PHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くためにPHPUnit でよりよくテストを書くために
PHPUnit でよりよくテストを書くために
 
Perl Bag of Tricks - Baltimore Perl mongers
Perl Bag of Tricks  -  Baltimore Perl mongersPerl Bag of Tricks  -  Baltimore Perl mongers
Perl Bag of Tricks - Baltimore Perl mongers
 
MySQL Create Table
MySQL Create TableMySQL Create Table
MySQL Create Table
 
Php
PhpPhp
Php
 
The Magic Of Tie
The Magic Of TieThe Magic Of Tie
The Magic Of Tie
 
So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019
 
IsTrue(true)?
IsTrue(true)?IsTrue(true)?
IsTrue(true)?
 
Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find Fraudsters
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 Version
 
How to stand on the shoulders of giants
How to stand on the shoulders of giantsHow to stand on the shoulders of giants
How to stand on the shoulders of giants
 
PHP Tips & Tricks
PHP Tips & TricksPHP Tips & Tricks
PHP Tips & Tricks
 
Coding website
Coding websiteCoding website
Coding website
 

Ähnlich wie London XQuery Meetup: Querying the World (Web Scraping)

Drupal Development (Part 2)
Drupal Development (Part 2)Drupal Development (Part 2)
Drupal Development (Part 2)Jeff Eaton
 
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)Michael Wales
 
R57shell
R57shellR57shell
R57shellady36
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applicationselliando dias
 
8時間耐久CakePHP2 勉強会
8時間耐久CakePHP2 勉強会8時間耐久CakePHP2 勉強会
8時間耐久CakePHP2 勉強会Yusuke Ando
 
Dropping ACID with MongoDB
Dropping ACID with MongoDBDropping ACID with MongoDB
Dropping ACID with MongoDBkchodorow
 
Secure Coding With Wordpress (BarCamp Orlando 2009)
Secure Coding With Wordpress (BarCamp Orlando 2009)Secure Coding With Wordpress (BarCamp Orlando 2009)
Secure Coding With Wordpress (BarCamp Orlando 2009)Mark Jaquith
 
Javascript Continues Integration in Jenkins with AngularJS
Javascript Continues Integration in Jenkins with AngularJSJavascript Continues Integration in Jenkins with AngularJS
Javascript Continues Integration in Jenkins with AngularJSLadislav Prskavec
 
The History of PHPersistence
The History of PHPersistenceThe History of PHPersistence
The History of PHPersistenceHugo Hamon
 
[PL] Jak nie zostać "programistą" PHP?
[PL] Jak nie zostać "programistą" PHP?[PL] Jak nie zostać "programistą" PHP?
[PL] Jak nie zostać "programistą" PHP?Radek Benkel
 

Ähnlich wie London XQuery Meetup: Querying the World (Web Scraping) (20)

Drupal Development (Part 2)
Drupal Development (Part 2)Drupal Development (Part 2)
Drupal Development (Part 2)
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
Ae internals
Ae internalsAe internals
Ae internals
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
 
R57shell
R57shellR57shell
R57shell
 
Blog Hacks 2011
Blog Hacks 2011Blog Hacks 2011
Blog Hacks 2011
 
Database api
Database apiDatabase api
Database api
 
PHP POWERPOINT SLIDES
PHP POWERPOINT SLIDESPHP POWERPOINT SLIDES
PHP POWERPOINT SLIDES
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
 
Daily notes
Daily notesDaily notes
Daily notes
 
8時間耐久CakePHP2 勉強会
8時間耐久CakePHP2 勉強会8時間耐久CakePHP2 勉強会
8時間耐久CakePHP2 勉強会
 
Dropping ACID with MongoDB
Dropping ACID with MongoDBDropping ACID with MongoDB
Dropping ACID with MongoDB
 
Php My Sql
Php My SqlPhp My Sql
Php My Sql
 
Secure Coding With Wordpress (BarCamp Orlando 2009)
Secure Coding With Wordpress (BarCamp Orlando 2009)Secure Coding With Wordpress (BarCamp Orlando 2009)
Secure Coding With Wordpress (BarCamp Orlando 2009)
 
Javascript Continues Integration in Jenkins with AngularJS
Javascript Continues Integration in Jenkins with AngularJSJavascript Continues Integration in Jenkins with AngularJS
Javascript Continues Integration in Jenkins with AngularJS
 
The History of PHPersistence
The History of PHPersistenceThe History of PHPersistence
The History of PHPersistence
 
[PL] Jak nie zostać "programistą" PHP?
[PL] Jak nie zostać "programistą" PHP?[PL] Jak nie zostać "programistą" PHP?
[PL] Jak nie zostać "programistą" PHP?
 
2013 - Benjamin Eberlei - Doctrine 2
2013 - Benjamin Eberlei - Doctrine 22013 - Benjamin Eberlei - Doctrine 2
2013 - Benjamin Eberlei - Doctrine 2
 
Views notwithstanding
Views notwithstandingViews notwithstanding
Views notwithstanding
 

Kürzlich hochgeladen

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

London XQuery Meetup: Querying the World (Web Scraping)

  • 1. XQuery: Querying the World (formerly known as Web Scraping) Dennis Knochenwefel <dennis.knochenwefel@28msec.com>
  • 3. PHP (2007) $url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD"; $raw = file_get_contents($url); $newlines = array("","","","2020","","0B"); $content = str_replace($newlines, "", html_entity_decode($raw)); $start = strpos($content,'<table cellpadding="2" class="standard_table"'); $end = strpos($content,'</table>',$start) + 8; $table = substr($content,$start,$end-$start); preg_match_all("|<tr(.*)</tr>|U",$table,$rows); foreach ($rows[0] as $row){ if ((strpos($row,'<th')===false)){ preg_match_all("|<td(.*)</td>|U",$row,$cells); $number = strip_tags($cells[0][0]); $name = strip_tags($cells[0][1]); $position = strip_tags($cells[0][2]); echo "{$position} - {$name} - Number {$number} <br>"; } } $url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD"; $raw = file_get_contents($url); $newlines = array("","","","2020","","0B"); $content = str_replace($newlines, "", html_entity_decode($raw)); $start = strpos($content,'<table cellpadding="2" class="standard_table"'); $end = strpos($content,'</table>',$start) + 8; $table = substr($content,$start,$end-$start); preg_match_all("|<tr(.*)</tr>|U",$table,$rows); foreach ($rows[0] as $row){ if ((strpos($row,'<th')===false)){ preg_match_all("|<td(.*)</td>|U",$row,$cells); $number = strip_tags($cells[0][0]); $name = strip_tags($cells[0][1]); $position = strip_tags($cells[0][2]); echo "{$position} - {$name} - Number {$number} <br>"; } } source: http://www.bradino.com/php/screen-scraping/
  • 4. PHP (June 2011) $url="http://www.rtu.ac.in/results/reformat.php"; $post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit"; $ch=curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_POST,1); curl_setopt($ch,CURLOPT_POSTFIELDS,$post); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); $content=curl_exec($ch); curl_close($ch); $totalPath="html/body/table[4]/tbody/tr[3]/td[4]"; $page=new DOMDocument(); $xpath=new DOMXPath($page); $page->loadHTML($content); $page->saveHTML();  // this shows the page contents $total=$xpath->query($totalPath); echo $total->length;    //shows 0 echo $total->item(0)->nodeValue;   //shows nothing $url="http://www.rtu.ac.in/results/reformat.php"; $post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit"; $ch=curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_POST,1); curl_setopt($ch,CURLOPT_POSTFIELDS,$post); curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); $content=curl_exec($ch); curl_close($ch); $totalPath="html/body/table[4]/tbody/tr[3]/td[4]"; $page=new DOMDocument(); $xpath=new DOMXPath($page); $page->loadHTML($content); $page->saveHTML();  // this shows the page contents $total=$xpath->query($totalPath); echo $total->length;    //shows 0 echo $total->item(0)->nodeValue;   //shows nothing ! ! source: http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
  • 7. awesome site awesome data no API
  • 9. Need to emulate setting options
  • 11. JSON ? XML ? CSV ! HTML ! XLS ! Zip ! App Website
  • 12. Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website
  • 13. Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website Customize with URL Params HTML Forms
  • 14. Stateless REST API ? JSON ? XML ? CSV ! HTML ! XLS ! Zip ! Session! App Website Customize with URL Params HTML Forms
  • 15. CSV ! HTML ! XLS ! Zip ! HTML ! Session! Session! App Website XQuery ! HTML Forms HTML Forms
  • 17. Session handling Forms ! ! XQuery Web Data Processing A browser can do it? XQuery can do it!

Hinweis der Redaktion

  1. http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
  2. http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page
  3. http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page