SlideShare ist ein Scribd-Unternehmen logo
1 von 27
scraping,




                               http://www.flickr.com/photos/juan23/82888194/
 scripting and
 hacking your way to
 API-less data
[AKA: if you don’t have data
feeds, we’ll get it anyway]
overview

•   “getting data out”
•   non-exhaustive (and rapid!)
•   slightly random
•   live examples (hopefully)
•   mainly non-technical(ish)
•   mainly non-illegal. I think.
anything goes

•   have no fear!
•   feel no remorse!
•   be shameless!
•   long live the open data revolution!
you

• half newbie, half “done some”
me

• not really a developer
• ..but code enough ASP (stop giggling)
to do what I want to do
• slides will be at slideshare.net/dmje
• www.electronicmuseum.org.uk
• mike.ellis@eduserv.org.uk
we <3 data

• we want programmatic access...
• ...but sites are often lacking
• ...and APIs are usually a pipe dream

 http://www.ucas.com/instit/i/h60.html




                                         http://unicorn.lib.ic.ac.uk/uhtbin/opac/webcentral
scraping

 • copy & paste, without having to copy &
 paste...
 • an inexact but really rather beautiful
 science




Set xmlhttp = Server.CreateObject("MSXML2.ServerXMLHTTP.4.0")

Call xmlhttp.Open("GET",url,False)
Call xmlhttp.send

ReturnedXML = xmlhttp.responsetext
scraping (cont)

• frowned on by purists...
• but really rather powerful
• http://hoard.it
extraction #1: Y!Pipes

•   find your data on page
•   view source
•   determine the delimeters
•   put it into Pipes
•   extract the output




                               originating page | output
extraction #2: Google Docs

• create a new google spreadsheet
• find the URL of the data you want
• identify how it is encapsulated (list/
table)
• use the importHTML() function (others for
feeds, xml, data, etc)
• dump out data as...CSV/XML/RSS/etc




                           originating page | output
extraction #3: dapper.net

• go to dapper.net/open
• identify several of the urls with the same
“shapes” that you want to scrape
• use the dapper dashboard to identify
content areas
• build the “dapp”
• pass in url’s of pages you want to extract
data from
• extract results from the output (xml,
flash, csv, etc)




                          originating page | output
extraction #4: YQL

•   view source on the page you want to grab
•   go to http://developer.yahoo.com/yql/console/
•   get your XPath hat on and build a query
•   grab the data from a RESTful query




      http://developer.yahoo.com/yql/console/?
      q=select%20*%20from%20html%20where%20url%3D
      %22http%3A%2F%2Fopenlibrary.org%2Fsearch%3Fq
      %3Dkeri%2Bhulme%22%20and%20xpath%3D%27%2F%2Fa
      %5B%40class%3D%22result%22%5D%27




                                   originating page | output
extraction #5: httrack

• grab a copy of httrack (or similar)from
  http://www.httrack.com/
• point it at the bit of the site you want,
make sure the filters are correct, and push
go...
• you now have a local copy of the site, to
munge as you see fit
extraction #6: hacked search

• get an API key from Yahoo!
• use it to search within a domain
• script a standard download script to pick
out each page and download it
• hack that mumma
• (variation on a theme: build a simple
spider...)
now you’ve got your data..

• once you’ve got your data, you usually
need to munge it...
munging #1: regex!

• I’m terrible at regex
• ([A-PR-UWYZ0-9][A-HK-Y0-9]
[AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}
[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)
• but it’s incredibly powerful...




                                            output
munging #2: find/replace

• use whatever scripting language you work
best with
• (even Word...)
• you’ll find that replace double space,
replace weird characters, replace paragraph
marks are about the most common needs
munging #3: mail merge!

• for rapid builds of html, javascript or
xml
• have a source document (often extracted or
munged from other sites) in Excel
• you can use filters to effectively grab
the data you need
• build the merge in Word, using the
“directory” option
• copy and paste the result out
munging #4: html removal

• have a function handy that you can pass a
block of html
• it is handy to have a script where you can
define which particular tags to remove or
leave in place
munging #5: html tidy

• grab a copy of html tidy from
 http://tidy.sourceforge.net/
• tidy is available as a downloadable .exe
or a component that you can pass data to in
your code
processing #1: Open Calais

• a service from Reuters for analysing
blocks of text for semantic “meaning”
• get an API key from Open Calais
• send data via a POST to the REST service
• retrieve results from the RDF
• OR...just paste your text into
http://sws.clearforest.com/calaisviewer/




                                             output
processing #2: Yahoo! TE

• a webservice for grabbing tags/terms from
blocks of text
• sign up for a Yahoo! API key
• pass your block of text using POST
• grab the results..




                                          output
processing #3: geo!

• go to http://developer.yahoo.com/geo !
the ugly sisters

• Access
• Excel (!)
the last resorts

• FOI (frankie!)
• OCR (me)
the very last resort..

• re-type it...
• (or use Amazon Mechanical Turk)
...any more?

Weitere ähnliche Inhalte

Andere mochten auch

CLV e Mídia Programática
CLV e Mídia ProgramáticaCLV e Mídia Programática
CLV e Mídia ProgramáticaSociomantic Labs
 
Top Mobile App Monetization Tactics You Ought to Know
Top Mobile App Monetization Tactics You Ought to KnowTop Mobile App Monetization Tactics You Ought to Know
Top Mobile App Monetization Tactics You Ought to KnowInMobi
 
Calculating LTV Using Flurry
Calculating LTV Using FlurryCalculating LTV Using Flurry
Calculating LTV Using FlurryYaniv Nizan
 
Calculating LTV Using Google Analytics
Calculating LTV Using Google AnalyticsCalculating LTV Using Google Analytics
Calculating LTV Using Google AnalyticsYaniv Nizan
 
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...Eric Seufert
 
Two Methods for Modeling LTV with a Spreadsheet
Two Methods for Modeling LTV with a SpreadsheetTwo Methods for Modeling LTV with a Spreadsheet
Two Methods for Modeling LTV with a SpreadsheetEric Seufert
 
Everything You Need to Know About Customer Lifetime Value (CLV)
Everything You Need to Know About Customer Lifetime Value (CLV)Everything You Need to Know About Customer Lifetime Value (CLV)
Everything You Need to Know About Customer Lifetime Value (CLV)Demac Media
 
11 mobile growth hacks. Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
11 mobile growth hacks.  Presentation at LTV>CPI, Wooga, Berlin 27/02/201411 mobile growth hacks.  Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
11 mobile growth hacks. Presentation at LTV>CPI, Wooga, Berlin 27/02/2014Rob Moffat
 
A step by-step guide to calculating customer lifetime value
A step by-step guide to calculating customer lifetime valueA step by-step guide to calculating customer lifetime value
A step by-step guide to calculating customer lifetime valueGeoff Fripp
 

Andere mochten auch (9)

CLV e Mídia Programática
CLV e Mídia ProgramáticaCLV e Mídia Programática
CLV e Mídia Programática
 
Top Mobile App Monetization Tactics You Ought to Know
Top Mobile App Monetization Tactics You Ought to KnowTop Mobile App Monetization Tactics You Ought to Know
Top Mobile App Monetization Tactics You Ought to Know
 
Calculating LTV Using Flurry
Calculating LTV Using FlurryCalculating LTV Using Flurry
Calculating LTV Using Flurry
 
Calculating LTV Using Google Analytics
Calculating LTV Using Google AnalyticsCalculating LTV Using Google Analytics
Calculating LTV Using Google Analytics
 
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
 
Two Methods for Modeling LTV with a Spreadsheet
Two Methods for Modeling LTV with a SpreadsheetTwo Methods for Modeling LTV with a Spreadsheet
Two Methods for Modeling LTV with a Spreadsheet
 
Everything You Need to Know About Customer Lifetime Value (CLV)
Everything You Need to Know About Customer Lifetime Value (CLV)Everything You Need to Know About Customer Lifetime Value (CLV)
Everything You Need to Know About Customer Lifetime Value (CLV)
 
11 mobile growth hacks. Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
11 mobile growth hacks.  Presentation at LTV>CPI, Wooga, Berlin 27/02/201411 mobile growth hacks.  Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
11 mobile growth hacks. Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
 
A step by-step guide to calculating customer lifetime value
A step by-step guide to calculating customer lifetime valueA step by-step guide to calculating customer lifetime value
A step by-step guide to calculating customer lifetime value
 

Ähnlich wie Scraping Scripting Hacking

The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchainjasonhaddix
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internetdrgath
 
Html5: Something wicked this way comes (Hack in Paris)
Html5: Something wicked this way comes (Hack in Paris)Html5: Something wicked this way comes (Hack in Paris)
Html5: Something wicked this way comes (Hack in Paris)Krzysztof Kotowicz
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxMichael Hackstein
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Esteve Castells
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesLeo Loobeek
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internetdrgath
 
Jinx - Malware 2.0
Jinx - Malware 2.0Jinx - Malware 2.0
Jinx - Malware 2.0Itzik Kotler
 
[2010]我有一个梦想
[2010]我有一个梦想[2010]我有一个梦想
[2010]我有一个梦想Twinsen Liang
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring databodaceacat
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring dataSara-Jayne Terp
 
Yahoo! Search monkey API - CEBIT 2008
Yahoo! Search monkey API - CEBIT 2008Yahoo! Search monkey API - CEBIT 2008
Yahoo! Search monkey API - CEBIT 2008Eric D.
 
Basic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 sessionBasic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 sessionRob Dunn
 
On the Edge Systems Administration with Golang
On the Edge Systems Administration with GolangOn the Edge Systems Administration with Golang
On the Edge Systems Administration with GolangChris McEniry
 
Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...
Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...
Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...Daniel Bohannon
 

Ähnlich wie Scraping Scripting Hacking (20)

The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchain
 
Scrapy
ScrapyScrapy
Scrapy
 
Learning to code
Learning to codeLearning to code
Learning to code
 
Google Hacking 101
Google Hacking 101Google Hacking 101
Google Hacking 101
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
 
Html5: Something wicked this way comes (Hack in Paris)
Html5: Something wicked this way comes (Hack in Paris)Html5: Something wicked this way comes (Hack in Paris)
Html5: Something wicked this way comes (Hack in Paris)
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB Foxx
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
 
Web Scrapping Using Python
Web Scrapping Using PythonWeb Scrapping Using Python
Web Scrapping Using Python
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying Techniques
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internet
 
Jinx - Malware 2.0
Jinx - Malware 2.0Jinx - Malware 2.0
Jinx - Malware 2.0
 
[2010]我有一个梦想
[2010]我有一个梦想[2010]我有一个梦想
[2010]我有一个梦想
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Splunk bsides
Splunk bsidesSplunk bsides
Splunk bsides
 
Yahoo! Search monkey API - CEBIT 2008
Yahoo! Search monkey API - CEBIT 2008Yahoo! Search monkey API - CEBIT 2008
Yahoo! Search monkey API - CEBIT 2008
 
Basic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 sessionBasic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 session
 
On the Edge Systems Administration with Golang
On the Edge Systems Administration with GolangOn the Edge Systems Administration with Golang
On the Edge Systems Administration with Golang
 
Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...
Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...
Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...
 

Mehr von Mike Ellis

5 digital habits of highly effective museums
5 digital habits of highly effective museums5 digital habits of highly effective museums
5 digital habits of highly effective museumsMike Ellis
 
How to stop freelance from killing you
How to stop freelance from killing youHow to stop freelance from killing you
How to stop freelance from killing youMike Ellis
 
Getting collections online
Getting collections onlineGetting collections online
Getting collections onlineMike Ellis
 
Why Wordpress is better than your cms
Why Wordpress is better than your cmsWhy Wordpress is better than your cms
Why Wordpress is better than your cmsMike Ellis
 
Forget the objects, tell the stories
Forget the objects, tell the storiesForget the objects, tell the stories
Forget the objects, tell the storiesMike Ellis
 
Bath Digital general introduction
Bath Digital general introductionBath Digital general introduction
Bath Digital general introductionMike Ellis
 
Stop the noise - ten digital marketing tips
Stop the noise - ten digital marketing tipsStop the noise - ten digital marketing tips
Stop the noise - ten digital marketing tipsMike Ellis
 
Bathcamp 2010 zeitgeist
Bathcamp 2010 zeitgeistBathcamp 2010 zeitgeist
Bathcamp 2010 zeitgeistMike Ellis
 
Strategic digital marketing: some ideas for joining things up
Strategic digital marketing: some ideas for joining things upStrategic digital marketing: some ideas for joining things up
Strategic digital marketing: some ideas for joining things upMike Ellis
 
If you love your content, set it free (v3.0)
If you love your content, set it free (v3.0) If you love your content, set it free (v3.0)
If you love your content, set it free (v3.0) Mike Ellis
 
Mobile: the next frontier
Mobile: the next frontierMobile: the next frontier
Mobile: the next frontierMike Ellis
 
Niche or Platform - what next for our institutions online?
Niche or Platform - what next for our institutions online?Niche or Platform - what next for our institutions online?
Niche or Platform - what next for our institutions online?Mike Ellis
 
The Intertubes Everywhere
The Intertubes EverywhereThe Intertubes Everywhere
The Intertubes EverywhereMike Ellis
 
Bathcamp #8: Quiz Of The Year
Bathcamp #8: Quiz Of The YearBathcamp #8: Quiz Of The Year
Bathcamp #8: Quiz Of The YearMike Ellis
 
The Benefits Of Doing Things Differently
The Benefits Of Doing Things DifferentlyThe Benefits Of Doing Things Differently
The Benefits Of Doing Things DifferentlyMike Ellis
 
Collaboration 2.0
Collaboration 2.0Collaboration 2.0
Collaboration 2.0Mike Ellis
 
Getting people together
Getting people togetherGetting people together
Getting people togetherMike Ellis
 
3 minutes, one technology: the piano
3 minutes, one technology: the piano3 minutes, one technology: the piano
3 minutes, one technology: the pianoMike Ellis
 
Don't Think Websites, think data
Don't Think Websites, think dataDon't Think Websites, think data
Don't Think Websites, think dataMike Ellis
 
Everyware - "the future is already here, it's just not well distributed yet"
Everyware - "the future is already here, it's just not well distributed yet"Everyware - "the future is already here, it's just not well distributed yet"
Everyware - "the future is already here, it's just not well distributed yet"Mike Ellis
 

Mehr von Mike Ellis (20)

5 digital habits of highly effective museums
5 digital habits of highly effective museums5 digital habits of highly effective museums
5 digital habits of highly effective museums
 
How to stop freelance from killing you
How to stop freelance from killing youHow to stop freelance from killing you
How to stop freelance from killing you
 
Getting collections online
Getting collections onlineGetting collections online
Getting collections online
 
Why Wordpress is better than your cms
Why Wordpress is better than your cmsWhy Wordpress is better than your cms
Why Wordpress is better than your cms
 
Forget the objects, tell the stories
Forget the objects, tell the storiesForget the objects, tell the stories
Forget the objects, tell the stories
 
Bath Digital general introduction
Bath Digital general introductionBath Digital general introduction
Bath Digital general introduction
 
Stop the noise - ten digital marketing tips
Stop the noise - ten digital marketing tipsStop the noise - ten digital marketing tips
Stop the noise - ten digital marketing tips
 
Bathcamp 2010 zeitgeist
Bathcamp 2010 zeitgeistBathcamp 2010 zeitgeist
Bathcamp 2010 zeitgeist
 
Strategic digital marketing: some ideas for joining things up
Strategic digital marketing: some ideas for joining things upStrategic digital marketing: some ideas for joining things up
Strategic digital marketing: some ideas for joining things up
 
If you love your content, set it free (v3.0)
If you love your content, set it free (v3.0) If you love your content, set it free (v3.0)
If you love your content, set it free (v3.0)
 
Mobile: the next frontier
Mobile: the next frontierMobile: the next frontier
Mobile: the next frontier
 
Niche or Platform - what next for our institutions online?
Niche or Platform - what next for our institutions online?Niche or Platform - what next for our institutions online?
Niche or Platform - what next for our institutions online?
 
The Intertubes Everywhere
The Intertubes EverywhereThe Intertubes Everywhere
The Intertubes Everywhere
 
Bathcamp #8: Quiz Of The Year
Bathcamp #8: Quiz Of The YearBathcamp #8: Quiz Of The Year
Bathcamp #8: Quiz Of The Year
 
The Benefits Of Doing Things Differently
The Benefits Of Doing Things DifferentlyThe Benefits Of Doing Things Differently
The Benefits Of Doing Things Differently
 
Collaboration 2.0
Collaboration 2.0Collaboration 2.0
Collaboration 2.0
 
Getting people together
Getting people togetherGetting people together
Getting people together
 
3 minutes, one technology: the piano
3 minutes, one technology: the piano3 minutes, one technology: the piano
3 minutes, one technology: the piano
 
Don't Think Websites, think data
Don't Think Websites, think dataDon't Think Websites, think data
Don't Think Websites, think data
 
Everyware - "the future is already here, it's just not well distributed yet"
Everyware - "the future is already here, it's just not well distributed yet"Everyware - "the future is already here, it's just not well distributed yet"
Everyware - "the future is already here, it's just not well distributed yet"
 

Kürzlich hochgeladen

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Scraping Scripting Hacking

  • 1. scraping, http://www.flickr.com/photos/juan23/82888194/ scripting and hacking your way to API-less data [AKA: if you don’t have data feeds, we’ll get it anyway]
  • 2. overview • “getting data out” • non-exhaustive (and rapid!) • slightly random • live examples (hopefully) • mainly non-technical(ish) • mainly non-illegal. I think.
  • 3. anything goes • have no fear! • feel no remorse! • be shameless! • long live the open data revolution!
  • 4. you • half newbie, half “done some”
  • 5. me • not really a developer • ..but code enough ASP (stop giggling) to do what I want to do • slides will be at slideshare.net/dmje • www.electronicmuseum.org.uk • mike.ellis@eduserv.org.uk
  • 6. we <3 data • we want programmatic access... • ...but sites are often lacking • ...and APIs are usually a pipe dream http://www.ucas.com/instit/i/h60.html http://unicorn.lib.ic.ac.uk/uhtbin/opac/webcentral
  • 7. scraping • copy & paste, without having to copy & paste... • an inexact but really rather beautiful science Set xmlhttp = Server.CreateObject("MSXML2.ServerXMLHTTP.4.0") Call xmlhttp.Open("GET",url,False) Call xmlhttp.send ReturnedXML = xmlhttp.responsetext
  • 8. scraping (cont) • frowned on by purists... • but really rather powerful • http://hoard.it
  • 9. extraction #1: Y!Pipes • find your data on page • view source • determine the delimeters • put it into Pipes • extract the output originating page | output
  • 10. extraction #2: Google Docs • create a new google spreadsheet • find the URL of the data you want • identify how it is encapsulated (list/ table) • use the importHTML() function (others for feeds, xml, data, etc) • dump out data as...CSV/XML/RSS/etc originating page | output
  • 11. extraction #3: dapper.net • go to dapper.net/open • identify several of the urls with the same “shapes” that you want to scrape • use the dapper dashboard to identify content areas • build the “dapp” • pass in url’s of pages you want to extract data from • extract results from the output (xml, flash, csv, etc) originating page | output
  • 12. extraction #4: YQL • view source on the page you want to grab • go to http://developer.yahoo.com/yql/console/ • get your XPath hat on and build a query • grab the data from a RESTful query http://developer.yahoo.com/yql/console/? q=select%20*%20from%20html%20where%20url%3D %22http%3A%2F%2Fopenlibrary.org%2Fsearch%3Fq %3Dkeri%2Bhulme%22%20and%20xpath%3D%27%2F%2Fa %5B%40class%3D%22result%22%5D%27 originating page | output
  • 13. extraction #5: httrack • grab a copy of httrack (or similar)from http://www.httrack.com/ • point it at the bit of the site you want, make sure the filters are correct, and push go... • you now have a local copy of the site, to munge as you see fit
  • 14. extraction #6: hacked search • get an API key from Yahoo! • use it to search within a domain • script a standard download script to pick out each page and download it • hack that mumma • (variation on a theme: build a simple spider...)
  • 15. now you’ve got your data.. • once you’ve got your data, you usually need to munge it...
  • 16. munging #1: regex! • I’m terrible at regex • ([A-PR-UWYZ0-9][A-HK-Y0-9] [AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2} [0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA) • but it’s incredibly powerful... output
  • 17. munging #2: find/replace • use whatever scripting language you work best with • (even Word...) • you’ll find that replace double space, replace weird characters, replace paragraph marks are about the most common needs
  • 18. munging #3: mail merge! • for rapid builds of html, javascript or xml • have a source document (often extracted or munged from other sites) in Excel • you can use filters to effectively grab the data you need • build the merge in Word, using the “directory” option • copy and paste the result out
  • 19. munging #4: html removal • have a function handy that you can pass a block of html • it is handy to have a script where you can define which particular tags to remove or leave in place
  • 20. munging #5: html tidy • grab a copy of html tidy from http://tidy.sourceforge.net/ • tidy is available as a downloadable .exe or a component that you can pass data to in your code
  • 21. processing #1: Open Calais • a service from Reuters for analysing blocks of text for semantic “meaning” • get an API key from Open Calais • send data via a POST to the REST service • retrieve results from the RDF • OR...just paste your text into http://sws.clearforest.com/calaisviewer/ output
  • 22. processing #2: Yahoo! TE • a webservice for grabbing tags/terms from blocks of text • sign up for a Yahoo! API key • pass your block of text using POST • grab the results.. output
  • 23. processing #3: geo! • go to http://developer.yahoo.com/geo !
  • 24. the ugly sisters • Access • Excel (!)
  • 25. the last resorts • FOI (frankie!) • OCR (me)
  • 26. the very last resort.. • re-type it... • (or use Amazon Mechanical Turk)