SlideShare a Scribd company logo
1 of 28
Web data acquisition with R


        Scott Chamberlain
        October 28, 2011
Why would you even need to do this?

  Why not just get data through a
            browser?
Some use cases
• Reason 1: It just takes too dam* long to
  manually search/get data on a web interface

• Reason 2: Workflow integration

• Reason 3: Your work is reproducible and
  transparent if done from R instead of clicking
  buttons on the web
A few general methods of getting web
           data through R
•   Read file – ideal if available
•   HTML
•   XML
•   JSON
•   APIs that serve up XML/JSON
Practice…read.csv (or xls, txt, etc.)



Get URL for file…see screenshot
url <- “http://datadryad.org/bitstream/handle/10255/dryad.8614/ScavengingFoodWebs_2009REV.csv?sequence=1”

mycsv <- read.csv(url)

mycsv
‘Scraping’ web data

• Why? When there is no API
  – Can either scrape XML or HTML or JSON
  – XML and JSON are easier formats to deal with
    from R
Scraping E.g. 1: XML
http://www.fishbase.org/summary/speciessummary.php?id=2
Scraping E.g. 1: XML
The summary XML page behind the rendered page…
Scraping E.g. 1: XML
We can process the XML ourselves using a bunch of lines of code…
Scraping E.g. 1: XML
…OR just use a package someone already created - rfishbase



                                         And you get this nice plot
Practice…XML and JSON formats
     data from the USA National Phenology Network
install.packages(c(“RCurl”,”XML”,”RJSONIO”)) # if not installed already
require(RCurl); require(XML); require(RJSONIO)

XML Format
xmlurl <- 'http://www-dev.usanpn.org/npn_portal/observations/
    getObservationsForSpeciesIndividualAtLocation.xml?
    year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3'
xmlout <- getURLContent(xmlurl, curl = getCurlHandle())
xmlTreeParse(xmlout)[[1]][[1]]

JSON Format
jsonurl <- 'http://www-dev.usanpn.org/npn_portal/observations/
    getObservationsForSpeciesIndividualAtLocation.json?
    year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3'
jsonout <- getURLContent(jsonurl, curl = getCurlHandle())
fromJSON(jsonout)
Scraping E.g. 2: HTML
 All this code can produce something like…
Scraping E.g. 2: HTML
          …this
Practice…scraping HTML
install.packages(c("XML","RCurl")) # if not already installed
require(XML); require(RCurl)

# Lets look at the raw html first
rawhtml <- getURLContent('http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752')
rawhtml

# Scrape data from the website
rawPMI <- readHTMLTable('http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752')
rawPMI
PMI <- data.frame(rawPMI[[1]])
names(PMI)[1] <- 'Year'
APIs (application programmatic interface)

• Many data sources have API’s – largely for
  talking to other web interfaces
  – we can use their API from R
• Consists of a set of methods to search,
  retrieve, or submit data to, a data
  source/repository
• One can write R code to interface with an API
  – Keep in mind some API’s require authentication
    keys
API Documentation
• API docs for the Integrated Taxonomic
  Information Service (ITIS):
http://www.itis.gov/ws_description.html




                  http://www.itis.gov/ITISWebService/services/ITISService/searchByScientificName?srchKey=Tardigrada
Example: Simple call to API
rOpenSci suite of R packages
• There are many packages on CRAN for specific
  data sources on the web – search on CRAN to
  find these
• rOpenSci is developing a lot of packages for as
  many open source data sources as possible
  – Please use and give feedback…
Data                                    Literature/metadata




       http://ropensci.org/ , code at GitHub
Three examples of packages that
      interact with an API
API E.g. 1: Search literature: rplos
You can do this using this tutorial: http://ropensci.org/tutorials/rplos-tutorial/
API E.g. 2: Get taxonomic information
    for your study species: taxize
      A tutorial: http://ropensci.org/tutorials/r-taxize-tutorial/
API E.g. 3: Get some data: dryad
     A tutorial: http://ropensci.org/tutorials/dryad-tutorial/
Calling external programs from
               R
Why even think about doing this?
• Again, workflow integration

• It’s just easier to call X program from R if you
  have are going to run many analyses with said
  program
Eg. 1: Phylometa
…using the files in the dropbox
Also, get Phylometa here:
http://lajeunesse.myweb.usf.edu/publications.html
• On a Mac: doesn’t work on mac because it’s
  .exe
   – But system() often can work to run external programs
• On Windows:
   system(paste('"new_phyloMeta_1.2b.exe" Aerts2006JEcol_tree.txt Aerts2006JEcol_data.txt'), intern=T)
   NOTE: intern = T, returns the output to the R console


   Should give you something like this   
Resources
• rOpenSci (development of R packages for all
  open source data and literature)
• CRAN packages (search for a data source)
• Tutorials/websites:
  – http://www.programmingr.com/content/webscraping-using-readlines-
    and-rcurl

• Non-R based, but cool:
  http://ecologicaldata.org/

More Related Content

What's hot

SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)
andyseaborne
 

What's hot (20)

Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearch
 
Live DBpedia querying with high availability
Live DBpedia querying with high availabilityLive DBpedia querying with high availability
Live DBpedia querying with high availability
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
The Future is Federated
The Future is FederatedThe Future is Federated
The Future is Federated
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma API
 
Querying data on the Web – client or server?
Querying data on the Web – client or server?Querying data on the Web – client or server?
Querying data on the Web – client or server?
 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and Kibana
 
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern FragmentsInitial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
4 sw architectures and sparql
4 sw architectures and sparql4 sw architectures and sparql
4 sw architectures and sparql
 
Rest - Representational State Transfer (EMC BRDC Internal Tech talk)
Rest - Representational State Transfer (EMC BRDC Internal Tech talk)Rest - Representational State Transfer (EMC BRDC Internal Tech talk)
Rest - Representational State Transfer (EMC BRDC Internal Tech talk)
 
The Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked LascauxThe Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked Lascaux
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
Using server logs to your advantage
Using server logs to your advantageUsing server logs to your advantage
Using server logs to your advantage
 
CrossRef Technical Information for Libraries
CrossRef Technical Information for LibrariesCrossRef Technical Information for Libraries
CrossRef Technical Information for Libraries
 
Use Cases for Elastic Search Percolator
Use Cases for Elastic Search PercolatorUse Cases for Elastic Search Percolator
Use Cases for Elastic Search Percolator
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)
 
SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)
 

Viewers also liked

Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
Fabio Benedetti
 
Practical Predictive Analytics Models and Methods
Practical Predictive Analytics Models and MethodsPractical Predictive Analytics Models and Methods
Practical Predictive Analytics Models and Methods
Zhipeng Liang
 
20130618 presentation big data in financial services English
20130618 presentation big data in financial services English20130618 presentation big data in financial services English
20130618 presentation big data in financial services English
Pascal Spelier
 
Simple Log Analysis and Trending
Simple Log Analysis and TrendingSimple Log Analysis and Trending
Simple Log Analysis and Trending
Mike Brittain
 

Viewers also liked (20)

R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlines
 
Introduction to the Web API
Introduction to the Web APIIntroduction to the Web API
Introduction to the Web API
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success Rates
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Google Analytics Data Mining with R
Google Analytics Data Mining with RGoogle Analytics Data Mining with R
Google Analytics Data Mining with R
 
Data mining with Google analytics
Data mining with Google analyticsData mining with Google analytics
Data mining with Google analytics
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Building powerful dashboards with r shiny
Building powerful dashboards with r shinyBuilding powerful dashboards with r shiny
Building powerful dashboards with r shiny
 
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeMove your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R code
 
Practical Predictive Analytics Models and Methods
Practical Predictive Analytics Models and MethodsPractical Predictive Analytics Models and Methods
Practical Predictive Analytics Models and Methods
 
20130618 presentation big data in financial services English
20130618 presentation big data in financial services English20130618 presentation big data in financial services English
20130618 presentation big data in financial services English
 
Webinar: Maximize Keyword Profits & Conversions with Data Science
Webinar: Maximize Keyword Profits & Conversions with Data ScienceWebinar: Maximize Keyword Profits & Conversions with Data Science
Webinar: Maximize Keyword Profits & Conversions with Data Science
 
An ad words ad performance analysis by r
An ad words ad performance analysis by rAn ad words ad performance analysis by r
An ad words ad performance analysis by r
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
Simple Log Analysis and Trending
Simple Log Analysis and TrendingSimple Log Analysis and Trending
Simple Log Analysis and Trending
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
 

Similar to Web data from R

Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
Sourcesense
 
EXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkEXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp framework
Florent Georges
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
Ajay Ohri
 
PHP Performance: Principles and tools
PHP Performance: Principles and toolsPHP Performance: Principles and tools
PHP Performance: Principles and tools
10n Software, LLC
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
Erik Hatcher
 
CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef Workshops
Crossref
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
Juan Sequeda
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB Foxx
Michael Hackstein
 

Similar to Web data from R (20)

Mashups MAX 360|MAX 2008 Unconference
Mashups MAX 360|MAX 2008 UnconferenceMashups MAX 360|MAX 2008 Unconference
Mashups MAX 360|MAX 2008 Unconference
 
Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...
Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...
Jones "Working with Scholarly APIs: A NISO Training Series, Session One: Foun...
 
Lightweight web frameworks
Lightweight web frameworksLightweight web frameworks
Lightweight web frameworks
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
20100707 e z_rmll_gig_v1
20100707 e z_rmll_gig_v120100707 e z_rmll_gig_v1
20100707 e z_rmll_gig_v1
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
Red5 - PHUG Workshops
Red5 - PHUG WorkshopsRed5 - PHUG Workshops
Red5 - PHUG Workshops
 
EXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkEXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp framework
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
PHP Performance: Principles and tools
PHP Performance: Principles and toolsPHP Performance: Principles and tools
PHP Performance: Principles and tools
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef Workshops
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
 
The Semantic Web Client Library - Consuming Linked Data in Your Applications
The Semantic Web Client Library - Consuming Linked Data in Your ApplicationsThe Semantic Web Client Library - Consuming Linked Data in Your Applications
The Semantic Web Client Library - Consuming Linked Data in Your Applications
 
Programming the Semantic Web
Programming the Semantic WebProgramming the Semantic Web
Programming the Semantic Web
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
 
Semantic Web
Semantic WebSemantic Web
Semantic Web
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB Foxx
 

More from schamber (6)

Poster
PosterPoster
Poster
 
Poster
PosterPoster
Poster
 
Chamberlain PhD Thesis
Chamberlain PhD ThesisChamberlain PhD Thesis
Chamberlain PhD Thesis
 
Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in R
 
regex-presentation_ed_goodwin
regex-presentation_ed_goodwinregex-presentation_ed_goodwin
regex-presentation_ed_goodwin
 
R Introduction
R IntroductionR Introduction
R Introduction
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Web data from R

  • 1. Web data acquisition with R Scott Chamberlain October 28, 2011
  • 2. Why would you even need to do this? Why not just get data through a browser?
  • 3. Some use cases • Reason 1: It just takes too dam* long to manually search/get data on a web interface • Reason 2: Workflow integration • Reason 3: Your work is reproducible and transparent if done from R instead of clicking buttons on the web
  • 4. A few general methods of getting web data through R
  • 5. Read file – ideal if available • HTML • XML • JSON • APIs that serve up XML/JSON
  • 6. Practice…read.csv (or xls, txt, etc.) Get URL for file…see screenshot url <- “http://datadryad.org/bitstream/handle/10255/dryad.8614/ScavengingFoodWebs_2009REV.csv?sequence=1” mycsv <- read.csv(url) mycsv
  • 7. ‘Scraping’ web data • Why? When there is no API – Can either scrape XML or HTML or JSON – XML and JSON are easier formats to deal with from R
  • 8. Scraping E.g. 1: XML http://www.fishbase.org/summary/speciessummary.php?id=2
  • 9. Scraping E.g. 1: XML The summary XML page behind the rendered page…
  • 10. Scraping E.g. 1: XML We can process the XML ourselves using a bunch of lines of code…
  • 11. Scraping E.g. 1: XML …OR just use a package someone already created - rfishbase And you get this nice plot
  • 12. Practice…XML and JSON formats data from the USA National Phenology Network install.packages(c(“RCurl”,”XML”,”RJSONIO”)) # if not installed already require(RCurl); require(XML); require(RJSONIO) XML Format xmlurl <- 'http://www-dev.usanpn.org/npn_portal/observations/ getObservationsForSpeciesIndividualAtLocation.xml? year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3' xmlout <- getURLContent(xmlurl, curl = getCurlHandle()) xmlTreeParse(xmlout)[[1]][[1]] JSON Format jsonurl <- 'http://www-dev.usanpn.org/npn_portal/observations/ getObservationsForSpeciesIndividualAtLocation.json? year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3' jsonout <- getURLContent(jsonurl, curl = getCurlHandle()) fromJSON(jsonout)
  • 13. Scraping E.g. 2: HTML All this code can produce something like…
  • 14. Scraping E.g. 2: HTML …this
  • 15. Practice…scraping HTML install.packages(c("XML","RCurl")) # if not already installed require(XML); require(RCurl) # Lets look at the raw html first rawhtml <- getURLContent('http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752') rawhtml # Scrape data from the website rawPMI <- readHTMLTable('http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752') rawPMI PMI <- data.frame(rawPMI[[1]]) names(PMI)[1] <- 'Year'
  • 16. APIs (application programmatic interface) • Many data sources have API’s – largely for talking to other web interfaces – we can use their API from R • Consists of a set of methods to search, retrieve, or submit data to, a data source/repository • One can write R code to interface with an API – Keep in mind some API’s require authentication keys
  • 17. API Documentation • API docs for the Integrated Taxonomic Information Service (ITIS): http://www.itis.gov/ws_description.html http://www.itis.gov/ITISWebService/services/ITISService/searchByScientificName?srchKey=Tardigrada
  • 19. rOpenSci suite of R packages • There are many packages on CRAN for specific data sources on the web – search on CRAN to find these • rOpenSci is developing a lot of packages for as many open source data sources as possible – Please use and give feedback…
  • 20. Data Literature/metadata http://ropensci.org/ , code at GitHub
  • 21. Three examples of packages that interact with an API
  • 22. API E.g. 1: Search literature: rplos You can do this using this tutorial: http://ropensci.org/tutorials/rplos-tutorial/
  • 23. API E.g. 2: Get taxonomic information for your study species: taxize A tutorial: http://ropensci.org/tutorials/r-taxize-tutorial/
  • 24. API E.g. 3: Get some data: dryad A tutorial: http://ropensci.org/tutorials/dryad-tutorial/
  • 26. Why even think about doing this? • Again, workflow integration • It’s just easier to call X program from R if you have are going to run many analyses with said program
  • 27. Eg. 1: Phylometa …using the files in the dropbox Also, get Phylometa here: http://lajeunesse.myweb.usf.edu/publications.html • On a Mac: doesn’t work on mac because it’s .exe – But system() often can work to run external programs • On Windows: system(paste('"new_phyloMeta_1.2b.exe" Aerts2006JEcol_tree.txt Aerts2006JEcol_data.txt'), intern=T) NOTE: intern = T, returns the output to the R console Should give you something like this 
  • 28. Resources • rOpenSci (development of R packages for all open source data and literature) • CRAN packages (search for a data source) • Tutorials/websites: – http://www.programmingr.com/content/webscraping-using-readlines- and-rcurl • Non-R based, but cool: http://ecologicaldata.org/