SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Using R for Scraping Data
             Ryan Elmore
    National Renewable Energy Lab

        rtelmore@gmail.com
          Twitter: rtelmore

            June 13, 2012
              useR! 2012
A Baseball Challenge

Question: Has the minimum number of pitches
per (full) inning (6 pitches) has ever been
attained?
Answer: I don’t know; scrape the boxscores at
baseball-reference.com.


http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
A Baseball Challenge

Question: Has the minimum number of pitches
per (full) inning (6 pitches) has ever been
attained?
Answer: I don’t know; scrape the boxscores at
baseball-reference.com.


http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
The Boxscore




This column seems useful!
The Boxscore




This column seems useful!
Dissecting the URL
http://www.baseball-reference.com/boxes/COL/COL201104010.shtml




   Just step through
    all of the teams:     YearMonthDay         Game ID
    COL, BOS, etc.
How Do We Proceed?
The most systematic way that I could find
was to break it down like this:
• 30 Teams
• 2005 - 2010
• Everyday from Apr 1 through Oct 31
• This is a little more than 78K URLs!
• My program took about 3 hrs 25 min.
How Do We Proceed?
The most systematic way that I could find
was to break it down like this:
• 30 Teams
• 2005 - 2010
• Everyday from Apr 1 through Oct 31
• This is a little more than 78K URLs!
• My program took about 3 hrs 25 min.
R Code
for (team in teams){
  for (year in years){
    out.string <- paste(Sys.time(), "--", team, year, sep = " ")
    print(out.string)
    for (month in months){
      for (day in days){
        for (i in 0:1){
          full.url <- paste(paste(base.url, team, date.url,
             sep="/"), i, ".shtml", sep="")
          table.stats <- readHTMLTable(full.url)
          ## Process the list of data.frames returned by
          ## the call to readHTMLTable
        }
      }
    }
  }
}
R Code
for (team in teams){
  for (year in years){
    out.string <- paste(Sys.time(), "--", team, year, sep = " ")
    print(out.string)
    for (month in months){
      for (day in days){
        for (i in 0:1){
          full.url <- paste(paste(base.url, team, date.url,
             sep="/"), i, ".shtml", sep="")
          table.stats <- readHTMLTable(full.url)
          ## Process the list of data.frames returned by
          ## the call to readHTMLTable
        }
      }
    }
  }
}
Tools

•   base: paste, strsplit, unlist, lapply
•   XML: readHTMLTable, htmlTreeParse,
    getNodeSet, xmlValue, xmlSApply
•   httr, stringr, and other Hadley things
•   useful, but not necessary: regex, xpath,
    XML, etc.
Tools

•   base: paste, strsplit, unlist, lapply
•   XML: readHTMLTable, htmlTreeParse,
    getNodeSet, xmlValue, xmlSApply
•   httr, stringr, and other Hadley things
•   useful, but not necessary: regex, xpath,
    XML, etc.
Conclusions/Discussion

• There is a lot of data available on the web!
• You can access this data from a browser;
  however, you can access A LOT more data
  if you let your computer do the work.
• R and its libraries provide a great platform
  for scraping data and data mining.
• Download data and see where you go.
Conclusions/Discussion

• There is a lot of data available on the web!
• You can access this data from a browser;
  however, you can access A LOT more data
  if you let your computer do the work.
• R and its libraries provide a great platform
  for scraping data and data mining.
• Download data and see where you go.
Was That Minimum Attained?

• NO! Unless there is an error in my code.
• Did we learn something? Of course.
• The skills are transferrable to other
  websites with data.

Weitere ähnliche Inhalte

Was ist angesagt?

Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...తేజ దండిభట్ల
 
SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)andyseaborne
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQLOlaf Hartig
 
WebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaWebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaKatrien Verbert
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Kai Chan
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache SolrBiogeeks
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioOpen Knowledge Belgium
 
Building social network with Neo4j and Python
Building social network with Neo4j and PythonBuilding social network with Neo4j and Python
Building social network with Neo4j and PythonAndrii Soldatenko
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma APIKyle Banerjee
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and SemanticsTatiana Al-Chueyr
 
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Kai Chan
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Storesandyseaborne
 
A Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsA Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsMaribel Acosta Deibe
 
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfsSWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfsMariano Rodriguez-Muro
 

Was ist angesagt? (20)

Reproducible research
Reproducible researchReproducible research
Reproducible research
 
Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...
 
Linked Data Fragments
Linked Data FragmentsLinked Data Fragments
Linked Data Fragments
 
SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)SPARQL 1.1 Update (2013-03-05)
SPARQL 1.1 Update (2013-03-05)
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQL
 
WebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaWebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPedia
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
 
4 sw architectures and sparql
4 sw architectures and sparql4 sw architectures and sparql
4 sw architectures and sparql
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
 
Unit 3
Unit 3Unit 3
Unit 3
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
 
Building social network with Neo4j and Python
Building social network with Neo4j and PythonBuilding social network with Neo4j and Python
Building social network with Neo4j and Python
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma API
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and Semantics
 
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Stores
 
A Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsA Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia Mappings
 
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfsSWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
 

Ähnlich wie Using R for Scraping Baseball Data from Baseball-Reference

Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrBrooke Ganz
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsAjay Ohri
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?ProCogia
 
Build Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to OmegaBuild Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to OmegaRavi Mynampaty
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1Charles Givre
 
PostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databasePostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databaseBarry Jones
 
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensFive Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensCitus Data
 
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...Citus Data
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really DoingDave Stokes
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data setsCreditas
 
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017Amazon Web Services
 
Amazon Athena (March 2017)
Amazon Athena (March 2017)Amazon Athena (March 2017)
Amazon Athena (March 2017)Julien SIMON
 
Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)Ganesh Samarthyam
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptxShree Shree
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008eComm2008
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internetdrgath
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
 
Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014PyData
 

Ähnlich wie Using R for Scraping Baseball Data from Baseball-Reference (20)

Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media Analytics
 
e_lumley.pdf
e_lumley.pdfe_lumley.pdf
e_lumley.pdf
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?
 
Build Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to OmegaBuild Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to Omega
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1
 
PostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databasePostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty database
 
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensFive Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
 
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really Doing
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data sets
 
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
 
Amazon Athena (March 2017)
Amazon Athena (March 2017)Amazon Athena (March 2017)
Amazon Athena (March 2017)
 
Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
Phpconf2008 Sphinx En
Phpconf2008 Sphinx EnPhpconf2008 Sphinx En
Phpconf2008 Sphinx En
 
Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014
 

Kürzlich hochgeladen

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 

Kürzlich hochgeladen (20)

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 

Using R for Scraping Baseball Data from Baseball-Reference

  • 1. Using R for Scraping Data Ryan Elmore National Renewable Energy Lab rtelmore@gmail.com Twitter: rtelmore June 13, 2012 useR! 2012
  • 2. A Baseball Challenge Question: Has the minimum number of pitches per (full) inning (6 pitches) has ever been attained? Answer: I don’t know; scrape the boxscores at baseball-reference.com. http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
  • 3. A Baseball Challenge Question: Has the minimum number of pitches per (full) inning (6 pitches) has ever been attained? Answer: I don’t know; scrape the boxscores at baseball-reference.com. http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
  • 4. The Boxscore This column seems useful!
  • 5. The Boxscore This column seems useful!
  • 6. Dissecting the URL http://www.baseball-reference.com/boxes/COL/COL201104010.shtml Just step through all of the teams: YearMonthDay Game ID COL, BOS, etc.
  • 7. How Do We Proceed? The most systematic way that I could find was to break it down like this: • 30 Teams • 2005 - 2010 • Everyday from Apr 1 through Oct 31 • This is a little more than 78K URLs! • My program took about 3 hrs 25 min.
  • 8. How Do We Proceed? The most systematic way that I could find was to break it down like this: • 30 Teams • 2005 - 2010 • Everyday from Apr 1 through Oct 31 • This is a little more than 78K URLs! • My program took about 3 hrs 25 min.
  • 9. R Code for (team in teams){ for (year in years){ out.string <- paste(Sys.time(), "--", team, year, sep = " ") print(out.string) for (month in months){ for (day in days){ for (i in 0:1){ full.url <- paste(paste(base.url, team, date.url, sep="/"), i, ".shtml", sep="") table.stats <- readHTMLTable(full.url) ## Process the list of data.frames returned by ## the call to readHTMLTable } } } } }
  • 10. R Code for (team in teams){ for (year in years){ out.string <- paste(Sys.time(), "--", team, year, sep = " ") print(out.string) for (month in months){ for (day in days){ for (i in 0:1){ full.url <- paste(paste(base.url, team, date.url, sep="/"), i, ".shtml", sep="") table.stats <- readHTMLTable(full.url) ## Process the list of data.frames returned by ## the call to readHTMLTable } } } } }
  • 11. Tools • base: paste, strsplit, unlist, lapply • XML: readHTMLTable, htmlTreeParse, getNodeSet, xmlValue, xmlSApply • httr, stringr, and other Hadley things • useful, but not necessary: regex, xpath, XML, etc.
  • 12. Tools • base: paste, strsplit, unlist, lapply • XML: readHTMLTable, htmlTreeParse, getNodeSet, xmlValue, xmlSApply • httr, stringr, and other Hadley things • useful, but not necessary: regex, xpath, XML, etc.
  • 13. Conclusions/Discussion • There is a lot of data available on the web! • You can access this data from a browser; however, you can access A LOT more data if you let your computer do the work. • R and its libraries provide a great platform for scraping data and data mining. • Download data and see where you go.
  • 14. Conclusions/Discussion • There is a lot of data available on the web! • You can access this data from a browser; however, you can access A LOT more data if you let your computer do the work. • R and its libraries provide a great platform for scraping data and data mining. • Download data and see where you go.
  • 15. Was That Minimum Attained? • NO! Unless there is an error in my code. • Did we learn something? Of course. • The skills are transferrable to other websites with data.