Using R for Scraping Baseball Data from Baseball-Reference

•

0 gefällt mir•2,031 views

rtelmore

My lightening talk from useR! 2012 conference on scraping some baseball data.

Technologie

Using R for Scraping Data
Ryan Elmore
National Renewable Energy Lab

rtelmore@gmail.com
Twitter: rtelmore

June 13, 2012
useR! 2012

A Baseball Challenge

Question: Has the minimum number of pitches
per (full) inning (6 pitches) has ever been
attained?
Answer: I don’t know; scrape the boxscores at
baseball-reference.com.

http://www.baseball-reference.com/boxes/COL/COL201104010.shtml

Dissecting the URL
http://www.baseball-reference.com/boxes/COL/COL201104010.shtml

Just step through
all of the teams: YearMonthDay Game ID
COL, BOS, etc.

How Do We Proceed?
The most systematic way that I could ﬁnd
was to break it down like this:
• 30 Teams
• 2005 - 2010
• Everyday from Apr 1 through Oct 31
• This is a little more than 78K URLs!
• My program took about 3 hrs 25 min.

R Code
for (team in teams){
for (year in years){
out.string <- paste(Sys.time(), "--", team, year, sep = " ")
print(out.string)
for (month in months){
for (day in days){
for (i in 0:1){
full.url <- paste(paste(base.url, team, date.url,
sep="/"), i, ".shtml", sep="")
table.stats <- readHTMLTable(full.url)
## Process the list of data.frames returned by
## the call to readHTMLTable
}
}
}
}
}

Tools

• base: paste, strsplit, unlist, lapply
• XML: readHTMLTable, htmlTreeParse,
getNodeSet, xmlValue, xmlSApply
• httr, stringr, and other Hadley things
• useful, but not necessary: regex, xpath,
XML, etc.

Conclusions/Discussion

• There is a lot of data available on the web!
• You can access this data from a browser;
however, you can access A LOT more data
if you let your computer do the work.
• R and its libraries provide a great platform
for scraping data and data mining.
• Download data and see where you go.

Was That Minimum Attained?

• NO! Unless there is an error in my code.
• Did we learn something? Of course.
• The skills are transferrable to other
websites with data.

Weitere ähnliche Inhalte

Was ist angesagt?

Reproducible researchC. Tobin Magle

Achieving time effective federated information from scalable rdf data using s...తేజ దండిభట్ల

Linked Data FragmentsRuben Verborgh

SPARQL 1.1 Update (2013-03-05)andyseaborne

TextMining with RAleksei Beloshytski

Querying Linked Data with SPARQLOlaf Hartig

WebTech Tutorial Querying DBPediaKatrien Verbert

Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan

4 sw architectures and sparqlMariano Rodriguez-Muro

Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Kai Chan

Unit 3Piyush Rochwani

Building your own search engine with Apache SolrBiogeeks

Do it on your own - From 3 to 5 Star Linked Open Data with RMLioOpen Knowledge Belgium

Building social network with Neo4j and PythonAndrii Soldatenko

Getting Started with the Alma APIKyle Banerjee

Linking the world with Python and SemanticsTatiana Al-Chueyr

Java Performance Tips (So Code Camp San Diego 2014)Kai Chan

NoSQL and Triple Storesandyseaborne

A Closer Look at the Changing Dynamics of DBpedia MappingsMaribel Acosta Deibe

SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfsMariano Rodriguez-Muro

Was ist angesagt? (20)

Reproducible research

Achieving time effective federated information from scalable rdf data using s...

Linked Data Fragments

SPARQL 1.1 Update (2013-03-05)

TextMining with R

Querying Linked Data with SPARQL

WebTech Tutorial Querying DBPedia

Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

4 sw architectures and sparql

Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)

Unit 3

Building your own search engine with Apache Solr

Do it on your own - From 3 to 5 Star Linked Open Data with RMLio

Building social network with Neo4j and Python

Getting Started with the Alma API

Linking the world with Python and Semantics

Java Performance Tips (So Code Camp San Diego 2014)

NoSQL and Triple Stores

A Closer Look at the Changing Dynamics of DBpedia Mappings

SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs

Ähnlich wie Using R for Scraping Baseball Data from Baseball-Reference

Creating an Open Source Genealogical Search Engine with Apache SolrBrooke Ganz

Training in Analytics, R and Social Media AnalyticsAjay Ohri

e_lumley.pdfbetsegaw123

Is your excel production code?ProCogia

Build Your Own World Class Directory Search From Alpha to OmegaRavi Mynampaty

Data Exploration with Apache Drill: Day 1Charles Givre

PostgreSQL - It's kind've a nifty databaseBarry Jones

Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensCitus Data

Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...Citus Data

What Your Database Query is Really DoingDave Stokes

Pig - Analyzing data setsCreditas

Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017Amazon Web Services

Amazon Athena (March 2017)Julien SIMON

Software Design in Practice (with Java examples)Ganesh Samarthyam

04-Data-Analysis-Overview.pptxShree Shree

Rocky Nevin's presentation at eComm 2008eComm2008

YQL: Select * from Internetdrgath

An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints

Phpconf2008 Sphinx EnMurugan Krishnamoorthy

Querying your database in natural language by Daniel Moisset PyData SV 2014PyData

Ähnlich wie Using R for Scraping Baseball Data from Baseball-Reference (20)

Creating an Open Source Genealogical Search Engine with Apache Solr

Training in Analytics, R and Social Media Analytics

e_lumley.pdf

Is your excel production code?

Build Your Own World Class Directory Search From Alpha to Omega

Data Exploration with Apache Drill: Day 1

PostgreSQL - It's kind've a nifty database

Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens

Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...

What Your Database Query is Really Doing

Pig - Analyzing data sets

Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017

Amazon Athena (March 2017)

Software Design in Practice (with Java examples)

04-Data-Analysis-Overview.pptx

Rocky Nevin's presentation at eComm 2008

YQL: Select * from Internet

An Introduction to gensim: "Topic Modelling for Humans"

Phpconf2008 Sphinx En

Querying your database in natural language by Daniel Moisset PyData SV 2014

Kürzlich hochgeladen

Connecting the Dots for Information Discovery.pdfNeo4j

Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA

Rise of the Machines: Known As Drones...Rick Flair

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

Sample pptx for embedding into website for demoHarshalMandlekar2

Decarbonising Buildings: Making a net-zero built environment a realityIES VE

2024 April Patch TuesdayIvanti

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda

Scale your database traffic with Read & Write split using MySQL RouterMydbops

Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes

UiPath Community: Communication Mining from Zero to HeroUiPathCommunity

Data governance with Unity Catalog PresentationKnoldus Inc.

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Take control of your SAP testing with UiPath Test SuiteDianaGray10

From Family Reminiscence to Scholarly Archive .Alan Dix

Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González

Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra

Kürzlich hochgeladen (20)

Connecting the Dots for Information Discovery.pdf

Long journey of Ruby standard library at RubyConf AU 2024

Rise of the Machines: Known As Drones...

Time Series Foundation Models - current state and future directions

Sample pptx for embedding into website for demo

Decarbonising Buildings: Making a net-zero built environment a reality

2024 April Patch Tuesday

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes

So einfach geht modernes Roaming fuer Notes und Nomad.pdf

Scale your database traffic with Read & Write split using MySQL Router

Assure Ecommerce and Retail Operations Uptime with ThousandEyes

UiPath Community: Communication Mining from Zero to Hero

Data governance with Unity Catalog Presentation

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Take control of your SAP testing with UiPath Test Suite

From Family Reminiscence to Scholarly Archive .

Generative Artificial Intelligence: How generative AI works.pdf

Testing tools and AI - ideas what to try with some tool examples

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

[Webinar] SpiraTest - Setting New Standards in Quality Assurance

Using R for Scraping Baseball Data from Baseball-Reference

1. Using R for Scraping Data Ryan Elmore National Renewable Energy Lab rtelmore@gmail.com Twitter: rtelmore June 13, 2012 useR! 2012

2. A Baseball Challenge Question: Has the minimum number of pitches per (full) inning (6 pitches) has ever been attained? Answer: I don’t know; scrape the boxscores at baseball-reference.com. http://www.baseball-reference.com/boxes/COL/COL201104010.shtml

3. A Baseball Challenge Question: Has the minimum number of pitches per (full) inning (6 pitches) has ever been attained? Answer: I don’t know; scrape the boxscores at baseball-reference.com. http://www.baseball-reference.com/boxes/COL/COL201104010.shtml

4. The Boxscore This column seems useful!

5. The Boxscore This column seems useful!

6. Dissecting the URL http://www.baseball-reference.com/boxes/COL/COL201104010.shtml Just step through all of the teams: YearMonthDay Game ID COL, BOS, etc.

7. How Do We Proceed? The most systematic way that I could ﬁnd was to break it down like this: • 30 Teams • 2005 - 2010 • Everyday from Apr 1 through Oct 31 • This is a little more than 78K URLs! • My program took about 3 hrs 25 min.

8. How Do We Proceed? The most systematic way that I could ﬁnd was to break it down like this: • 30 Teams • 2005 - 2010 • Everyday from Apr 1 through Oct 31 • This is a little more than 78K URLs! • My program took about 3 hrs 25 min.

9. R Code for (team in teams){ for (year in years){ out.string <- paste(Sys.time(), "--", team, year, sep = " ") print(out.string) for (month in months){ for (day in days){ for (i in 0:1){ full.url <- paste(paste(base.url, team, date.url, sep="/"), i, ".shtml", sep="") table.stats <- readHTMLTable(full.url) ## Process the list of data.frames returned by ## the call to readHTMLTable } } } } }

10. R Code for (team in teams){ for (year in years){ out.string <- paste(Sys.time(), "--", team, year, sep = " ") print(out.string) for (month in months){ for (day in days){ for (i in 0:1){ full.url <- paste(paste(base.url, team, date.url, sep="/"), i, ".shtml", sep="") table.stats <- readHTMLTable(full.url) ## Process the list of data.frames returned by ## the call to readHTMLTable } } } } }

11. Tools • base: paste, strsplit, unlist, lapply • XML: readHTMLTable, htmlTreeParse, getNodeSet, xmlValue, xmlSApply • httr, stringr, and other Hadley things • useful, but not necessary: regex, xpath, XML, etc.

12. Tools • base: paste, strsplit, unlist, lapply • XML: readHTMLTable, htmlTreeParse, getNodeSet, xmlValue, xmlSApply • httr, stringr, and other Hadley things • useful, but not necessary: regex, xpath, XML, etc.

13. Conclusions/Discussion • There is a lot of data available on the web! • You can access this data from a browser; however, you can access A LOT more data if you let your computer do the work. • R and its libraries provide a great platform for scraping data and data mining. • Download data and see where you go.

14. Conclusions/Discussion • There is a lot of data available on the web! • You can access this data from a browser; however, you can access A LOT more data if you let your computer do the work. • R and its libraries provide a great platform for scraping data and data mining. • Download data and see where you go.

15. Was That Minimum Attained? • NO! Unless there is an error in my code. • Did we learn something? Of course. • The skills are transferrable to other websites with data.

Using R for Scraping Baseball Data from Baseball-Reference

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Using R for Scraping Baseball Data from Baseball-Reference

Ähnlich wie Using R for Scraping Baseball Data from Baseball-Reference (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Using R for Scraping Baseball Data from Baseball-Reference