[Webinar] SpiraTest - Setting New Standards in Quality Assurance
Using R for Scraping Baseball Data from Baseball-Reference
1. Using R for Scraping Data
Ryan Elmore
National Renewable Energy Lab
rtelmore@gmail.com
Twitter: rtelmore
June 13, 2012
useR! 2012
2. A Baseball Challenge
Question: Has the minimum number of pitches
per (full) inning (6 pitches) has ever been
attained?
Answer: I don’t know; scrape the boxscores at
baseball-reference.com.
http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
3. A Baseball Challenge
Question: Has the minimum number of pitches
per (full) inning (6 pitches) has ever been
attained?
Answer: I don’t know; scrape the boxscores at
baseball-reference.com.
http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
7. How Do We Proceed?
The most systematic way that I could find
was to break it down like this:
• 30 Teams
• 2005 - 2010
• Everyday from Apr 1 through Oct 31
• This is a little more than 78K URLs!
• My program took about 3 hrs 25 min.
8. How Do We Proceed?
The most systematic way that I could find
was to break it down like this:
• 30 Teams
• 2005 - 2010
• Everyday from Apr 1 through Oct 31
• This is a little more than 78K URLs!
• My program took about 3 hrs 25 min.
9. R Code
for (team in teams){
for (year in years){
out.string <- paste(Sys.time(), "--", team, year, sep = " ")
print(out.string)
for (month in months){
for (day in days){
for (i in 0:1){
full.url <- paste(paste(base.url, team, date.url,
sep="/"), i, ".shtml", sep="")
table.stats <- readHTMLTable(full.url)
## Process the list of data.frames returned by
## the call to readHTMLTable
}
}
}
}
}
10. R Code
for (team in teams){
for (year in years){
out.string <- paste(Sys.time(), "--", team, year, sep = " ")
print(out.string)
for (month in months){
for (day in days){
for (i in 0:1){
full.url <- paste(paste(base.url, team, date.url,
sep="/"), i, ".shtml", sep="")
table.stats <- readHTMLTable(full.url)
## Process the list of data.frames returned by
## the call to readHTMLTable
}
}
}
}
}
11. Tools
• base: paste, strsplit, unlist, lapply
• XML: readHTMLTable, htmlTreeParse,
getNodeSet, xmlValue, xmlSApply
• httr, stringr, and other Hadley things
• useful, but not necessary: regex, xpath,
XML, etc.
12. Tools
• base: paste, strsplit, unlist, lapply
• XML: readHTMLTable, htmlTreeParse,
getNodeSet, xmlValue, xmlSApply
• httr, stringr, and other Hadley things
• useful, but not necessary: regex, xpath,
XML, etc.
13. Conclusions/Discussion
• There is a lot of data available on the web!
• You can access this data from a browser;
however, you can access A LOT more data
if you let your computer do the work.
• R and its libraries provide a great platform
for scraping data and data mining.
• Download data and see where you go.
14. Conclusions/Discussion
• There is a lot of data available on the web!
• You can access this data from a browser;
however, you can access A LOT more data
if you let your computer do the work.
• R and its libraries provide a great platform
for scraping data and data mining.
• Download data and see where you go.
15. Was That Minimum Attained?
• NO! Unless there is an error in my code.
• Did we learn something? Of course.
• The skills are transferrable to other
websites with data.