Presented to Chicago R User Group, Jan 29 2015
Good data analysis is reproducible. If someone else can’t independently replicate your results from your data, the consequences can be severe. With R, a major challenge for reproducibility is the ever-changing package ecosystem: it's all too easy to develop an R script using packages, only to find collaborators will download later versions of those packages when they attempt to reproduce your results, and outcome can be unpredictable!
In this talk I'll introduce the Reproducible R Toolkit, and the "checkpoint" package, included with Revolution R Open, and describe some best practices for writing reliable, reproducible R code with packages.
2. But first: Breaking News!
• Microsoft will acquire Revolution Analytics
• Revolution R and OSS projects will continue
• Continued commitment to R community
• More at:
blog.revolutionanalytics.com/2015/01/revolution-acquired.html
2
3. Revolution R Open is:
• Downstream open source R distribution
– GPLv2 license
• Compatible with all R-related software
– Packages, Rstudio, …
• Multi-threaded for performance
• Focus on reproducibility
• Binaries available for Windows, Mac OS X,
Ubuntu, Red Hat and OpenSUSE
• Download RRO free from
mran.revolutionanalytics.com
• Paid support available from Revolution
Analytics with Revolution R Plus
3
4. 4
MRAN: The Managed R Archive Network
• Download Revolution R
Open
• Learn about R and RRO
• Explore Packages
– and dependencies
• Explore Task Views
• R tips and applications
• Daily CRAN snapshots
mran.revolutionanalytics.com
5. What is Reproducibility?
“The goal of reproducible research is to tie
specific instructions to data analysis and
experimental data so that scholarship can be
recreated, better understood and verified.”
CRAN Task View on Reproducible Research (Kuhn)
• Method + Environment
-> Results
• A process for:
– Sharing the method
– Describing the environment
– Recreating the results
5 xkcd.com/242/
6. Why care about reproducibility?
Academic / Research
• Verify results
• Advance Research
Business
• Production code
• Reliability
• Reusability
• Regulation
6
www.nytimes.com/2011/07/08/health/research/08genes.html
http://arxiv.org/pdf/1010.1092.pdf
7. Observations
• R versions are pretty manageable
– Major versions just once a year
– Patches rarely introduce incompatible changes
• Good solutions for literate programming
– Rstudio / knitr / Rmarkdown
• OS/Hardware not the major cause of
problems
• The big problem is with packages
– CRAN is in a state of continual flux
7
9. Reproducible R Toolkit
• Static CRAN mirror in RRO
– CRAN packages fixed with each Revolution R Open update
• Daily CRAN snapshots
– Storing every package version since September 2014
– Hosted at mran.revolutionanalytics.com/snapshot
• Write and share scripts synced to a specific
snapshot date
– checkpoint package installed with RRO
9
projects.revolutionanalytics.com/rrt/
10. 10
Using checkpoint
• Add 2 lines to the top of your script
library(checkpoint)
checkpoint("2015-01-28")
• Err, that’s it.
Or, whichever date you want
11. Checkpoint tips for script authors
• Work within a project
– Dedicated folder with scripts, data and output
– eg /Users/david/R/weather
• Create a master .R script file beginning with
library(checkpoint)
checkpoint("DATE")
– package versions used will be as of this date
• Don’t use install.packages directly
– Use library() and checkpoint does the rest
– You can have different package versions installed for different
projects at the same time!
• Works only with CRAN packages
– working on options for github, etc.
11
13. Package dependency explosion
• R script file using 6 most popular packages
13
Any updated package = potential reproducibility error!
http://blog.revolutionanalytics.com/2014/10/explore-r-package-connections-at-mran.html
14. Sharing projects with checkpoint
• Just share your script or project folder!
• Recipient only needs:
– compatible R version
– checkpoint package (installed with RRO)
– Internet connection to MRAN (at least first time)
• Checkpoint takes care of:
– Installing CRAN packages
• Binaries (ease of installation)
• Correct versions (reproducibility)
• Dependencies (ease of installation)
– Eliminating conflicts with other installed packages
14
15. The checkpoint magic
The checkpoint() call does all this:
• Scans project for required packages
• Installs required packages and dependencies
– Packages installed specific to project
– Versions specific to checkpoint date
• Installed in ~/.checkpoint/DATE
• Skips packages if already installed (2nd run through)
• Reconfigures package search path
– Points only to project-specific library
15
16. MRAN checkpoint server
Checkpoint uses MRAN’s downstream CRAN mirror with daily
snapshots.
16
CRAN
RRDaily
snapshots
checkpoint
package
library(checkpoint)
checkpoint("2015-01-28")
CRAN mirror
cran.revolutionanalytics.com
(Windows binaries virus-scanned)
checkpoint
server
Midnight
UTC
mran.revolutionanalytics.com/snapshot/
17. checkpoint server - implementation
Checkpoint uses MRAN’s downstream CRAN mirror with daily
snapshots.
• rsync to mirror CRAN daily
– Only downloads changed packages
• zfs to store incremental snapshots
– Storage only required for new packages
• Organizes snapshots into a labelled hierarchy
– mran.revolutionanalytics.com/snapshot/YYYY-MM-DD
• MRAN hosted by high-performance cloud provider
– Provisioned for availability and latency
17
https://github.com/RevolutionAnalytics/checkpoint-server
18. CRAN mirrors and RRO
• Revolution R Open ships with a fixed default
CRAN mirror
– Currently, 1 Dec 2014 snapshot (v 8.0.1)
• All users of same version get same CRAN
package versions by default
– regardless when “install.packages” is run
• Use checkpoint to access newer package
versions
18
19. Future work
• Checks on R versions
checkpoint("2015-02-17", R="3.1.2")
• Support for other repos
– BioConductor, GitHub, user
• Institution-level package duplication
– CRAN “behind the firewall” – miniCRAN package
• Suggestions welcome!
github.com/RevolutionAnalytics/checkpoint
19
20. Package Problem: The Author
http://xkcd.com/970/20
Time to update
my package on
CRAN!
>> Dependent
packages that
now fail to build:
67
>> Resubmit
your package
and try again
Crap.
21. Package Problem : The Update
http://xkcd.com/664/21
3 days later…
Woot! A new version of R
is out! I have 10 minutes
now, time to download
and install!
… package not found …
… can’t install package…
… error …
22. What about packrat?
• Packrat is flexible and powerful
– Supports non-CRAN packages (e.g. github)
– Allows mix-and-matching package versions
– Requires shipping all package source
– Requires recipients to build packages from source
• Checkpoint is simple
– Reproducibility from one script
– Simple for recipients to reproduce results
– Only allows use of CRAN packages versions that have been
tested together
– Relies on availability of MRAN
22
rstudio.github.io/packrat/
23. Get started with checkpoint
• Install RRO: mran.revolutionanalytics.com
– checkpoint comes pre-installed
• Or, install “checkpoint” from CRAN
– Latest version available on GitHub
library(devtools)
install_github("RevolutionAnalytics/checkpoint")
• Contributions welcome!
projects.revolutionanalytics.com/rrt/
23
24. Check out our other OSS projects
• DeployR Open
– simple, secure R integration for application developers
• Rhadoop
– connect R to Hadoop and run R on Hadoop nodes
• ParallelR:
– R packages for parallel and distributed programming
projects.revolutionanalytics.com
24