SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Simple Reproducibility
with the checkpoint package
David Smith
useR! 2015, July 1 2015
R Community Lead
Revolution Analytics, a Microsoft company
@revodavid
2
An R Reproducibility Problem
Adapted from http://xkcd.com/234/ CC BY-NC 2.5
Package dependency explosion
 R script file using 6 most popular packages
3
Any updated package = potential reproducibility error!
http://blog.revolutionanalytics.com/2014/10/explore-r-package-connections-at-mran.html
4
Using the checkpoint package
 Install checkpoint package from CRAN
– Or use Revolution R Open
 Add 2 lines to the top of your script
library(checkpoint)
checkpoint("2015-01-28")
 Err, that’s it.
 Optionally, check the R version as well
library(checkpoint)
checkpoint("2015-01-28", R.version="3.1.3")
Use package versions as of this date
Demo
Weather Map
Checkpoint tips for script authors
 Work within a project
– Dedicated folder with scripts, data and output
– eg /Users/david/R/weather
 Create a master .R script file beginning with
library(checkpoint)
checkpoint("DATE")
– package versions used will be as of this date
 Don’t use install.packages directly
– Use library() and checkpoint does the rest
– You can have different package versions installed for different projects at the
same time!
6
Sharing projects with checkpoint
 Just share your script or project folder!
 Recipient only needs:
– compatible R version
– checkpoint package (installed with RRO)
– Internet connection to MRAN (at least first time)
 Checkpoint takes care of:
– Installing CRAN packages
• Binaries (ease of installation)
• Correct versions (reproducibility)
• Dependencies (ease of installation)
– Eliminating conflicts with other installed packages
7
The checkpoint magic
The checkpoint() call does all this:
 Scans project for required packages
 Installs required packages and dependencies
– Packages installed specific to project
– Versions specific to checkpoint date
• Installed in ~/.checkpoint/DATE
• Skips packages if already installed (2nd run through)
 Reconfigures package search path
– Points only to project-specific library
8
MRAN checkpoint server
Checkpoint uses MRAN’s downstream CRAN mirror
with daily snapshots.
9
CRAN
RRDaily
snapshots
checkpoint
package
library(checkpoint)
checkpoint("2015-01-28")
CRAN mirror
cran.revolutionanalytics.com
(Windows binaries virus-scanned)
checkpoint
server
Midnight
UTC
mran.revolutionanalytics.com/snapshot/
checkpoint server - implementation
Checkpoint uses MRAN’s downstream CRAN mirror with daily
snapshots.
 rsync to mirror CRAN daily
– Only downloads changed packages
 zfs to store incremental snapshots
– Storage only required for new packages
 Organizes snapshots into a labelled hierarchy
– mran.revolutionanalytics.com/snapshot/YYYY-MM-DD
 MRAN hosted by high-performance cloud provider
– Provisioned for availability and latency
10
https://github.com/RevolutionAnalytics/checkpoint-server
11
Using non-CRAN packages Reproducibly
 Today, checkpoint only manages packages from CRAN
 GitHub: use install_github with a specific checkin hash
install_github("ramnathv/rblocks",
ref="a85e748390c17c752cc0ba961120d1e784fb1956")
 BioConductor: use packages from a specific BioConductor release
– Not as easy as it seems!
 Private packages / behind the firewall
– use miniCRAN to create a local, static repository
Comparison with packrat
 Packrat is flexible and powerful
– Supports non-CRAN packages (e.g. github)
– Allows mix-and-matching package versions
– Requires shipping all package source
– Requires recipients to build packages from source
 Checkpoint is simple
– Reproducibility from one script
– Simple for recipients to reproduce results
– Only allows use of CRAN packages versions that have been tested
together
– Requires Web access (and availability of MRAN)
12
rstudio.github.io/packrat/
13
Revolution R Open includes checkpoint
 Enhanced Open Source R distribution
 100% compatible with all R-related software
 Multi-threaded for performance
 Built-in reproducibility
 Open source (GPLv2 license)
 Available for Windows, Mac OS X, Ubuntu,
Red Hat and OpenSUSE
 Free download at
mran.revolutionanalytics.com
14
MRAN
The Managed R Archive Network
 Download Revolution R Open
 Learn about R and RRO
 Explore R Packages
 Explore Task Views
 R tips and applications
 Daily CRAN snapshots
mran.revolutionanalytics.com
15
Why use checkpoint?
 Write and share code R whose results can be reproduced, even if new
(and possibly incompatible) package versions are released later.
 Share R scripts with others that will automatically install the appropriate
package versions (no need to manually install CRAN packages).
 Write R scripts that use older versions of packages, or packages that are
no longer available on CRAN.
 Install packages (or package versions) visible only to a specific project,
without affecting other R projects or R users on the same system.
 Manage multiple projects that use different package versions.
Thank you
Contribute:
github.com/RevolutionAnalytics/checkpoint
Download Revolution R Open
mran.revolutionanalytics.com/download
www.revolutionanalytics.com
Twitter: @RevolutionR
David Smith
R Community Lead
Revolution Analytics
@revodavid
davidsmi@microsoft.com
17
Multi-threaded performance
 Intel MKL replaces standard
BLAS/LAPACK algorithms
(Windows/Linux)
 Pipelined operations
– Optimized for Intel, works for all archs
 High-performance algorithms
 Sequential  Parallel
– Uses as many threads as there are
available cores
– Control with:
setMKLthreads(<value>)
 No need to change any R code
 Included in RRO binary distribution
More at Revolutions blog

Weitere ähnliche Inhalte

Was ist angesagt?

12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
Revolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
Revolution Analytics
 

Was ist angesagt? (20)

R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
Revolution R - 100% R and More
Revolution R - 100% R and MoreRevolution R - 100% R and More
Revolution R - 100% R and More
 
Big data analytics on teradata with revolution r enterprise bill jacobs
Big data analytics on teradata with revolution r enterprise   bill jacobsBig data analytics on teradata with revolution r enterprise   bill jacobs
Big data analytics on teradata with revolution r enterprise bill jacobs
 
R reproducibility
R reproducibilityR reproducibility
R reproducibility
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
 
R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with Hadoop
 
Big Data - Analytics with R
Big Data - Analytics with RBig Data - Analytics with R
Big Data - Analytics with R
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
Rdf saturator
Rdf saturatorRdf saturator
Rdf saturator
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 

Andere mochten auch

Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
Revolution Analytics
 

Andere mochten auch (12)

The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
Distributed Computing Patterns in R
Distributed Computing Patterns in RDistributed Computing Patterns in R
Distributed Computing Patterns in R
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solution
 
DeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence ApplicationsDeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence Applications
 
Deploying R in BI and Real time Applications
Deploying R in BI and Real time ApplicationsDeploying R in BI and Real time Applications
Deploying R in BI and Real time Applications
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
R server and spark
R server and sparkR server and spark
R server and spark
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
 

Ähnlich wie Simple Reproducibility with the checkpoint package

Apache maven, a software project management tool
Apache maven, a software project management toolApache maven, a software project management tool
Apache maven, a software project management tool
Renato Primavera
 

Ähnlich wie Simple Reproducibility with the checkpoint package (20)

Reproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RROReproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RRO
 
[JOI] TOTVS Developers Joinville - Java #1
[JOI] TOTVS Developers Joinville - Java #1[JOI] TOTVS Developers Joinville - Java #1
[JOI] TOTVS Developers Joinville - Java #1
 
Exploring Maven SVN GIT
Exploring Maven SVN GITExploring Maven SVN GIT
Exploring Maven SVN GIT
 
Apache maven, a software project management tool
Apache maven, a software project management toolApache maven, a software project management tool
Apache maven, a software project management tool
 
Native Hadoop with prebuilt spark
Native Hadoop with prebuilt sparkNative Hadoop with prebuilt spark
Native Hadoop with prebuilt spark
 
Agile Software Development & Tools
Agile Software Development & ToolsAgile Software Development & Tools
Agile Software Development & Tools
 
DEF CON 27 - workshop - ISAAC EVANS - discover exploit and eradicate entire v...
DEF CON 27 - workshop - ISAAC EVANS - discover exploit and eradicate entire v...DEF CON 27 - workshop - ISAAC EVANS - discover exploit and eradicate entire v...
DEF CON 27 - workshop - ISAAC EVANS - discover exploit and eradicate entire v...
 
OpenSCAP Overview(security scanning for docker image and container)
OpenSCAP Overview(security scanning for docker image and container)OpenSCAP Overview(security scanning for docker image and container)
OpenSCAP Overview(security scanning for docker image and container)
 
Maven 2 features
Maven 2 featuresMaven 2 features
Maven 2 features
 
Maven
MavenMaven
Maven
 
Approaching package manager
Approaching package managerApproaching package manager
Approaching package manager
 
OpenSCAP Overview(security scanning for docker image and container)
OpenSCAP Overview(security scanning for docker image and container)OpenSCAP Overview(security scanning for docker image and container)
OpenSCAP Overview(security scanning for docker image and container)
 
Ephemeral DevOps: Adventures in Managing Short-Lived Systems
Ephemeral DevOps: Adventures in Managing Short-Lived SystemsEphemeral DevOps: Adventures in Managing Short-Lived Systems
Ephemeral DevOps: Adventures in Managing Short-Lived Systems
 
PHP Dependency Management with Composer
PHP Dependency Management with ComposerPHP Dependency Management with Composer
PHP Dependency Management with Composer
 
Open Source tools overview
Open Source tools overviewOpen Source tools overview
Open Source tools overview
 
RDO and Ceph meetup BCN - Testing in RDO
RDO and Ceph meetup BCN - Testing in RDORDO and Ceph meetup BCN - Testing in RDO
RDO and Ceph meetup BCN - Testing in RDO
 
Mavenized RCP
Mavenized RCPMavenized RCP
Mavenized RCP
 
Introduction maven3 and gwt2.5 rc2 - Lesson 01
Introduction maven3 and gwt2.5 rc2 - Lesson 01Introduction maven3 and gwt2.5 rc2 - Lesson 01
Introduction maven3 and gwt2.5 rc2 - Lesson 01
 
Puzzle ITC Talk @Docker CH meetup CI CD_with_Openshift_0.2
Puzzle ITC Talk @Docker CH meetup CI CD_with_Openshift_0.2Puzzle ITC Talk @Docker CH meetup CI CD_with_Openshift_0.2
Puzzle ITC Talk @Docker CH meetup CI CD_with_Openshift_0.2
 
OWASP Dependency-Track Introduction
OWASP Dependency-Track IntroductionOWASP Dependency-Track Introduction
OWASP Dependency-Track Introduction
 

Mehr von Revolution Analytics

Mehr von Revolution Analytics (9)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
 

Simple Reproducibility with the checkpoint package

  • 1. Simple Reproducibility with the checkpoint package David Smith useR! 2015, July 1 2015 R Community Lead Revolution Analytics, a Microsoft company @revodavid
  • 2. 2 An R Reproducibility Problem Adapted from http://xkcd.com/234/ CC BY-NC 2.5
  • 3. Package dependency explosion  R script file using 6 most popular packages 3 Any updated package = potential reproducibility error! http://blog.revolutionanalytics.com/2014/10/explore-r-package-connections-at-mran.html
  • 4. 4 Using the checkpoint package  Install checkpoint package from CRAN – Or use Revolution R Open  Add 2 lines to the top of your script library(checkpoint) checkpoint("2015-01-28")  Err, that’s it.  Optionally, check the R version as well library(checkpoint) checkpoint("2015-01-28", R.version="3.1.3") Use package versions as of this date
  • 6. Checkpoint tips for script authors  Work within a project – Dedicated folder with scripts, data and output – eg /Users/david/R/weather  Create a master .R script file beginning with library(checkpoint) checkpoint("DATE") – package versions used will be as of this date  Don’t use install.packages directly – Use library() and checkpoint does the rest – You can have different package versions installed for different projects at the same time! 6
  • 7. Sharing projects with checkpoint  Just share your script or project folder!  Recipient only needs: – compatible R version – checkpoint package (installed with RRO) – Internet connection to MRAN (at least first time)  Checkpoint takes care of: – Installing CRAN packages • Binaries (ease of installation) • Correct versions (reproducibility) • Dependencies (ease of installation) – Eliminating conflicts with other installed packages 7
  • 8. The checkpoint magic The checkpoint() call does all this:  Scans project for required packages  Installs required packages and dependencies – Packages installed specific to project – Versions specific to checkpoint date • Installed in ~/.checkpoint/DATE • Skips packages if already installed (2nd run through)  Reconfigures package search path – Points only to project-specific library 8
  • 9. MRAN checkpoint server Checkpoint uses MRAN’s downstream CRAN mirror with daily snapshots. 9 CRAN RRDaily snapshots checkpoint package library(checkpoint) checkpoint("2015-01-28") CRAN mirror cran.revolutionanalytics.com (Windows binaries virus-scanned) checkpoint server Midnight UTC mran.revolutionanalytics.com/snapshot/
  • 10. checkpoint server - implementation Checkpoint uses MRAN’s downstream CRAN mirror with daily snapshots.  rsync to mirror CRAN daily – Only downloads changed packages  zfs to store incremental snapshots – Storage only required for new packages  Organizes snapshots into a labelled hierarchy – mran.revolutionanalytics.com/snapshot/YYYY-MM-DD  MRAN hosted by high-performance cloud provider – Provisioned for availability and latency 10 https://github.com/RevolutionAnalytics/checkpoint-server
  • 11. 11 Using non-CRAN packages Reproducibly  Today, checkpoint only manages packages from CRAN  GitHub: use install_github with a specific checkin hash install_github("ramnathv/rblocks", ref="a85e748390c17c752cc0ba961120d1e784fb1956")  BioConductor: use packages from a specific BioConductor release – Not as easy as it seems!  Private packages / behind the firewall – use miniCRAN to create a local, static repository
  • 12. Comparison with packrat  Packrat is flexible and powerful – Supports non-CRAN packages (e.g. github) – Allows mix-and-matching package versions – Requires shipping all package source – Requires recipients to build packages from source  Checkpoint is simple – Reproducibility from one script – Simple for recipients to reproduce results – Only allows use of CRAN packages versions that have been tested together – Requires Web access (and availability of MRAN) 12 rstudio.github.io/packrat/
  • 13. 13 Revolution R Open includes checkpoint  Enhanced Open Source R distribution  100% compatible with all R-related software  Multi-threaded for performance  Built-in reproducibility  Open source (GPLv2 license)  Available for Windows, Mac OS X, Ubuntu, Red Hat and OpenSUSE  Free download at mran.revolutionanalytics.com
  • 14. 14 MRAN The Managed R Archive Network  Download Revolution R Open  Learn about R and RRO  Explore R Packages  Explore Task Views  R tips and applications  Daily CRAN snapshots mran.revolutionanalytics.com
  • 15. 15 Why use checkpoint?  Write and share code R whose results can be reproduced, even if new (and possibly incompatible) package versions are released later.  Share R scripts with others that will automatically install the appropriate package versions (no need to manually install CRAN packages).  Write R scripts that use older versions of packages, or packages that are no longer available on CRAN.  Install packages (or package versions) visible only to a specific project, without affecting other R projects or R users on the same system.  Manage multiple projects that use different package versions.
  • 16. Thank you Contribute: github.com/RevolutionAnalytics/checkpoint Download Revolution R Open mran.revolutionanalytics.com/download www.revolutionanalytics.com Twitter: @RevolutionR David Smith R Community Lead Revolution Analytics @revodavid davidsmi@microsoft.com
  • 17. 17 Multi-threaded performance  Intel MKL replaces standard BLAS/LAPACK algorithms (Windows/Linux)  Pipelined operations – Optimized for Intel, works for all archs  High-performance algorithms  Sequential  Parallel – Uses as many threads as there are available cores – Control with: setMKLthreads(<value>)  No need to change any R code  Included in RRO binary distribution More at Revolutions blog