SlideShare ist ein Scribd-Unternehmen logo
1 von 20
R and Reproducibility
A Proposal
David Smith
useR! 2014
What is Reproducibility?
“The goal of reproducible research is to tie
specific instructions to data analysis and
experimental data so that scholarship can be
recreated, better understood and verified.”
CRAN Task View on Reproducible Research (Kuhn)
• Method + Environment
-> Results
• A process for:
– Sharing the method
– Describing the environment
– Recreating the results
2 xkcd.com/242/
Why care about reproducibility?
Academic / Research
• Verify results
• Advance Research
Business
• Production code
• Reliability
• Reusability
• Regulation
3
www.nytimes.com/2011/07/08/health/research/08genes.html
http://arxiv.org/pdf/1010.1092.pdf
R and Reproducibility
4
Results
Interfaces
Platform
Packages
R Engine
• Hand-assembled
• Sweave/knitr/DeployR/Shiny
• R GUI / DevelopR / RStudio
• Batch / Web Services
• OS / Virtualization
• Hardware Architecture
• CRAN
• BioConductor / GitHub / …
• R Version
• Base + Recommended pkgs
Observations
• R versions are pretty manageable
– Major versions just once a year
– Patches rarely introduce incompatible changes
• Good solutions for literate programming
– Interfaces help
• OS/Hardware not the major cause of
problems
• The big problem is with packages
– CRAN is in a state of continual flux
5
Package Problem #1 : The User
http://xkcd.com/234/6
I heard you need to create a
TPS Report. Here, I’ve got an
R script that does that
already.
Oh, you need to
download these 5
packages first.
I already
did, and it
still
doesn’t
work!
Well, it worked when I
wrote it 3 weeks ago.
YOUR
Grr.
Package
updates…
Package Problem #2: The Author
http://xkcd.com/970/7
Time to update
my package on
CRAN!
>> Dependent
packages that
now fail to build:
67
>> Resubmit
your package
and try again
Crap.
Package Problem #3 : The Update
http://xkcd.com/664/8
3 days later…
Woot! A new version of R
is out! I have 10 minutes
now, time to download
and install!
… package not found …
… can’t install package…
… error …
The Proposal
• Change the default way R handles packages
– Install packages local to projects
• “Snapshot” CRAN daily
– Make it easy to get & use package versions used in script
development
Not a new idea!
– Ooms, “Possible Directions for Improving Dependency
Versioning in R”, R Journal 5/1
– BioConductor Project
– Revolution R Enterprise
– Linux distros
9
Example
• R script file using 6 most popular packages
10
Sharing a script reproducibly
… and simply
# Run with R 3.1.0
require(RRT)
mran_set(snapshot="2014-06-27")
# find packages used in this project
# get package versions used by script author
# install locally to this project
require(ggplot2)
require(data.table)
require(knitr) …
11
RRT: The R Reproducibility Toolkit
• Open Source R Package (GPLv2)
• From an R project folder:
– Detect packages & dependencies used in project
– Download and install from MRAN
– Versions selected according to script date
– Find and use packages from local install
github.com/RevolutionAnalytics/RRT
12
MRAN - Implementation
A downstream CRAN mirror with daily snapshots
• Use rsync to mirror CRAN daily
– Only downloads changed packages
• Use zfs to store incremental snapshots
– Storage only required for new packages
• Organize snapshots into a labelled hierarchy
– Access package versions by date of use
• CRAN snapshot server hosted by cloud provider
– Provisioned for availability and latency
13
Future work
• Just getting started!
• Snapshot binaries and source packages
• Other repos (BioConductor, GitHub, user)
• Institution-level package duplication
– CRAN “behind the firewall”
• User-defined package versions
• Checks on R versions
• Suggestions welcome!
github.com/RevolutionAnalytics/RRT
14
Thank You!
David Smith
david@revolutionanalytics.com
blog.revolutionanalytics.com
Possible Solution
• Bundle all packages with scripts
• Packrat solves this very well
– Project + package dependencies stored in Github
• But:
– Contributes to package fragmentation
– Adds friction to the sharing process
– Doesn’t address the problem for R generally
16
CRAN vs Github
CRAN
• “Repository of Record”
– Default for R users
• Strict quality checking
• Handles dependencies
• Binaries built
– But only current versions
saved
• Manual update process
• Dependent on volunteer
support
Github
• Frictionless publishing /
updates
– RStudio integration
• Social development
– Pull requests FTW
• Ease of updates
• Fragmented – no unified
directory of packages
• Permanence – accounts
closed / repos deleted
17
A downstream CRAN solution?
“I don't see why CRAN needs to be involved in
this effort at all. A third party could take
snapshots of CRAN at R release dates, and make
those available to package users in a separate
repository. It is not hard to set a different
repository than CRAN as the default location
from which to obtain packages.”
-- R-core member, r-devel, March 2014
18
Snapshot CRAN repository :
requirements
• Availability
• Latency
• Bandwidth
• Storage
• Binary package archives
• Other enhancements?
19
Proposal
“Development Branch” “Stable Branch”
Defaults are important!!20
MRANCRAN Downstram
Reproducible

Weitere ähnliche Inhalte

Was ist angesagt?

Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Revolution Analytics
 
Alex Liu Harvard Forest Presentation
Alex Liu Harvard Forest PresentationAlex Liu Harvard Forest Presentation
Alex Liu Harvard Forest Presentation
lexicron345
 
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Revolution Analytics
 

Was ist angesagt? (20)

What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
Big Data - Analytics with R
Big Data - Analytics with RBig Data - Analytics with R
Big Data - Analytics with R
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Alex Liu Harvard Forest Presentation
Alex Liu Harvard Forest PresentationAlex Liu Harvard Forest Presentation
Alex Liu Harvard Forest Presentation
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
Reproducibility with Revolution R Open
Reproducibility with Revolution R OpenReproducibility with Revolution R Open
Reproducibility with Revolution R Open
 
Data Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program AnalysisData Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program Analysis
 
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and Revolution
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
 
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love TestsDr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
 
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
 
Reproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceReproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R Conference
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Intro to Reproducible Research
Intro to Reproducible ResearchIntro to Reproducible Research
Intro to Reproducible Research
 

Ähnlich wie R reproducibility

Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
BlueData, Inc.
 
Package Repositories: The Unsung Heroes of Configuration and Release Managem...
Package Repositories:  The Unsung Heroes of Configuration and Release Managem...Package Repositories:  The Unsung Heroes of Configuration and Release Managem...
Package Repositories: The Unsung Heroes of Configuration and Release Managem...
IBM UrbanCode Products
 
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
IxiaRomania
 

Ähnlich wie R reproducibility (20)

Effectively using Open Source with conda
Effectively using Open Source with condaEffectively using Open Source with conda
Effectively using Open Source with conda
 
R development
R developmentR development
R development
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
 
OpenStack Documentation in the Open
OpenStack Documentation in the OpenOpenStack Documentation in the Open
OpenStack Documentation in the Open
 
Managing Open Source Software in the GitHub Era
Managing Open Source Software in the GitHub EraManaging Open Source Software in the GitHub Era
Managing Open Source Software in the GitHub Era
 
Upgrading CentOS on the Facebook fleet
Upgrading CentOS on the Facebook fleetUpgrading CentOS on the Facebook fleet
Upgrading CentOS on the Facebook fleet
 
Docker: Containers for Data Science
Docker: Containers for Data ScienceDocker: Containers for Data Science
Docker: Containers for Data Science
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
 
Reproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RROReproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RRO
 
Avogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsAvogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and Semantics
 
Guidelines for Working with Contract Developers in Evergreen
Guidelines for Working with Contract Developers in EvergreenGuidelines for Working with Contract Developers in Evergreen
Guidelines for Working with Contract Developers in Evergreen
 
Package Repositories: The Unsung Heroes of Configuration and Release Managem...
Package Repositories:  The Unsung Heroes of Configuration and Release Managem...Package Repositories:  The Unsung Heroes of Configuration and Release Managem...
Package Repositories: The Unsung Heroes of Configuration and Release Managem...
 
Developing a Framework for File Format Migrations. Joey Heinen and Andrea Goe...
Developing a Framework for File Format Migrations. Joey Heinen and Andrea Goe...Developing a Framework for File Format Migrations. Joey Heinen and Andrea Goe...
Developing a Framework for File Format Migrations. Joey Heinen and Andrea Goe...
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
 
Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018
Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018
Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018
 
Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?
 
R meetup 20161011v2
R meetup 20161011v2R meetup 20161011v2
R meetup 20161011v2
 
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
 

Mehr von Revolution Analytics

The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
Revolution Analytics
 
Reproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint PackageReproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint Package
Revolution Analytics
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
Revolution Analytics
 

Mehr von Revolution Analytics (18)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solution
 
Reproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint PackageReproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint Package
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 

Kürzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

R reproducibility

  • 1. R and Reproducibility A Proposal David Smith useR! 2014
  • 2. What is Reproducibility? “The goal of reproducible research is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, better understood and verified.” CRAN Task View on Reproducible Research (Kuhn) • Method + Environment -> Results • A process for: – Sharing the method – Describing the environment – Recreating the results 2 xkcd.com/242/
  • 3. Why care about reproducibility? Academic / Research • Verify results • Advance Research Business • Production code • Reliability • Reusability • Regulation 3 www.nytimes.com/2011/07/08/health/research/08genes.html http://arxiv.org/pdf/1010.1092.pdf
  • 4. R and Reproducibility 4 Results Interfaces Platform Packages R Engine • Hand-assembled • Sweave/knitr/DeployR/Shiny • R GUI / DevelopR / RStudio • Batch / Web Services • OS / Virtualization • Hardware Architecture • CRAN • BioConductor / GitHub / … • R Version • Base + Recommended pkgs
  • 5. Observations • R versions are pretty manageable – Major versions just once a year – Patches rarely introduce incompatible changes • Good solutions for literate programming – Interfaces help • OS/Hardware not the major cause of problems • The big problem is with packages – CRAN is in a state of continual flux 5
  • 6. Package Problem #1 : The User http://xkcd.com/234/6 I heard you need to create a TPS Report. Here, I’ve got an R script that does that already. Oh, you need to download these 5 packages first. I already did, and it still doesn’t work! Well, it worked when I wrote it 3 weeks ago. YOUR Grr. Package updates…
  • 7. Package Problem #2: The Author http://xkcd.com/970/7 Time to update my package on CRAN! >> Dependent packages that now fail to build: 67 >> Resubmit your package and try again Crap.
  • 8. Package Problem #3 : The Update http://xkcd.com/664/8 3 days later… Woot! A new version of R is out! I have 10 minutes now, time to download and install! … package not found … … can’t install package… … error …
  • 9. The Proposal • Change the default way R handles packages – Install packages local to projects • “Snapshot” CRAN daily – Make it easy to get & use package versions used in script development Not a new idea! – Ooms, “Possible Directions for Improving Dependency Versioning in R”, R Journal 5/1 – BioConductor Project – Revolution R Enterprise – Linux distros 9
  • 10. Example • R script file using 6 most popular packages 10
  • 11. Sharing a script reproducibly … and simply # Run with R 3.1.0 require(RRT) mran_set(snapshot="2014-06-27") # find packages used in this project # get package versions used by script author # install locally to this project require(ggplot2) require(data.table) require(knitr) … 11
  • 12. RRT: The R Reproducibility Toolkit • Open Source R Package (GPLv2) • From an R project folder: – Detect packages & dependencies used in project – Download and install from MRAN – Versions selected according to script date – Find and use packages from local install github.com/RevolutionAnalytics/RRT 12
  • 13. MRAN - Implementation A downstream CRAN mirror with daily snapshots • Use rsync to mirror CRAN daily – Only downloads changed packages • Use zfs to store incremental snapshots – Storage only required for new packages • Organize snapshots into a labelled hierarchy – Access package versions by date of use • CRAN snapshot server hosted by cloud provider – Provisioned for availability and latency 13
  • 14. Future work • Just getting started! • Snapshot binaries and source packages • Other repos (BioConductor, GitHub, user) • Institution-level package duplication – CRAN “behind the firewall” • User-defined package versions • Checks on R versions • Suggestions welcome! github.com/RevolutionAnalytics/RRT 14
  • 16. Possible Solution • Bundle all packages with scripts • Packrat solves this very well – Project + package dependencies stored in Github • But: – Contributes to package fragmentation – Adds friction to the sharing process – Doesn’t address the problem for R generally 16
  • 17. CRAN vs Github CRAN • “Repository of Record” – Default for R users • Strict quality checking • Handles dependencies • Binaries built – But only current versions saved • Manual update process • Dependent on volunteer support Github • Frictionless publishing / updates – RStudio integration • Social development – Pull requests FTW • Ease of updates • Fragmented – no unified directory of packages • Permanence – accounts closed / repos deleted 17
  • 18. A downstream CRAN solution? “I don't see why CRAN needs to be involved in this effort at all. A third party could take snapshots of CRAN at R release dates, and make those available to package users in a separate repository. It is not hard to set a different repository than CRAN as the default location from which to obtain packages.” -- R-core member, r-devel, March 2014 18
  • 19. Snapshot CRAN repository : requirements • Availability • Latency • Bandwidth • Storage • Binary package archives • Other enhancements? 19
  • 20. Proposal “Development Branch” “Stable Branch” Defaults are important!!20 MRANCRAN Downstram Reproducible

Hinweis der Redaktion

  1. http://xkcd.com/242/
  2. https://stat.ethz.ch/pipermail/r-devel/2014-March/068552.html