Reproducible Data Science with R

•

10 gefällt mir•16,097 views

Presented by David Smith at The Data Science Summit, Chicago, April 20 2017. The ability to independently reproduce results is a critical issue within the scientific community today, and is equally important for collaboration and compliance in business. In this talk, I'll introduce several features available in R that help you make reproducibility a standard part of your data science workflow. The talk will include tips on working with data and files, combining code and output, and managing R's changing package ecosystem.

Software

Reproducible Data Science with R
David Smith
R Community Lead, Microsoft
@revodavid

What is reproducibility?
“Two honest researchers would
get the same result”
– John Mount
• Transparent data sourcing and availability
• Fully automated analysis pipeline (code provided)
• Traceability from published results back to data
2

Why Reproducibility?
• Save time
• Better science
• More authoritative research
• Reduce risk of errors
• Facilitate collaboration
3

Why Reproducibility?
• Save time
• Better science
• More authoritative research
• Reduce risk of errors
• Facilitate collaboration
4

Why Reproducibility?
• Save time
• Better science
• More authoritative research
• Reduce risk of errors
• Facilitate collaboration
5

Why Reproducibility?
• Save time
• Better science
• More authoritative research
• Reduce risk of errors
• Facilitate collaboration
6

Why Reproducibility?
• Save time
• Better science
• More authoritative research
• Reduce risk of errors
• Facilitate collaboration
7

Accessing Data
• Data sources
– databases, sensor logs, spreadsheets, file download, APIs, …
– Remember, these are sources: don’t modify them!
• Snapshot data into local static files (text is good)
– You will likely include some ETL steps here
– Record a timestamp of when data was extracted: Sys.time()
– With big data, sample during development, scale up to finalize
– Document how this was all done, preferably with code (source file)!
• Import text files using R functions
– Recommended package: readr (avoids issues with locale, dates, etc)
8

Analysis process
• Interactively explore data & develop analyses as usual
• Capture the entire process in an R script
– library(tidyverse) is helpful for cleaning, feature generation, etc.
– Re-usable components can be shared as (private) packages
• Generate artifacts using scripts
– Graphics (please, no JPGs!)
– Tables
– Documents
• Include timestamps in output: Sys.time()
9

A reproducible analysis environment
• Operating system and R version
– For most purposes, not the biggest cause of issues
• But do document your R session info: sessionInfo()
– For production, consider VM / container
• A clean R environment
– Organize work into independent projects (directories)
– Use relative paths in scripts
– Avoid use of .Rprofile
– Set explicit random seeds
– Do not save R workspace
10

Managing a changing R package ecosystem
• One command to lock package versions to a specific date:
checkpoint("2017-04-11")
(For collaborators, same command downloads required package versions)
• Applies just to this project
– Global package upgrades won’t break reproducibility in other projects
• checkpoint package avail on CRAN, included with Microsoft R
11

Presenting results
• Eliminate manual processes
(as far as possible)
– Annotations (graphs / tables)
– Cut-and-paste into documents
• Notebooks
– Combines code, output and
narrative
– Good for collaboration with
other researchers
• Document Generation
– Best for automating reports
12

knitr / Rmarkdown
• Generate HTML, Word,
or PDF reports
– Or books, blogs, slides, …
• Combine narrative and
R code in a single
document
• Human-readable, easy
to edit
• Single-click update!
13

Collaboration and sharing
• Just share R project folder
• Publish on Github
– Happy Git and GitHub for the
useR, Jenny Bryan
http://happygitwithr.com/
– Version retention and tracking
– Collaboration (code and
comments)
14

Take-Aways
Reproducibility is Beneficial
• Saves time
• Produces better science
• More trusted research
• Reduced risk of errors
• Encourages collaboration
Reproducibility is Simple
• Document and automate processes
with R scripts
• Read and clean data with tidyverse
• Use checkpoint to manage package
versions
• Generate documents with knitr
• Share reproducible projects with
Github
15

Weitere ähnliche Inhalte

Was ist angesagt?

Insights Without Tradeoffs: Using Structured StreamingDatabricks

A Step Towards Reproducibility in RRevolution Analytics

Introduction to Microsoft R ServicesGregg Barrett

Bay Area Apache Flink Meetup Community Update August 2015Henry Saputra

The R EcosystemRevolution Analytics

Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedRevolution Analytics

Spark Summit EU talk by Tug GrallSpark Summit

Data Science at Scale by Sarah GuidoSpark Summit

Dogfooding data at Lyftmarkgrover

From R Script to Production Using rsparkling with Navdeep GillDatabricks

Managing the Complete Machine Learning Lifecycle with MLflowDatabricks

Apache Flink and what it is used forAljoscha Krettek

Kamal Hakimzadeh – Reproducible Distributed ExperimentsFlink Forward

Cascalog at May Bay Area Hadoop User Groupnathanmarz

Demystifying Data Engineeringnathanmarz

Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinDatabricks

Airflow at lyft for Airflow summit 2020 conferenceTao Feng

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

Revolution R Enterprise - Portland R User Group, November 2013Revolution Analytics

Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks

Was ist angesagt? (20)

Insights Without Tradeoffs: Using Structured Streaming

A Step Towards Reproducibility in R

Introduction to Microsoft R Services

Bay Area Apache Flink Meetup Community Update August 2015

The R Ecosystem

Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed

Spark Summit EU talk by Tug Grall

Data Science at Scale by Sarah Guido

Dogfooding data at Lyft

From R Script to Production Using rsparkling with Navdeep Gill

Managing the Complete Machine Learning Lifecycle with MLflow

Apache Flink and what it is used for

Kamal Hakimzadeh – Reproducible Distributed Experiments

Cascalog at May Bay Area Hadoop User Group

Demystifying Data Engineering

Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin

Airflow at lyft for Airflow summit 2020 conference

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

Revolution R Enterprise - Portland R User Group, November 2013

Rental Cars and Industrialized Learning to Rank with Sean Downes

Ähnlich wie Reproducible Data Science with R

ODSC and iRODSRaminder Singh

Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart

Collaborative Data Analysis with Taverna WorkflowsAndrea Wiggins

Bren - UCSB - Spooky spreadsheetsCarly Strasser

In-Database Analytics Deep Dive with Teradata and RevolutionRevolution Analytics

A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.

From lots of reports (with some data Analysis)  to Massive Data Analysis (Wit...Mark Rittman

Data Archiving and SharingC. Tobin Magle

Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB

Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Anita de Waard

Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.

Empowering Transformational ScienceChelle Gentemann

Hadoop meets Agile! - An Agile Big Data ModelUwe Printz

Reproducible research concepts and toolsC. Tobin Magle

Data Visibility and Protection at the Scale of Life SciencesAdam Marko

Apache Spark sqlaftab alam

microsoft r server for distributed computingBAINIDA

OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfAltinity Ltd

Software Analytics: Data Analytics for Software EngineeringTao Xie

Database part1-Taymoor Nazmy

Ähnlich wie Reproducible Data Science with R (20)

ODSC and iRODS

Research Data (and Software) Management at Imperial: (Everything you need to ...

Collaborative Data Analysis with Taverna Workflows

Bren - UCSB - Spooky spreadsheets

In-Database Analytics Deep Dive with Teradata and Revolution

A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...

From lots of reports (with some data Analysis)  to Massive Data Analysis (Wit...

Data Archiving and Sharing

Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes

Research Object Composer: A Tool for Publishing Complex Data Objects in the C...

Cloudera Breakfast Series, Analytics Part 1: Use All Your Data

Empowering Transformational Science

Hadoop meets Agile! - An Agile Big Data Model

Reproducible research concepts and tools

Data Visibility and Protection at the Scale of Life Sciences

Apache Spark sql

microsoft r server for distributed computing

OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf

Software Analytics: Data Analytics for Software Engineering

Database part1-

Mehr von Revolution Analytics

Speeding up R with Parallel Programming in the CloudRevolution Analytics

Migrating Existing Open Source Machine Learning to AzureRevolution Analytics

R in Minecraft Revolution Analytics

The case for R for AI developersRevolution Analytics

Speed up R with parallel programming in the CloudRevolution Analytics

The R EcosystemRevolution Analytics

The Value of Open Source CommunitiesRevolution Analytics

R at Microsoft (useR! 2016)Revolution Analytics

Building a scalable data science platform with RRevolution Analytics

The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics

Taking R Analytics to SQL and the CloudRevolution Analytics

The Network structure of R packages on CRAN & BioConductorRevolution Analytics

The network structure of cran 2015 07-02 finalRevolution Analytics

Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution Analytics

Warranty Predictive Analytics solutionRevolution Analytics

Reproducibility with Checkpoint & RRO - NYC R ConferenceRevolution Analytics

Reproducibility with Revolution R Open and the Checkpoint PackageRevolution Analytics

Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics

Reproducibility with Revolution R OpenRevolution Analytics

Batter Up! Advanced Sports Analytics with R and StormRevolution Analytics

Mehr von Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud

Migrating Existing Open Source Machine Learning to Azure

R in Minecraft

The case for R for AI developers

Speed up R with parallel programming in the Cloud

The R Ecosystem

The Value of Open Source Communities

R at Microsoft (useR! 2016)

Building a scalable data science platform with R

The Business Economics and Opportunity of Open Source Data Science

Taking R Analytics to SQL and the Cloud

The Network structure of R packages on CRAN & BioConductor

The network structure of cran 2015 07-02 final

Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15

Warranty Predictive Analytics solution

Reproducibility with Checkpoint & RRO - NYC R Conference

Reproducibility with Revolution R Open and the Checkpoint Package

Performance and Scale Options for R with Hadoop: A comparison of potential ar...

Reproducibility with Revolution R Open

Batter Up! Advanced Sports Analytics with R and Storm

Kürzlich hochgeladen

UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz

2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin

Ronisha Informatics Private Limited Catalogueitservices996

eSoftTools IMAP Backup Software and migration toolsosttopstonverter

Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions

What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics

Copilot para Microsoft 365 y Power Platform CopilotEdgard Alejos

Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app

Introduction to Firebase Workshop Slidesvaideheekore1

Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp

Strategies for using alternative queries to mitigate zero resultsJean Silva

Data modeling 101 - Basics - Software DomainAbdul Ahad

GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j

Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea

Understanding Flamingo - DeepMind's VLM Architecturerahul_net

Zer0con 2024 final share short version.pdfmaor17

JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver

Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171

Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions

Kürzlich hochgeladen (20)

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx

2024 DevNexus Patterns for Resiliency: Shuffle shards

Ronisha Informatics Private Limited Catalogue

eSoftTools IMAP Backup Software and migration tools

Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...

What’s New in VictoriaMetrics: Q1 2024 Updates

Copilot para Microsoft 365 y Power Platform Copilot

Effectively Troubleshoot 9 Types of OutOfMemoryError

Introduction to Firebase Workshop Slides

Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf

Strategies for using alternative queries to mitigate zero results

Data modeling 101 - Basics - Software Domain

GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j

Keeping your build tool updated in a multi repository world

Understanding Flamingo - DeepMind's VLM Architecture

Zer0con 2024 final share short version.pdf

JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...

Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf

Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...

Reproducible Data Science with R

1. Reproducible Data Science with R David Smith R Community Lead, Microsoft @revodavid

2. What is reproducibility? “Two honest researchers would get the same result” – John Mount • Transparent data sourcing and availability • Fully automated analysis pipeline (code provided) • Traceability from published results back to data 2

3. Why Reproducibility? • Save time • Better science • More authoritative research • Reduce risk of errors • Facilitate collaboration 3

4. Why Reproducibility? • Save time • Better science • More authoritative research • Reduce risk of errors • Facilitate collaboration 4

5. Why Reproducibility? • Save time • Better science • More authoritative research • Reduce risk of errors • Facilitate collaboration 5

6. Why Reproducibility? • Save time • Better science • More authoritative research • Reduce risk of errors • Facilitate collaboration 6

7. Why Reproducibility? • Save time • Better science • More authoritative research • Reduce risk of errors • Facilitate collaboration 7

8. Accessing Data • Data sources – databases, sensor logs, spreadsheets, file download, APIs, … – Remember, these are sources: don’t modify them! • Snapshot data into local static files (text is good) – You will likely include some ETL steps here – Record a timestamp of when data was extracted: Sys.time() – With big data, sample during development, scale up to finalize – Document how this was all done, preferably with code (source file)! • Import text files using R functions – Recommended package: readr (avoids issues with locale, dates, etc) 8

9. Analysis process • Interactively explore data & develop analyses as usual • Capture the entire process in an R script – library(tidyverse) is helpful for cleaning, feature generation, etc. – Re-usable components can be shared as (private) packages • Generate artifacts using scripts – Graphics (please, no JPGs!) – Tables – Documents • Include timestamps in output: Sys.time() 9

10. A reproducible analysis environment • Operating system and R version – For most purposes, not the biggest cause of issues • But do document your R session info: sessionInfo() – For production, consider VM / container • A clean R environment – Organize work into independent projects (directories) – Use relative paths in scripts – Avoid use of .Rprofile – Set explicit random seeds – Do not save R workspace 10

11. Managing a changing R package ecosystem • One command to lock package versions to a specific date: checkpoint("2017-04-11") (For collaborators, same command downloads required package versions) • Applies just to this project – Global package upgrades won’t break reproducibility in other projects • checkpoint package avail on CRAN, included with Microsoft R 11

12. Presenting results • Eliminate manual processes (as far as possible) – Annotations (graphs / tables) – Cut-and-paste into documents • Notebooks – Combines code, output and narrative – Good for collaboration with other researchers • Document Generation – Best for automating reports 12

13. knitr / Rmarkdown • Generate HTML, Word, or PDF reports – Or books, blogs, slides, … • Combine narrative and R code in a single document • Human-readable, easy to edit • Single-click update! 13

14. Collaboration and sharing • Just share R project folder • Publish on Github – Happy Git and GitHub for the useR, Jenny Bryan http://happygitwithr.com/ – Version retention and tracking – Collaboration (code and comments) 14

15. Take-Aways Reproducibility is Beneficial • Saves time • Produces better science • More trusted research • Reduced risk of errors • Encourages collaboration Reproducibility is Simple • Document and automate processes with R scripts • Read and clean data with tidyverse • Use checkpoint to manage package versions • Generate documents with knitr • Share reproducible projects with Github 15

16. Reproducible Data Science with R David Smith R Community Lead, Microsoft @revodavid

Reproducible Data Science with R

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Reproducible Data Science with R

Ähnlich wie Reproducible Data Science with R (20)

Mehr von Revolution Analytics

Mehr von Revolution Analytics (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Reproducible Data Science with R