SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Fostering a Reproducible
Research Culture
Jeremy Leipzig
3.19.15
RR in the Field
Why’s of Reproducibility
Sanity
Reuse
Redundancy
Evaluation
Reproducible interests
Sweave & knitr
Ties that bind
Why Snakemake?
 Addresses Makefile weaknesses:
 Difficult to implement control flow
 No cluster support
 Inflexible wildcards
 Too much reliance on sentinel files
 No reporting mechanism
Keeps the good stuff:
 Implicit dependency resolution
Johannes Köster
Self-Reporting Workflows
The Future
Virtual Machine Container
Dockerized analyses?
Sweave
Snakemake
Container
Acknowledgements
 BiG
 Deanne Taylor
 Batsal Devkota
 Noor Dawany
 Perry Evans
 Juan Perin
 Pichai Raman
 Ariella Sasson
 Hongbo Xie
 Zhe Zhang
 TIU
 Jeff Pennington
 Byron Ruth
 Kevin Murphy
 DBHi
 Bryan Wolf
 Mark Porter

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 

Kürzlich hochgeladen (20)

Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 

Empfohlen

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 

Empfohlen (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Fostering a Reproducible Research Culture

Hinweis der Redaktion

  1. ---
  2. --- Slide 2 Because I'm giving a talk on reproducible research I am legally obligated to open with the cautionary tale of Anil Potti scandal. This is the Duke researcher who from 2006 to 2009 used microarray expression data to classify predict sensitivity of clinical tumor samples to various chemotherapy treatments. As early as 2007 Keith Baggerly and Kevin Coombes from MD Anderson tried to reproduce the initial analyses and obtained radically different results. So lacking much of the code had to reverse engineer what they believed occurred and discovered a number of off-by-one errors, one of which was induced by a program that didn’t want a header row, genes missing from the test set that were in the training set, genes that appeared in a later email from the authors were not even in the probeset, and a huge number of samples that were duplicated, some of which were both labeled sensitive and resistant. Despite years of efforts by Baggerly and Coombes to have the authors come clean with a working analysis this went all the way to clinical trials, which were halted only when it was discovered Anil Potti lied on his CV and was not in fact a Rhodes Scholar. I really encourage everyone to view the presentation by Keith Baggerly on Youtube - “The Importance of Reproducible Research in High-Throughput Biology” The real point of the story is that the manipulation of data might have began with a cell shift and off-by-one errors, stupid mistakes that anyone can make but are virtually impossible to detect unless submit a reproducible workflow with your paper and allow reviewers to run that on a novel dataset. Withholding of data was likely a side of active obfuscation I suspect much of the initial errors in these papers were due to stupid mistakes. So for me preventing the outright falsification of data is not even in my top reasons for reproducible research. If someone is determined to lie they're going to find a way to do it. The thing is they Duke fired Anil Potti but I didn’t they fired Microsoft Excel. Excel still has tenure at Duke for all I know. And that’s a shame because Excel was likely a partner in this crime.
  3. --- Slide 3 The kind of reproducible research also tends to get conflated with a bunch of hot topics: open access, open data, software carpentry, and a bunch of other stuff people don't want to do. This is what I call the “reproducible research guilt trip”. In the big bad mean world the relationship between journals, funding institutions, and reviewers is essentially adversarial. Inside a group like ours there are a lot more incentives and ways of enforcing reproducibility, and if an investigator wants to say publish an open data set and a fully reproducible analysis it should be possible. If not, that’s fine, these things still benefit us. I tend to blur the lines between good practice, automation and reproducibility. I consider this a branch of software or process engineering rather than ethics. This just goes hand-in-hand with optimizing our practices and becoming a more efficient and productive group and also produce analyses that will live up to the increased scrutiny that is coming from the journals. So this is not feel-good reproducibility and this not just for the benefit of the people we work with. So I want to talk about our values and our practices, and our habits. I was a biologist I wouldn't work with a core that didn't have a reproducibility as a standard. That might be because I've seen how the sausage is made, but I think there are sound reasons why this should a guiding principle for how we do work.
  4. --- Slide 4 The why’s of reproducible research, other that “umm this is science” are for me: Sanity – in terms of being able to reliably derive results from raw data because if I can’t do that then I don’t have a leg to stand on Reuse – I want others or my future self to reproduce an analysis from the start that should be possible Redundancy – so if someone gets hits by a bus there rest of us Evaluation – we have a group with a lot of different strengths and weaknesses – software development, statistics, systems biology, sequencing, and disease domains. If we don’t have codebase that is shared, open, reproducible I really don’t see any reason to have a group at all. We might as well just be fully embedded analysts. I’m sure there are some people here who feel this would be a better arrangement and I can respect that viewpoint but I would think there is some real synergistic benefit to having a group of analysts rather than 8 scattered throughout an organization
  5. --- Slide 5 I'm here today to speak about two very domestic areas of interest to me and where I think DBHi can be a standard bearer and innovator. First is the marriage or "bondage" between code, data and tracking metadata (which is called data provenance) , and results (or what are sometimes called deliverables) – so version control, reproducible reports, MyBiC, tools that are more or less in place My other interest is in accelerating what I call the “edit cycle”, the spin cycle of occurs when you present results to an investigator and they want to tweak parameters and redo everything. The challenge is keeping the work in a reproducible context while still allowing people to explore their own data.
  6. --- Slide 6 So a couple years ago there was this paper outlining the ten simple rules for reproducible computational research. And these are great rules I can’t argue with any of them: Rule 1: For Every Result, Keep Track of How It Was Produced Rule 2: Avoid Manual Data Manipulation Steps Rule 3: Archive the Exact Versions of All External Programs Used Rule 4: Version Control All Custom Scripts Rule 5: Record All Intermediate Results, When Possible in Standardized Formats Etc. etc. ten rules is actually a bit much to take in at once but I think we can melt this down to one principal component for these is…
  7. --- Slide 7 You wouldn’t think’d need to say that. But a lot of people who come from life science disciplines with very strict traditions in keeping lab notebooks when they come to bioinformatics they just say “well I’ll do whatever the hell I want” This is quite dangerous because the less experienced you are as a programmer the more prone you are to making stupid mistakes because you write repetitive code, you don’t write sanity checks or test cases. So where do we write “stuff” down?
  8. --- Slide 8 So where do we write “stuff” down? In version control. We use Git for version control and we’re lucky to have a 40 seat Github enterprise instance behind the firewall. If your code is not in Github it doesn’t exist. You’ve probably used something akin to an automatic version control in google docs…
  9. --- Slide 9 The difference with git is the that the changes are explicit manually denoted by commits in which you attach a message that says what the significance of this change was. Repositories are distributed so you can work on something without a constant internet connection to the server Commits can be organized into branches. This is a branching pattern we use in software development where you have a master branch of major releases, a development branch, and then feature branches that become more experimental as you move left here. What distinguishes Git from older version control systems is the ability to do very graceful merges. So person A and person B can both make changes and then merge those changes into a common branch if the changes are mutually exclusive they’ll be transparent. If there are conflicts, if two people modify the exact same line of code, then it will denote that and ask you to decide how to proceed. It’s not unusual for projects to have 20 or 30 branches that come and go – one for each feature and one for each bug.
  10. --- Slide 10 This a non-normalized heatmap of git commits on our enterprise Github from members of the bioinformatics group. What I like most is to see who is committing stuff on Saturdays and Sundays. Typically it's people without kids. I want to discuss this because reproducible research culture has to come from the top down. The previous director gave us zero support for this initiative, for him enforcing any software development standards was just a brake on the wheel. He said “You don't want to get bogged down in process.” This ball didn’t get rolling Mark Porter stepped in as interim director. And he kind of held people’s feet to the fire and said hey you need to push your damn code. Despite being thrust unwillingly into the director position he actually wound up being one of the best directors we ever had. And pretty soon cores won't be hiring programmers and analysts who don't have a body of work on Github. This does factor into hiring. That's just where it's headed. We had one applicant who, for whatever reason, just to acquiesce to the journal put her code in the readme section of the repository, where you describe your repository. And the code was awful. So for me this is fundamental. If you value reproducibility it starts here. I'd like to hear your thoughts if you would like to join me at the roundtable.
  11. --- Slide 11 So it’s not enough that we just put reproducible code if it’s some inscrutable black box. We need something that is both reproducible and literate. That’s where Sweave comes in. Sweave is acutally pronounced S-Weave (S is the predecessor to R) In Sweave you wrap your R code chunks in these tags and it is embedded or “weaved” into the LaTeX formatted markup that describes what the code is doing and we use this to produce PDF reports. This is generally the last step in an analysis pipeline but the one that involves the vast majority of tweaks and edits and actual analysis and statistics. So when people ask why Excel is such a pariah in the bioninformatics community no self-respecting analyst would use Excel. Is it because it mangles files or turns gene names into dates? No they fix that and it would still suck. The reason Excel is not a tool we use for scientific research is because what you do in it is not reproducible, it's not automated, and it's not literate. Jim has really run with the report concept and made great use of it both for NGS and microarray expression reports that he has developed with Deborah Watson. In the R community Sweave has been supplanted by a package called knitr which has a lot more features in terms of caching and variable execution of chunks as well as support for Markdown so you can easily produce web reports. I’m still somewhat partial to PDFs because they have a beginning and an end and you can print them out and you can time stamp them and you can stamp them with git commit hashes.
  12. --- Slide 12 This git hash looks pretty hairy but it actually provides a hook by which we can really hold provenance So can include metadata we get from the LIMS, what we alignments and variants calls from CBMi-Seq and everything downstream from there
  13. --- Slide 13 So what’s the glue brings from raw data to Sweave reports. For a long time it was Make. Make is a build system designed to compile C programs but it has been really useful in for bioinformatics because it provides a syntax for describing how to convert one type of file to another based on filename suffixes. That is 95% of all analytical pipelines. But Make has some limitations that are frustrating to work with. And that’s where this genius Johannes Koster basically solved all those by basically subsuming the entirety of python into the domain specific language.
  14. --- Slide 14 But what is really attractive for me is the ability to keep an entire workflow encapsulated in the Snakefile. So all the input, outputs, and intermediates are all first-class citizens in the Snakefile, so the same code that runs the alignments can also kick off the Sweave scripts and also produce Markdown web pages that we can display in a portal.
  15. --- Slide 15 So what is that portal. Where do we put the deliverables? For a long time they were being emailed, or put in DropBox. These both present big disadvantages in terms of persistence, and access, and also technically I’m not sure we’re supposed to use DropBox. We could in theory use Github since it has a lot of utility for displaying Markdown and it has issue tracking, but the authentication in github is crude and you can’t really put big files in there, it’s just not organized the way an investigator would want to use them. MyBiC is a Django-based delivery portal that I created to serve as a delivery portal for analyses. MyBiC provides an Users/Labs/Groups authentication scheme, search, news, and trackingProjects within MyBiC can be created in Markdown or HTML and can be loaded directly from Github or from the disk. The MyBiC server has a read-only mount to the Isilon, so even very large files can be served.
  16. --- Slide 16 OS-level virtualization Unlike a virtual machine which talks to the hardware through a translator, the docker engine is much more lightweight. And Unlike a VM which is kind of a stateful machine you massage into place, a docker container is run off of a reproducible configured script called a Dockerfile which makes it “literate”, if that were a terms DevOps people used. Once a program is dockerized it can be run without installing it. It lives inside a container and it only knows what you tell it as far as network ports, permissions, file volumes. This has great appeal for anyone who has ever tried to install software.
  17. --- Slide 17 The question is what do we get from dockerizing an entire analysis For one it will be trivial to send an entire workflow to a colleague or a journal and say have at it and they can hit the ground running But even if we don’t do that it should be much easier for a website like MyBiC to execute a workflow. In my mind I can see how this could accelerate the edit cycle I mentioned earlier. Sometimes they’re hunting for p-values but often they are just exploring the data. The problem is it ties up a lot of our time just reading emails and re-running analyses. This is what I call the “parameter purgatory” So tradiitonaly we could build full scale web apps in Shiny (Pichai and Jim have done this), but that’s really better for traditional database-driven portals, sometimes we just want to write and analyses once and then choose parameters from there. You don’t want to build an analysis to do some kind of meta-analysis.
  18. --- Slide 18