SlideShare a Scribd company logo
1 of 52
Download to read offline
Bram Zandbelt
Best Practices for Scientific Computing:
Principles & MATLAB Examples
Community Page
Best Practices for Scientific Computing
Greg Wilson1
*, D. A. Aruliah2
, C. Titus Brown3
, Neil P. Chue Hong4
, Matt Davis5
, Richard T. Guy6¤
,
Steven H. D. Haddock7
, Kathryn D. Huff8
, Ian M. Mitchell9
, Mark D. Plumbley10
, Ben Waugh11
,
Ethan P. White12
, Paul Wilson13
1 Mozilla Foundation, Toronto, Ontario, Canada, 2 University of Ontario Institute of Technology, Oshawa, Ontario, Canada, 3 Michigan State University, East Lansing,
Michigan, United States of America, 4 Software Sustainability Institute, Edinburgh, United Kingdom, 5 Space Telescope Science Institute, Baltimore, Maryland, United
States of America, 6 University of Toronto, Toronto, Ontario, Canada, 7 Monterey Bay Aquarium Research Institute, Moss Landing, California, United States of America,
8 University of California Berkeley, Berkeley, California, United States of America, 9 University of British Columbia, Vancouver, British Columbia, Canada, 10 Queen Mary
University of London, London, United Kingdom, 11 University College London, London, United Kingdom, 12 Utah State University, Logan, Utah, United States of America,
13 University of Wisconsin, Madison, Wisconsin, United States of America
Introduction
Scientists spend an increasing amount of time building and
using software. However, most scientists are never taught how to
do this efficiently. As a result, many are unaware of tools and
practices that would allow them to write more reliable and
maintainable code with less effort. We describe a set of best
practices for scientific software development that have solid
foundations in research and experience, and that improve
scientists’ productivity and the reliability of their software.
Software is as important to modern scientific research as
telescopes and test tubes. From groups that work exclusively on
computational problems, to traditional laboratory and field
scientists, more and more of the daily operation of science revolves
around developing new algorithms, managing and analyzing the
large amounts of data that are generated in single research
projects, combining disparate datasets to assess synthetic problems,
and other computational tasks.
Scientists typically develop their own software for these purposes
because doing so requires substantial domain-specific knowledge.
As a result, recent studies have found that scientists typically spend
30% or more of their time developing software [1,2]. However,
90% or more of them are primarily self-taught [1,2], and therefore
lack exposure to basic software development practices such as
writing maintainable code, using version control and issue
trackers, code reviews, unit testing, and task automation.
We believe that software is just another kind of experimental
apparatus [3] and should be built, checked, and used as carefully
as any physical apparatus. However, while most scientists are
careful to validate their laboratory and field equipment, most do
not know how reliable their software is [4,5]. This can lead to
serious errors impacting the central conclusions of published
research [6]: recent high-profile retractions, technical comments,
and corrections because of errors in computational methods
error from another group’s code was not discovered until after
publication [6]. As with bench experiments, not everything must be
done to the most exacting standards; however, scientists need to be
aware of best practices both to improve their own approaches and
for reviewing computational work by others.
This paper describes a set of practices that are easy to adopt and
have proven effective in many research settings. Our recommenda-
tions are based on several decades of collective experience both
building scientific software and teaching computing to scientists
[17,18], reports from many other groups [19–25], guidelines for
commercial and open source software development [26,27], and on
empirical studies of scientific computing [28–31] and software
development in general (summarized in [32]). None of these practices
will guarantee efficient, error-free software development, but used in
concert they will reduce the number of errors in scientific software,
make it easier to reuse, and save the authors of the software time and
effort that can used for focusing on the underlying scientific questions.
Our practices are summarized in Box 1; labels in the main text
such as ‘‘(1a)’’ refer to items in that summary. For reasons of space,
we do not discuss the equally important (but independent) issues of
reproducible research, publication and citation of code and data,
and open science. We do believe, however, that all of these will be
much easier to implement if scientists have the skills we describe.
Citation: Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, et
al. (2014) Best Practices for Scientific Computing. PLoS Biol 12(1): e1001745.
doi:10.1371/journal.pbio.1001745
Academic Editor: Jonathan A. Eisen, University of California Davis, United States
of America
Published January 7, 2014
Copyright: ß 2014 Wilson et al. This is an open-access article distributed under
the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the
original author and source are credited.
Best Practices for Scientific Computing
Wilson et al. (2014) Best Practices for Scientific Computing. PLoS Biol, 12(1), e1001745.
Best Practices for Scientific Computing
Why is this important?
We rely heavily on research software
No software, no research
http://www.software.ac.uk/blog/2014-12-04-its-impossible-conduct-research-without-software-say-7-out-10-uk-
researchers
2014 survey
among 417 UK researchers
Best Practices for Scientific Computing
https://innoscholcomm.silk.co/
Tools that researchers
use in practice
Best Practices for Scientific Computing
Why is this important?
We rely heavily on research software
No software, no research
Many of us write software ourselves
From single- command line scripts to multigigabyte code
repositories
Best Practices for Scientific Computing
adapted from Leek & Peng (2015) Nature, 520 (7549), 612.
of the iceberg
Ridding science of shoddy statistics will require scrutiny of every step,
not merely the last one, say Jeffrey T. Leek and Roger D. Peng.
T
here is no statistic more maligned
than the P value. Hundreds of papers
and blogposts have been written
about what some statisticians deride as ‘null
hypothesis significance testing’ (NHST; see,
for example, go.nature.com/pfvgqe). NHST
deems whether the results of a data analysis
are important on the basis of whether a
summary statistic (such as a P value) has
crossed a threshold. Given the discourse, it
is no surprise that some hailed as a victory
the banning of NHST methods (and all of
statistical inference) in the journal Basic
and Applied Social Psychology in February1
.
Such a ban will in fact have scant effect
on the quality of published science. There
are many stages to the design and analysis
of a successful study (see ‘Data pipeline’).
The last of these steps is the calculation of
an inferential statistic such as a P value, and
the application of a ‘decision rule’ to it (for
example, P < 0.05). In practice, decisions
that are made earlier in data analysis have
a much greater impact on results — from
experimental design to batch effects, lack
of adjustment for confounding factors, or
simple measurement error. Arbitrary levels
of statistical significance can be achieved by
changing the ways in which data are cleaned,
summarized or modelled2
.
P values are an easy target: being widely
used, they are widely abused. But, in prac-
tice, deregulating statistical significance
opens the door to even more ways to game
statistics — intentionally or unintentionally
— to get a result. Replacing P values with
Bayesfactorsoranotherstatisticisultimately
about choosing a different trade-off of true
positives and false positives. Arguing about
the P value is like focusing on a single mis-
spelling, rather than on the faulty logic of a
sentence.
Better education is a start. Just as anyone
who does DNA sequencing or remote-
sensing has to be trained to use a machine,
so too anyone who analyses data must be
trained in the relevant software and con-
designed to address this crisis. For
example, the Data Science Specialization,
offered by Johns Hopkins University in
Baltimore, Maryland, and Data Carpen-
try, can easily be integrated into training
and research. It is increasingly possible to
analysis is taught through an apprenticeship
model, and different disciplines develop
their own analysis subcultures. Decisions
are based on cultural conventions in spe-
cific communities rather than on empirical
evidence. For example, economists call data
measured over time ‘panel data’, to which
they frequently apply mixed-effects models.
Biomedical scientists refer to the same type
of data structure as ‘longitudinal data’, and
often go at it with generalized estimating
equations.
Statistical research largely focuses on
mathematical statistics, to the exclusion of
the behaviour and processes involved in
data analysis. To solve this deeper problem,
we must study how people perform data
analysis in the real world. What sets them up
for success, and what for failure? Controlled
experiments have been done in visualiza-
tion3
and risk interpretation4
to evaluate
how humans perceive and interact with data
and statistics. More recently, we and others
have been studying the entire analysis pipe-
line. We found, for example, that recently
trained data analysts do not know how to
infer P values from plots of data5
, but they
can learn to do so with practice.
The ultimate goal is evidence-based data
analysis6
. This is analogous to evidence-
based medicine, in which physicians are
encouraged to use only treatments for which
efficacy has been proved in controlled trials.
Statisticians and the people they teach and
collaborate with need to stop arguing about
P values, and prevent the rest of the iceberg
from sinking science.■
Jeffrey T. Leek and Roger D. Peng are
associate professors of biostatistics at the
Johns Hopkins Bloomberg School of Public
Health in Baltimore, Maryland, USA.
e-mail: jleek@jhsph.edu
1. Trafimow, D. & Marks, M. Basic Appl. Soc. Psych.
37, 1–2 (2015).
2. Simmons, J. P., Nelson, L. D. & Simonsohn, U.
Psychol. Sci. 22, 1359–1366 (2011).
DATA PIPELINE
The design and analysis of a successful study
has many stages, all of which need policing.
Experimental design
Data collection
Raw data
Data cleaning
Tidy data
Exploratory data analysis
Potential statistical models
Statistical modelling
Summary statistics
Inference
P value
Extreme scrutiny
Little debate
Best Practices for Scientific Computing
Why is this important?
2014 survey
among 417 UK researchers
http://www.software.ac.uk/blog/2014-12-04-its-impossible-conduct-research-without-software-say-7-out-10-uk-
researchers
We rely heavily on research software
No software, no research
Many of us write software ourselves
From single- command line scripts to multigigabyte code
repositories
Best Practices for Scientific Computing
Why is this important?
2014 survey
among 417 UK researchers
http://www.software.ac.uk/blog/2014-12-04-its-impossible-conduct-research-without-software-say-7-out-10-uk-
researchers
We rely heavily on research software
No software, no research
Many of us write software ourselves
From single- command line scripts to multigigabyte code
repositories
Not all us have had formal training
Most of us have learned it “on the street”
Best Practices for Scientific Computing
Why is this important?
1856
NEWS>>
THIS WEEK A dolphin’s
demise
Indians wary of
nuclear pact
1860 1863
Untilrecently,GeoffreyChang’scareerwason
a trajectory most young scientists only dream
about. In 1999, at the age of 28, the protein
crystallographer landed a faculty position at
the prestigious Scripps Research Institute in
San Diego, California.The next year, in a cer-
emony at the White House, Chang received a
Presidential Early Career Award
for Scientists and Engineers, the
country’s highest honor for young
researchers. His lab generated a
stream of high-profile papers
detailing the molecular structures
of important proteins embedded in
cell membranes.
Then the dream turned into a
nightmare. In September, Swiss
researchers published a paper in
Nature that cast serious doubt on a
protein structure Chang’s group
had described in a 2001 Science
paper. When he investigated,
Chang was horrified to discover
thatahomemadedata-analysispro-
gram had flipped two columns of
data, inverting the electron-density
map from which his team had
derived the final protein structure.
Unfortunately, his group had used
the program to analyze data for
other proteins. As a result, on page 1875,
Chang and his colleagues retract three Science
papers and report that two papers in other jour-
nals also contain erroneous structures.
“I’ve been devastated,” Chang says. “I hope
people will understand that it was a mistake,
and I’m very sorry for it.” Other researchers
don’t doubt that the error was unintentional,
and although some say it has cost them time
and effort, many praise Chang for setting the
record straight promptly and forthrightly. “I’m
very pleased he’s done this because there has
been some confusion” about the original struc-
tures, says Christopher Higgins, a biochemist
at Imperial College London. “Now the field
can really move forward.”
The most influential of Chang’s retracted
publications, other researchers say, was the
2001 Science paper, which described the struc-
tureofaproteincalledMsbA,isolatedfromthe
bacterium Escherichia coli. MsbA belongs to a
huge and ancient family of molecules that use
energy from adenosine triphosphate to trans-
port molecules across cell membranes. These
so-called ABC transporters perform many
essential biological duties and are of great clin-
icalinterestbecauseoftheirrolesindrugresist-
ance. Some pump antibiotics out of bacterial
cells, for example; others clear chemotherapy
drugs from cancer cells. Chang’s MsbA struc-
ture was the first molecular portrait of an entire
ABC transporter, and many researchers saw it
asamajorcontributiontowardfiguringouthow
these crucial proteins do their jobs. That paper
alone has been cited by 364 publications,
according to Google Scholar.
Two subsequent papers, both now being
retracted, describe the structure of MsbA from
other bacteria, Vibrio cholera (published in
Molecular Biology in 2003) and Salmonella
typhimurium (published in Science in 2005).
The other retractions, a 2004 paper in the
Proceedings of the National Academy of
Sciences and a 2005 Science paper, described
EmrE, a different type of transporter protein.
Crystallizing and obtaining structures of
five membrane proteins in just over 5 years
was an incredible feat, says Chang’s former
postdoc adviser Douglas Rees of the Califor-
nia Institute ofTechnology in Pasadena. Such
proteins are a challenge for crystallographers
because they are large, unwieldy, and notori-
ously difficult to coax into the crystals
needed for x-ray crystallography. Rees says
determination was at the root of Chang’s suc-
cess: “He has an incredible drive and work
ethic. He really pushed the field in the sense
of getting things to crystallize that
no one else had been able to do.”
Chang’s data are good, Rees says,
but the faulty software threw
everything off.
Ironically, another former post-
doc in Rees’s lab, Kaspar Locher,
exposedthemistake.Inthe14Sep-
tember issue of Nature, Locher,
now at the Swiss Federal Institute
ofTechnology in Zurich, described
the structure of anABC transporter
calledSav1866fromStaphylococcus
aureus. The structure was dramati-
cally—and unexpectedly—differ-
ent from that of MsbA. After
pulling up Sav1866 and Chang’s
MsbA from S. typhimurium on a
computer screen, Locher says he
realized in minutes that the MsbA
structurewasinverted.Interpreting
the “hand” of a molecule is always
a challenge for crystallographers,
Locher notes, and many mistakes can lead to
an incorrect mirror-image structure. Getting
the wrong hand is “in the category of monu-
mental blunders,” Locher says.
On reading the Nature paper, Chang
quickly traced the mix-up back to the analysis
program, which he says he inherited from
another lab. Locher suspects that Chang
would have caught the mistake if he’d taken
more time to obtain a higher resolution struc-
ture. “I think he was under immense pressure
to get the first structure, and that’s what made
him push the limits of his data,” he says. Oth-
ers suggest that Chang might have caught the
problem if he’d paid closer attention to bio-
chemicalfindingsthatdidn’tjibewellwiththe
MsbA structure. “When the first structure
came out, we and others said, ‘We really
A Scientist’s Nightmare: Software
Problem Leads to Five Retractions
SCIENTIFIC PUBLISHING
CREDIT:R.J.P.DAWSONANDK.P.LOCHER,NATURE443,180(2006)
22 DECEMBER 2006 VOL 314 SCIENCE www.sciencemag.org
Flipping fiasco. The structures of MsbA (purple) and Sav1866 (green) overlap
little (left) until MsbA is inverted (right).
▲
Published by AAAS
onJanuary5,2007www.sciencemag.orgDownloadedfrom
Miller (2006) Science, 314 (5807), 1856-1857.
We rely heavily on research software
No software, no research
Many of us write software ourselves
From single- command line scripts to multigigabyte code
repositories
Not all us have had formal training
Most of us have learned it “on the street”
Poor coding practices can lead to serious problems
Worst case perhaps: 1 coding error led to 5 retractions:
3 in Science, 1 in PNAS, 1 in Molecular Biology
Best Practices for Scientific Computing
Why now?
Call for increased scrutiny and openness
Following cases of scientific misconduct, widespread
questionable research practices, and reproducibility
crises
Best Practices for Scientific Computing
Why now?
Ince et al (2012) Nature,482 (7386) 485-488.
PERSPECTIVE doi:10.1038/nature10836
The case for open computer programs
Darrel C. Ince1
, Leslie Hatton2
& John Graham-Cumming3
Scientific communication relies on evidence that cannot be entirely included in publications, but the rise of
computational science has added a new layer of inaccessibility. Although it is now accepted that data should be made
available on request, the current regulations regarding the availability of software are inconsistent. We argue that, with
some exceptions, anything less than the release of source programs is intolerable for results that depend on computation.
The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains
uncertain, but withholding code increases the chances that efforts to reproduce results will fail.
T
he rise of computational science has led to unprecedented
opportunities for scientificadvance.Evermorepowerfulcomputers
enable theories to be investigated that were thought almost
intractable a decade ago, robust hardware technologies allow data collec-
tion in the most inhospitable environments, more data are collected, and
an increasingly rich set of software tools are now available with which to
analyse computer-generated data.
However, there is the difficulty of reproducibility, by which we mean
the reproduction of a scientific paper’s central finding, rather than exact
replication of each specific numerical result down to several decimal
places. We examine the problem of reproducibility (for an early attempt
at solving it, see ref. 1) in the context of openly available computer
programs, or code. Our view is that we have reached the point that, with
some exceptions, anything less than release of actual source code is an
indefensible approach for any scientific results that depend on computa-
tion, because not releasing such code raises needless, and needlessly
confusing, roadblocks to reproducibility.
At present, debate rages on the need to release computer programs
associated with scientific experiments2–4
, with policies still ranging from
mandatory total release to the release only of natural language descrip-
tions, that is, written descriptions of computer program algorithms.
Some journals have already changed their policies on computer program
openness; Science, for example, now includes code in the list of items
that should be supplied by an author5
. Other journals promoting code
availability include Geoscientific Model Development, which is devoted,
at least in part, to model description and code publication, and
Biostatistics, which has appointed an editor to assess the reproducibility
of the software and data associated with an article6
.
In contrast, less stringent policies are exemplified by statements such
as7
‘‘Nature does not require authors to make code available, but we do
expect a description detailed enough to allow others to write their own
code to do similar analysis.’’ Although Nature’s broader policy states that
‘‘...authors are required to make materials, data and associated protocols
promptly available to readers...’’, and editors and referees are fully
empowered to demand and evaluate any specific code, we believe that
its stated policy on code availability actively hinders reproducibility.
Muchofthe debateaboutcode transparencyinvolvesthephilosophy of
science, error validation and research ethics8,9
, but our contention is more
practical: that the cause of reproducibility is best furthered by focusing on
thedissectionandunderstandingofcode,asentimentalreadyappreciated
by the growing open-source movement10
. Dissection and understanding
of open code would improve the chances of both direct and indirect
reproducibility. Direct reproducibility refers to the recompilation and
rerunning of the code on, say, a different combination of hardware and
systems software, to detect the sort of numerical computation11,12
and
interpretation13
problems found in programming languages, which we
discuss later. Without code, direct reproducibility is impossible. Indirect
reproducibility refers to independent efforts to validate something other
than the entire code package, for example a subset of equations or a par-
ticular code module. Here, before time-consuming reprogramming of an
entiremodel,researchersmaysimplywanttocheckthatincorrectcodingof
previously published equations has not invalidated a paper’s result, to
extract and check detailed assumptions, or to run their own code against
the original to check for statistical validity and explain any discrepancies.
Any debate over the difficulties of reproducibility (which, as we will
show, are non-trivial) must of course be tempered by recognizing the
undeniablebenefits afforded by theexplosion of internet facilities and the
rapid increase in raw computational speed and data-handling capability
that has occurred as a result of major advances in computer technology14
.
Such advances have presented science with a greatopportunity to address
problems that would have been intractable in even the recent past. It is
our view, however, that the debate over code release should be resolved as
soon as possible to benefit fully from our novel technical capabilities. On
their own, finer computational grids, longer and more complex compu-
tations and larger data sets—although highly attractive to scientific
researchers—do not resolve underlying computational uncertainties of
proven intransigence and may even exacerbate them.
Although our arguments are focused on the implications of Nature’s
code statement, it is symptomatic of a wider problem: the scientific
community places more faith in computation than is justified. As we
outline below and in two case studies (Boxes 1 and 2), ambiguity in its
many forms and numerical errors render natural language descriptions
insufficient and, in many cases, unintentionally misleading.
The failure of code descriptions
The curse of ambiguity
Ambiguity in program descriptions leads to the possibility, if not the
certainty, that a given natural language description can be converted
into computer code in various ways, each of which may lead to different
numerical outcomes. Innumerable potential issues exist, but might
include mistaken order of operations, reference to different model ver-
sions, or unclear calculations of uncertainties. The problem of ambiguity
has haunted software development from its earliest days.
Ambiguity can occur at the lexical, syntactic or semantic level15
and is
not necessarily the result of incompetence or bad practice. It is a natural
consequence of using natural language16
and is unavoidable. The
1
Department of Computing Open University, Walton Hall, Milton Keynes MK7 6AA, UK. 2
School of Computing and Information Systems, Kingston University, Kingston KT1 2EE, UK. 3
83 Victoria Street,
London SW1H 0HW, UK.
2 3 F E B R U A R Y 2 0 1 2 | V O L 4 8 2 | N A T U R E | 4 8 5
Macmillan Publishers Limited. All rights reserved©2012
Call for increased scrutiny and openness
Following cases of scientific misconduct, widespread
questionable research practices, and reproducibility
crises
Best Practices for Scientific Computing
Why now?
Peng (2011) Science, 334 (6060) 1226-1227
PERSPECTIVE
Reproducible Research in
Computational Science
Roger D. Peng
Computational science has led to exciting new developments, but the nature of the work has
exposed limitations in our ability to evaluate published findings. Reproducibility has the potential
to serve as a minimum standard for judging scientific claims when full independent replication of a
study is not possible.
T
he rise of computational science has led to
exciting and fast-moving developments in
many scientific areas. New technologies,
increased computing power, and methodological
advances have dramatically improved our ability
to collect complex high-dimensional data (1, 2).
Large data sets have led to scientists doing more
computation, as well as researchers in computa-
tionally oriented fields directly engaging in more
science. The availability of large public databases
has allowed for researchers to make meaningful
scientific contributions without using the tradi-
tional tools of a given field. As an example of
this overall trend, the Sloan Digital Sky Survey,
a large publicly available astronomical survey
of the Northern Hemisphere, was ranked the most
cited observatory (3), allowing astronomers with-
out telescopes to make discoveries using data col-
lected by others. Similar developments can be
found in fields such as biology and epidemiology.
Replication is the ultimate standard by which
scientific claims are judged. With replication,
independent investigators address a
scientific hypothesis and build up
evidence for or against it. The scien-
tific community’s “culture of replica-
tion” has served to quickly weed out
spurious claims and enforce on the
community a disciplined approach to
scientific discovery. However, with
computational science and the corre-
sponding collection of large and com-
plex data sets the notion of replication
can be murkier. It would require tre-
mendous resources to independently
replicate the Sloan Digital Sky Sur-
vey. Many studies—for example, in
climate science—require computing power that
may not be available to all researchers. Even if
computing and data size are not limiting factors,
replication can be difficult for other reasons. In
environmental epidemiology, large cohort studies
designed to examine subtle health effects of en-
vironmental pollutants can be very expensive and
require long follow-up times. Such studies are
difficult to replicate because of time and expense,
especially in the time frame of policy decisions
that need to be made regarding regulation (2).
Researchers across a range of computational
science disciplines have been calling for repro-
ducibility, or reproducible research, as an attain-
able minimum standard for assessing the value of
scientific claims, particularly when full independent
replication of a study is not feasible (4–8). The
standard of reproducibility calls for the data and
the computer code used to analyze the data be made
available to others. This standard falls short of full
replication because the same data are analyzed
again, rather than analyzing independently col-
lected data. However, under this standard, limited
exploration of the data and the analysis code is
possible and may be sufficient to verify the quality
of the scientific claims. One aim of the reproduc-
ibility standard is to fill the gap in the scientific
evidence-generating process between full repli-
cation of a study and no replication. Between
these two extreme end points, there is a spectrum
of possibilities, and a study may be more or less
reproducible than another depending on what data
and code are made available (Fig. 1). A recent
review of microarray gene expression analyses
found that studies were either not reproducible,
partially reproducible with some discrepancies, or
reproducible. This range was largely explained by
the availability of data and metadata (9).
The reproducibility standard is based on the
fact that every computational experiment has, in
theory, a detailed log of every action taken by the
computer. Making these computer codes avail-
able to others provides a level of detail regarding
the analysis that is greater than the analagous non-
computational experimental descriptions printed
in journals using a natural language.
A critical barrier to reproducibility in many
cases is that the computer code is no longer avail-
able. Interactive software systems often used for
exploratory data analysis typically do not keep
track of users’ actions in any concrete form. Even
if researchers use software that is run by written
code, often multiple packages are used, and the
code that combines the different results together
is not saved (10). Addressing this problem will
require either changing the behavior of the soft-
ware systems themselves or getting researchers
to use other software systems that are more ame-
nable to reproducibility. Neither is likely to hap-
pen quickly; old habits die hard, and many will
be unwilling to discard the hours spent learning
existing systems. Non–open source software can
only be changed by their owners, who may not
perceive reproducibility as a high priority.
In order to advance reproducibility in com-
putational science, contributions will need to
come from multiple directions. Journals can play
a role here as part of a joint effort by the scientific
community. The journal Biostatistics, for which
I am an associate editor, has implemented a pol-
icy for encouraging authors of accepted papers
to make their work reproducible by others (11).
Authors can submit their code or data to the
journal for posting as supporting online material
and can additionally request a “reproducibility
review,” in which the associate editor for repro-
ducibility runs the submitted code on the data
and verifies that the code produces the results
published in the article. Articles with data or
code receive a “D” or “C” kite-mark, respec-
tively, printed on the first page of the article
itself. Articles that have passed the reproduc-
ibility review receive an “R.” The policy was im-
plemented in July 2009, and as of July 2011,
21 of 125 articles have been published with a
kite-mark, including five articles with an “R.”
The articles have reflected a range of topics
from biostatistical methods, epidemiology, and
genomics. In this admittedly small sample, we
Data Replication & Reproducibility
Department of Biostatistics, Johns Hopkins Bloomberg
School of Public Health, Baltimore MD 21205, USA.
To whom correspondence should be addressed. E-mail:
rpeng@jhsph.edu
Reproducibility Spectrum
Not reproducible Gold standard
Full
replication
Publication
only
Publication +
Code
Code
and data
Linked and
executable
code and data
Fig. 1. The spectrum of reproducibility.
2 DECEMBER 2011 VOL 334 SCIENCE www.sciencemag.org1226
onMarch23,2015www.sciencemag.orgDownloadedfromonMarch23,2015www.sciencemag.orgDownloadedfrom
Call for increased scrutiny and openness
Following cases of scientific misconduct, widespread
questionable research practices, and reproducibility
crises
Best Practices for Scientific Computing
Why now?
INSIGHTS | PERSPECTIVES
1422 26 JUNE 2015 • VOL 348 ISSUE 6242 sciencemag.org SCIENCE
and neutral resource that supports and
complements efforts of the research enter-
prise and its key stakeholders.
Universities should insist that their fac-
ulties and students are schooled in the eth-
ics of research, their publications feature
neither honorific nor ghost authors, their
public information offices avoid hype in
publicizing findings, and suspect research
is promptly and thoroughly investigated.
All researchers need to realize that the
best scientific practice is produced when,
like Darwin, they persistently search for
flaws in their arguments. Because inherent
variability in biological systems makes it
possible for researchers to explore differ-
ent sets of conditions until the expected
(and rewarded) result is obtained, the need
for vigilant self-critique may be especially
great in research with direct application to
human disease. We encourage each branch
of science to invest in case studies identify-
ing what went wrong in a selected subset
of nonreproducible publications—enlisting
social scientists and experts in the respec-
tive fields to interview those who were
involved (and perhaps examining lab note-
books or redoing statistical analyses), with
the hope of deriving general principles for
improving science in each field.
Industry should publish its failed efforts
to reproduce scientific findings and join
scientists in the academy in making the
case for the importance of scientific work.
Scientific associations should continue to
communicate science as a way of know-
ing, and educate their members in ways to
more effectively communicate key scien-
tific findings to broader publics. Journals
should continue to ask for higher stan-
dards of transparency and reproducibility.
We recognize that incentives can backfire.
Still, because those such as enhanced social
image and forms of public recognition (10,
11) can increase productive social behavior
(12), we believe that replacing the stigma of
retraction with language that lauds report-
ing of unintended errors in a publication will
increase that behavior. Because sustaining a
good reputation can incentivize cooperative
behavior (13), we anticipate that our pro-
posed changes in the review process will not
only increase the quality of the final product
but also expose efforts to sabotage indepen-
dent review. To ensure that such incentives
not only advance our objectives but above
all do no harm, we urge that each be scru-
tinized and evaluated before being broadly
implemented.
Will past be prologue? If science is to
enhance its capacities to improve our un-
derstanding of ourselves and our world,
protect the hard-earned trust and esteem
in which society holds it, and preserve its
role as a driver of our economy, scientists
must safeguard its rigor and reliability in
the face of challenges posed by a research
ecosystem that is evolving in dramatic and
sometimes unsettling ways. To do this, the
scientific research community needs to be
involved in an ongoing dialogue. We hope
that this essay and the report The Integrity
of Science (14), forthcoming in 2015, will
serve as catalysts for such a dialogue.
Asked at the close of the U.S. Consti-
tutional Convention of 1787 whether the
deliberations had produced a republic or
a monarchy, Benjamin Franklin said “A
Republic, if you can keep it.” Just as pre-
serving a system of government requires
ongoing dedication and vigilance, so too
does protecting the integrity of science. ■
REFERENCES AND NOTES
1. Troubleatthelab,TheEconomist,19October2013;
www.economist.com/news/briefing/
21588057-scientists-think-science-self-correcting-
alarming-degree-it-not-trouble.
2. R.Merton,TheSociologyofScience:Theoreticaland
EmpiricalInvestigations(UniversityofChicagoPress,
Chicago,1973),p.276.
3. K.Popper,ConjecturesandRefutations:TheGrowthof
ScientificKnowledge(Routledge,London,1963),p.293.
4. EditorialBoard,Nature511,5(2014);www.nature.com/
news/stap-retracted-1.15488.
5. B.A.Noseketal.,Science348,1422(2015).
6. InstituteofMedicine,DiscussionFrameworkforClinical
TrialDataSharing:GuidingPrinciples,Elements,and
Activities(NationalAcademiesPress,Washington,DC,
2014).
7. B.Nosek,J.Spies,M.Motyl,Perspect.Psychol.Sci.7,615
(2012).
8. C.Franzoni,G.Scellato,P.Stephan,Science333,702
(2011).
9. NationalAcademyofSciences,NationalAcademyof
Engineering,andInstituteofMedicine,Responsible
Science,VolumeI:EnsuringtheIntegrityoftheResearch
Process(NationalAcademiesPress,Washington,DC,
1992).
10. N.Lacetera,M.Macis,J.Econ.Behav.Organ.76,225
(2010).
11. D.Karlan,M.McConnell,J.Econ.Behav.Organ.106,402
(2014).
12. R.Thaler,C.Sunstein,Nudge:ImprovingDecisionsAbout
Health,WealthandHappiness(YaleUniv.Press,New
Haven,CT,2009).
13. T.Pfeiffer,L.Tran,C.Krumme,D.Rand,J.R.Soc.Interface
2012,rsif20120332(2012).
14. CommitteeonScience,Engineering,andPublicPolicy
oftheNationalAcademyofSciences,NationalAcademy
ofEngineering,andInstituteofMedicine,TheIntegrity
ofScience(NationalAcademiesPress,forthcoming).
http://www8.nationalacademies.org/cp/projectview.
aspx?key=49387.
10.1126/science.aab3847
“Instances in which
scientists detect and
address flaws in work
constitute evidence
of success, not failure.”
T
ransparency, openness, and repro-
ducibility are readily recognized as
vital features of science (1, 2). When
asked, most scientists embrace these
features as disciplinary norms and
values (3). Therefore, one might ex-
pect that these valued features would be
routine in daily practice. Yet, a growing
body of evidence suggests that this is not
the case (4–6).
A likely culprit for this disconnect is an
academic reward system that does not suf-
ficiently incentivize open practices (7). In the
present reward system, emphasis on innova-
tion may undermine practices
that support verification. Too
often, publication requirements
(whether actual or perceived) fail to encour-
age transparent, open, and reproducible sci-
ence (2, 4, 8, 9). For example, in a transparent
science, both null results and statistically
significant results are made available and
help others more accurately assess the evi-
dence base for a phenomenon. In the present
culture, however, null results are published
less frequently than statistically significant
results (10) and are, therefore, more likely
inaccessible and lost in the “file drawer” (11).
The situation is a classic collective action
problem. Many individual researchers lack
Promoting an
open research
culture
By B. A. Nosek,* G. Alter, G. C. Banks,
D. Borsboom, S. D. Bowman,
S. J. Breckler, S. Buck, C. D. Chambers,
G. Chin, G. Christensen, M. Contestabile,
A. Dafoe, E. Eich, J. Freese,
R. Glennerster, D. Goroff, D. P. Green, B.
Hesse, M. Humphreys, J. Ishiyama,
D. Karlan, A. Kraut, A. Lupia, P. Mabry,
T. Madon, N. Malhotra,
E. Mayo-Wilson, M. McNutt, E. Miguel,
E. Levy Paluck, U. Simonsohn,
C. Soderberg, B. A. Spellman,
J. Turitto, G. VandenBos, S. Vazire,
E. J. Wagenmakers, R. Wilson, T. Yarkoni
Author guidelines for
journals could help to
promote transparency,
openness, and
reproducibility
SCIENTIFIC STANDARDS
POLICY
DA_0626PerspectivesR1.indd 1422 7/24/15 10:22 AM
Published by AAAS
onSeptember17,2015www.sciencemag.orgDownloadedfromonSeptember17,2015www.sciencemag.orgDownloadedfromonSeptember17,2015www.sciencemag.orgDownloadedfromonSeptember17,2015www.sciencemag.orgDownloadedfrom
PERSPECTIVE doi:10.1038/nature10836
The case for open computer programs
Darrel C. Ince1
, Leslie Hatton2
& John Graham-Cumming3
Scientific communication relies on evidence that cannot be entirely included in publications, but the rise of
computational science has added a new layer of inaccessibility. Although it is now accepted that data should be made
available on request, the current regulations regarding the availability of software are inconsistent. We argue that, with
some exceptions, anything less than the release of source programs is intolerable for results that depend on computation.
The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains
uncertain, but withholding code increases the chances that efforts to reproduce results will fail.
T
he rise of computational science has led to unprecedented
opportunities for scientificadvance.Evermorepowerfulcomputers
enable theories to be investigated that were thought almost
intractable a decade ago, robust hardware technologies allow data collec-
tion in the most inhospitable environments, more data are collected, and
an increasingly rich set of software tools are now available with which to
analyse computer-generated data.
However, there is the difficulty of reproducibility, by which we mean
the reproduction of a scientific paper’s central finding, rather than exact
replication of each specific numerical result down to several decimal
places. We examine the problem of reproducibility (for an early attempt
at solving it, see ref. 1) in the context of openly available computer
programs, or code. Our view is that we have reached the point that, with
some exceptions, anything less than release of actual source code is an
indefensible approach for any scientific results that depend on computa-
tion, because not releasing such code raises needless, and needlessly
confusing, roadblocks to reproducibility.
At present, debate rages on the need to release computer programs
associated with scientific experiments2–4
, with policies still ranging from
mandatory total release to the release only of natural language descrip-
tions, that is, written descriptions of computer program algorithms.
Some journals have already changed their policies on computer program
openness; Science, for example, now includes code in the list of items
that should be supplied by an author5
. Other journals promoting code
availability include Geoscientific Model Development, which is devoted,
at least in part, to model description and code publication, and
Biostatistics, which has appointed an editor to assess the reproducibility
of the software and data associated with an article6
.
In contrast, less stringent policies are exemplified by statements such
as7
‘‘Nature does not require authors to make code available, but we do
expect a description detailed enough to allow others to write their own
code to do similar analysis.’’ Although Nature’s broader policy states that
‘‘...authors are required to make materials, data and associated protocols
promptly available to readers...’’, and editors and referees are fully
empowered to demand and evaluate any specific code, we believe that
its stated policy on code availability actively hinders reproducibility.
Muchofthe debateaboutcode transparencyinvolvesthephilosophy of
science, error validation and research ethics8,9
, but our contention is more
practical: that the cause of reproducibility is best furthered by focusing on
thedissectionandunderstandingofcode,asentimentalreadyappreciated
by the growing open-source movement10
. Dissection and understanding
of open code would improve the chances of both direct and indirect
reproducibility. Direct reproducibility refers to the recompilation and
rerunning of the code on, say, a different combination of hardware and
systems software, to detect the sort of numerical computation11,12
and
interpretation13
problems found in programming languages, which we
discuss later. Without code, direct reproducibility is impossible. Indirect
reproducibility refers to independent efforts to validate something other
than the entire code package, for example a subset of equations or a par-
ticular code module. Here, before time-consuming reprogramming of an
entiremodel,researchersmaysimplywanttocheckthatincorrectcodingof
previously published equations has not invalidated a paper’s result, to
extract and check detailed assumptions, or to run their own code against
the original to check for statistical validity and explain any discrepancies.
Any debate over the difficulties of reproducibility (which, as we will
show, are non-trivial) must of course be tempered by recognizing the
undeniablebenefits afforded by theexplosion of internet facilities and the
rapid increase in raw computational speed and data-handling capability
that has occurred as a result of major advances in computer technology14
.
Such advances have presented science with a greatopportunity to address
problems that would have been intractable in even the recent past. It is
our view, however, that the debate over code release should be resolved as
soon as possible to benefit fully from our novel technical capabilities. On
their own, finer computational grids, longer and more complex compu-
tations and larger data sets—although highly attractive to scientific
researchers—do not resolve underlying computational uncertainties of
proven intransigence and may even exacerbate them.
Although our arguments are focused on the implications of Nature’s
code statement, it is symptomatic of a wider problem: the scientific
community places more faith in computation than is justified. As we
outline below and in two case studies (Boxes 1 and 2), ambiguity in its
many forms and numerical errors render natural language descriptions
insufficient and, in many cases, unintentionally misleading.
The failure of code descriptions
The curse of ambiguity
Ambiguity in program descriptions leads to the possibility, if not the
certainty, that a given natural language description can be converted
into computer code in various ways, each of which may lead to different
numerical outcomes. Innumerable potential issues exist, but might
include mistaken order of operations, reference to different model ver-
sions, or unclear calculations of uncertainties. The problem of ambiguity
has haunted software development from its earliest days.
Ambiguity can occur at the lexical, syntactic or semantic level15
and is
not necessarily the result of incompetence or bad practice. It is a natural
consequence of using natural language16
and is unavoidable. The
1
Department of Computing Open University, Walton Hall, Milton Keynes MK7 6AA, UK. 2
School of Computing and Information Systems, Kingston University, Kingston KT1 2EE, UK. 3
83 Victoria Street,
London SW1H 0HW, UK.
2 3 F E B R U A R Y 2 0 1 2 | V O L 4 8 2 | N A T U R E | 4 8 5
Macmillan Publishers Limited. All rights reserved©2012
Nosek et al. (2015) Science, 348 (6242) 1422-1425.
Call for increased scrutiny and openness
Following cases of scientific misconduct, widespread
questionable research practices, and reproducibility
crises
Best Practices for Scientific Computing
Why now?
INSIGHTS | PERSPECTIVES
1422 26 JUNE 2015 • VOL 348 ISSUE 6242 sciencemag.org SCIENCE
and neutral resource that supports and
complements efforts of the research enter-
prise and its key stakeholders.
Universities should insist that their fac-
ulties and students are schooled in the eth-
ics of research, their publications feature
neither honorific nor ghost authors, their
public information offices avoid hype in
publicizing findings, and suspect research
is promptly and thoroughly investigated.
All researchers need to realize that the
best scientific practice is produced when,
like Darwin, they persistently search for
flaws in their arguments. Because inherent
variability in biological systems makes it
possible for researchers to explore differ-
ent sets of conditions until the expected
(and rewarded) result is obtained, the need
for vigilant self-critique may be especially
great in research with direct application to
human disease. We encourage each branch
of science to invest in case studies identify-
ing what went wrong in a selected subset
of nonreproducible publications—enlisting
social scientists and experts in the respec-
tive fields to interview those who were
involved (and perhaps examining lab note-
books or redoing statistical analyses), with
the hope of deriving general principles for
improving science in each field.
Industry should publish its failed efforts
to reproduce scientific findings and join
scientists in the academy in making the
case for the importance of scientific work.
Scientific associations should continue to
communicate science as a way of know-
ing, and educate their members in ways to
more effectively communicate key scien-
tific findings to broader publics. Journals
should continue to ask for higher stan-
dards of transparency and reproducibility.
We recognize that incentives can backfire.
Still, because those such as enhanced social
image and forms of public recognition (10,
11) can increase productive social behavior
(12), we believe that replacing the stigma of
retraction with language that lauds report-
ing of unintended errors in a publication will
increase that behavior. Because sustaining a
good reputation can incentivize cooperative
behavior (13), we anticipate that our pro-
posed changes in the review process will not
only increase the quality of the final product
but also expose efforts to sabotage indepen-
dent review. To ensure that such incentives
not only advance our objectives but above
all do no harm, we urge that each be scru-
tinized and evaluated before being broadly
implemented.
Will past be prologue? If science is to
enhance its capacities to improve our un-
derstanding of ourselves and our world,
protect the hard-earned trust and esteem
in which society holds it, and preserve its
role as a driver of our economy, scientists
must safeguard its rigor and reliability in
the face of challenges posed by a research
ecosystem that is evolving in dramatic and
sometimes unsettling ways. To do this, the
scientific research community needs to be
involved in an ongoing dialogue. We hope
that this essay and the report The Integrity
of Science (14), forthcoming in 2015, will
serve as catalysts for such a dialogue.
Asked at the close of the U.S. Consti-
tutional Convention of 1787 whether the
deliberations had produced a republic or
a monarchy, Benjamin Franklin said “A
Republic, if you can keep it.” Just as pre-
serving a system of government requires
ongoing dedication and vigilance, so too
does protecting the integrity of science. ■
REFERENCES AND NOTES
1. Troubleatthelab,TheEconomist,19October2013;
www.economist.com/news/briefing/
21588057-scientists-think-science-self-correcting-
alarming-degree-it-not-trouble.
2. R.Merton,TheSociologyofScience:Theoreticaland
EmpiricalInvestigations(UniversityofChicagoPress,
Chicago,1973),p.276.
3. K.Popper,ConjecturesandRefutations:TheGrowthof
ScientificKnowledge(Routledge,London,1963),p.293.
4. EditorialBoard,Nature511,5(2014);www.nature.com/
news/stap-retracted-1.15488.
5. B.A.Noseketal.,Science348,1422(2015).
6. InstituteofMedicine,DiscussionFrameworkforClinical
TrialDataSharing:GuidingPrinciples,Elements,and
Activities(NationalAcademiesPress,Washington,DC,
2014).
7. B.Nosek,J.Spies,M.Motyl,Perspect.Psychol.Sci.7,615
(2012).
8. C.Franzoni,G.Scellato,P.Stephan,Science333,702
(2011).
9. NationalAcademyofSciences,NationalAcademyof
Engineering,andInstituteofMedicine,Responsible
Science,VolumeI:EnsuringtheIntegrityoftheResearch
Process(NationalAcademiesPress,Washington,DC,
1992).
10. N.Lacetera,M.Macis,J.Econ.Behav.Organ.76,225
(2010).
11. D.Karlan,M.McConnell,J.Econ.Behav.Organ.106,402
(2014).
12. R.Thaler,C.Sunstein,Nudge:ImprovingDecisionsAbout
Health,WealthandHappiness(YaleUniv.Press,New
Haven,CT,2009).
13. T.Pfeiffer,L.Tran,C.Krumme,D.Rand,J.R.Soc.Interface
2012,rsif20120332(2012).
14. CommitteeonScience,Engineering,andPublicPolicy
oftheNationalAcademyofSciences,NationalAcademy
ofEngineering,andInstituteofMedicine,TheIntegrity
ofScience(NationalAcademiesPress,forthcoming).
http://www8.nationalacademies.org/cp/projectview.
aspx?key=49387.
10.1126/science.aab3847
“Instances in which
scientists detect and
address flaws in work
constitute evidence
of success, not failure.”
T
ransparency, openness, and repro-
ducibility are readily recognized as
vital features of science (1, 2). When
asked, most scientists embrace these
features as disciplinary norms and
values (3). Therefore, one might ex-
pect that these valued features would be
routine in daily practice. Yet, a growing
body of evidence suggests that this is not
the case (4–6).
A likely culprit for this disconnect is an
academic reward system that does not suf-
ficiently incentivize open practices (7). In the
present reward system, emphasis on innova-
tion may undermine practices
that support verification. Too
often, publication requirements
(whether actual or perceived) fail to encour-
age transparent, open, and reproducible sci-
ence (2, 4, 8, 9). For example, in a transparent
science, both null results and statistically
significant results are made available and
help others more accurately assess the evi-
dence base for a phenomenon. In the present
culture, however, null results are published
less frequently than statistically significant
results (10) and are, therefore, more likely
inaccessible and lost in the “file drawer” (11).
The situation is a classic collective action
problem. Many individual researchers lack
Promoting an
open research
culture
By B. A. Nosek,* G. Alter, G. C. Banks,
D. Borsboom, S. D. Bowman,
S. J. Breckler, S. Buck, C. D. Chambers,
G. Chin, G. Christensen, M. Contestabile,
A. Dafoe, E. Eich, J. Freese,
R. Glennerster, D. Goroff, D. P. Green, B.
Hesse, M. Humphreys, J. Ishiyama,
D. Karlan, A. Kraut, A. Lupia, P. Mabry,
T. Madon, N. Malhotra,
E. Mayo-Wilson, M. McNutt, E. Miguel,
E. Levy Paluck, U. Simonsohn,
C. Soderberg, B. A. Spellman,
J. Turitto, G. VandenBos, S. Vazire,
E. J. Wagenmakers, R. Wilson, T. Yarkoni
Author guidelines for
journals could help to
promote transparency,
openness, and
reproducibility
SCIENTIFIC STANDARDS
POLICY
DA_0626PerspectivesR1.indd 1422 7/24/15 10:22 AM
Published by AAAS
onSeptember17,2015www.sciencemag.orgDownloadedfromonSeptember17,2015www.sciencemag.orgDownloadedfromonSeptember17,2015www.sciencemag.orgDownloadedfromonSeptember17,2015www.sciencemag.orgDownloadedfrom
PERSPECTIVE doi:10.1038/nature10836
The case for open computer programs
Darrel C. Ince1
, Leslie Hatton2
& John Graham-Cumming3
Scientific communication relies on evidence that cannot be entirely included in publications, but the rise of
computational science has added a new layer of inaccessibility. Although it is now accepted that data should be made
available on request, the current regulations regarding the availability of software are inconsistent. We argue that, with
some exceptions, anything less than the release of source programs is intolerable for results that depend on computation.
The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains
uncertain, but withholding code increases the chances that efforts to reproduce results will fail.
T
he rise of computational science has led to unprecedented
opportunities for scientificadvance.Evermorepowerfulcomputers
enable theories to be investigated that were thought almost
intractable a decade ago, robust hardware technologies allow data collec-
tion in the most inhospitable environments, more data are collected, and
an increasingly rich set of software tools are now available with which to
analyse computer-generated data.
However, there is the difficulty of reproducibility, by which we mean
the reproduction of a scientific paper’s central finding, rather than exact
replication of each specific numerical result down to several decimal
places. We examine the problem of reproducibility (for an early attempt
at solving it, see ref. 1) in the context of openly available computer
programs, or code. Our view is that we have reached the point that, with
some exceptions, anything less than release of actual source code is an
indefensible approach for any scientific results that depend on computa-
tion, because not releasing such code raises needless, and needlessly
confusing, roadblocks to reproducibility.
At present, debate rages on the need to release computer programs
associated with scientific experiments2–4
, with policies still ranging from
mandatory total release to the release only of natural language descrip-
tions, that is, written descriptions of computer program algorithms.
Some journals have already changed their policies on computer program
openness; Science, for example, now includes code in the list of items
that should be supplied by an author5
. Other journals promoting code
availability include Geoscientific Model Development, which is devoted,
at least in part, to model description and code publication, and
Biostatistics, which has appointed an editor to assess the reproducibility
of the software and data associated with an article6
.
In contrast, less stringent policies are exemplified by statements such
as7
‘‘Nature does not require authors to make code available, but we do
expect a description detailed enough to allow others to write their own
code to do similar analysis.’’ Although Nature’s broader policy states that
‘‘...authors are required to make materials, data and associated protocols
promptly available to readers...’’, and editors and referees are fully
empowered to demand and evaluate any specific code, we believe that
its stated policy on code availability actively hinders reproducibility.
Muchofthe debateaboutcode transparencyinvolvesthephilosophy of
science, error validation and research ethics8,9
, but our contention is more
practical: that the cause of reproducibility is best furthered by focusing on
thedissectionandunderstandingofcode,asentimentalreadyappreciated
by the growing open-source movement10
. Dissection and understanding
of open code would improve the chances of both direct and indirect
reproducibility. Direct reproducibility refers to the recompilation and
rerunning of the code on, say, a different combination of hardware and
systems software, to detect the sort of numerical computation11,12
and
interpretation13
problems found in programming languages, which we
discuss later. Without code, direct reproducibility is impossible. Indirect
reproducibility refers to independent efforts to validate something other
than the entire code package, for example a subset of equations or a par-
ticular code module. Here, before time-consuming reprogramming of an
entiremodel,researchersmaysimplywanttocheckthatincorrectcodingof
previously published equations has not invalidated a paper’s result, to
extract and check detailed assumptions, or to run their own code against
the original to check for statistical validity and explain any discrepancies.
Any debate over the difficulties of reproducibility (which, as we will
show, are non-trivial) must of course be tempered by recognizing the
undeniablebenefits afforded by theexplosion of internet facilities and the
rapid increase in raw computational speed and data-handling capability
that has occurred as a result of major advances in computer technology14
.
Such advances have presented science with a greatopportunity to address
problems that would have been intractable in even the recent past. It is
our view, however, that the debate over code release should be resolved as
soon as possible to benefit fully from our novel technical capabilities. On
their own, finer computational grids, longer and more complex compu-
tations and larger data sets—although highly attractive to scientific
researchers—do not resolve underlying computational uncertainties of
proven intransigence and may even exacerbate them.
Although our arguments are focused on the implications of Nature’s
code statement, it is symptomatic of a wider problem: the scientific
community places more faith in computation than is justified. As we
outline below and in two case studies (Boxes 1 and 2), ambiguity in its
many forms and numerical errors render natural language descriptions
insufficient and, in many cases, unintentionally misleading.
The failure of code descriptions
The curse of ambiguity
Ambiguity in program descriptions leads to the possibility, if not the
certainty, that a given natural language description can be converted
into computer code in various ways, each of which may lead to different
numerical outcomes. Innumerable potential issues exist, but might
include mistaken order of operations, reference to different model ver-
sions, or unclear calculations of uncertainties. The problem of ambiguity
has haunted software development from its earliest days.
Ambiguity can occur at the lexical, syntactic or semantic level15
and is
not necessarily the result of incompetence or bad practice. It is a natural
consequence of using natural language16
and is unavoidable. The
1
Department of Computing Open University, Walton Hall, Milton Keynes MK7 6AA, UK. 2
School of Computing and Information Systems, Kingston University, Kingston KT1 2EE, UK. 3
83 Victoria Street,
London SW1H 0HW, UK.
2 3 F E B R U A R Y 2 0 1 2 | V O L 4 8 2 | N A T U R E | 4 8 5
Macmillan Publishers Limited. All rights reserved©2012
Nosek et al. (2015) Science, 348 (6242) 1422-1425.
Call for increased scrutiny and openness
Following cases of scientific misconduct, widespread
questionable research practices, and reproducibility
crises
Best Practices for Scientific Computing
Why now?
Call for increased scrutiny and openness
Following cases of scientific misconduct, widespread
questionable research practices, and reproducibility
crises
Best Practices for Scientific Computing
Why now?
Call for increased scrutiny and openness
Following cases of scientific misconduct, widespread
questionable research practices, and reproducibility
crises
Netherlands Organisation for Scientific Research
Open where possible,
protected where needed
NWO and Open Access
Guidelines on Open Access
to Scientific Publications and Research Data
in Horizon 2020
Version 1.0
11 December 2013
http://www.nwo.nl/binaries/content/assets/nwo/documents/nwo/open-science/open-access-flyer-2012-def-eng-
scherm.pdf
http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-pilot-
guide_en.pdf
Best Practices for Scientific Computing
What should we (try to) do?
1. Write programs for people

so you and others quickly understand the code
2. Let the computer do the work

to increase speed and reproducibility
3. Make incremental changes

to increase productivity and reproducibility
4. Don’t repeat yourself (or others)

repetitions are vulnerable to errors and a waste of time
5. Plan for mistakes

almost all code has bugs, but not all bugs cause errors
6. Optimize software only after it works

and perhaps only if it is worth the effort
7. Document design and purpose, not implementation

to clarify understanding and usage
8. Collaborate

to maximize reproducibility, clarity, and learning
Best Practices for Scientific Computing
1. Write programs for people, not computers
… so you and others quickly understand the code
A program should not require its readers to hold
more than a handful of facts in memory at once
Break up programs in easily understood functions, each
of which performs an easily understood task
Best Practices for Scientific Computing
1. Write programs for people, not computers
… so you and others quickly understand the code
SCRIPT X
Statements
…
…
…
SCRIPT X
Call function a
Call function b
Call function c
Statements
…
…
…
Statements
…
…
…
Best Practices for Scientific Computing
1. Write programs for people, not computers
… so you and others quickly understand the code
A program should not require its readers to hold
more than a handful of facts in memory at once
Break up programs in easily understood functions, each
of which performs an easily understood task
Make names consistent, distinctive, and meaningful
e.g. fileName, meanRT rather than, tmp, result
e.g. camelCase, pothole_case, UPPERCASE
Best Practices for Scientific Computing
http://nl.mathworks.com/matlabcentral/fileexchange/46056-matlab-style-guidelines-2-0
http://nl.mathworks.com/matlabcentral/fileexchange/48121-matlab-coding-checklist
MATLAB
Guidelines 2.0
Richard Johnson
A program should not require its readers to hold
more than a handful of facts in memory at once
Break up programs in easily understood functions, each
of which performs an easily understood task
Make names consistent, distinctive, and meaningful
e.g. fileName, meanRT rather than, tmp, result
e.g. camelCase, pothole_case, UPPERCASE
Make code style and formatting consistent
Use existing guidelines and check lists
MATLAB® @Work
richj@datatool.com www.datatool.com
CODING CHECKLIST
GENERAL
DESIGN
Do you have a good design?
Is the code traceable to requirements?
Does the code support automated use and testing?
Does the code have documented test cases?
Does the code adhere to your designated coding style and standard?
Is the code free from Code Analyzer messages?
Does each function perform one well-defined task?
Is each class cohesive?
Have you refactored out any blocks of code repeated unnecessarily?
Are there unnecessary global constants or variables?
Does the code have appropriate generality?
Does the code make naïve assumptions about the data?
Does the code use appropriate error checking and handling?
If performance is an issue, have you profiled the code?
UNDERSTANDABILITY
Is the code straightforward and does it avoid “cleverness”?
Has tricky code been rewritten rather than commented?
Do you thoroughly understand your code? Will anyone else?
Is it easy to follow and understand?
CODE CHANGES
Have the changes been reviewed as thoroughly as initial development would be?
Do the changes enhance the program’s internal quality rather than degrading it?
Have you improved the system’s modularity by breaking functions into smaller functions?
Have you improved the programming style--names, formatting, comments?
Have you introduced any flaws or limitations?
Best Practices for Scientific Computing
2. Let the computer do the work
… to increase speed and reproducibility
Make the computer repeat tasks
Go beyond the point-and-click approach
Best Practices for Scientific Computing
2. Let the computer do the work
… to increase speed and reproducibility
Make the computer repeat tasks
Go beyond the point-and-click approach
Save recent commands in a file for re-use
Log session’s activity to file
In MATLAB: diary
Best Practices for Scientific Computing
2. Let the computer do the work
… to increase speed and reproducibility
Make the computer repeat tasks
Go beyond the point-and-click approach
Save recent commands in a file for re-use
Log session’s activity to file
In MATLAB: diary
Use a build tool to automate workflows
e.g. SPM’s matlabbatch, IPython/Jupyter notebook,
SPSS syntax
SPM’s matlabbatch
Best Practices for Scientific Computing
2. Let the computer do the work
http://jupyter.org/
Best Practices for Scientific Computing
3. Make incremental changes
Work in small steps with frequent feedback and
course correction
Agile development: add/change one thing at a time, aim
for working code at completion
… to increase productivity and reproducibility
Best Practices for Scientific Computing
3. Make incremental changes
… to increase productivity and reproducibility
Work in small steps with frequent feedback and
course correction
Agile development: add/change one thing at a time, aim
for working code at completion
Use a version control system
Documents who made what contributions when
Provides access to entire history of code + metadata
Many free options to choose from: e.g. git, svn
http://www.phdcomics.com/comics/archive.php?comicid=1531
Best Practices for Scientific Computing
http://swcarpentry.github.io/git-novice/01-basics.html
Different versions can be saved Multiple versions can be merged
Changes are saved sequentially (“snapshots”)
Best Practices for Scientific Computing
3. Make incremental changes
http://www.phdcomics.com/comics/archive.php?comicid=1531
Work in small steps with frequent feedback and
course correction
Agile development: add/change one thing at a time, aim
for working code at completion
Use a version control system
Documents who made what contributions when
Provides access to entire history of code + metadata
Many free options to choose from: e.g. git, svn
Put everything that has been created manually in
version control
Advantages of version control also apply to manuscripts,
figures, presentations
… to increase productivity and reproducibility
Best Practices for Scientific Computing
…
https://github.com/smathot/materials_for_P0001/commits/master/manuscript
Best Practices for Scientific Computing
4. Don’t repeat yourself (or others)
Every piece of data must have a single authoritative
representation in the system
Raw data should have a single canonical version
… repetitions are vulnerable to errors and a waste of time
http://swcarpentry.github.io/v4/data/mgmt.pdf
Best Practices for Scientific Computing
4. Don’t repeat yourself (or others)
… repetitions are vulnerable to errors and a waste of time
http://swcarpentry.github.io/v4/data/mgmt.pdf
Best Practices for Scientific Computing
4. Don’t repeat yourself (or others)
… repetitions are vulnerable to errors and a waste of time
Every piece of data must have a single authoritative
representation in the system
Raw data should have a single canonical version
Modularize code rather than copying and pasting
Avoid “code clones”, they make it difficult to change
code as you have to find and update multiple fragments
Best Practices for Scientific Computing
4. Don’t repeat yourself (or others)
… repetitions are vulnerable to errors and a waste of time
SCRIPT X
Call function a
Call function b
SCRIPT Y
Call function a
Call function c
SCRIPT Y
Statements
….
SCRIPT X
Statements
….
Statements
….
Statements
….
code clone
Best Practices for Scientific Computing
4. Don’t repeat yourself (or others)
… repetitions are vulnerable to errors and a waste of time
Every piece of data must have a single authoritative
representation in the system
Raw data should have a single canonical version
Modularize code rather than copying and pasting
Avoid “code clones”, they make it difficult to change
code as you have to find and update multiple fragments
Re-use code instead of rewriting it
Chances are somebody else made it: e.g. look at SPM
Extensions (~200), NITRC (810), CRAN (7395),
MATLAB File Exchange (25297), PyPI (68247)
Best Practices for Scientific Computing
5. Plan for mistakes
Add assertions to programs to check their operation
Assertions are internal self-checks that verify whether
inputs, outputs, etc. are line with the programmer’s
expectations
In MATLAB: assert
… almost all code has bugs, but not all bugs cause errors
Best Practices for Scientific Computing
5. Plan for mistakes
Add assertions to programs to check their operation
Assertions are internal self-checks that verify whether
inputs, outputs, etc. are line with the programmer’s
expectations
In MATLAB: assert
Use an off-the-shelf unit testing library
Unit tests check whether code returns correct results
In MATLAB: runtests
… almost all code has bugs, but not all bugs cause errors
Best Practices for Scientific Computing
5. Plan for mistakes
Add assertions to programs to check their operation
Assertions are internal self-checks that verify whether
inputs, outputs, etc. are line with the programmer’s
expectations
In MATLAB: assert
Use an off-the-shelf unit testing library
Unit tests check whether code returns correct results
In MATLAB: runtests
Turn bugs into test cases
To ensure that they will not happen again
… almost all code has bugs, but not all bugs cause errors
Best Practices for Scientific Computing
5. Plan for mistakes
Add assertions to programs to check their operation
Assertions are internal self-checks that verify whether
inputs, outputs, etc. are line with the programmer’s
expectations
In MATLAB: assert
Use an off-the-shelf unit testing library
Unit tests check whether code returns correct results
In MATLAB: runtests
Turn bugs into test cases
To ensure that they will not happen again
Use a symbolic debugger
Symbolic debuggers make it easy to improve code
Can detect bugs and sub-optimal code, run code step-by-
step, track values of variables
… almost all code has bugs, but not all bugs cause errors
Best Practices for Scientific Computing
Best Practices for Scientific Computing
Best Practices for Scientific Computing
Best Practices for Scientific Computing
6. Optimize software only after it works correctly
Use a profiler to identify bottlenecks
Useful if your code takes a long time to run
In MATLAB: profile
… and perhaps only if it is worth the effort
Best Practices for Scientific Computing
6. Optimize software only after it works correctly
Use a profiler to identify bottlenecks
Useful if your code takes a long time to run
In MATLAB: profile
Write code in the highest-level language possible
Programmers produce roughly same number of lines of
code per day, regardless of language
High-level languages achieve complex computations in
few lines of code
… and perhaps only if it is worth the effort
Best Practices for Scientific Computing
7. Document design and purpose, not mechanics
Document interfaces and reasons, not
implementations
Write structured and informative function headers

Write a README file for each set of functions
… good documentation helps people understand code
see: http://www.mathworks.com/matlabcentral/fileexchange/4908-m-file-header-template/content/template_header.m
Best Practices for Scientific Computing
7. Document design and purpose, not mechanics
Document interfaces and reasons, not
implementations
Write structured and informative function headers

Write a README file for each set of functions
Refactor code in preference to explaining how it
works
Restructuring provides clarity, reducing space/time
needed for explanation
… good documentation helps people understand code
Best Practices for Scientific Computing
7. Document design and purpose, not mechanics
Document interfaces and reasons, not
implementations
Write structured and informative function headers

Write a README file for each set of functions
Refactor code in preference to explaining how it
works
Restructuring provides clarity, reducing space/time
needed for explanation
Embed the documentation for a piece of software in
that software
Embed in function headers, README files, but also
interactive notebooks (e.g. knitr, IPython/Jupyter)
Increases likelihood that code changes will be reflected in
the documentation
… good documentation helps people understand code
Best Practices for Scientific Computing
8. Collaborate
Use (pre-merge) code reviews
Like proof-reading manuscripts, but then for code
Aims to identify bugs, improve clarity, but other benefits
are spreading knowledge, learning to coded
Best Practices for Scientific Computing
8. Collaborate
Use (pre-merge) code reviews
Like proof-reading manuscripts, but then for code
Aims to identify bugs, improve clarity, but other benefits
are spreading knowledge, learning to coded
Use pair programming when bringing someone new
up to speed and when tackling tricky problems
Intimidating, but really helpful
Best Practices for Scientific Computing
8. Collaborate
Use (pre-merge) code reviews
Like proof-reading manuscripts, but then for code
Aims to identify bugs, improve clarity, but other benefits
are spreading knowledge, learning to coded
Use pair programming when bringing someone new
up to speed and when tackling tricky problems
Intimidating, but really helpful
Use an issue tracking tool
Perhaps mainly useful for large, collaborative coding
efforts, but many code sharing platforms (e.g. Github,
Bitbucket) come with these tools
Best Practices for Scientific Computing
Useful resources
www.ru.nl/donders
www.bramzandbelt.com

More Related Content

What's hot

The Reproducibility Crisis in Psychological Science: One Year Later
The Reproducibility Crisis in Psychological Science: One Year LaterThe Reproducibility Crisis in Psychological Science: One Year Later
The Reproducibility Crisis in Psychological Science: One Year LaterJimGrange
 
Annotation examples--Fribourg--2019-09-03
Annotation examples--Fribourg--2019-09-03Annotation examples--Fribourg--2019-09-03
Annotation examples--Fribourg--2019-09-03jodischneider
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studywolf vanpaemel
 
Big Data: Big Opportunities or Big Trouble?
Big Data: Big Opportunities or Big Trouble?Big Data: Big Opportunities or Big Trouble?
Big Data: Big Opportunities or Big Trouble?Shea Swauger
 
Towards knowledge maintenance in scientific digital libraries with the keysto...
Towards knowledge maintenance in scientific digital libraries with the keysto...Towards knowledge maintenance in scientific digital libraries with the keysto...
Towards knowledge maintenance in scientific digital libraries with the keysto...jodischneider
 
Biomedical Research_House of Cards_Editorial
Biomedical Research_House of Cards_EditorialBiomedical Research_House of Cards_Editorial
Biomedical Research_House of Cards_EditorialRathnam Chaguturu
 
Enabling reuse of arguments and opinions in open collaboration systems PhD vi...
Enabling reuse of arguments and opinions in open collaboration systems PhD vi...Enabling reuse of arguments and opinions in open collaboration systems PhD vi...
Enabling reuse of arguments and opinions in open collaboration systems PhD vi...jodischneider
 
Sense About Science: Peer Review Survey
Sense About Science: Peer Review SurveySense About Science: Peer Review Survey
Sense About Science: Peer Review SurveySenseAboutScience
 
OSFair2017 Training | Increasing Research Transparency using the Open Science...
OSFair2017 Training | Increasing Research Transparency using the Open Science...OSFair2017 Training | Increasing Research Transparency using the Open Science...
OSFair2017 Training | Increasing Research Transparency using the Open Science...Open Science Fair
 
Digital Scholar Webinar: Understanding and using PROSPERO: International pros...
Digital Scholar Webinar: Understanding and using PROSPERO: International pros...Digital Scholar Webinar: Understanding and using PROSPERO: International pros...
Digital Scholar Webinar: Understanding and using PROSPERO: International pros...SC CTSI at USC and CHLA
 
What are metrics good for? Reflections on REF and TEF
What are metrics good for? Reflections on REF and TEFWhat are metrics good for? Reflections on REF and TEF
What are metrics good for? Reflections on REF and TEFDorothy Bishop
 
CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata managementPistoia Alliance
 
Advice from the Battleground: Inside NIH Study Sections and Common Mistakes o...
Advice from the Battleground: Inside NIH Study Sections and Common Mistakes o...Advice from the Battleground: Inside NIH Study Sections and Common Mistakes o...
Advice from the Battleground: Inside NIH Study Sections and Common Mistakes o...SC CTSI at USC and CHLA
 
Do Open data badges influence author behaviour? A case study at Springer Nature
Do Open data badges influence author behaviour? A case study at Springer NatureDo Open data badges influence author behaviour? A case study at Springer Nature
Do Open data badges influence author behaviour? A case study at Springer NatureRebecca Grant
 
Using the Micropublications ontology and the Open Annotation Data Model to re...
Using the Micropublications ontology and the Open Annotation Data Model to re...Using the Micropublications ontology and the Open Annotation Data Model to re...
Using the Micropublications ontology and the Open Annotation Data Model to re...jodischneider
 
Pistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier DatathonPistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier DatathonPistoia Alliance
 

What's hot (20)

The Reproducibility Crisis in Psychological Science: One Year Later
The Reproducibility Crisis in Psychological Science: One Year LaterThe Reproducibility Crisis in Psychological Science: One Year Later
The Reproducibility Crisis in Psychological Science: One Year Later
 
Project and Thesis
Project and ThesisProject and Thesis
Project and Thesis
 
Nov1 webinar intro_slides v
Nov1 webinar intro_slides vNov1 webinar intro_slides v
Nov1 webinar intro_slides v
 
Annotation examples--Fribourg--2019-09-03
Annotation examples--Fribourg--2019-09-03Annotation examples--Fribourg--2019-09-03
Annotation examples--Fribourg--2019-09-03
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
 
Big Data: Big Opportunities or Big Trouble?
Big Data: Big Opportunities or Big Trouble?Big Data: Big Opportunities or Big Trouble?
Big Data: Big Opportunities or Big Trouble?
 
Towards knowledge maintenance in scientific digital libraries with the keysto...
Towards knowledge maintenance in scientific digital libraries with the keysto...Towards knowledge maintenance in scientific digital libraries with the keysto...
Towards knowledge maintenance in scientific digital libraries with the keysto...
 
Biomedical Research_House of Cards_Editorial
Biomedical Research_House of Cards_EditorialBiomedical Research_House of Cards_Editorial
Biomedical Research_House of Cards_Editorial
 
Enabling reuse of arguments and opinions in open collaboration systems PhD vi...
Enabling reuse of arguments and opinions in open collaboration systems PhD vi...Enabling reuse of arguments and opinions in open collaboration systems PhD vi...
Enabling reuse of arguments and opinions in open collaboration systems PhD vi...
 
Effectiveness of New, Informationist-led Curriculum Changes at the College of...
Effectiveness of New, Informationist-led Curriculum Changes at the College of...Effectiveness of New, Informationist-led Curriculum Changes at the College of...
Effectiveness of New, Informationist-led Curriculum Changes at the College of...
 
Sense About Science: Peer Review Survey
Sense About Science: Peer Review SurveySense About Science: Peer Review Survey
Sense About Science: Peer Review Survey
 
OSFair2017 Training | Increasing Research Transparency using the Open Science...
OSFair2017 Training | Increasing Research Transparency using the Open Science...OSFair2017 Training | Increasing Research Transparency using the Open Science...
OSFair2017 Training | Increasing Research Transparency using the Open Science...
 
Developing a Replicable Methodology for Automated Identification of Emerging ...
Developing a Replicable Methodology for Automated Identification of Emerging ...Developing a Replicable Methodology for Automated Identification of Emerging ...
Developing a Replicable Methodology for Automated Identification of Emerging ...
 
Digital Scholar Webinar: Understanding and using PROSPERO: International pros...
Digital Scholar Webinar: Understanding and using PROSPERO: International pros...Digital Scholar Webinar: Understanding and using PROSPERO: International pros...
Digital Scholar Webinar: Understanding and using PROSPERO: International pros...
 
What are metrics good for? Reflections on REF and TEF
What are metrics good for? Reflections on REF and TEFWhat are metrics good for? Reflections on REF and TEF
What are metrics good for? Reflections on REF and TEF
 
CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata management
 
Advice from the Battleground: Inside NIH Study Sections and Common Mistakes o...
Advice from the Battleground: Inside NIH Study Sections and Common Mistakes o...Advice from the Battleground: Inside NIH Study Sections and Common Mistakes o...
Advice from the Battleground: Inside NIH Study Sections and Common Mistakes o...
 
Do Open data badges influence author behaviour? A case study at Springer Nature
Do Open data badges influence author behaviour? A case study at Springer NatureDo Open data badges influence author behaviour? A case study at Springer Nature
Do Open data badges influence author behaviour? A case study at Springer Nature
 
Using the Micropublications ontology and the Open Annotation Data Model to re...
Using the Micropublications ontology and the Open Annotation Data Model to re...Using the Micropublications ontology and the Open Annotation Data Model to re...
Using the Micropublications ontology and the Open Annotation Data Model to re...
 
Pistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier DatathonPistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier Datathon
 

Viewers also liked

Writing expressions and formulae
Writing expressions and formulaeWriting expressions and formulae
Writing expressions and formulaecnewmanbsh
 
Scientific computing on jruby
Scientific computing on jrubyScientific computing on jruby
Scientific computing on jrubyPrasun Anand
 
Matlab assignment help
Matlab assignment helpMatlab assignment help
Matlab assignment helpAnderson Silva
 
Matlab: Procedures And Functions
Matlab: Procedures And FunctionsMatlab: Procedures And Functions
Matlab: Procedures And Functionsmatlab Content
 
Lewis Shepherd on the Revolution in Scientific Computing
Lewis Shepherd on the Revolution in Scientific ComputingLewis Shepherd on the Revolution in Scientific Computing
Lewis Shepherd on the Revolution in Scientific ComputingAlexander Howard
 
Modeling response inhibition
Modeling response inhibitionModeling response inhibition
Modeling response inhibitionBram Zandbelt
 
English language editing for African scientists
English language editing for African scientistsEnglish language editing for African scientists
English language editing for African scientistsBioMedCentral
 
Publishing Scientific Research and How to Write High-Impact Research Papers
Publishing Scientific Research and How to Write High-Impact Research PapersPublishing Scientific Research and How to Write High-Impact Research Papers
Publishing Scientific Research and How to Write High-Impact Research Papersjjuhlrich
 
Scientific Writing 3/3
Scientific Writing 3/3Scientific Writing 3/3
Scientific Writing 3/3Franco Bagnoli
 
Maximizng the power of good scientific writing
Maximizng the power of good scientific writingMaximizng the power of good scientific writing
Maximizng the power of good scientific writingJames Coyne
 
Taylor & Francis: Author and Researcher Workshop
Taylor & Francis: Author and Researcher WorkshopTaylor & Francis: Author and Researcher Workshop
Taylor & Francis: Author and Researcher WorkshopSIBiUSP
 
Introduction to Matlab Scripts
Introduction to Matlab ScriptsIntroduction to Matlab Scripts
Introduction to Matlab ScriptsShameer Ahmed Koya
 
Preparing a manuscript
Preparing a manuscriptPreparing a manuscript
Preparing a manuscriptlemberger
 
The 8th Workshop on The Art of Writing Publishable Scientific Manuscript and ...
The 8th Workshop on The Art of Writing Publishable Scientific Manuscript and ...The 8th Workshop on The Art of Writing Publishable Scientific Manuscript and ...
The 8th Workshop on The Art of Writing Publishable Scientific Manuscript and ...Professor Dr. Bassim H. Hameed
 
Prediction powerpoint #2
Prediction powerpoint #2Prediction powerpoint #2
Prediction powerpoint #2amyjulietmartin
 
Cover letter
Cover letterCover letter
Cover letterometh
 
Writing and get published in high impact journals
Writing and get published in high impact journalsWriting and get published in high impact journals
Writing and get published in high impact journalsProf Jamaluddin
 
Pre-submission guide for manuscripts
Pre-submission guide for manuscriptsPre-submission guide for manuscripts
Pre-submission guide for manuscriptsWiley
 
Examples of-strong-sentence-starters
Examples of-strong-sentence-startersExamples of-strong-sentence-starters
Examples of-strong-sentence-startersmrsstrong-clay
 

Viewers also liked (20)

Writing expressions and formulae
Writing expressions and formulaeWriting expressions and formulae
Writing expressions and formulae
 
Scientific computing on jruby
Scientific computing on jrubyScientific computing on jruby
Scientific computing on jruby
 
Matlab assignment help
Matlab assignment helpMatlab assignment help
Matlab assignment help
 
Matlab: Procedures And Functions
Matlab: Procedures And FunctionsMatlab: Procedures And Functions
Matlab: Procedures And Functions
 
Lewis Shepherd on the Revolution in Scientific Computing
Lewis Shepherd on the Revolution in Scientific ComputingLewis Shepherd on the Revolution in Scientific Computing
Lewis Shepherd on the Revolution in Scientific Computing
 
Modeling response inhibition
Modeling response inhibitionModeling response inhibition
Modeling response inhibition
 
MATLAB Scripts - Examples
MATLAB Scripts - ExamplesMATLAB Scripts - Examples
MATLAB Scripts - Examples
 
English language editing for African scientists
English language editing for African scientistsEnglish language editing for African scientists
English language editing for African scientists
 
Publishing Scientific Research and How to Write High-Impact Research Papers
Publishing Scientific Research and How to Write High-Impact Research PapersPublishing Scientific Research and How to Write High-Impact Research Papers
Publishing Scientific Research and How to Write High-Impact Research Papers
 
Scientific Writing 3/3
Scientific Writing 3/3Scientific Writing 3/3
Scientific Writing 3/3
 
Maximizng the power of good scientific writing
Maximizng the power of good scientific writingMaximizng the power of good scientific writing
Maximizng the power of good scientific writing
 
Taylor & Francis: Author and Researcher Workshop
Taylor & Francis: Author and Researcher WorkshopTaylor & Francis: Author and Researcher Workshop
Taylor & Francis: Author and Researcher Workshop
 
Introduction to Matlab Scripts
Introduction to Matlab ScriptsIntroduction to Matlab Scripts
Introduction to Matlab Scripts
 
Preparing a manuscript
Preparing a manuscriptPreparing a manuscript
Preparing a manuscript
 
The 8th Workshop on The Art of Writing Publishable Scientific Manuscript and ...
The 8th Workshop on The Art of Writing Publishable Scientific Manuscript and ...The 8th Workshop on The Art of Writing Publishable Scientific Manuscript and ...
The 8th Workshop on The Art of Writing Publishable Scientific Manuscript and ...
 
Prediction powerpoint #2
Prediction powerpoint #2Prediction powerpoint #2
Prediction powerpoint #2
 
Cover letter
Cover letterCover letter
Cover letter
 
Writing and get published in high impact journals
Writing and get published in high impact journalsWriting and get published in high impact journals
Writing and get published in high impact journals
 
Pre-submission guide for manuscripts
Pre-submission guide for manuscriptsPre-submission guide for manuscripts
Pre-submission guide for manuscripts
 
Examples of-strong-sentence-starters
Examples of-strong-sentence-startersExamples of-strong-sentence-starters
Examples of-strong-sentence-starters
 

Similar to Journal Club - Best Practices for Scientific Computing

Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reusevoginip
 
A Characterization Of The Scientific Data Analysis Process. Revision 1
A Characterization Of The Scientific Data Analysis Process. Revision 1A Characterization Of The Scientific Data Analysis Process. Revision 1
A Characterization Of The Scientific Data Analysis Process. Revision 1Sean Flores
 
A Pilot Study - What, Why, and How
A Pilot Study - What, Why, and HowA Pilot Study - What, Why, and How
A Pilot Study - What, Why, and HowSoleh Al Ayubi
 
Research EDU821-1.pptx
Research EDU821-1.pptxResearch EDU821-1.pptx
Research EDU821-1.pptxSalmaNiazi2
 
Some Aspects of Research Work: A View from Engineering and Creativity
Some Aspects of Research Work: A View from Engineering and Creativity Some Aspects of Research Work: A View from Engineering and Creativity
Some Aspects of Research Work: A View from Engineering and Creativity Enrique Posada
 
Using Healthcare Data for Research @ The Hyve - Campus Party 2016
Using Healthcare Data for Research @ The Hyve - Campus Party 2016Using Healthcare Data for Research @ The Hyve - Campus Party 2016
Using Healthcare Data for Research @ The Hyve - Campus Party 2016Kees van Bochove
 
GBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdfGBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdfStanleyChivandire1
 
A guide to deal with uncertainties in software project management
A guide to deal with uncertainties in software project managementA guide to deal with uncertainties in software project management
A guide to deal with uncertainties in software project managementijcsit
 
Computer is an electronic device or combination of electronic devices
Computer is an electronic device or combination of electronic devicesComputer is an electronic device or combination of electronic devices
Computer is an electronic device or combination of electronic devicesArti Arora
 
Rearch methodology
Rearch methodologyRearch methodology
Rearch methodologyYedu Dharan
 
Data at the NIH: Some Early Thoughts
Data at the NIH: Some Early ThoughtsData at the NIH: Some Early Thoughts
Data at the NIH: Some Early ThoughtsPhilip Bourne
 
PhRMA Some Early Thoughts
PhRMA Some Early ThoughtsPhRMA Some Early Thoughts
PhRMA Some Early ThoughtsPhilip Bourne
 
The Research Process: A quick glance
The Research Process: A quick glanceThe Research Process: A quick glance
The Research Process: A quick glanceAnantha Kumar
 
Empirical research methods for software engineering
Empirical research methods for software engineeringEmpirical research methods for software engineering
Empirical research methods for software engineeringsarfraznawaz
 
Acting as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeActing as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeLizLyon
 
Operational research
Operational researchOperational research
Operational researchjyotinayak44
 
Research Methodology For A Researcher
Research Methodology For A ResearcherResearch Methodology For A Researcher
Research Methodology For A ResearcherRenee Wardowski
 
A_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdf
A_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdfA_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdf
A_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdfLandingJatta1
 
WK 2_Oppenheim 1992_pp 6-29.pdf
WK 2_Oppenheim 1992_pp 6-29.pdfWK 2_Oppenheim 1992_pp 6-29.pdf
WK 2_Oppenheim 1992_pp 6-29.pdfsnazkhan8686
 
A Quick Guide For Using Microsoft OneNote As An Electronic Laboratory Notebook
A Quick Guide For Using Microsoft OneNote As An Electronic Laboratory NotebookA Quick Guide For Using Microsoft OneNote As An Electronic Laboratory Notebook
A Quick Guide For Using Microsoft OneNote As An Electronic Laboratory NotebookCourtney Esco
 

Similar to Journal Club - Best Practices for Scientific Computing (20)

Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reuse
 
A Characterization Of The Scientific Data Analysis Process. Revision 1
A Characterization Of The Scientific Data Analysis Process. Revision 1A Characterization Of The Scientific Data Analysis Process. Revision 1
A Characterization Of The Scientific Data Analysis Process. Revision 1
 
A Pilot Study - What, Why, and How
A Pilot Study - What, Why, and HowA Pilot Study - What, Why, and How
A Pilot Study - What, Why, and How
 
Research EDU821-1.pptx
Research EDU821-1.pptxResearch EDU821-1.pptx
Research EDU821-1.pptx
 
Some Aspects of Research Work: A View from Engineering and Creativity
Some Aspects of Research Work: A View from Engineering and Creativity Some Aspects of Research Work: A View from Engineering and Creativity
Some Aspects of Research Work: A View from Engineering and Creativity
 
Using Healthcare Data for Research @ The Hyve - Campus Party 2016
Using Healthcare Data for Research @ The Hyve - Campus Party 2016Using Healthcare Data for Research @ The Hyve - Campus Party 2016
Using Healthcare Data for Research @ The Hyve - Campus Party 2016
 
GBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdfGBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdf
 
A guide to deal with uncertainties in software project management
A guide to deal with uncertainties in software project managementA guide to deal with uncertainties in software project management
A guide to deal with uncertainties in software project management
 
Computer is an electronic device or combination of electronic devices
Computer is an electronic device or combination of electronic devicesComputer is an electronic device or combination of electronic devices
Computer is an electronic device or combination of electronic devices
 
Rearch methodology
Rearch methodologyRearch methodology
Rearch methodology
 
Data at the NIH: Some Early Thoughts
Data at the NIH: Some Early ThoughtsData at the NIH: Some Early Thoughts
Data at the NIH: Some Early Thoughts
 
PhRMA Some Early Thoughts
PhRMA Some Early ThoughtsPhRMA Some Early Thoughts
PhRMA Some Early Thoughts
 
The Research Process: A quick glance
The Research Process: A quick glanceThe Research Process: A quick glance
The Research Process: A quick glance
 
Empirical research methods for software engineering
Empirical research methods for software engineeringEmpirical research methods for software engineering
Empirical research methods for software engineering
 
Acting as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeActing as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decade
 
Operational research
Operational researchOperational research
Operational research
 
Research Methodology For A Researcher
Research Methodology For A ResearcherResearch Methodology For A Researcher
Research Methodology For A Researcher
 
A_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdf
A_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdfA_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdf
A_Comparison_of_Manual_and_Computational_Thematic_Analyses.pdf
 
WK 2_Oppenheim 1992_pp 6-29.pdf
WK 2_Oppenheim 1992_pp 6-29.pdfWK 2_Oppenheim 1992_pp 6-29.pdf
WK 2_Oppenheim 1992_pp 6-29.pdf
 
A Quick Guide For Using Microsoft OneNote As An Electronic Laboratory Notebook
A Quick Guide For Using Microsoft OneNote As An Electronic Laboratory NotebookA Quick Guide For Using Microsoft OneNote As An Electronic Laboratory Notebook
A Quick Guide For Using Microsoft OneNote As An Electronic Laboratory Notebook
 

Recently uploaded

Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.Silpa
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Silpa
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxRenuJangid3
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxSilpa
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfSumit Kumar yadav
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.Silpa
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Silpa
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...Scintica Instrumentation
 

Recently uploaded (20)

Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 

Journal Club - Best Practices for Scientific Computing

  • 1. Bram Zandbelt Best Practices for Scientific Computing: Principles & MATLAB Examples
  • 2. Community Page Best Practices for Scientific Computing Greg Wilson1 *, D. A. Aruliah2 , C. Titus Brown3 , Neil P. Chue Hong4 , Matt Davis5 , Richard T. Guy6¤ , Steven H. D. Haddock7 , Kathryn D. Huff8 , Ian M. Mitchell9 , Mark D. Plumbley10 , Ben Waugh11 , Ethan P. White12 , Paul Wilson13 1 Mozilla Foundation, Toronto, Ontario, Canada, 2 University of Ontario Institute of Technology, Oshawa, Ontario, Canada, 3 Michigan State University, East Lansing, Michigan, United States of America, 4 Software Sustainability Institute, Edinburgh, United Kingdom, 5 Space Telescope Science Institute, Baltimore, Maryland, United States of America, 6 University of Toronto, Toronto, Ontario, Canada, 7 Monterey Bay Aquarium Research Institute, Moss Landing, California, United States of America, 8 University of California Berkeley, Berkeley, California, United States of America, 9 University of British Columbia, Vancouver, British Columbia, Canada, 10 Queen Mary University of London, London, United Kingdom, 11 University College London, London, United Kingdom, 12 Utah State University, Logan, Utah, United States of America, 13 University of Wisconsin, Madison, Wisconsin, United States of America Introduction Scientists spend an increasing amount of time building and using software. However, most scientists are never taught how to do this efficiently. As a result, many are unaware of tools and practices that would allow them to write more reliable and maintainable code with less effort. We describe a set of best practices for scientific software development that have solid foundations in research and experience, and that improve scientists’ productivity and the reliability of their software. Software is as important to modern scientific research as telescopes and test tubes. From groups that work exclusively on computational problems, to traditional laboratory and field scientists, more and more of the daily operation of science revolves around developing new algorithms, managing and analyzing the large amounts of data that are generated in single research projects, combining disparate datasets to assess synthetic problems, and other computational tasks. Scientists typically develop their own software for these purposes because doing so requires substantial domain-specific knowledge. As a result, recent studies have found that scientists typically spend 30% or more of their time developing software [1,2]. However, 90% or more of them are primarily self-taught [1,2], and therefore lack exposure to basic software development practices such as writing maintainable code, using version control and issue trackers, code reviews, unit testing, and task automation. We believe that software is just another kind of experimental apparatus [3] and should be built, checked, and used as carefully as any physical apparatus. However, while most scientists are careful to validate their laboratory and field equipment, most do not know how reliable their software is [4,5]. This can lead to serious errors impacting the central conclusions of published research [6]: recent high-profile retractions, technical comments, and corrections because of errors in computational methods error from another group’s code was not discovered until after publication [6]. As with bench experiments, not everything must be done to the most exacting standards; however, scientists need to be aware of best practices both to improve their own approaches and for reviewing computational work by others. This paper describes a set of practices that are easy to adopt and have proven effective in many research settings. Our recommenda- tions are based on several decades of collective experience both building scientific software and teaching computing to scientists [17,18], reports from many other groups [19–25], guidelines for commercial and open source software development [26,27], and on empirical studies of scientific computing [28–31] and software development in general (summarized in [32]). None of these practices will guarantee efficient, error-free software development, but used in concert they will reduce the number of errors in scientific software, make it easier to reuse, and save the authors of the software time and effort that can used for focusing on the underlying scientific questions. Our practices are summarized in Box 1; labels in the main text such as ‘‘(1a)’’ refer to items in that summary. For reasons of space, we do not discuss the equally important (but independent) issues of reproducible research, publication and citation of code and data, and open science. We do believe, however, that all of these will be much easier to implement if scientists have the skills we describe. Citation: Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, et al. (2014) Best Practices for Scientific Computing. PLoS Biol 12(1): e1001745. doi:10.1371/journal.pbio.1001745 Academic Editor: Jonathan A. Eisen, University of California Davis, United States of America Published January 7, 2014 Copyright: ß 2014 Wilson et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Best Practices for Scientific Computing Wilson et al. (2014) Best Practices for Scientific Computing. PLoS Biol, 12(1), e1001745.
  • 3. Best Practices for Scientific Computing Why is this important? We rely heavily on research software No software, no research http://www.software.ac.uk/blog/2014-12-04-its-impossible-conduct-research-without-software-say-7-out-10-uk- researchers 2014 survey among 417 UK researchers
  • 4. Best Practices for Scientific Computing https://innoscholcomm.silk.co/ Tools that researchers use in practice
  • 5. Best Practices for Scientific Computing Why is this important? We rely heavily on research software No software, no research Many of us write software ourselves From single- command line scripts to multigigabyte code repositories
  • 6. Best Practices for Scientific Computing adapted from Leek & Peng (2015) Nature, 520 (7549), 612. of the iceberg Ridding science of shoddy statistics will require scrutiny of every step, not merely the last one, say Jeffrey T. Leek and Roger D. Peng. T here is no statistic more maligned than the P value. Hundreds of papers and blogposts have been written about what some statisticians deride as ‘null hypothesis significance testing’ (NHST; see, for example, go.nature.com/pfvgqe). NHST deems whether the results of a data analysis are important on the basis of whether a summary statistic (such as a P value) has crossed a threshold. Given the discourse, it is no surprise that some hailed as a victory the banning of NHST methods (and all of statistical inference) in the journal Basic and Applied Social Psychology in February1 . Such a ban will in fact have scant effect on the quality of published science. There are many stages to the design and analysis of a successful study (see ‘Data pipeline’). The last of these steps is the calculation of an inferential statistic such as a P value, and the application of a ‘decision rule’ to it (for example, P < 0.05). In practice, decisions that are made earlier in data analysis have a much greater impact on results — from experimental design to batch effects, lack of adjustment for confounding factors, or simple measurement error. Arbitrary levels of statistical significance can be achieved by changing the ways in which data are cleaned, summarized or modelled2 . P values are an easy target: being widely used, they are widely abused. But, in prac- tice, deregulating statistical significance opens the door to even more ways to game statistics — intentionally or unintentionally — to get a result. Replacing P values with Bayesfactorsoranotherstatisticisultimately about choosing a different trade-off of true positives and false positives. Arguing about the P value is like focusing on a single mis- spelling, rather than on the faulty logic of a sentence. Better education is a start. Just as anyone who does DNA sequencing or remote- sensing has to be trained to use a machine, so too anyone who analyses data must be trained in the relevant software and con- designed to address this crisis. For example, the Data Science Specialization, offered by Johns Hopkins University in Baltimore, Maryland, and Data Carpen- try, can easily be integrated into training and research. It is increasingly possible to analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in spe- cific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations. Statistical research largely focuses on mathematical statistics, to the exclusion of the behaviour and processes involved in data analysis. To solve this deeper problem, we must study how people perform data analysis in the real world. What sets them up for success, and what for failure? Controlled experiments have been done in visualiza- tion3 and risk interpretation4 to evaluate how humans perceive and interact with data and statistics. More recently, we and others have been studying the entire analysis pipe- line. We found, for example, that recently trained data analysts do not know how to infer P values from plots of data5 , but they can learn to do so with practice. The ultimate goal is evidence-based data analysis6 . This is analogous to evidence- based medicine, in which physicians are encouraged to use only treatments for which efficacy has been proved in controlled trials. Statisticians and the people they teach and collaborate with need to stop arguing about P values, and prevent the rest of the iceberg from sinking science.■ Jeffrey T. Leek and Roger D. Peng are associate professors of biostatistics at the Johns Hopkins Bloomberg School of Public Health in Baltimore, Maryland, USA. e-mail: jleek@jhsph.edu 1. Trafimow, D. & Marks, M. Basic Appl. Soc. Psych. 37, 1–2 (2015). 2. Simmons, J. P., Nelson, L. D. & Simonsohn, U. Psychol. Sci. 22, 1359–1366 (2011). DATA PIPELINE The design and analysis of a successful study has many stages, all of which need policing. Experimental design Data collection Raw data Data cleaning Tidy data Exploratory data analysis Potential statistical models Statistical modelling Summary statistics Inference P value Extreme scrutiny Little debate
  • 7. Best Practices for Scientific Computing Why is this important? 2014 survey among 417 UK researchers http://www.software.ac.uk/blog/2014-12-04-its-impossible-conduct-research-without-software-say-7-out-10-uk- researchers We rely heavily on research software No software, no research Many of us write software ourselves From single- command line scripts to multigigabyte code repositories
  • 8. Best Practices for Scientific Computing Why is this important? 2014 survey among 417 UK researchers http://www.software.ac.uk/blog/2014-12-04-its-impossible-conduct-research-without-software-say-7-out-10-uk- researchers We rely heavily on research software No software, no research Many of us write software ourselves From single- command line scripts to multigigabyte code repositories Not all us have had formal training Most of us have learned it “on the street”
  • 9. Best Practices for Scientific Computing Why is this important? 1856 NEWS>> THIS WEEK A dolphin’s demise Indians wary of nuclear pact 1860 1863 Untilrecently,GeoffreyChang’scareerwason a trajectory most young scientists only dream about. In 1999, at the age of 28, the protein crystallographer landed a faculty position at the prestigious Scripps Research Institute in San Diego, California.The next year, in a cer- emony at the White House, Chang received a Presidential Early Career Award for Scientists and Engineers, the country’s highest honor for young researchers. His lab generated a stream of high-profile papers detailing the molecular structures of important proteins embedded in cell membranes. Then the dream turned into a nightmare. In September, Swiss researchers published a paper in Nature that cast serious doubt on a protein structure Chang’s group had described in a 2001 Science paper. When he investigated, Chang was horrified to discover thatahomemadedata-analysispro- gram had flipped two columns of data, inverting the electron-density map from which his team had derived the final protein structure. Unfortunately, his group had used the program to analyze data for other proteins. As a result, on page 1875, Chang and his colleagues retract three Science papers and report that two papers in other jour- nals also contain erroneous structures. “I’ve been devastated,” Chang says. “I hope people will understand that it was a mistake, and I’m very sorry for it.” Other researchers don’t doubt that the error was unintentional, and although some say it has cost them time and effort, many praise Chang for setting the record straight promptly and forthrightly. “I’m very pleased he’s done this because there has been some confusion” about the original struc- tures, says Christopher Higgins, a biochemist at Imperial College London. “Now the field can really move forward.” The most influential of Chang’s retracted publications, other researchers say, was the 2001 Science paper, which described the struc- tureofaproteincalledMsbA,isolatedfromthe bacterium Escherichia coli. MsbA belongs to a huge and ancient family of molecules that use energy from adenosine triphosphate to trans- port molecules across cell membranes. These so-called ABC transporters perform many essential biological duties and are of great clin- icalinterestbecauseoftheirrolesindrugresist- ance. Some pump antibiotics out of bacterial cells, for example; others clear chemotherapy drugs from cancer cells. Chang’s MsbA struc- ture was the first molecular portrait of an entire ABC transporter, and many researchers saw it asamajorcontributiontowardfiguringouthow these crucial proteins do their jobs. That paper alone has been cited by 364 publications, according to Google Scholar. Two subsequent papers, both now being retracted, describe the structure of MsbA from other bacteria, Vibrio cholera (published in Molecular Biology in 2003) and Salmonella typhimurium (published in Science in 2005). The other retractions, a 2004 paper in the Proceedings of the National Academy of Sciences and a 2005 Science paper, described EmrE, a different type of transporter protein. Crystallizing and obtaining structures of five membrane proteins in just over 5 years was an incredible feat, says Chang’s former postdoc adviser Douglas Rees of the Califor- nia Institute ofTechnology in Pasadena. Such proteins are a challenge for crystallographers because they are large, unwieldy, and notori- ously difficult to coax into the crystals needed for x-ray crystallography. Rees says determination was at the root of Chang’s suc- cess: “He has an incredible drive and work ethic. He really pushed the field in the sense of getting things to crystallize that no one else had been able to do.” Chang’s data are good, Rees says, but the faulty software threw everything off. Ironically, another former post- doc in Rees’s lab, Kaspar Locher, exposedthemistake.Inthe14Sep- tember issue of Nature, Locher, now at the Swiss Federal Institute ofTechnology in Zurich, described the structure of anABC transporter calledSav1866fromStaphylococcus aureus. The structure was dramati- cally—and unexpectedly—differ- ent from that of MsbA. After pulling up Sav1866 and Chang’s MsbA from S. typhimurium on a computer screen, Locher says he realized in minutes that the MsbA structurewasinverted.Interpreting the “hand” of a molecule is always a challenge for crystallographers, Locher notes, and many mistakes can lead to an incorrect mirror-image structure. Getting the wrong hand is “in the category of monu- mental blunders,” Locher says. On reading the Nature paper, Chang quickly traced the mix-up back to the analysis program, which he says he inherited from another lab. Locher suspects that Chang would have caught the mistake if he’d taken more time to obtain a higher resolution struc- ture. “I think he was under immense pressure to get the first structure, and that’s what made him push the limits of his data,” he says. Oth- ers suggest that Chang might have caught the problem if he’d paid closer attention to bio- chemicalfindingsthatdidn’tjibewellwiththe MsbA structure. “When the first structure came out, we and others said, ‘We really A Scientist’s Nightmare: Software Problem Leads to Five Retractions SCIENTIFIC PUBLISHING CREDIT:R.J.P.DAWSONANDK.P.LOCHER,NATURE443,180(2006) 22 DECEMBER 2006 VOL 314 SCIENCE www.sciencemag.org Flipping fiasco. The structures of MsbA (purple) and Sav1866 (green) overlap little (left) until MsbA is inverted (right). ▲ Published by AAAS onJanuary5,2007www.sciencemag.orgDownloadedfrom Miller (2006) Science, 314 (5807), 1856-1857. We rely heavily on research software No software, no research Many of us write software ourselves From single- command line scripts to multigigabyte code repositories Not all us have had formal training Most of us have learned it “on the street” Poor coding practices can lead to serious problems Worst case perhaps: 1 coding error led to 5 retractions: 3 in Science, 1 in PNAS, 1 in Molecular Biology
  • 10. Best Practices for Scientific Computing Why now? Call for increased scrutiny and openness Following cases of scientific misconduct, widespread questionable research practices, and reproducibility crises
  • 11. Best Practices for Scientific Computing Why now? Ince et al (2012) Nature,482 (7386) 485-488. PERSPECTIVE doi:10.1038/nature10836 The case for open computer programs Darrel C. Ince1 , Leslie Hatton2 & John Graham-Cumming3 Scientific communication relies on evidence that cannot be entirely included in publications, but the rise of computational science has added a new layer of inaccessibility. Although it is now accepted that data should be made available on request, the current regulations regarding the availability of software are inconsistent. We argue that, with some exceptions, anything less than the release of source programs is intolerable for results that depend on computation. The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, but withholding code increases the chances that efforts to reproduce results will fail. T he rise of computational science has led to unprecedented opportunities for scientificadvance.Evermorepowerfulcomputers enable theories to be investigated that were thought almost intractable a decade ago, robust hardware technologies allow data collec- tion in the most inhospitable environments, more data are collected, and an increasingly rich set of software tools are now available with which to analyse computer-generated data. However, there is the difficulty of reproducibility, by which we mean the reproduction of a scientific paper’s central finding, rather than exact replication of each specific numerical result down to several decimal places. We examine the problem of reproducibility (for an early attempt at solving it, see ref. 1) in the context of openly available computer programs, or code. Our view is that we have reached the point that, with some exceptions, anything less than release of actual source code is an indefensible approach for any scientific results that depend on computa- tion, because not releasing such code raises needless, and needlessly confusing, roadblocks to reproducibility. At present, debate rages on the need to release computer programs associated with scientific experiments2–4 , with policies still ranging from mandatory total release to the release only of natural language descrip- tions, that is, written descriptions of computer program algorithms. Some journals have already changed their policies on computer program openness; Science, for example, now includes code in the list of items that should be supplied by an author5 . Other journals promoting code availability include Geoscientific Model Development, which is devoted, at least in part, to model description and code publication, and Biostatistics, which has appointed an editor to assess the reproducibility of the software and data associated with an article6 . In contrast, less stringent policies are exemplified by statements such as7 ‘‘Nature does not require authors to make code available, but we do expect a description detailed enough to allow others to write their own code to do similar analysis.’’ Although Nature’s broader policy states that ‘‘...authors are required to make materials, data and associated protocols promptly available to readers...’’, and editors and referees are fully empowered to demand and evaluate any specific code, we believe that its stated policy on code availability actively hinders reproducibility. Muchofthe debateaboutcode transparencyinvolvesthephilosophy of science, error validation and research ethics8,9 , but our contention is more practical: that the cause of reproducibility is best furthered by focusing on thedissectionandunderstandingofcode,asentimentalreadyappreciated by the growing open-source movement10 . Dissection and understanding of open code would improve the chances of both direct and indirect reproducibility. Direct reproducibility refers to the recompilation and rerunning of the code on, say, a different combination of hardware and systems software, to detect the sort of numerical computation11,12 and interpretation13 problems found in programming languages, which we discuss later. Without code, direct reproducibility is impossible. Indirect reproducibility refers to independent efforts to validate something other than the entire code package, for example a subset of equations or a par- ticular code module. Here, before time-consuming reprogramming of an entiremodel,researchersmaysimplywanttocheckthatincorrectcodingof previously published equations has not invalidated a paper’s result, to extract and check detailed assumptions, or to run their own code against the original to check for statistical validity and explain any discrepancies. Any debate over the difficulties of reproducibility (which, as we will show, are non-trivial) must of course be tempered by recognizing the undeniablebenefits afforded by theexplosion of internet facilities and the rapid increase in raw computational speed and data-handling capability that has occurred as a result of major advances in computer technology14 . Such advances have presented science with a greatopportunity to address problems that would have been intractable in even the recent past. It is our view, however, that the debate over code release should be resolved as soon as possible to benefit fully from our novel technical capabilities. On their own, finer computational grids, longer and more complex compu- tations and larger data sets—although highly attractive to scientific researchers—do not resolve underlying computational uncertainties of proven intransigence and may even exacerbate them. Although our arguments are focused on the implications of Nature’s code statement, it is symptomatic of a wider problem: the scientific community places more faith in computation than is justified. As we outline below and in two case studies (Boxes 1 and 2), ambiguity in its many forms and numerical errors render natural language descriptions insufficient and, in many cases, unintentionally misleading. The failure of code descriptions The curse of ambiguity Ambiguity in program descriptions leads to the possibility, if not the certainty, that a given natural language description can be converted into computer code in various ways, each of which may lead to different numerical outcomes. Innumerable potential issues exist, but might include mistaken order of operations, reference to different model ver- sions, or unclear calculations of uncertainties. The problem of ambiguity has haunted software development from its earliest days. Ambiguity can occur at the lexical, syntactic or semantic level15 and is not necessarily the result of incompetence or bad practice. It is a natural consequence of using natural language16 and is unavoidable. The 1 Department of Computing Open University, Walton Hall, Milton Keynes MK7 6AA, UK. 2 School of Computing and Information Systems, Kingston University, Kingston KT1 2EE, UK. 3 83 Victoria Street, London SW1H 0HW, UK. 2 3 F E B R U A R Y 2 0 1 2 | V O L 4 8 2 | N A T U R E | 4 8 5 Macmillan Publishers Limited. All rights reserved©2012 Call for increased scrutiny and openness Following cases of scientific misconduct, widespread questionable research practices, and reproducibility crises
  • 12. Best Practices for Scientific Computing Why now? Peng (2011) Science, 334 (6060) 1226-1227 PERSPECTIVE Reproducible Research in Computational Science Roger D. Peng Computational science has led to exciting new developments, but the nature of the work has exposed limitations in our ability to evaluate published findings. Reproducibility has the potential to serve as a minimum standard for judging scientific claims when full independent replication of a study is not possible. T he rise of computational science has led to exciting and fast-moving developments in many scientific areas. New technologies, increased computing power, and methodological advances have dramatically improved our ability to collect complex high-dimensional data (1, 2). Large data sets have led to scientists doing more computation, as well as researchers in computa- tionally oriented fields directly engaging in more science. The availability of large public databases has allowed for researchers to make meaningful scientific contributions without using the tradi- tional tools of a given field. As an example of this overall trend, the Sloan Digital Sky Survey, a large publicly available astronomical survey of the Northern Hemisphere, was ranked the most cited observatory (3), allowing astronomers with- out telescopes to make discoveries using data col- lected by others. Similar developments can be found in fields such as biology and epidemiology. Replication is the ultimate standard by which scientific claims are judged. With replication, independent investigators address a scientific hypothesis and build up evidence for or against it. The scien- tific community’s “culture of replica- tion” has served to quickly weed out spurious claims and enforce on the community a disciplined approach to scientific discovery. However, with computational science and the corre- sponding collection of large and com- plex data sets the notion of replication can be murkier. It would require tre- mendous resources to independently replicate the Sloan Digital Sky Sur- vey. Many studies—for example, in climate science—require computing power that may not be available to all researchers. Even if computing and data size are not limiting factors, replication can be difficult for other reasons. In environmental epidemiology, large cohort studies designed to examine subtle health effects of en- vironmental pollutants can be very expensive and require long follow-up times. Such studies are difficult to replicate because of time and expense, especially in the time frame of policy decisions that need to be made regarding regulation (2). Researchers across a range of computational science disciplines have been calling for repro- ducibility, or reproducible research, as an attain- able minimum standard for assessing the value of scientific claims, particularly when full independent replication of a study is not feasible (4–8). The standard of reproducibility calls for the data and the computer code used to analyze the data be made available to others. This standard falls short of full replication because the same data are analyzed again, rather than analyzing independently col- lected data. However, under this standard, limited exploration of the data and the analysis code is possible and may be sufficient to verify the quality of the scientific claims. One aim of the reproduc- ibility standard is to fill the gap in the scientific evidence-generating process between full repli- cation of a study and no replication. Between these two extreme end points, there is a spectrum of possibilities, and a study may be more or less reproducible than another depending on what data and code are made available (Fig. 1). A recent review of microarray gene expression analyses found that studies were either not reproducible, partially reproducible with some discrepancies, or reproducible. This range was largely explained by the availability of data and metadata (9). The reproducibility standard is based on the fact that every computational experiment has, in theory, a detailed log of every action taken by the computer. Making these computer codes avail- able to others provides a level of detail regarding the analysis that is greater than the analagous non- computational experimental descriptions printed in journals using a natural language. A critical barrier to reproducibility in many cases is that the computer code is no longer avail- able. Interactive software systems often used for exploratory data analysis typically do not keep track of users’ actions in any concrete form. Even if researchers use software that is run by written code, often multiple packages are used, and the code that combines the different results together is not saved (10). Addressing this problem will require either changing the behavior of the soft- ware systems themselves or getting researchers to use other software systems that are more ame- nable to reproducibility. Neither is likely to hap- pen quickly; old habits die hard, and many will be unwilling to discard the hours spent learning existing systems. Non–open source software can only be changed by their owners, who may not perceive reproducibility as a high priority. In order to advance reproducibility in com- putational science, contributions will need to come from multiple directions. Journals can play a role here as part of a joint effort by the scientific community. The journal Biostatistics, for which I am an associate editor, has implemented a pol- icy for encouraging authors of accepted papers to make their work reproducible by others (11). Authors can submit their code or data to the journal for posting as supporting online material and can additionally request a “reproducibility review,” in which the associate editor for repro- ducibility runs the submitted code on the data and verifies that the code produces the results published in the article. Articles with data or code receive a “D” or “C” kite-mark, respec- tively, printed on the first page of the article itself. Articles that have passed the reproduc- ibility review receive an “R.” The policy was im- plemented in July 2009, and as of July 2011, 21 of 125 articles have been published with a kite-mark, including five articles with an “R.” The articles have reflected a range of topics from biostatistical methods, epidemiology, and genomics. In this admittedly small sample, we Data Replication & Reproducibility Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore MD 21205, USA. To whom correspondence should be addressed. E-mail: rpeng@jhsph.edu Reproducibility Spectrum Not reproducible Gold standard Full replication Publication only Publication + Code Code and data Linked and executable code and data Fig. 1. The spectrum of reproducibility. 2 DECEMBER 2011 VOL 334 SCIENCE www.sciencemag.org1226 onMarch23,2015www.sciencemag.orgDownloadedfromonMarch23,2015www.sciencemag.orgDownloadedfrom Call for increased scrutiny and openness Following cases of scientific misconduct, widespread questionable research practices, and reproducibility crises
  • 13. Best Practices for Scientific Computing Why now? INSIGHTS | PERSPECTIVES 1422 26 JUNE 2015 • VOL 348 ISSUE 6242 sciencemag.org SCIENCE and neutral resource that supports and complements efforts of the research enter- prise and its key stakeholders. Universities should insist that their fac- ulties and students are schooled in the eth- ics of research, their publications feature neither honorific nor ghost authors, their public information offices avoid hype in publicizing findings, and suspect research is promptly and thoroughly investigated. All researchers need to realize that the best scientific practice is produced when, like Darwin, they persistently search for flaws in their arguments. Because inherent variability in biological systems makes it possible for researchers to explore differ- ent sets of conditions until the expected (and rewarded) result is obtained, the need for vigilant self-critique may be especially great in research with direct application to human disease. We encourage each branch of science to invest in case studies identify- ing what went wrong in a selected subset of nonreproducible publications—enlisting social scientists and experts in the respec- tive fields to interview those who were involved (and perhaps examining lab note- books or redoing statistical analyses), with the hope of deriving general principles for improving science in each field. Industry should publish its failed efforts to reproduce scientific findings and join scientists in the academy in making the case for the importance of scientific work. Scientific associations should continue to communicate science as a way of know- ing, and educate their members in ways to more effectively communicate key scien- tific findings to broader publics. Journals should continue to ask for higher stan- dards of transparency and reproducibility. We recognize that incentives can backfire. Still, because those such as enhanced social image and forms of public recognition (10, 11) can increase productive social behavior (12), we believe that replacing the stigma of retraction with language that lauds report- ing of unintended errors in a publication will increase that behavior. Because sustaining a good reputation can incentivize cooperative behavior (13), we anticipate that our pro- posed changes in the review process will not only increase the quality of the final product but also expose efforts to sabotage indepen- dent review. To ensure that such incentives not only advance our objectives but above all do no harm, we urge that each be scru- tinized and evaluated before being broadly implemented. Will past be prologue? If science is to enhance its capacities to improve our un- derstanding of ourselves and our world, protect the hard-earned trust and esteem in which society holds it, and preserve its role as a driver of our economy, scientists must safeguard its rigor and reliability in the face of challenges posed by a research ecosystem that is evolving in dramatic and sometimes unsettling ways. To do this, the scientific research community needs to be involved in an ongoing dialogue. We hope that this essay and the report The Integrity of Science (14), forthcoming in 2015, will serve as catalysts for such a dialogue. Asked at the close of the U.S. Consti- tutional Convention of 1787 whether the deliberations had produced a republic or a monarchy, Benjamin Franklin said “A Republic, if you can keep it.” Just as pre- serving a system of government requires ongoing dedication and vigilance, so too does protecting the integrity of science. ■ REFERENCES AND NOTES 1. Troubleatthelab,TheEconomist,19October2013; www.economist.com/news/briefing/ 21588057-scientists-think-science-self-correcting- alarming-degree-it-not-trouble. 2. R.Merton,TheSociologyofScience:Theoreticaland EmpiricalInvestigations(UniversityofChicagoPress, Chicago,1973),p.276. 3. K.Popper,ConjecturesandRefutations:TheGrowthof ScientificKnowledge(Routledge,London,1963),p.293. 4. EditorialBoard,Nature511,5(2014);www.nature.com/ news/stap-retracted-1.15488. 5. B.A.Noseketal.,Science348,1422(2015). 6. InstituteofMedicine,DiscussionFrameworkforClinical TrialDataSharing:GuidingPrinciples,Elements,and Activities(NationalAcademiesPress,Washington,DC, 2014). 7. B.Nosek,J.Spies,M.Motyl,Perspect.Psychol.Sci.7,615 (2012). 8. C.Franzoni,G.Scellato,P.Stephan,Science333,702 (2011). 9. NationalAcademyofSciences,NationalAcademyof Engineering,andInstituteofMedicine,Responsible Science,VolumeI:EnsuringtheIntegrityoftheResearch Process(NationalAcademiesPress,Washington,DC, 1992). 10. N.Lacetera,M.Macis,J.Econ.Behav.Organ.76,225 (2010). 11. D.Karlan,M.McConnell,J.Econ.Behav.Organ.106,402 (2014). 12. R.Thaler,C.Sunstein,Nudge:ImprovingDecisionsAbout Health,WealthandHappiness(YaleUniv.Press,New Haven,CT,2009). 13. T.Pfeiffer,L.Tran,C.Krumme,D.Rand,J.R.Soc.Interface 2012,rsif20120332(2012). 14. CommitteeonScience,Engineering,andPublicPolicy oftheNationalAcademyofSciences,NationalAcademy ofEngineering,andInstituteofMedicine,TheIntegrity ofScience(NationalAcademiesPress,forthcoming). http://www8.nationalacademies.org/cp/projectview. aspx?key=49387. 10.1126/science.aab3847 “Instances in which scientists detect and address flaws in work constitute evidence of success, not failure.” T ransparency, openness, and repro- ducibility are readily recognized as vital features of science (1, 2). When asked, most scientists embrace these features as disciplinary norms and values (3). Therefore, one might ex- pect that these valued features would be routine in daily practice. Yet, a growing body of evidence suggests that this is not the case (4–6). A likely culprit for this disconnect is an academic reward system that does not suf- ficiently incentivize open practices (7). In the present reward system, emphasis on innova- tion may undermine practices that support verification. Too often, publication requirements (whether actual or perceived) fail to encour- age transparent, open, and reproducible sci- ence (2, 4, 8, 9). For example, in a transparent science, both null results and statistically significant results are made available and help others more accurately assess the evi- dence base for a phenomenon. In the present culture, however, null results are published less frequently than statistically significant results (10) and are, therefore, more likely inaccessible and lost in the “file drawer” (11). The situation is a classic collective action problem. Many individual researchers lack Promoting an open research culture By B. A. Nosek,* G. Alter, G. C. Banks, D. Borsboom, S. D. Bowman, S. J. Breckler, S. Buck, C. D. Chambers, G. Chin, G. Christensen, M. Contestabile, A. Dafoe, E. Eich, J. Freese, R. Glennerster, D. Goroff, D. P. Green, B. Hesse, M. Humphreys, J. Ishiyama, D. Karlan, A. Kraut, A. Lupia, P. Mabry, T. Madon, N. Malhotra, E. Mayo-Wilson, M. McNutt, E. Miguel, E. Levy Paluck, U. Simonsohn, C. Soderberg, B. A. Spellman, J. Turitto, G. VandenBos, S. Vazire, E. J. Wagenmakers, R. Wilson, T. Yarkoni Author guidelines for journals could help to promote transparency, openness, and reproducibility SCIENTIFIC STANDARDS POLICY DA_0626PerspectivesR1.indd 1422 7/24/15 10:22 AM Published by AAAS onSeptember17,2015www.sciencemag.orgDownloadedfromonSeptember17,2015www.sciencemag.orgDownloadedfromonSeptember17,2015www.sciencemag.orgDownloadedfromonSeptember17,2015www.sciencemag.orgDownloadedfrom PERSPECTIVE doi:10.1038/nature10836 The case for open computer programs Darrel C. Ince1 , Leslie Hatton2 & John Graham-Cumming3 Scientific communication relies on evidence that cannot be entirely included in publications, but the rise of computational science has added a new layer of inaccessibility. Although it is now accepted that data should be made available on request, the current regulations regarding the availability of software are inconsistent. We argue that, with some exceptions, anything less than the release of source programs is intolerable for results that depend on computation. The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, but withholding code increases the chances that efforts to reproduce results will fail. T he rise of computational science has led to unprecedented opportunities for scientificadvance.Evermorepowerfulcomputers enable theories to be investigated that were thought almost intractable a decade ago, robust hardware technologies allow data collec- tion in the most inhospitable environments, more data are collected, and an increasingly rich set of software tools are now available with which to analyse computer-generated data. However, there is the difficulty of reproducibility, by which we mean the reproduction of a scientific paper’s central finding, rather than exact replication of each specific numerical result down to several decimal places. We examine the problem of reproducibility (for an early attempt at solving it, see ref. 1) in the context of openly available computer programs, or code. Our view is that we have reached the point that, with some exceptions, anything less than release of actual source code is an indefensible approach for any scientific results that depend on computa- tion, because not releasing such code raises needless, and needlessly confusing, roadblocks to reproducibility. At present, debate rages on the need to release computer programs associated with scientific experiments2–4 , with policies still ranging from mandatory total release to the release only of natural language descrip- tions, that is, written descriptions of computer program algorithms. Some journals have already changed their policies on computer program openness; Science, for example, now includes code in the list of items that should be supplied by an author5 . Other journals promoting code availability include Geoscientific Model Development, which is devoted, at least in part, to model description and code publication, and Biostatistics, which has appointed an editor to assess the reproducibility of the software and data associated with an article6 . In contrast, less stringent policies are exemplified by statements such as7 ‘‘Nature does not require authors to make code available, but we do expect a description detailed enough to allow others to write their own code to do similar analysis.’’ Although Nature’s broader policy states that ‘‘...authors are required to make materials, data and associated protocols promptly available to readers...’’, and editors and referees are fully empowered to demand and evaluate any specific code, we believe that its stated policy on code availability actively hinders reproducibility. Muchofthe debateaboutcode transparencyinvolvesthephilosophy of science, error validation and research ethics8,9 , but our contention is more practical: that the cause of reproducibility is best furthered by focusing on thedissectionandunderstandingofcode,asentimentalreadyappreciated by the growing open-source movement10 . Dissection and understanding of open code would improve the chances of both direct and indirect reproducibility. Direct reproducibility refers to the recompilation and rerunning of the code on, say, a different combination of hardware and systems software, to detect the sort of numerical computation11,12 and interpretation13 problems found in programming languages, which we discuss later. Without code, direct reproducibility is impossible. Indirect reproducibility refers to independent efforts to validate something other than the entire code package, for example a subset of equations or a par- ticular code module. Here, before time-consuming reprogramming of an entiremodel,researchersmaysimplywanttocheckthatincorrectcodingof previously published equations has not invalidated a paper’s result, to extract and check detailed assumptions, or to run their own code against the original to check for statistical validity and explain any discrepancies. Any debate over the difficulties of reproducibility (which, as we will show, are non-trivial) must of course be tempered by recognizing the undeniablebenefits afforded by theexplosion of internet facilities and the rapid increase in raw computational speed and data-handling capability that has occurred as a result of major advances in computer technology14 . Such advances have presented science with a greatopportunity to address problems that would have been intractable in even the recent past. It is our view, however, that the debate over code release should be resolved as soon as possible to benefit fully from our novel technical capabilities. On their own, finer computational grids, longer and more complex compu- tations and larger data sets—although highly attractive to scientific researchers—do not resolve underlying computational uncertainties of proven intransigence and may even exacerbate them. Although our arguments are focused on the implications of Nature’s code statement, it is symptomatic of a wider problem: the scientific community places more faith in computation than is justified. As we outline below and in two case studies (Boxes 1 and 2), ambiguity in its many forms and numerical errors render natural language descriptions insufficient and, in many cases, unintentionally misleading. The failure of code descriptions The curse of ambiguity Ambiguity in program descriptions leads to the possibility, if not the certainty, that a given natural language description can be converted into computer code in various ways, each of which may lead to different numerical outcomes. Innumerable potential issues exist, but might include mistaken order of operations, reference to different model ver- sions, or unclear calculations of uncertainties. The problem of ambiguity has haunted software development from its earliest days. Ambiguity can occur at the lexical, syntactic or semantic level15 and is not necessarily the result of incompetence or bad practice. It is a natural consequence of using natural language16 and is unavoidable. The 1 Department of Computing Open University, Walton Hall, Milton Keynes MK7 6AA, UK. 2 School of Computing and Information Systems, Kingston University, Kingston KT1 2EE, UK. 3 83 Victoria Street, London SW1H 0HW, UK. 2 3 F E B R U A R Y 2 0 1 2 | V O L 4 8 2 | N A T U R E | 4 8 5 Macmillan Publishers Limited. All rights reserved©2012 Nosek et al. (2015) Science, 348 (6242) 1422-1425. Call for increased scrutiny and openness Following cases of scientific misconduct, widespread questionable research practices, and reproducibility crises
  • 14. Best Practices for Scientific Computing Why now? INSIGHTS | PERSPECTIVES 1422 26 JUNE 2015 • VOL 348 ISSUE 6242 sciencemag.org SCIENCE and neutral resource that supports and complements efforts of the research enter- prise and its key stakeholders. Universities should insist that their fac- ulties and students are schooled in the eth- ics of research, their publications feature neither honorific nor ghost authors, their public information offices avoid hype in publicizing findings, and suspect research is promptly and thoroughly investigated. All researchers need to realize that the best scientific practice is produced when, like Darwin, they persistently search for flaws in their arguments. Because inherent variability in biological systems makes it possible for researchers to explore differ- ent sets of conditions until the expected (and rewarded) result is obtained, the need for vigilant self-critique may be especially great in research with direct application to human disease. We encourage each branch of science to invest in case studies identify- ing what went wrong in a selected subset of nonreproducible publications—enlisting social scientists and experts in the respec- tive fields to interview those who were involved (and perhaps examining lab note- books or redoing statistical analyses), with the hope of deriving general principles for improving science in each field. Industry should publish its failed efforts to reproduce scientific findings and join scientists in the academy in making the case for the importance of scientific work. Scientific associations should continue to communicate science as a way of know- ing, and educate their members in ways to more effectively communicate key scien- tific findings to broader publics. Journals should continue to ask for higher stan- dards of transparency and reproducibility. We recognize that incentives can backfire. Still, because those such as enhanced social image and forms of public recognition (10, 11) can increase productive social behavior (12), we believe that replacing the stigma of retraction with language that lauds report- ing of unintended errors in a publication will increase that behavior. Because sustaining a good reputation can incentivize cooperative behavior (13), we anticipate that our pro- posed changes in the review process will not only increase the quality of the final product but also expose efforts to sabotage indepen- dent review. To ensure that such incentives not only advance our objectives but above all do no harm, we urge that each be scru- tinized and evaluated before being broadly implemented. Will past be prologue? If science is to enhance its capacities to improve our un- derstanding of ourselves and our world, protect the hard-earned trust and esteem in which society holds it, and preserve its role as a driver of our economy, scientists must safeguard its rigor and reliability in the face of challenges posed by a research ecosystem that is evolving in dramatic and sometimes unsettling ways. To do this, the scientific research community needs to be involved in an ongoing dialogue. We hope that this essay and the report The Integrity of Science (14), forthcoming in 2015, will serve as catalysts for such a dialogue. Asked at the close of the U.S. Consti- tutional Convention of 1787 whether the deliberations had produced a republic or a monarchy, Benjamin Franklin said “A Republic, if you can keep it.” Just as pre- serving a system of government requires ongoing dedication and vigilance, so too does protecting the integrity of science. ■ REFERENCES AND NOTES 1. Troubleatthelab,TheEconomist,19October2013; www.economist.com/news/briefing/ 21588057-scientists-think-science-self-correcting- alarming-degree-it-not-trouble. 2. R.Merton,TheSociologyofScience:Theoreticaland EmpiricalInvestigations(UniversityofChicagoPress, Chicago,1973),p.276. 3. K.Popper,ConjecturesandRefutations:TheGrowthof ScientificKnowledge(Routledge,London,1963),p.293. 4. EditorialBoard,Nature511,5(2014);www.nature.com/ news/stap-retracted-1.15488. 5. B.A.Noseketal.,Science348,1422(2015). 6. InstituteofMedicine,DiscussionFrameworkforClinical TrialDataSharing:GuidingPrinciples,Elements,and Activities(NationalAcademiesPress,Washington,DC, 2014). 7. B.Nosek,J.Spies,M.Motyl,Perspect.Psychol.Sci.7,615 (2012). 8. C.Franzoni,G.Scellato,P.Stephan,Science333,702 (2011). 9. NationalAcademyofSciences,NationalAcademyof Engineering,andInstituteofMedicine,Responsible Science,VolumeI:EnsuringtheIntegrityoftheResearch Process(NationalAcademiesPress,Washington,DC, 1992). 10. N.Lacetera,M.Macis,J.Econ.Behav.Organ.76,225 (2010). 11. D.Karlan,M.McConnell,J.Econ.Behav.Organ.106,402 (2014). 12. R.Thaler,C.Sunstein,Nudge:ImprovingDecisionsAbout Health,WealthandHappiness(YaleUniv.Press,New Haven,CT,2009). 13. T.Pfeiffer,L.Tran,C.Krumme,D.Rand,J.R.Soc.Interface 2012,rsif20120332(2012). 14. CommitteeonScience,Engineering,andPublicPolicy oftheNationalAcademyofSciences,NationalAcademy ofEngineering,andInstituteofMedicine,TheIntegrity ofScience(NationalAcademiesPress,forthcoming). http://www8.nationalacademies.org/cp/projectview. aspx?key=49387. 10.1126/science.aab3847 “Instances in which scientists detect and address flaws in work constitute evidence of success, not failure.” T ransparency, openness, and repro- ducibility are readily recognized as vital features of science (1, 2). When asked, most scientists embrace these features as disciplinary norms and values (3). Therefore, one might ex- pect that these valued features would be routine in daily practice. Yet, a growing body of evidence suggests that this is not the case (4–6). A likely culprit for this disconnect is an academic reward system that does not suf- ficiently incentivize open practices (7). In the present reward system, emphasis on innova- tion may undermine practices that support verification. Too often, publication requirements (whether actual or perceived) fail to encour- age transparent, open, and reproducible sci- ence (2, 4, 8, 9). For example, in a transparent science, both null results and statistically significant results are made available and help others more accurately assess the evi- dence base for a phenomenon. In the present culture, however, null results are published less frequently than statistically significant results (10) and are, therefore, more likely inaccessible and lost in the “file drawer” (11). The situation is a classic collective action problem. Many individual researchers lack Promoting an open research culture By B. A. Nosek,* G. Alter, G. C. Banks, D. Borsboom, S. D. Bowman, S. J. Breckler, S. Buck, C. D. Chambers, G. Chin, G. Christensen, M. Contestabile, A. Dafoe, E. Eich, J. Freese, R. Glennerster, D. Goroff, D. P. Green, B. Hesse, M. Humphreys, J. Ishiyama, D. Karlan, A. Kraut, A. Lupia, P. Mabry, T. Madon, N. Malhotra, E. Mayo-Wilson, M. McNutt, E. Miguel, E. Levy Paluck, U. Simonsohn, C. Soderberg, B. A. Spellman, J. Turitto, G. VandenBos, S. Vazire, E. J. Wagenmakers, R. Wilson, T. Yarkoni Author guidelines for journals could help to promote transparency, openness, and reproducibility SCIENTIFIC STANDARDS POLICY DA_0626PerspectivesR1.indd 1422 7/24/15 10:22 AM Published by AAAS onSeptember17,2015www.sciencemag.orgDownloadedfromonSeptember17,2015www.sciencemag.orgDownloadedfromonSeptember17,2015www.sciencemag.orgDownloadedfromonSeptember17,2015www.sciencemag.orgDownloadedfrom PERSPECTIVE doi:10.1038/nature10836 The case for open computer programs Darrel C. Ince1 , Leslie Hatton2 & John Graham-Cumming3 Scientific communication relies on evidence that cannot be entirely included in publications, but the rise of computational science has added a new layer of inaccessibility. Although it is now accepted that data should be made available on request, the current regulations regarding the availability of software are inconsistent. We argue that, with some exceptions, anything less than the release of source programs is intolerable for results that depend on computation. The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, but withholding code increases the chances that efforts to reproduce results will fail. T he rise of computational science has led to unprecedented opportunities for scientificadvance.Evermorepowerfulcomputers enable theories to be investigated that were thought almost intractable a decade ago, robust hardware technologies allow data collec- tion in the most inhospitable environments, more data are collected, and an increasingly rich set of software tools are now available with which to analyse computer-generated data. However, there is the difficulty of reproducibility, by which we mean the reproduction of a scientific paper’s central finding, rather than exact replication of each specific numerical result down to several decimal places. We examine the problem of reproducibility (for an early attempt at solving it, see ref. 1) in the context of openly available computer programs, or code. Our view is that we have reached the point that, with some exceptions, anything less than release of actual source code is an indefensible approach for any scientific results that depend on computa- tion, because not releasing such code raises needless, and needlessly confusing, roadblocks to reproducibility. At present, debate rages on the need to release computer programs associated with scientific experiments2–4 , with policies still ranging from mandatory total release to the release only of natural language descrip- tions, that is, written descriptions of computer program algorithms. Some journals have already changed their policies on computer program openness; Science, for example, now includes code in the list of items that should be supplied by an author5 . Other journals promoting code availability include Geoscientific Model Development, which is devoted, at least in part, to model description and code publication, and Biostatistics, which has appointed an editor to assess the reproducibility of the software and data associated with an article6 . In contrast, less stringent policies are exemplified by statements such as7 ‘‘Nature does not require authors to make code available, but we do expect a description detailed enough to allow others to write their own code to do similar analysis.’’ Although Nature’s broader policy states that ‘‘...authors are required to make materials, data and associated protocols promptly available to readers...’’, and editors and referees are fully empowered to demand and evaluate any specific code, we believe that its stated policy on code availability actively hinders reproducibility. Muchofthe debateaboutcode transparencyinvolvesthephilosophy of science, error validation and research ethics8,9 , but our contention is more practical: that the cause of reproducibility is best furthered by focusing on thedissectionandunderstandingofcode,asentimentalreadyappreciated by the growing open-source movement10 . Dissection and understanding of open code would improve the chances of both direct and indirect reproducibility. Direct reproducibility refers to the recompilation and rerunning of the code on, say, a different combination of hardware and systems software, to detect the sort of numerical computation11,12 and interpretation13 problems found in programming languages, which we discuss later. Without code, direct reproducibility is impossible. Indirect reproducibility refers to independent efforts to validate something other than the entire code package, for example a subset of equations or a par- ticular code module. Here, before time-consuming reprogramming of an entiremodel,researchersmaysimplywanttocheckthatincorrectcodingof previously published equations has not invalidated a paper’s result, to extract and check detailed assumptions, or to run their own code against the original to check for statistical validity and explain any discrepancies. Any debate over the difficulties of reproducibility (which, as we will show, are non-trivial) must of course be tempered by recognizing the undeniablebenefits afforded by theexplosion of internet facilities and the rapid increase in raw computational speed and data-handling capability that has occurred as a result of major advances in computer technology14 . Such advances have presented science with a greatopportunity to address problems that would have been intractable in even the recent past. It is our view, however, that the debate over code release should be resolved as soon as possible to benefit fully from our novel technical capabilities. On their own, finer computational grids, longer and more complex compu- tations and larger data sets—although highly attractive to scientific researchers—do not resolve underlying computational uncertainties of proven intransigence and may even exacerbate them. Although our arguments are focused on the implications of Nature’s code statement, it is symptomatic of a wider problem: the scientific community places more faith in computation than is justified. As we outline below and in two case studies (Boxes 1 and 2), ambiguity in its many forms and numerical errors render natural language descriptions insufficient and, in many cases, unintentionally misleading. The failure of code descriptions The curse of ambiguity Ambiguity in program descriptions leads to the possibility, if not the certainty, that a given natural language description can be converted into computer code in various ways, each of which may lead to different numerical outcomes. Innumerable potential issues exist, but might include mistaken order of operations, reference to different model ver- sions, or unclear calculations of uncertainties. The problem of ambiguity has haunted software development from its earliest days. Ambiguity can occur at the lexical, syntactic or semantic level15 and is not necessarily the result of incompetence or bad practice. It is a natural consequence of using natural language16 and is unavoidable. The 1 Department of Computing Open University, Walton Hall, Milton Keynes MK7 6AA, UK. 2 School of Computing and Information Systems, Kingston University, Kingston KT1 2EE, UK. 3 83 Victoria Street, London SW1H 0HW, UK. 2 3 F E B R U A R Y 2 0 1 2 | V O L 4 8 2 | N A T U R E | 4 8 5 Macmillan Publishers Limited. All rights reserved©2012 Nosek et al. (2015) Science, 348 (6242) 1422-1425. Call for increased scrutiny and openness Following cases of scientific misconduct, widespread questionable research practices, and reproducibility crises
  • 15. Best Practices for Scientific Computing Why now? Call for increased scrutiny and openness Following cases of scientific misconduct, widespread questionable research practices, and reproducibility crises
  • 16. Best Practices for Scientific Computing Why now? Call for increased scrutiny and openness Following cases of scientific misconduct, widespread questionable research practices, and reproducibility crises Netherlands Organisation for Scientific Research Open where possible, protected where needed NWO and Open Access Guidelines on Open Access to Scientific Publications and Research Data in Horizon 2020 Version 1.0 11 December 2013 http://www.nwo.nl/binaries/content/assets/nwo/documents/nwo/open-science/open-access-flyer-2012-def-eng- scherm.pdf http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-pilot- guide_en.pdf
  • 17. Best Practices for Scientific Computing What should we (try to) do? 1. Write programs for people
 so you and others quickly understand the code 2. Let the computer do the work
 to increase speed and reproducibility 3. Make incremental changes
 to increase productivity and reproducibility 4. Don’t repeat yourself (or others)
 repetitions are vulnerable to errors and a waste of time 5. Plan for mistakes
 almost all code has bugs, but not all bugs cause errors 6. Optimize software only after it works
 and perhaps only if it is worth the effort 7. Document design and purpose, not implementation
 to clarify understanding and usage 8. Collaborate
 to maximize reproducibility, clarity, and learning
  • 18. Best Practices for Scientific Computing 1. Write programs for people, not computers … so you and others quickly understand the code A program should not require its readers to hold more than a handful of facts in memory at once Break up programs in easily understood functions, each of which performs an easily understood task
  • 19. Best Practices for Scientific Computing 1. Write programs for people, not computers … so you and others quickly understand the code SCRIPT X Statements … … … SCRIPT X Call function a Call function b Call function c Statements … … … Statements … … …
  • 20. Best Practices for Scientific Computing 1. Write programs for people, not computers … so you and others quickly understand the code A program should not require its readers to hold more than a handful of facts in memory at once Break up programs in easily understood functions, each of which performs an easily understood task Make names consistent, distinctive, and meaningful e.g. fileName, meanRT rather than, tmp, result e.g. camelCase, pothole_case, UPPERCASE
  • 21. Best Practices for Scientific Computing http://nl.mathworks.com/matlabcentral/fileexchange/46056-matlab-style-guidelines-2-0 http://nl.mathworks.com/matlabcentral/fileexchange/48121-matlab-coding-checklist MATLAB Guidelines 2.0 Richard Johnson A program should not require its readers to hold more than a handful of facts in memory at once Break up programs in easily understood functions, each of which performs an easily understood task Make names consistent, distinctive, and meaningful e.g. fileName, meanRT rather than, tmp, result e.g. camelCase, pothole_case, UPPERCASE Make code style and formatting consistent Use existing guidelines and check lists MATLAB® @Work richj@datatool.com www.datatool.com CODING CHECKLIST GENERAL DESIGN Do you have a good design? Is the code traceable to requirements? Does the code support automated use and testing? Does the code have documented test cases? Does the code adhere to your designated coding style and standard? Is the code free from Code Analyzer messages? Does each function perform one well-defined task? Is each class cohesive? Have you refactored out any blocks of code repeated unnecessarily? Are there unnecessary global constants or variables? Does the code have appropriate generality? Does the code make naïve assumptions about the data? Does the code use appropriate error checking and handling? If performance is an issue, have you profiled the code? UNDERSTANDABILITY Is the code straightforward and does it avoid “cleverness”? Has tricky code been rewritten rather than commented? Do you thoroughly understand your code? Will anyone else? Is it easy to follow and understand? CODE CHANGES Have the changes been reviewed as thoroughly as initial development would be? Do the changes enhance the program’s internal quality rather than degrading it? Have you improved the system’s modularity by breaking functions into smaller functions? Have you improved the programming style--names, formatting, comments? Have you introduced any flaws or limitations?
  • 22. Best Practices for Scientific Computing 2. Let the computer do the work … to increase speed and reproducibility Make the computer repeat tasks Go beyond the point-and-click approach
  • 23. Best Practices for Scientific Computing 2. Let the computer do the work … to increase speed and reproducibility Make the computer repeat tasks Go beyond the point-and-click approach Save recent commands in a file for re-use Log session’s activity to file In MATLAB: diary
  • 24. Best Practices for Scientific Computing 2. Let the computer do the work … to increase speed and reproducibility Make the computer repeat tasks Go beyond the point-and-click approach Save recent commands in a file for re-use Log session’s activity to file In MATLAB: diary Use a build tool to automate workflows e.g. SPM’s matlabbatch, IPython/Jupyter notebook, SPSS syntax SPM’s matlabbatch
  • 25. Best Practices for Scientific Computing 2. Let the computer do the work http://jupyter.org/
  • 26. Best Practices for Scientific Computing 3. Make incremental changes Work in small steps with frequent feedback and course correction Agile development: add/change one thing at a time, aim for working code at completion … to increase productivity and reproducibility
  • 27. Best Practices for Scientific Computing 3. Make incremental changes … to increase productivity and reproducibility Work in small steps with frequent feedback and course correction Agile development: add/change one thing at a time, aim for working code at completion Use a version control system Documents who made what contributions when Provides access to entire history of code + metadata Many free options to choose from: e.g. git, svn http://www.phdcomics.com/comics/archive.php?comicid=1531
  • 28. Best Practices for Scientific Computing http://swcarpentry.github.io/git-novice/01-basics.html Different versions can be saved Multiple versions can be merged Changes are saved sequentially (“snapshots”)
  • 29. Best Practices for Scientific Computing 3. Make incremental changes http://www.phdcomics.com/comics/archive.php?comicid=1531 Work in small steps with frequent feedback and course correction Agile development: add/change one thing at a time, aim for working code at completion Use a version control system Documents who made what contributions when Provides access to entire history of code + metadata Many free options to choose from: e.g. git, svn Put everything that has been created manually in version control Advantages of version control also apply to manuscripts, figures, presentations … to increase productivity and reproducibility
  • 30. Best Practices for Scientific Computing … https://github.com/smathot/materials_for_P0001/commits/master/manuscript
  • 31. Best Practices for Scientific Computing 4. Don’t repeat yourself (or others) Every piece of data must have a single authoritative representation in the system Raw data should have a single canonical version … repetitions are vulnerable to errors and a waste of time http://swcarpentry.github.io/v4/data/mgmt.pdf
  • 32. Best Practices for Scientific Computing 4. Don’t repeat yourself (or others) … repetitions are vulnerable to errors and a waste of time http://swcarpentry.github.io/v4/data/mgmt.pdf
  • 33. Best Practices for Scientific Computing 4. Don’t repeat yourself (or others) … repetitions are vulnerable to errors and a waste of time Every piece of data must have a single authoritative representation in the system Raw data should have a single canonical version Modularize code rather than copying and pasting Avoid “code clones”, they make it difficult to change code as you have to find and update multiple fragments
  • 34. Best Practices for Scientific Computing 4. Don’t repeat yourself (or others) … repetitions are vulnerable to errors and a waste of time SCRIPT X Call function a Call function b SCRIPT Y Call function a Call function c SCRIPT Y Statements …. SCRIPT X Statements …. Statements …. Statements …. code clone
  • 35. Best Practices for Scientific Computing 4. Don’t repeat yourself (or others) … repetitions are vulnerable to errors and a waste of time Every piece of data must have a single authoritative representation in the system Raw data should have a single canonical version Modularize code rather than copying and pasting Avoid “code clones”, they make it difficult to change code as you have to find and update multiple fragments Re-use code instead of rewriting it Chances are somebody else made it: e.g. look at SPM Extensions (~200), NITRC (810), CRAN (7395), MATLAB File Exchange (25297), PyPI (68247)
  • 36. Best Practices for Scientific Computing 5. Plan for mistakes Add assertions to programs to check their operation Assertions are internal self-checks that verify whether inputs, outputs, etc. are line with the programmer’s expectations In MATLAB: assert … almost all code has bugs, but not all bugs cause errors
  • 37. Best Practices for Scientific Computing 5. Plan for mistakes Add assertions to programs to check their operation Assertions are internal self-checks that verify whether inputs, outputs, etc. are line with the programmer’s expectations In MATLAB: assert Use an off-the-shelf unit testing library Unit tests check whether code returns correct results In MATLAB: runtests … almost all code has bugs, but not all bugs cause errors
  • 38. Best Practices for Scientific Computing 5. Plan for mistakes Add assertions to programs to check their operation Assertions are internal self-checks that verify whether inputs, outputs, etc. are line with the programmer’s expectations In MATLAB: assert Use an off-the-shelf unit testing library Unit tests check whether code returns correct results In MATLAB: runtests Turn bugs into test cases To ensure that they will not happen again … almost all code has bugs, but not all bugs cause errors
  • 39. Best Practices for Scientific Computing 5. Plan for mistakes Add assertions to programs to check their operation Assertions are internal self-checks that verify whether inputs, outputs, etc. are line with the programmer’s expectations In MATLAB: assert Use an off-the-shelf unit testing library Unit tests check whether code returns correct results In MATLAB: runtests Turn bugs into test cases To ensure that they will not happen again Use a symbolic debugger Symbolic debuggers make it easy to improve code Can detect bugs and sub-optimal code, run code step-by- step, track values of variables … almost all code has bugs, but not all bugs cause errors
  • 40. Best Practices for Scientific Computing
  • 41. Best Practices for Scientific Computing
  • 42. Best Practices for Scientific Computing
  • 43. Best Practices for Scientific Computing 6. Optimize software only after it works correctly Use a profiler to identify bottlenecks Useful if your code takes a long time to run In MATLAB: profile … and perhaps only if it is worth the effort
  • 44. Best Practices for Scientific Computing 6. Optimize software only after it works correctly Use a profiler to identify bottlenecks Useful if your code takes a long time to run In MATLAB: profile Write code in the highest-level language possible Programmers produce roughly same number of lines of code per day, regardless of language High-level languages achieve complex computations in few lines of code … and perhaps only if it is worth the effort
  • 45. Best Practices for Scientific Computing 7. Document design and purpose, not mechanics Document interfaces and reasons, not implementations Write structured and informative function headers
 Write a README file for each set of functions … good documentation helps people understand code see: http://www.mathworks.com/matlabcentral/fileexchange/4908-m-file-header-template/content/template_header.m
  • 46. Best Practices for Scientific Computing 7. Document design and purpose, not mechanics Document interfaces and reasons, not implementations Write structured and informative function headers
 Write a README file for each set of functions Refactor code in preference to explaining how it works Restructuring provides clarity, reducing space/time needed for explanation … good documentation helps people understand code
  • 47. Best Practices for Scientific Computing 7. Document design and purpose, not mechanics Document interfaces and reasons, not implementations Write structured and informative function headers
 Write a README file for each set of functions Refactor code in preference to explaining how it works Restructuring provides clarity, reducing space/time needed for explanation Embed the documentation for a piece of software in that software Embed in function headers, README files, but also interactive notebooks (e.g. knitr, IPython/Jupyter) Increases likelihood that code changes will be reflected in the documentation … good documentation helps people understand code
  • 48. Best Practices for Scientific Computing 8. Collaborate Use (pre-merge) code reviews Like proof-reading manuscripts, but then for code Aims to identify bugs, improve clarity, but other benefits are spreading knowledge, learning to coded
  • 49. Best Practices for Scientific Computing 8. Collaborate Use (pre-merge) code reviews Like proof-reading manuscripts, but then for code Aims to identify bugs, improve clarity, but other benefits are spreading knowledge, learning to coded Use pair programming when bringing someone new up to speed and when tackling tricky problems Intimidating, but really helpful
  • 50. Best Practices for Scientific Computing 8. Collaborate Use (pre-merge) code reviews Like proof-reading manuscripts, but then for code Aims to identify bugs, improve clarity, but other benefits are spreading knowledge, learning to coded Use pair programming when bringing someone new up to speed and when tackling tricky problems Intimidating, but really helpful Use an issue tracking tool Perhaps mainly useful for large, collaborative coding efforts, but many code sharing platforms (e.g. Github, Bitbucket) come with these tools
  • 51. Best Practices for Scientific Computing Useful resources