a presentation explaining the what, how and why of some of the features of science 2.0 (replication, registration, high power, bayesian statistics, estimation, co-pilot multi-software approach, distinction between confirmatory and exploratory analyses, and open science) using steegen et al. (2014) as a running example.
3. Why can we definitively say that? Because psychology often does not meet the
five basic requirements for a field to be considered scientifically rigorous:
clearly defined terminology, quantifiability, highly controlled experimental
conditions, reproducibility and, finally, predictability and testability.
8. - mundane 'regular' misbehaviours present greater threats to the scientific
enterprise than those caused by high-profile misconduct cases such as
fraud.
- first assessment of questionable research practices (QRP)
- 2002 assessment: NIH funded research
1768 mid-career (52% response rate)
1479 early-career(43% response rate)
9.
10. - first assessment of QRP in psychology
- 2155 respondents (36% response rate)
11.
12. the problems of QRP are widespread, and have very severe
consequences
why is that the case?
“never attribute to malice what can be adequately explained
by incompetence”
the main reasons are lack of guidelines, and the high
publication pressure
13. i’m not interested in fraud (e.g., diederik stapel who made
up his own data)
preventing fraud requires a different approach
15. a new way of doing science that aims to increase the
confidence in research results
not one, single, coherent whole
16.
17. a demonstration of science 2.0 with a real study
reference:
Steegen, S., Dewitte, L., Tuerlinckx, F., & Vanpaemel, W.
(2014). Measuring the crowd within again: A pre-registered
replication study. Frontiers in Psychology, 5, 786, 1-8.
doi:10.3389/fpsyg.2014.00786
paper:
http://ppw.kuleuven.be/okp/_pdf/Steegen2014MTCWA.pdf
OSF page:
https://osf.io/ivfu6/
18. based on some recommendations on good research practices made in the
literature
19. based on some recommendations on good research practices made in the
literature
• not exhaustive
• non-directive examples
• for inspiration
most recommendations can be implemented separately from each other
• not an all or none package deal
20.
21. crowd within effect (vul & pashler, 2008)
• averaging multiple guesses from one
person provides a better estimate than
either guess alone
22. crowd within effect (vul & pashler, 2008)
• averaging multiple guesses from one
person provides a better estimate than
either guess alone
experiment
• 8 general knowledge questions
e.g., what percent of the world's roads
are in India?
• guess 1
guess 2
23. 1. replication
2. registration
3. high power
4. bayesian statistics
5. alpha level
6. estimations
7. co-pilot multi-software approach
8. distinction between confirmatory and exploratory analyses
9. open science
what? how? why?
features of science 2.0
before data collection
after data collection/during data analysis
after data analysis
27. replication
how?
communicate with the original authors; ask information; and
feedback
ideal for masterproef
not much focus on creativity but more on skill building
28. replication
why?
- lots of variability between studied phenomena
- lots of variability between labs/replications
- what can we learn from a single study?
32. registration
what?
we specified all research details before data
collection
data collection
• sample size planning (stopping rule; see
below)
33. registration
what?
we specified all research details before data
collection
data collection
• sample size planning (stopping rule; see
below)
• recruitment: how to recruit participants
(e.g., pool)
34. registration
what?
we specified all research details before data
collection
data collection
• sample size planning (stopping rule; see
below)
• recruitment: how to recruit participants
(e.g., pool)
data analysis
• data cleaning plan (when to delete data)
35. registration
what?
we specified all research details before data
collection
data collection
• sample size planning (stopping rule; see
below)
• recruitment: how to recruit participants
(e.g., pool)
data analysis
• data cleaning plan (when to delete data)
• analysis plan
36. registration
what?
we specified all research details before data
collection
data collection
• sample size planning (stopping rule; see
below)
• recruitment: how to recruit participants
(e.g., pool)
data analysis
• data cleaning plan (when to delete data)
• analysis plan
- which exact hypotheses to test
37. registration
what?
we specified all research details before data
collection
data collection
• sample size planning (stopping rule; see
below)
• recruitment: how to recruit participants
(e.g., pool)
data analysis
• data cleaning plan (when to delete data)
• analysis plan
- which exact hypotheses to test
- which variables to use
38. registration
what?
we specified all research details before data
collection
data collection
• sample size planning (stopping rule; see
below)
• recruitment: how to recruit participants
(e.g., pool)
data analysis
• data cleaning plan (when to delete data)
• analysis plan
- which exact hypotheses to test
- which variables to use
- analyses for testing the hypotheses
39. registration
what?
we specified all research details before data
collection
data collection
• sample size planning (stopping rule; see
below)
• recruitment: how to recruit participants
(e.g., pool)
data analysis
• data cleaning plan (when to delete data)
• analysis plan
- which exact hypotheses to test
- which variables to use
- analyses for testing the hypotheses
• code for the analyses
40. registration
what?
we specified all research details before data
collection
experimental details (optional)
• experimental materials
- stimuli (questions)
- exact instructions
41. registration
what?
we specified all research details before data
collection
experimental details (optional)
• experimental materials
- stimuli (questions)
- exact instructions
• experimental procedure
- randomization etc
42. registration
how?
• Registered Report
- new format of publishing
- review prior to data collection
- accepted papers then are (almost)
guaranteed publication if the authors
follow through with the registered
methodology
AIMS Neuroscience; Attention,
Perception & Psychophysics; Cortex;
Drug and Alcohol Dependence;
Experimental Psychology, Frontiers in
Cognition; Perspectives on Psychological
Science; Social Psychology; …
50. registration
why?
prevent readers from thinking you might have exploited your
researchers degrees of freedom
extreme flexibility in
• data collection
• eg data peeking
• data analysis
• what is an outlier ?
• when to add covariates ?
• when to transform the data ?
• reporting
• did you report all variables, conditions, experiments, analyses
?
51.
52. registration
why?
prevent readers from thinking you might have exploited your
researchers degrees of freedom
exploiting researchers degrees of freedom can lead to an increase in
false positives
-- without adjustment, a true hypothesis will always be
rejected if sampling continues long enough
if you can convince readers that you didn’t exploit the researchers
degrees of freedom, they will put more confidence in your result; it
will be seen as more trustworthy
54. high power
what?
among the decisions you have to make and
register in advance is when you’ll stop
collecting data
our stopping rule was based on fixing the
sample size
fixing the sample size was based on a
power calculation
power = P(reject null hypothesis | null
hypothesis is false)
55. high power
what?
as far as constraining the researchers
degrees of freedom is concerned, low power
is as good as high power
we aimed for high power (95%)
56. high power
how?
compute sample size needed to achieve
desired power level
- given the statistical test
- given the significance level
- given the effect size (e.g., based on previous
studies)
57. high power
how?
compute sample size needed to achieve
desired power level
- given the statistical test
- given the significance level
- given the effect size (e.g., based on previous
studies)
G*Power, R packages (pwr), …
58. high power
why?
• low power reduces the probability of discovering effects that are
there
• low power reduces the probability that a significant result reflects a
true effect (button et al., 2013)
• low power leads to an inflation of estimated effect sizes
• only overestimates will be significant
59. there are other stopping rules!
sources for how to do decide when to stop collecting data
-when I have a participant with the name of my mother
-availability
---when the day/testweek is over
-when I have a fixed number of participants
---100
--- based on power calculations
--- based on accuracy in parameter estimation
60. in general, the most important thing is that you do it, more
than how to do it
all these stopping rules are equally valid to constrain the
researchers degrees of freedom
but some will lead to better, research than other
---more informative
---more precise and less biased estimates of e.g.
effect size
62. NHST & Bayesian testing
what?
we did not just use Null Hypothesis
Significance Testing (NHST i.e. p-values) but
also Bayes factors (the p-value of Bayesian
statistics)
the core of bayesian statistics is bayes’ rule
𝑝 𝑎 𝑏 =
𝑝 𝑏 𝑎 𝑝(𝑎)
𝑝(𝑏)
bayes treats probabilities as degrees of
belief
63. NHST & Bayesian testing
what?
we can use bayes to compute the belief in
our hypothesis H, given the data d
𝑝 𝐻 𝑑 =
𝑝 𝑑 𝐻 𝑝(𝐻)
𝑝(𝑑)
bayes rule tells us how we should update
our belief about H after observing data
64. NHST & Bayesian testing
how?
• several online tools (e.g., Rouder’s
website)
• BayesFactor package in R (Morey &
Rouder, 2014)
65. NHST & Bayesian testing
why?
• p(H|d) seems exactly what science
needs
• evidence for null hypothesis
• intuitive to interpret
• consistent: correct answer in large
sample limit
• exact for small sample size
• clear interpretation of evidence
• based on the observed data, not on
hypothetical replications of experiments
69. NHST & estimation
what?
we did not just use p-values and Bayes
factors but also effect size estimates and
their confidence intervals
how?
Matlab, R, SPPS, ESCI (Cumming, 2013), …
why?
diverts focus from the presence of an effect
to the more informative size of an effect
and its precision
71. co-pilot multi-software approach
what/how?
• two people independently processed and
analyzed the same data …
• … using different software (MATLAB,
SPSS)
why?
decreases the likelihood of errors
errors are easily made:
50% of published papers in psychology
contain reporting errors (bakker &
wicherts, 2011)
e.g, error sample size planning (G*Power)
73. clear distinction between confirmatory and
exploratory (post hoc) analyses
what?
we indicated whether the analyses where
specified before seeing the data, or based
on the data (see registration)
how?
be transparent
easy when having registered
why?
you still want to report analyses you
thought about too late! they can be useful
for generating hypotheses
75. open science
what?
we made our full research output
publicly available to everybody
- experimental materials (stimuli,
questionnaire items, instructions, and so
on)
- raw data
- processed data
- code for data processing
- code for confirmatory analyses
- code for post-hoc analyses
- paper
76. open science
how?
Open Science Framework (public)
-online repository
-free
-under development
goal: share and find research materials
make study materials (experimental
material, data, code, …) public so that
other researchers can find, use and cite
them
several other sharing possibilities
77.
78. open science
how?
Open Science Framework (public)
make sure OSF is not the only place
where your stuff is!
who knows what will happen with these
servers in 20 years?
unclear what the best data format is
79. open science
why?
• the current standards of what is
considered research output (paper with
summary statistics and conclusion) are
not inspired by desiderata for good
science, but rather by arbitrary and
outdated technical constraints (paper +
publishing costs)
•if we would start doing science right
now, in the computer and internet age,
we would probably set a completely
different standard
80. open science
why?
• facilitates
- replication studies
- follow up studies (e.g., use same
stimuli)
- new or re-analyses
- meta-analyses
- accumulation of scientific
knowledge
- detection of errors or fraud
• yields useful teaching material
81. open science
why?
• increases visibility
• increases citability
• decreases number of emails about
experiments, data or analyses, …
• is a moral obligation to tax payer
(publicly funded research is a public
good)
85. 1. replication
2. registration
3. high power
4. bayesian statistics
5. alpha level
6. estimations
7. co-pilot multi-software approach
8. distinction between confirmatory and exploratory analyse
9. open science
what? how? why?
why not?
features of science 2.0
before data collection
after data collection/during data analysis
after data analysis
86. replication
why not
-it is impossible!
---things can never always the same (e.g. population)
---the details of the original study are lost (e.g., which questions
used in a post experimental interview)
-it is a waste of time and resources!
---should we value novelty more than truth?
-it is not good for my career
---can I publish this?
87. registration
why not?
• it takes time, thought and effort
• it is harder than it seems!
• writing the code help a lot
• exploration might be the only possibility
• domain specific (qualitative studies? complex studies?)
88. high power
why not?
• can be hard to guess expected effect size or trust published effect
size
• often requires large sample size
• collaborate!
• restricted to NHST framework
94. this illustration used a very simple study
• replication study
• easily administered 8-item questionnaire
• basic t test
this made pre-registration, sample size planning, high power,
estimation, bayesian statistics, sharing protocol, code and data,
co-pilot multi software, etc probably much easier than in most
other studies
but everything is also possible (though harder) for non-
replication studies!
feasibility will depend on the type and scope of your research
95. science 2.0 is no package deal
---you can register, but not share
---you can share, but not use bayes
some practices are graded
--- you can register without code
--- you can estimate without reporting CI
97. • the (psychological) literature is littered with spurious
findings
• which results can you trust?
– has this result been replicated?
– did the researchers exploit their researchers degrees of
freedom?
– is the evidence based on NHST with a liberal alpha level?
– was the analysis correct (e.g., at least, check dfs; better do
the analysis yourself with the shared data and code)
– ???