SlideShare ist ein Scribd-Unternehmen logo
1 von 114
Downloaden Sie, um offline zu lesen
28 April, 2015
pg. 1
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
Impact of Study Size on Data Quality in Regulated Clinical Research:
Analysis of Probabilities of Erroneous Regulatory Decisions in the Presence of Data Errors
Vadim Tantsyura
A Doctoral Dissertation in the Program in Health Policy and Management
Submitted to the Faculty of the Graduate School of Health Sciences and Practice
In Partial Fulfillment of the Requirements
for the Degree of Doctor of Public Health
at New York Medical College
2015
28 April, 2015
pg. 3
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
Acknowledgements
I would like to thank my colleagues, friends and family who have provided support
throughout this journey. Thank you, Dr. Kenneth Knapp, for keeping this research focused.
With your guidance and foresight, goals became reachable. Dr. Imogene McCanless Dunn,
you stepped into my life many years ago and introduced me to the true meaning of scientific
inquiry. You were the first one who planted the seeds more than a decade ago that led to this
research. Your wisdom and ethics have helped me pull the pieces of this project together. I
thank you, Kaye Fendt, for introducing this research area to me, for ideas, and for bringing
this important issue forward so many people may benefit from. Your gentle touch changed
the direction of my thinking multiple times over the past ten years. I would like to express
my deepest gratitude to you, Dr. Jules Mitchel, for supporting me from my earlier years in
the industry, for shaping my views and writing style, and for your long-time mentorship and
regular encouragement. Your frame of reference allowed me to finalize my own thoughts,
and your impact on my work is much greater than you might think. Dr. Rick Ittenbach, I
greatly appreciate the time you took to listen, and I aspire to be someday as fair and wise as
you are. I would also like to thank Joel Waters and Amy Liu for “turning things around” for
me. I would especially like to thank my wonderful family, who allowed me to step away to
complete my journey. I will make up the lost time, I promise. I hope my children, Eva,
Daniel and Joseph, understand their education will go on for the rest of their lives. I hope
they enjoy their journeys as much as I have mine. I thank my parents, Volodymyr and Lyuba,
for their commitment to being role models for lifelong learning. Last, I would like to thank
my wife, Nadia. She has been a constant source of support through this process and fourteen
years of my other brainstorms. Thanks for understanding why I needed to do this.
28 April, 2015
pg. 4
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
Abstract.
Background: Data errors are undesirable in clinical research because they tend to increase the
chance of false-negative and false-positive study conclusions. While recent literature points out
the inverse relationship between the study size and the impact of errors on data quality and study
conclusions, no systematic assessment of this phenomenon has been conducted. The IOM (1999)
definition of high quality data, described as “…data strong enough to support conclusions and
interpretations equivalent to those derived from error-free data”, is used for this study.
Purpose: To assess the statistical impact of the study size on data quality, and identify the areas
for potential policy changes.
Methods: A formal comparison between an error-free dataset and the same dataset with induced
data errors via replacement of several data points with the “erroneous” values was implemented
in this study. The data are simulated for one hundred and forty-seven hypothetical scenarios
using the Monte-Carlo method. Each scenario was repeated 200 times, resulting in an equivalent
of 29,400 hypothetical clinical trials. The probabilities of correct, false-positive and false-
negative conclusions were calculated. Subsequently, the trend analysis of the simulated
probabilities was conducted and the best fit regression lines were identified.
Results: The data demonstrate that the monotonic, logarithmic-like asymptotic increase towards
100% is associated with the sample size increase. The strength of association between study size
and probabilities is high (R2
= 0.84-0.93). Median probability of the correct study conclusion is
equal to or exceeds 99% for all larger size studies – with 200 observations per arm or more.
Also, marginal effects of additional errors on the study conclusions has been demonstrated. For
28 April, 2015
pg. 5
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
the smaller studies (up to 200 subjects per arm), the variability of changes is high with
essentially no trend. The range of fluctuations of changes in the median probability of the correct
study conclusion is within 3% (ΔPcorrect = [-3%-0%]). For the larger studies (n ≥ 200 per arm),
on the other hand, the variability of ΔPcorrect and the negative impact of errors on ΔPcorrect is
minimal (within 0.5% range).
Limitations: The number of errors and study size are considered independent variables in this
experiment. However, non-zero correlation between the sample size and the number of errors
can be found in the real trials.
Conclusions: (1) The “sample size effect,” i.e. the neutralizing effect of the sample size on the
noise from data errors was consistently observed in the case of single data error as well as in the
case of the incremental increase in the number of errors. The data cleaning threshold
methodology as suggested by this manuscript can lead to a reduction in resource utilization. (2)
Error rates have been considered the gold standard of DQ assessment, but have not been widely
utilized in practice due to the prohibitively high cost. The proposed method suggests estimating
DQ using the simulated probability of correct/false-negative/false-positive study conclusions as
an outcome variable and the rough estimates of error rates as input variables.
28 April, 2015
pg. 6
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
TABLE OF CONTENTS:
Section I. Introduction ................................................................................................................. 9
Section II. Background and Literature Review .......................................................................... 13
Basic Concepts and Terminology .................................................................................. 13
Measuring Data Quality ................................................................................................. 24
DQ in Regulatory and Health Policy Decision-Making ................................................ 28
Error-Free-Data Myth and the Multi-Dimensional Nature of DQ ................................. 31
Evolution of DQ Definition and Emergence of Risk-Based Approach to DQ .............. 32
Data Quality and Cost .................................................................................................... 34
Magnitude of Data Errors in Clinical Research ............................................................. 41
Source Data Verification, Its Minimal Impact on DQ and the Emergence
of Central/Remote Review of Data ................................................................................ 44
Effect of Data Errors on Study Results .......................................................................... 46
Study Size, Effect and Rational for the Study ............................................................... 46
Section III. Methods of Analysis ............................................................................................... 48
Objectives and End-Points ............................................................................................. 48
Synopsis ......................................................................................................................... 49
Main Assumptions ......................................................................................................... 50
Design of the Experiment .............................................................................................. 51
Data Generation ............................................................................................................. 53
Data Verification ............................................................................................................ 54
Analysis Methods ........................................................................................................... 55
Section IV. Results and Findings ............................................................................................... 57
28 April, 2015
pg. 7
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
Section V. Discussion ................................................................................................................ 66
Practical Implications ..................................................................................................... 67
Economic Impact ........................................................................................................... 72
Policy Recommendations ............................................................................................... 77
Study Limitations and Suggestions for Future Research ............................................... 80
Conclusions .................................................................................................................... 82
References .................................................................................................................................. 84
LIST OF APPENDIXES.
Appendix 1. Dimensions of DQ ..................................................................................... 93
Appendix 2. Simulation Algorithm ................................................................................ 94
Appendix 3. Excel VBA Code for Verification Program .............................................. 99
Appendix 4. Verification Program Output ................................................................... 105
Appendix 5. Simulation Results (SAS Output) ........................................................... 108
Appendix 6. Additional Verification Programming Results ........................................ 110
Appendix 7. Comparing the Mean in Two Samples .................................................... 113
LIST OF TABLES.
Table 1. Hypothesis Testing in the Presence of Errors .................................................. 15
Table 2. Risk-Based Approach to DQ – Principles and Implications ............................ 33
Table 3. R&D Cost per Drug ......................................................................................... 37
Table 4. Coding For “Hits” and “Misses” ..................................................................... 51
Table 5. Input Variables and Covariates ........................................................................ 52
Table 6. Summary of Data Generation .......................................................................... 52
28 April, 2015
pg. 8
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
Table 7. Analysis Methods ............................................................................................ 56
Table 8. Descriptive Statistics ........................................................................................ 58
Table 9. Descriptive Statistics by Sample Size Per Study Arm ..................................... 58
Table 10. Example of Adjustment in Type I and Type II Errors ................................... 68
Table 11. Sample Size Increase Associated with Reduction in Alpha .......................... 69
Table 12. Data Cleaning Cut-Off Estimates .................................................................. 71
Table 13. Proposed Source Data Verification Approach ............................................... 73
Table 14. Estimated Cost Savings ................................................................................. 76
28 April, 2015
pg. 9
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
I. Introduction.
Because clinical research, public health, regulatory and business decisions are largely
determined by the quality of the data these decisions are based on, individuals, businesses,
academic researchers, and policy makers all suffer when data quality (DQ) is poor. Many cases
published in the news media and scientific literature exemplify the magnitude of the DQ
problems that health care organizations and regulators face every day1
. Gartner estimates that
data quality problems cost the U.S. economy $600 billion a year. With regard to the
pharmaceutical industry, the issues surrounding DQ carry significant economic implications and
contribute to the large cost of pharmaceutical drug and device development. In fact, the cost of
clinical trials for medical product development has become prohibitive, and presents a significant
problem for future pharmaceutical research, which, in turn, may pose a threat to the public
health. This is why National Institute of Health (NIH) and Patient-Centered Outcome Research
Institute (PCORI) make investments in DQ research (Kahn et al., 2013; Zozus et al., 2015).
Data are collected during clinical trials on the safety and efficacy of new pharmaceutical
products, and the regulatory decisions that follow are based on these data. Ultimately,
reproducibility, the credibility of research, and regulatory decisions are only as good as the
underlying data. Moreover, decision makers and regulators often depend on the investigator’s
demonstration that the data on which conclusions are based are of sufficient quality to support
1
For example, the medical provider may change the diagnosis to a more serious one (especially when
doctors/hospitals feel that insurance companies do not give fair value for a treatment). Fisher, Lauria, Chngalur-
Smith and Wang (2006) site a study where “40% of physicians reported that they exaggerated the severity of patient
condition, changed billing diagnoses, and reported non-existent symptoms to help patients recover medical
expenses. The reimbursement for bacterial pneumonia averages $2500 more than for viral pneumonia. Multiplying
this by the thousands of cases in hundreds of hospitals will give you an idea of how big the financial errors in the
public health assessment field could be”. A new edition of the book (Fisher, et al., 2012) presents multiple examples
of poor quality: “In industry, error rates as high as 75% are often reported, while error rates to 30% are typical. Of
the data in mission-critical databases, 1% to 10% may be inaccurate. More than 60% of surveyed firms had
problems with DQ. In one survey, 70% of the respondents reported their jobs had been interrupted at least once by
poor-quality data, 32% experienced inaccurate data entry, 25% reported incomplete data entry, 69% described the
overall quality of their data as unacceptable, and 44% had no system in place to check the quality of their data.”
28 April, 2015
pg. 10
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
them (Zozus, et al., 2015). However, with the exception of newly emerging NIH Collaboratory
DQ assessment standards, no clear operational definition of DQ has been adopted to date. As a
direct consequence of the lack of agreement on such a fundamental component of the quality
system, data cleaning processes vary, conservative approaches dominate, and too many resources
are devoted to meeting unnecessarily stringent quality thresholds. According to DQ expert Kaye
Fendt (2004), the considerable financial impact of DQ-related issues on the U.S. health care
system is due to the following: (1) “Our medical-scientific-regulatory system is built upon the
validity of clinical observation,” and (2) “The clinical trials and drug regulatory processes must
be trusted by our society”. It would therefore be reasonable to conclude that the importance of
DQ in modern healthcare, health research, and health policy will continue to grow.
Several experts in the field have been emphasizing the importance of DQ research for
more than a decade. For instance, K. Fendt stated at the West Coast Annual DIA (2004): “We
need a definition (Target) of data quality that allows the Industry to know when the data are
clean enough.” Dr. J. Woodcock, FDA, Deputy Commissioner for Operations and Chief
Operating Officer, echoed this sentiment at the Annual DIA (2006): “…we [the industry] need
consensus on the definition of high quality data.” As a result of combined efforts over the past 20
years, the definition of high-quality data has evolved from a simple “error-free-data” to a much
more sophisticated “absence of errors that matter and are the data fit for purpose” (CTTI, 2012).
The industry, however, is slow in adopting this new definition of DQ. Experts agree that
the industry is so big that, for numerous reasons, it is always slow to change. Some researchers
“in the trenches” are simply not aware of the new definition. Others are not yet ready to embrace
it due to organizational inertia and rigidity of the established processes, as well as because of
insufficient knowledge, misunderstanding, and misinterpretation caused by the complexity of the
28 April, 2015
pg. 11
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
new paradigm. In addition, due to the fact that vendors and CROs make a lot of money on the
monitoring of sites, monitoring has long become the “holy grail” of the QC regulatory
requirement of operations at the site, regardless of its labor-intensive nature and low
effectiveness. The fact that monitoring findings have “no denominator” is another reason. Each
individual monitoring finding is given great attention, and not put in perspective with the other
hundreds of forms and data points that are good. Also, the prevailing industry belief in training
and certification of monitors might need reexamination. Perhaps it is time to shift resources in
training and certifying coordinators and investigators. Finally, the industry has not agreed on
what errors can be spotted by smart programs vs. what errors require review by monitors. In fact,
only one publication to date addressed this topic. The review by Bakobaki and colleagues (2012)
“determined that centralized [as opposed to manual / on-site] monitoring activities could have
identified more than 90% of the findings identified during on-site monitoring visits.”
Consequently, the industry continues to spend a major portion of the resources allocated for
clinical research on the processes and procedures related to the quality of secondary data – which
often have little direct impact on the study conclusions – while allowing some other critical
quality components (e.g., quality planning, triple-checking of critical variables, data
standardization, and documentation) to fall through the cracks. In my fifteen years in the
industry, I have heard numerous anecdotal accounts of similar situations and have witnessed
several examples firsthand. In a landmark new study, TransCelerate revealed what many clinical
research professionals suspected for a long time: that there is a relatively low proportion (less
than one-third) of data clarification queries that are related to “critical data” (Scheetz et al.,
2014). How much does current processes that arise from the imprecision in defining and
interpreting DQ cost the pharmaceutical and device companies? Some publications encourage
28 April, 2015
pg. 12
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
$4-9 billion annually in the U.S. (Ezekiel & Fuchs, 2008; Funning, Grahnén, Eriksson, & Kettis-
policy changes (Ezekiel, 2003; Lorstad, 2004) and others estimate the potential savings in the
neighborhood of Linblad, 2009; Getz et al., 2013; Tantsyura et al, 2015) Some cost savings are
likely to come from leveraging innovative computerized data validation checks and reduction in
manual efforts, such as source data verification (SDV). In previous decades, when paper case
report forms (CRFs) were used, the manual review component was essential in identifying and
addressing DQ issues. But with the availability of modern technology, this step of the process is
largely a wasted effort because the computer-enabled algorithms are able to identify data issues
much more efficiently.
Undoubtedly, an in-depth discussion of the definition of DQ is essential and timely.
Study size needs to be a key component of any DQ discussion. Recent literature points out the
inverse relationship between the study size and the impact of errors on the quality of the
decisions based on the data. However, the magnitude of this impact has not been systematically
evaluated. The intent of the proposed study is to fill in this knowledge gap. The Monte-Carlo
method was used to generate the data for the analysis. Because DQ is a severely under-
researched area, a thorough analysis of this important concept could lead to new economic and
policy breakthroughs. More specifically, if DQ thresholds can be established, or minimal impact
of data errors under certain conditions uncovered, then these advancements might potentially
simplify the data cleaning processes, free resources, and ultimately reduce the cost of clinical
research operations.
The primary focus of current research is the sufficient assurance of data quality in clinical
trials for regulatory decision-making. Statistical, economic, regulatory, and policy aspects of the
issue will be examined and discussed.
28 April, 2015
pg. 13
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
II. Background and Literature Review.
Basic Concepts and Terminology
A data error occurs when “a data point inaccurately represents a true value…” (GCDMP,
v4, 2005 p. 77). This definition is intentionally broad and includes errors with root causes of
misunderstanding, mistakes, mismanagement, negligence, and fraud. Similarly to GCDMP, the
NIH Collaboratory white paper (Zozus et al., 2015) uses the term “error” to denote “any
deviation from accuracy regardless of the cause.” Not every inaccuracy or deviation from the
plan or from a regulatory standard in a clinical trial constitutes a “data error.” Noncompliance of
procedures with regulations or an incomplete adherence of practices to written documentation
cannot be considered examples of data errors.
Frequently, for the monitoring step of Quality Control (QC), clinical researchers
“conveniently” choose to define data error as a mismatch between the database and the source
record (the document from which the data were originally captured). In the majority of cases,
perhaps in as many as 99% or more, the source represents a “true value,” and that prompts many
clinical researchers to assume mistakenly that the “source document” represents the true value,
as well. One should keep in mind that this is not always the case, as can be clearly demonstrated
in a situation where the recorded value is incompatible with life. This erroneous definition and
the conventional belief that it reinforces are not only misleading, but also bear costs for society.
From the statistical point of view, data errors are undesirable because they introduce
variability2
into the analysis, and this variability makes it more difficult to establish statistical
2
Generally speaking, variability is viewed from three different angles: science, statistics and experimental design.
More specifically, science is concerned with understanding variability in nature, statistics is concerned with making
28 April, 2015
pg. 14
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
significance and to observe a “treatment effect,” if one exists, because it reduces the standardized
treatment effect size (difference in means divided by standard deviation). When superiority trials
are affected by data errors, it becomes more difficult to demonstrate the superiority of one
treatment over another. Without loss of generality, to describe the fundamental framework for
hypothesis testing, it is assumed that there are two treatments, and the goal of the clinical study is
to declare that an active drug is superior to a placebo. In clinical trial research, generally the null
hypothesis expresses the hypothesis that two treatments are equivalent (or equal), and the
alternative hypothesis (sometimes called the motivating hypothesis) expresses that the two
treatments are not equivalent (or equal). Assuming an Aristotelian, 2-value logic system in
support, in the true, unknown state of nature, one of the hypotheses is true and the other one is
false. The goal of the clinical trial is to make a decision based on statistical testing that one
hypothesis is true and the other is false. The 2-b-2 configuration of the possible outcomes has
two correct decisions and two incorrect decisions. Table 1 displays the possible outcomes.
Table 1: Hypothesis testing in the presence of errors
In statistical terms, the presence of data errors increases variability for the study drug arm
and the comparator, while increasing the proportion of false-positive results (Type I errors) and
decisions about nature in the presence of variability, and experimental design is concerned with reducing and
controlling variability in ways which make statistical theory applicable to decisions about nature (Winer, 1971).
Increased
due to errors
Increased due to errors
28 April, 2015
pg. 15
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
false-negative results (Type 2 errors), thus reducing the power of the statistical test and the
probability of the right conclusion (as shown in Table 1). Because the definitions of the Type I
and Type II errors are reversed in non-inferiority trials relative to the superiority trials, the
treatment of interest in non-inferiority trials becomes artificially more similar to the comparator
in the presence of data errors.
DQ experts state that “DQ is a matter of degree, not an absolute” (Fendt, 2004) and
suggest reaching an agreement with regulators on acceptable DQ “cut-off” levels – similar to the
commonly established alpha and beta levels, 0.05 and 0.20 respectively. This thesis offers a
methodology to facilitate such discussion.
Each additional data error gradually increases the probability of a wrong study conclusion
(relative to the conclusion derived from the error-free dataset), but the results of this effect are
inconsistent. For one study, under one set of study parameters, a single data error could lead to a
reduction in the chance of the correct study conclusion from 100% to 99%. For another study of
a much smaller sample size, for example, the same error could lead to a reduction from 100% to
93.5%. For this reason DQ can be viewed as a continuous variable characterized by the
probability of the correct study conclusion in presence of errors. Obviously, such approach to
DQ is in direct contradiction with the common-sense belief that DQ is a dichotomous variable –
”good”/”poor” quality. This alternative (continuous and probabilistic) view of DQ reduces
subjectivity of “good/poor” quality assessment, quantifies the effect of data errors on the study
conclusions, and leads to establishing the probability cut-off level (X%), which distinguishes
between acceptable and not acceptable levels of the correct/false-positive/false-negative study
conclusions that satisfies regulators and all stakeholders.
28 April, 2015
pg. 16
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
Numerous sources and origins of data errors exist. There are errors in source documents,
such as an ambiguous question resulting in unintended responses, data collection outside of the
required time window, or the lack of inter-rater reliability (which occurs in cases where
subjectivity is inherently present or more than one individual is assessing the data). Transcription
errors occur in the process of extraction from a source document to the CRF or the electronic
case report form (eCRF). Additional errors occur during data processing, such as keying errors
and errors related to data integration issues. There are also database errors, which include the
data being stored in the wrong place. Multi-step data processing introduces even more data
errors. Because the clinical research process can be complex, involving many possible process
steps, each step at which the data are transcribed, transferred, or otherwise manipulated has an
error rate associated with it. Each subsequent data processing step can create or correct errors as
demonstrated in Figure 1.
Figure 1. Sources of Errors in Data Processing (GCDMP, v4, 2005 p. 78)
28 April, 2015
pg. 17
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
Data errors can be introduced even after a study is completed and the database is
“locked.” Data errors and loss of information often occur at the data integration step of data
preparation, in cases of meta-analysis, or during the compilation of the integrated summary of
efficacy (ISE)/integrated safety summary (ISS) for a regulatory submission. This could be a
direct result of the lack of specific data standards. For example, if the code list for a variable has
two choices in one study and five choices in the other, considerable data quality reduction would
occur during the data integration step.
How data are collected plays an important role not only in what is collected by also “with
what quality.” It is known that if adverse events are elicited via checklists, there is an over-
reporting of minor events. If no checklist is used and no elicitation is used, it is known that
adverse events could be under-reported. Similarly, it is known that important medical events get
reported. The FDA guidance on safety reviews expresses views consistent with this known
result. For example, a patient may not recall episodes of insomnia or headaches that were not
particularly bothersome, but they may recall them when probed. Important, bothersome events,
however, tend to be remembered and reported as a result of open-ended questions. Checklists can
also create a mind-set: if questions relate to common ailments of headache, nausea, vomiting,
then the signs and symptoms related to energy, sleep patterns, mood, etc., may be lost, due to the
focus on physical findings.
Variability in the sources of data errors is accompanied by the substantial variation in
data error detection techniques. Some errors, such as discrepancies between the source
documents and the CRF, are easily detected at the SDV step of the data cleaning process. Other
errors, such as misunderstandings, non-compliance, protocol violations, and fraud are more
difficult to identify. The most frequently missed data error types include the data recorded under
28 April, 2015
pg. 18
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
the wrong subject number and data captured incorrectly from source records. Thus, data error
detection in clinical research is not as trivial as it may appear on the surface. Computerized edit
checks help identify illogical (such as visit two date is prior to visit one date), missing, or out of
expected range values much more efficiently than the manual data review.
Variability in the sources of data errors and the general complexity of this topic have
created substantial impediments to reaching a consensus as to what constitutes “quality” in
clinical research. The ISO 8000 (2012) focuses on data quality and defines “quality” as the
“…degree to which a set of inherent characteristics fulfills requirements.” Three distinct
approaches to defining data quality are found in the literature and in practice. “Error-free” or
“100% accurate” is the first commonly used definition that has dominated clinical research.
According to this definition, the data are deemed to be of high quality only if they correctly
represent the real world construct to which they refer. There are many weaknesses to this
approach. The lack of objective evidence supporting this definition, and prohibitively high cost
of achieving “100% accuracy” are two major weakness of this approach. Additionally, the
tedium associated with source-document verification and general quality control procedures
involved in the process of trying to identify and remove all errors is actually a distraction from a
focus on important issues. Humans are not very good at fending off tedium, and this reality
results in errors going undetected in the vast platforms of errorless data.
The “fit to use” definition, introduced by J.M. Juran (1986), is considered the gold
standard in many industries. It states that the data are of high quality "if they are fit for their
intended uses in operations, decision making and planning". This definition has been interpreted
by many clinical researchers as “the degree to which a set of inherent characteristics of the data
fulfills requirements for the data” (Zozus et al, 2015) or, simply, “meeting protocol-specified
28 April, 2015
pg. 19
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
parameters.” Ambiguity and implementation challenges are the main impediments to a wider
acceptance of this definition. Unlike service or manufacturing environments, clinical trials vary
dramatically and are consequently more difficult to standardize. These inherent limitations have
resulted in the practice by many clinical research companies in the past three decades of using a
third definition – namely arbitrarily acceptable levels of variation per explicit protocol
specification (GCDMP, 2005). The development of objective data quality standards is more
important today than ever before. The fundamental question of “how should adequate data
quality be defined?” will require taking into consideration the study-specific scientific, statistical,
economic, and technological data collection and cleaning context, as well as the DQ indicators3
.
As mentioned above, data cleaning and elimination of data errors reduces variability and
helps detect the “treatment effect.” On the other hand, the data cleaning process is not entirely
without flaws or unintended consequences. Not only does it add considerable cost to the clinical
trial conduct, it may potentially introduce bias and shift study conclusions. There are several
possible scenarios of bias being introduced via data cleaning. Systematically focusing on the
extreme values, when as many errors are likely to exist in the expected range (as shown in Figure
2), is one such scenario. Selectively prompting to modify or add non-numeric data to make the
(edited) data appear “correct,” even when the cleaning is fully blinded and well-intended, is
another example. Before selective cleaning, the data may be flawed, but the data errors are not
systematically concentrated, and, thus, introduce no bias. However, when the data cleaning is
non-random for any reason, it leads to a reduction of variance, and increases the Type I Error
Rate and the risk of making incorrect inferences, i.e., finding statistically significant differences
3
Multiple terms are used in the literature to describe DQ indicators – “quality criteria,” “attributes,” and
“dimensions.”
28 April, 2015
pg. 20
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
due to chance alone or failing to find differences in a non-inferiority comparison (declaring that a
drug works when it does not work). Finally, new (processing) data errors are often introduced
into the data during cleaning.
Figure 2. Statistical Impact of “out-of-range” Edit Checks
In an experiment described later in the manuscript, the data cleaning is abridged to “out
of range” checks that are symmetrical. This eliminates all out of range errors, while introducing
no bias. The data errors themselves are assumed distributed as a Gaussian white noise.
Data cleaning is an important component of data management in regulated clinical
development. It is typically handled by clinical data managers and involves many different
scenarios. Impossible (e.g., incompatible with life) values are typically detected and corrected.
Medical inconsistencies are usually resolved. For example, if the recorded diastolic blood
pressure (BD) appears to be higher than the systolic BP, the values might be reversed (if
confirmed by the investigator). Missing values are obtained or confirmed missing and extreme
values are confirmed or replaced with more acceptable values. And finally, unexpected data are
removed, modified, or explained away. For example, concomitant medications, lab tests, and
adverse events (AEs) reported with start dates later than 30 days post-treatment are generally
28 April, 2015
pg. 21
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
queried and removed. Subjects with AEs are probed for previously unreported medical history
details, which are then added to the database. Sites modify subjective assessments when the
direction of a scale is misinterpreted (e.g., one is selected instead of nine on a zero- to- ten scale).
All data corrections require the investigator’s endorsement and are subject to an audit trail.
Data cleaning is a time-consuming, monotonous, repetitive and uninspiring process. The
mind-numbing nature of this work makes it difficult for the data management professionals in
the trenches to maintain a broader perspective and refrain from over-analysing trivial details.
Data managers frequently ask the following question: If a site makes too many obvious
transcription errors in the case of outliers, why would one think that fewer errors are made within
the expected ranges? They use this as a justification to keep digging further, finding more and
more errors, irrespective of the relevance or the significance of those findings. The desire to
“fix” things that appear wrong is part of human nature, especially when one is prompted to look
for inconsistencies. Each new “finding” and data correction make data managers proud of their
detective abilities, but it costs their employers and society billions of dollars. Does the correction
of the respiratory rate from twenty two to twenty one make a substantial difference? More often
than not, the answer is no. As new research demonstrates, over two thirds of data corrections are
clinically insignificant and do not add any value to study results (Mitchel, Kim, Choi, Park,
Cappi, & Horn, 2011; TransCelerate, 2013; Scheetz et al., 2014). Unlike a generation ago, many
researchers and practitioners in the modern economic environment are facing resource
constraints. Therefore, they legitimately ask these types of questions in an attempt to gain a
better understanding of the true value of data cleaning, and to improve effectiveness and
efficiency in clinical trial operations.
28 April, 2015
pg. 22
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
The process of data collection has markedly transformed over the past 30 years. Manual
processes dominated the 1980s and 1990s, accompanied by the heavy reliance on the verification
steps of the process at that. As an example, John Cavalito (in his unpublished work at Burroughs
Wellcome in approximately 1985) found that a good data entry operator could key 99 of 100
items, without error. The conclusion was made that the data entry operator error rate was 1%.
With two data entry operators creating independent files, that would yield an error rate of
1/10,000 (.01 multiplied by .01 yields .0001, or 1/10,000). Similarly, Mei-Mei Ma’s dissertation
(1986) evaluated the impact of each step in the paper process, and discovered that having two
people work on the same file had a higher error rate than having the same person create two
different files (on different days); the rule for independence was unexpected. It was more useful
to separate the files than to use different people.
The introduction of electronic data capture technology (EDC) eliminated the need for
paper CRFs. Additionally, EDC had antiquated some types of extra work that was associated
with the traditional paper process4
. More recently, introduction of eSource technologies such as
electronic medical records (EMRs) and direct data entry (DDE) has started the process of
eliminating paper source records, eradicating transcription errors and, as many believe, further
improving DQ. These events have also led to growing reliance on computer-enabled data
cleaning as opposed to manual processes such as SDV. While in the previous “paper” generation
the SDV component was essential in identifying and addressing issues, modern technology has
made this step of the process largely a redundant effort because the computer-enabled algorithms
4
Examples include (a) “None” was checked for “Any AEs?” on the AE CRF, but AEs were listed so the “None” is
removed, (b) data recorded on page 7a that belonged on page 4 is moved, (c) sequence numbers are updated, (d)
effort is spent cleaning data that will have little or no impact on conclusions and (e) comments recorded on the
margin that have no place in the database.
28 April, 2015
pg. 23
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
can identify over 90% of data issues much more efficiently (Bakobaki et al., 2012). Nevertheless,
some research professionals have expressed the opinion that the new technologies such as EDC,
EMR, DDE, and their corresponding data cleaning processes may result in a lower DQ,
compared to the traditional paper process, thus justifying their resistance to change. One
argument against the elimination of paper is that lay typists are not as accurate as trained data
entry operators (Nahm, Pieper, & Cunningham, 2008). The second argument, that EDC and DDE
use logical checks to “find & fix” additional “typing errors” by selectively challenging item
values (e.g., focusing on values outside of expected ranges) at the time of entry, also has some
merit. The clinical site personnel performing data entry, using one of the new technologies, is
vulnerable to the suggestion that a value may not be “right” and may unintentionally reject a true
value. In the final analysis, these skeptical arguments will not reverse the visible trend in the
evolution of clinical research that is characterized by greater reliance on technology. The
growing body of evidence confirms that, in spite of its shortcomings, DDE leads to the higher
overall data quality.
The DQ debate is not new. It attracts attention through discussions led by the Institute of
Medicine (IOM)5
, the Food and Drug Administration (FDA), DQRI6
, and MIT TDQM7
, but
consensus in the debate over a definition of DQ has not been reached, nor have the practical
problems standing in the way of implementing that definition been solved. According to K.
5
“The Institute of Medicine serves as adviser to the nation to improve health. Established in 1970 under the charter
of the National Academy of Sciences, the Institute of Medicine provides independent, objective, evidence-based
advice to policymakers, health professionals, the private sector, and the public.” (http://www.iom.edu/; Accessed in
November, 2007)
6
The Data Quality Research Institute (DQRI) is a non-profit organization that existed in early 2000’s and provided
an international scientific forum for academia, healthcare providers, industry, government, and other stakeholders to
research and develop the science of quality as it applies to clinical research data. The Institute also had been
assigned an FDA liaison, Steve Wilson. (http://www.dqri.org/; Accessed in November, 2007)
7
MIT Total Data Quality Management (TDQM) Program: “A joint effort among members of the TDQM Program,
MIT Information Quality Program, CITM at UC Berkeley, government agencies such as the U.S. Navy, and industry
partners.” (See http://web.mit.edu/tdqm/ for more details.)
28 April, 2015
pg. 24
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
Fendt, a former Director of DQRI, the following aspects of DQ are fundamentally important: (1)
DQ is “a matter of degree; not an absolute,” (2) “There is a cost for each increment,” (3) There is
a need to determine “what is acceptable,” (4) “It is important to provide a measure of confidence
in DQ,” and (5) “Trust in the process” must be established. And the consequences of mistrust
are: “(1) “poor medical care based on non-valid data,” (2) “poor enrollment in clinical trials,” (3)
“public distrust” (of scientists/regulators), and (4) lawsuits.” (Fendt, 2004).
Measuring Data Quality
The purpose of measuring data quality is to identify, quantify, and interpret data errors,
so that quality checks can be added or deleted and the desired level of data quality can be
maintained (GCDMP, v 4, 2005, p. 80). The “error rate” is considered the gold standard metric
for measuring DQ, and for easier comparison across studies it is usually expressed as the number
of errors per 10,000 fields. It is expressed by the formula below. For the purpose of this thesis, it
is important to note that an increase in the denominator (i.e, the sample size) reduces the error
rate and dilutes the impact of data errors.
InspectedFields
FoundErrors
RateError
ofNumber
ofNumber

“Measuring” DQ should not be confused with “assuring” DQ or “quality assurance”
(QA), which focuses on infrastructures and practices used to assure data quality. According to
the International Standards Organization (ISO), QA is the part of quality management focused on
providing confidence that the quality requirements will be met8
(ISO 9000:2000 3.2.11). As an
8
ICH GCP interprets this requirement as “all those planned and systematic actions that are established to ensure that
the trial is performed and the data are generated, documented (recorded), and reported in compliance with Good
Clinical Practice (GCP) and the applicable regulatory requirement(s).” (ICH GCP 1.16). Thus, in order for ICH to be
compatible with current ISO concepts, the word “ensure” should be replaced with “assure”. Hence the aim of
28 April, 2015
pg. 25
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
example, a QA plan for registries “should address: 1) structured training tools for data
abstractors; 2) use of data quality checks for ranges and logical consistency for key exposure and
outcome variables and covariates; and 3) data review and verification procedures, including
source data verification plans and validation statistics focused on the key exposure and outcome
variables and covariates for which sites may be especially challenged. A risk-based approach to
quality assurance is advisable, focused on variables of greatest importance” (PCORI, 2013).
Also, measuring DQ should not be confused with “quality control” (QC). According to ISO, QC
is the part of quality management focused on fulfilling requirements9
. (ISO 9000:2000 3.2.10)
Auditing is a crucial component of Quality Assurance. An audit is defined as “a
systematic, independent and documented process for obtaining audit evidence objectively to
determine the extent to which audit criteria are fulfilled”10
(ISO 19011:2002, 3.1). The FDA
audits are called “inspections.” The scope of an audit may vary from “quality system” to
“process” and to “product audit.” Similarly, audits of a clinical research database may vary from
the assessment of a one-step process (e.g., data entry) to a multi-step (e.g., source-to-database)
audit. The popularity of one-step audits comes from their low cost. However, the results of such
one-step audits are often over-interpreted. For example, if a data entry step audit produces an
acceptably low error rate, it is sometimes erroneously concluded that the database is of a “good
quality.” The reality is that the low error rate and high quality of one step of the data handling
Clinical Quality Assurance should be to give assurance to management (and ultimately to the Regulatory
Authorities) that the processes are reliable and that no major system failures are expected to occur that would expose
patients to unnecessary risks, violate their legal or ethical rights, or result in unreliable data.
9
ICH GCP interprets these requirements as “the operational techniques and activities undertaken within the quality
assurance systems to verify that the requirements for quality of the trial have been fulfilled.” (ICH GCP 1.47) In
order for ICH language to be compatible with current ISO concepts, the word “assurance” should be replaced with
“management.”
10
ICH GCP interprets this requirement as “A systematic and independent examination of trial related activities and
documents to determine whether the evaluated trial related activities were conducted, and the data were recorded,
analyzed and accurately reported according to the protocol, the sponsor’s Standard Operating Procedures (SOPs),
Good Clinical Practice (GCP), and the applicable regulatory requirement(s). (ICH GCP 1.6)
28 April, 2015
pg. 26
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
process does not necessarily equate to a high quality/low error rate associated with the other
steps of data handling.
Audits are popular in many industries, such as manufacturing or services, for their ability
to provide tangible statistical evidence about the quality and reliability of products. However, the
utility of audits in clinical research is hindered by the substantial costs associated with them,
especially in the case of the most informative “source-to-database” audits. This accounts for the
reduction in error rate reporting in the literature in the past seven to ten years, since the
elimination of paper CRFs and the domination of EDC on the data collection market. At the
same time, such a surrogate measure of DQ as the rate of data correction has been growing in
popularity and reported in all recent landmark DQ-related studies (Mitchel et al., 2011, Yong,
2013; TransCelerate, 2013; Scheetz et al., 2014). Data corrections are captured in all modern
research databases (as required by the “audit-trail” stipulation of the FDA), making the rate of
data correction calculation an easily automatable and inexpensive benchmark for the DQ
research. The rates of data corrections were originally coined by statisticians as a measure of
“stability” of a variable, which is outside of the scope of this discussion.
Theoretically speaking, error rates calculated at the end of a trial and the rates of data
corrections are not substitutes one for the other, but rather complementary to each other. Ideally,
when all errors are captured via the data cleaning techniques, the error rate from a source-to-
database audit should be zero and the rate of corrections (data changes) should be equivalent to
the pre-data cleaning error rate. In a less than ideal scenario, when a small proportion of the total
errors is eliminated via data cleaning, the rate of data correction is less informative, leaving the
observer wondering how many errors are remaining in the database. The first (ideal) scenario is
closer to the real world of pharmaceutical clinical trials, where extensive (and extremely
28 April, 2015
pg. 27
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
expensive) data cleaning efforts lead to supposedly “error-free” databases. The second (less than
ideal) scenario is more typical of academic clinical research, where the amount of data cleaning
is limited for financial and other reasons. Thus, the rate of data changes is primarily a measure of
the reduction of the error rate (or DQ improvement) as a result of data cleaning, and a measure of
effectiveness of the data cleaning in a particular study. Low cost is the main benefit of the data
correction rates over the error rates, while obvious incongruence of data correction rates to the
error rates calculated via audits is the main flaw of this fashionable metric.
DQ in Regulatory and Health Policy Decision-Making
The amount of clinical research data has been growing exponentially over the past half
century. ClinicalTrials.gov currently lists 188,494 studies with locations in all 50 States and in
190 countries. BCC Research, predicted a growth rate of 5.8%, roughly doubling the number of
clinical trials every ten years (Fee, 2007).
Figure 3. The number of published trials, 1950 to 2007
28 April, 2015
pg. 28
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
This unprecedented growth, especially in the past decade, coupled with rising economic
constraints, places enormous pressure on regulators such as the FDA and the EMA. Regulators
ask and address a number of important questions in order to claim that the quality of decisions
made by policy makers and clinicians meets an acceptable level. There are numerous examples
of the regulatory documents that clarify and provide guidance in respect to many aspects of
Quality System and DQ (ICH Q8, 2009; ICH Q9, 2006; ICH Q10, 2009; FDA, 2013; EMA,
2013).
Fisher and colleagues reference Kingma’s finding that “even a suspicion of poor-quality
data influences decision making.” A lack of consumer confidence in data quality potentially
leads to a lack of trust in a study’s conclusions. (Kingma, 1996.) Thus, one of the most
important questions the policy-makers and regulators ask is to what extent do the data limitations
prohibit one from having confidence in the study conclusions. Some of these questions: What
methods were used to collect data? Are the sources of data relevant and credible? Are the data
reliable, valid and accurate? One hundred percent data accuracy is too costly, if ever achievable.
Would 99% accurate data be acceptable? Why or why not? Are there any other important
“dimensions” of quality, beyond “accuracy,” “validity,” and “reliability?” (Fisher, 2006; Fink
2005).
The DQ discussion (that is often spearheaded by regulators and regulatory experts) has
to date made progress on several fronts. There is a general consensus on the “hierarchy of errors”
(i.e. some errors are more important than others), and on the impossibility of achieving a
complete elimination of data errors. There will always be some errors that are not addressed by
quality checks, as well as those that slip through the quality check process undetected (IOM,
1999). Therefore, the clinical research practitioners’ primary goal should be to correct the errors
28 April, 2015
pg. 29
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
that will have an impact on study results (GCDMP, v4, 2005, p. 76). Quality checks performed
as part of data processing, such as data validation or edit checks, should target the fields that are
(1) critical to the analysis, (2) where errors frequently occur, and (3) where a high percent of
error resolution can be reasonably expected. “Because a clinical trial often generates millions of
data points, ensuring 100% completeness, for example, is often not possible; however it is not
necessary” (IOM, 1999) and “trial quality ultimately rests on having a well-articulated
investigational plan with clearly defined objectives and associated outcome measures” (CTTI,
2012) are the key message from the regulatory experts to the industry.
Furthermore, the growing importance of pragmatic clinical trials (PCT)11
, the difficuty in
estimating error rates using traditional “audits,” and a deeper understanding of the multi-
dimensional nature of DQ (that is discussed next) have led to the development over the last
couple of years of novel approaches to DQ assessment. An NIH Collaboratory white paper,
which was supported by a cooperative agreement (U54 AT007748) from the NIH Common Fund
for the NIH Health Care Systems Research Collaboratory (Zozus et al., 2015) is the best example
of such innovative methodology that stresses the multi-dimensional nature of DQ assessment and
the determination of “fitness for use of research data.” The following statement by the authors
summarizes the rational, and approach taken by this nationally recognized group of thought
leaders:
Pragmatic clinical trials in healthcare settings rely upon data generated during routine
patient care to support the identification of individual research subjects or cohorts as well
as outcomes. Knowing whether data are accurate depends on some comparison, e.g.,
comparison to a source of “truth” or to an independent source of data. Estimating an error
or discrepancy rate, of course, requires a representative sample for the comparison.
Assessing variability in the error or discrepancy rates between multiple clinical research
11
PCT is defined as “a prospective comparison of a community, clinical, or system-level intervention and a relevant
comparator in participants who are similar to those affected by the condition(s) under study and in settings that are
similar to those in which the condition is typically treated.” (Saltz, 2014)
28 April, 2015
pg. 30
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
sites likewise requires a sufficient sample from each site. In cases where the data used for
the comparison are available electronically, the cost of data quality assessment is largely
based on time required for programming and statistical analysis. However, when labor-
intensive methods such as manual review of patient charts are used, the cost is
considerably higher. The cost of rigorous data quality assessment may in some cases
present a barrier to conducting PCTs. For this reason, we seek to highlight the need for
more cost-effective methods for assessing data quality… Thus, the objective of this
document is to provide guidance, based on the best available evidence and practice, for
assessing data quality in PCTs conducted through the National Institutes of Health (NIH)
Health Care Systems Research Collaboratory. (Zozus et al., 2015)
Multiple recommendations provided by this white paper (conditional upon the advances in data
standards or “common data elements”) lead in a new direction in the evolution of the data quality
assessment to support important public health policy decisions.
Error-Free-Data Myth and the Multi-dimensional Nature of DQ
In clinical research practice, DQ is often confused with “100% accuracy.” However,
“accuracy” is only one of very many attributes of DQ. For example, Wang and Strong (1996)
identified 16 dimensions and 156 attributes of DQ. Accuracy and precision, reliability, timeliness
and currency, completeness, and consistency are among the most commonly cited dimensions of
DQ in the literature (Wand & Wang, 1996; Kahn, Strong, & Wang, 2002; see Appendix 1 for
more details). The Code of Federal Regulation (21CFR Part 11, initially published in 1997,
revised in 2003 and 2013) emphasized the importance of “accuracy, reliability, integrity,
availability, and authenticity12
" (FDA, 1997; FDA, 2003). Additionally, some experts and
regulators emphasize “trustworthiness” (Wilson, 2006), “reputation” and “believability” (Fendt,
2004) as key dimensions of DQ. More recently, Weiskopf and Weng (2013) identified five
dimensions that are pertinent to electronic health record (EHR) data used for research:
12
The original version of the document listed different attributes - Attributable-Legible-Contemporaneous-Original-
Accurate that were often called the ALCOA standards for data quality and integrity; eSource and “direct data entry”
exponentially increase their share and threaten to eliminate the paper source documents, making “Legible”
dimension of DQ an obsolete.
28 April, 2015
pg. 31
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
completeness, correctness, concordance, plausibility and currency, and NIH Collaboratory
stressed completeness, accuracy and consistency as the key DQ dimensions determining fitness
for use of research data where completeness is defined as “presence of the necessary data”,
accuracy as “closeness of agreement between a data value and the true value,” and consistency as
“relevant uniformity in data across clinical investigation sites, facilities, departments, units
within a facility, providers, or other assessors” (Zozus et al., 2015).
Even in the presence of its established multi-dimensional nature, DQ is still often
interpreted exclusively as data accuracy13
. High data quality for many clinical research
professionals simply means 100% accuracy, which is costly, unnecessary, and frequently
unattainable. Funning et al. (2009) based on their survey of 97% (n=250) of phase III trials
performed in Sweden in 2005, concluded that significant resources are wasted in the name of
higher data quality. “The bureaucracy that the Good Clinical Practice (GCP) system generates,
due to industry over-interpretation of documentation requirements, clinical monitoring, data
verification etc. is substantial. Approximately 50% of the total budget for a phase III study was
reported to be GCP-related. 50% of the GCP-related cost was related to Source Data Verification
(SDV). A vast majority (71%) of respondents did not support the notion that these GCP-related
activities increase the scientific reliability of clinical trials.” This confusion between DQ and
“100% accuracy” contributes a substantial amount to the costs of clinical research and
subsequent costs to the public.
13
“Assessing data accuracy, primarily with regard to information loss and degradation, involves comparisons, either
of individual values (as is commonly done in clinical trials and registries) or of aggregate or distributional
statistics… In the absence of a source of truth, comprehensive accuracy assessment of multisite studies includes use
of individual value, aggregate, and distributional measures.” (Zozus et al., 2015)
28 April, 2015
pg. 32
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
Evolution of DQ Definition and Emergence of Risk-based (RB) Approach to DQ
Prior to 1999, there was no commonly accepted formal definition of DQ. As a result,
everyone interpreted the term “high-quality” data differently and, in most cases, these
interpretations were very conservative. The overwhelming majority of researchers of that time
believed (and many continue to believe) that all errors are equally bad. This misguided belief
leads to costly consequences. As a result of this conservative interpretation of DQ, hundreds of
thousands (if not millions) of man-hours have been spent attempting to ensure the accuracy of
every minute detail.
“High-quality” data was first formally defined in 1999 at the IOM/FDA workshop “…as
data strong enough to support conclusions and interpretations equivalent to those derived from
error-free data.” This definition could be illustrated by a simple algorithm (Figure 4).
Figure 4. Data Quality Assessment Algorithm
This workshop also introduced important concepts, such as “Greater Emphasis on Building
Quality into the Process,” “Data Simplification,” “Hierarchy of Errors,” and “Targeted Strategies
[to DQ].” Certain data points are more important to interpreting the outcome of a study than
28 April, 2015
pg. 33
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
others, and these should receive the greatest effort and focus. Implementation of this definition
would require agreement on data standards.
Below is a summary of the main DQ concepts reflected in the literature between 1998-
2010 (Table 2 is extracted from Tantsyura et. al, 2015):
Table 2. Risk-based approach to DQ –Principles and Implications
Fundamental Principles Practical and Operational Implications
Data fit for use. (IOM, 1999) The Institute of Medicine defines quality data as “data that support
conclusions and interpretations equivalent to those derived from error-
free data.” (IOM, 1999)
“…the arbitrarily set standards ranging from a 0 to 0.5% error rate for
data processing may be unnecessary and masked by other, less quantified
errors.” (GCDMP, v4, 2005)
Hierarchy of errors. (IOM, 1999;
CTTI, 2009)
“Because a clinical trial often generates millions of data points, ensuring
100 percent completeness, for example, is often not possible; however, it
is also not necessary.” (IOM, 1999) Different data points carry different
weights for analysis and, thus, require different levels of scrutiny to
insure quality. “It is not practical, necessary, or efficient to design a
quality check for every possible error, or to perform a 100% manual
review of all data... There will always be errors that are not addressed by
quality checks or reviews, and errors that slip through the quality check
process undetected” (GCDMP, 2008)
Focus on critical variables. (ICH E9,
1998; CTTI, 2009)
Multi-tiered approach to monitoring (and data cleaning in general) is
recommended. (Khosla, Verma, Kapur, & Khosla, 2000; GCDMP v4, 2005;
Tantsyura et al., 2010)
Advantages of early error
detection. (ICH E9, 1998; CTTI,
2009)
“The identification of priorities and potential risks should commence at a
very early stage in the preparation of a trial, as part of the basic design
process with the collaboration of expert functions…” (EMA, 2013)
A decade later, two FDA officials (Ball & Meeker-O’Connell, 2011) had reiterated that
“Clinical research is an inherently human activity, and hence subject to a higher degree of
variation than in the manufacture of a product; consequently, the goal is not an error-free
dataset...” Recently, the Clinical Trials Transformation Initiative (CTTI), a public-private
partnership to identify and promote practices that will increase the quality and efficiency of
28 April, 2015
pg. 34
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
clinical trials, has introduced a more practical version of the definition, namely “the absence of
errors that matter.” Furthermore, CTTI (2012) stresses that “the likelihood of a successful,
quality trial can be dramatically improved through prospective attention to preventing important
errors that could undermine the ability to obtain meaningful information from a trial.” Finally, in
2013, in its Risk-Based Monitoring Guidance, the FDA re-stated its commitments, as follows:
“…there is a growing consensus that risk-based approaches to monitoring, focused on risks to
the most critical data elements and processes is necessary to achieve study objectives...” (FDA,
2013) Similarly, and more generally, a risk-based approach to DQ requires focus on “the most
critical data elements and processes.”
A risk-based approach to DQ is very convincing and understandable as a theoretical
concept. The difficulty comes when one attempts to implement it in practice and faces
surmounting variability among clinical trials. The next generation of DQ discussion should focus
on segregating different types of clinical trials and standard end-points, identifying acceptable
DQ thresholds and other commonalities in achieving DQ for each category.
Data Quality and Cost
If it is established that DQ is a “matter of degree, not an absolute,” then the next question
is where to draw the fine line between “good”/acceptable and “poor”/unacceptable quality. D.
Hoyle (1998, p. 28) writes in his “ISO 9000 Quality Systems Development Handbook”: “When a
product or service satisfies our needs we are likely to say it is of good quality and likewise when
we are dissatisfied we say the product or service is of poor quality. When the product or service
exceeds our needs we will probably say it is of high quality and likewise if it falls well below our
28 April, 2015
pg. 35
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
expectations we say it is of low quality. These measures of quality are all subjective. What is
good to one may be poor to another…”
Categorizing “good enough” and “poor” quality is a particularly crucial question for
clinical drug developers, because the answer carries serious cost implications. The objectively
defined acceptable levels of variation and errors have not yet been established. Despite the
significant efforts to eliminate inefficiencies in data collection and cleaning, there are still
significant resources devoted to low-value variables that have minimal to no impact on the
critical analyses. Focusing data cleaning efforts on “critical data” and establishing industry-wide
DQ thresholds are two of the main areas of interest. When implemented, there will be a major
elimination of waste in the current clinical development system.
A new area of research called the “Cost of Quality” suggests viewing costs associated
with quality as a combination of two drivers. According to this model, the presence of additional
errors carries negative consequences and cost (reflected in the monotonically decreasing “Cost of
poor DQ” line in the Figure 5). Thus, the first driver is a reflection of risk and the cost increase
due to the higher error rate/lower DQ. The second driver is the cost of data cleaning. Because the
elimination of data errors requires significant resources, it increases the costs, as manifested by
the monotonically increasing “Cost of DQ Maintenance” line in Figure 5. It is important to note
that the function is non-linear but asymptotic, i.e. the closer one comes to the “error-free” state
(100% accuracy), the more expensive each increment of DQ becomes. The “overall cost” is a
sum of both components, as depicted by the convex line in Figure 5. Consequently, there always
exists an optimal (lowest cost) point below the point that provides 100% accuracy. The “DQ
Cost Model” appears to be applicable and useful in pharmaceutical clinical development.
28 April, 2015
pg. 36
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
Figure 5. Data Quality Cost Model (Riain, and Helfert, 2005)
Pharmaceutical clinical trials are extremely expensive. On the surface, the clinical
development process appears to be fairly straight-forward – recruit a patient, collect safety and
efficacy data, and submit the data to the regulators for approval. In reality, only nine out of a
hundred of compounds entering pharmaceutical human clinical trials, ultimately get approved
and only two out of ten marketed drugs return revenues that match or exceed research and
development (R&D) costs (Vernon, Golec, & DiMasi, 2009). “Strictly speaking, a product’s
fixed development costs are not relevant to how it is priced because they are sunk (already
incurred and not recoverable) before the product reaches the market. But a company incurs R&D
costs in expectation of a product’s likely price, and on average, it must cover those fixed costs if
it is to continue to develop new products” (CBO, 2006).
28 April, 2015
pg. 37
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
According to various estimates, “the industry’s real (inflation-adjusted) spending on drug
R&D has grown between threefold and sixfold over the past 25 years—and that rise has been
closely matched by growth in drug sales.” (CBO, 2006) Most often quoted cost estimate by
DiMasi, Hansen and Grabowski (2003) states that the process of data generation costs more than
$800 million per approved drug14
, up from $100 million in 1975 and 403 million US dollars in
2000 (in 2000 dollars). Some experts believe the real cost, as measured by the average cost to
develop one drug, to be closer to $4 billion (Miseta, 2013). However, this (high) estimate is
criticized by some experts for including the cost of drug failures but not the R&D Tax credit
afforded to the pharmaceutical companies. For some drug development companies the cost is
even higher – 14 companies spend in excess of $5 billion per new drug according to Forbes
(Harper, 2013) as referenced in the Table 3.
Table 3. R&D cost per drug (Liu, Constantinides, & Li, 2013)
14
“A recent, widely circulated estimate put the average cost of developing an innovative new drug at more than
$800 million, including expenditures on failed projects and the value of forgone alternative investments. Although
that average cost suggests that new-drug discovery and development can be very expensive, it reflects the research
strategies and drug-development choices that companies make on the basis of their expectations about future
revenue. If companies expected to earn less from future drug sales, they would alter their research strategies to lower
their average R&D spending per drug. Moreover, that estimate represents only NMEs developed by a sample of
large pharmaceutical firms. Other types of drugs often cost much less to develop (although NMEs have been the
source of most of the major therapeutic advances in pharmaceuticals)” (CBO, 2006).
28 April, 2015
pg. 38
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
In fact, the exponential growth of the clinical trial cost impedes progress in medicine and,
ultimately, may put the national public health at risk. CBO (2006) reported that the federal
government spent more than $25 billion on health-related R&D in 2005. Fee (2007), based on
the CMSInfo analysis, reported that national spending on clinical trials (including new
indications and marketing studies) in the United States was nearly $24 billion in 2005. The
research institute expected this number to rise to $25.6 billion in 2006 and $32.1 billion in
2011—growing at an average rate of 4.6% per year. The reality exceeded the initial projections.
It is believed that the nation spends over $50 billion per year on pharmaceutical clinical trials
(CBO, 2006; Kaitin, 2008). Research companies need to find a way to reduce these costs if the
industry to sustain growth and fund all necessary clinical trials.
Some authors question the effectiveness and efficiency of the processes employed by the
industry. Ezekiel (2003) pointed out more than a decade ago that a large proportion of the
clinical trial costs has been devoted to the non-treatment trial activities. More recently, Tufts
Center for the Study of Drug Development conducted an extensive study among a working group
of 15 pharmaceutical companies in which a total of 25,103 individual protocol procedures were
evaluated and classified using clinical study reports and analysis plans. This study uncovered
significant waste in clinical trial conduct across the industry. More specifically, the results
demonstrate that
…the typical later-stage protocol had an average of 7 objectives and 13 end points of
which 53.8% are supplementary. One (24.7%) of every 4 procedures performed per
phase-III protocol and 17.7% of all phase-II procedures per protocol were classified as
"Noncore" in that they supported supplemental secondary, tertiary, and exploratory end
points. For phase-III protocols, 23.6% of all procedures supported regulatory compliance
requirements and 15.9% supported those for phase-II protocols. The study also found that
on average, $1.7 million (18.5% of the total) is spent in direct costs to administer
Noncore procedures per phase-III protocol and $0.3 million (13.1% of the total) in direct
costs are spent on Noncore procedures for each phase-II protocol. Based on the results of
28 April, 2015
pg. 39
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
this study, the total direct cost to perform Noncore procedures for all active annual phase-
II and phase-III protocols is conservatively estimated at $3.7 billion annually. (Getz et al.,
2013)
A major portion of the clinical trial operation cost is devoted to “data cleaning” and other
data handling activities that are intended to increase the likelihood of drug approval. It appears
that the pharmaceutical industry (and society in general) continues to pay the price of not using a
clear-cut definition of DQ, in the hopes that 100% accuracy will make their approval less
painful. The title of Lörstad’s publication (2004) “Data Quality of the Clinical Trial Process –
Costly Regulatory Compliance at the Expense of Scientific Proficiency” summarizes the
concerns of the scientific community as they pertain to unintelligent utilization of resources. A
vital question– what proportion of these resources could be saved and how? – has been
frequently posed in the scientific literature over that past decade. Some authors estimate that
significant portion of the funds allocated for clinical research may be wasted, due to issues
related to data quality (DQ) (Ezekiel & Fuchs, 2008; Funning, 2009; Tantsyura et al, 2015).
Many recent publications focus on the reduction of barely efficient manual data cleaning
techniques, such as SDV. Typically, such over-utilization of resources is a direct result of the
conservative interpretation of the regulatory requirements for quality. Lörstad (2004) calls this
belief in the need for perfection of clinical data a “carefully nursed myth” and “nothing but
embarrassing to its scientifically trained promoters.”
The fundamental uniqueness of QA in clinical development relative to other industries
(manufacturing being the most extreme example), is the practical difficulty of using the classic
quality assurance approach, where a sample of “gadgets” is inspected, the probability of failure is
estimated, and the conclusions about quality are drawn. Such an approach to QA is not only
impractical in the pharmaceutical clinical trials because of the prohibitively high cost, but may
28 April, 2015
pg. 40
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
also be unnecessary. To further complicate the situation, the efficacy end-points and data
collection instruments and mechanisms are not yet standardized across the industry. This fact
makes almost every trial (and DQ context for this trial) unique, and thus requires intelligent
variation in the approaches to DQ from one trial to another in order to eliminate waste.
The industry does not appear to be fully prepared for such a cerebral and variable
approach to quality. Very often, from the senior management (allocating the data cleaning
resources) perspective, it is much safer, easier and surprise-free to stick to the uniformly
conservative interpretations of the regulatory requirements than allow a study team to determine
the quality acceptance criteria on a case-by-case basis. Study teams are not eager to change either
because of insufficient training and lack of motivation to change. Generalization and application
of DQ standards from one trial to another lead to imprecision in defining study-specific levels of
“good” and “poor” quality and, subsequently, to overspending in the name of high DQ. Thus, the
definition of DQ is at the heart of inefficient utilization of resources. An anticipated dramatic
reduction in reliance on manual SDV and its replacement with the statistically powered
computerized algorithms will lead to DQ improvements while reducing cost.
One can consider on-site monitoring, the most expensive and widely used element of the
process to assure DQ, as a vivid example of such inefficiency. The FDA has made it clear that
extensive on-site monitoring is no longer essential. In April 2011, the FDA withdrew a 1988
guidance on clinical monitoring (FDA, 1988) that emphasized on-site monitoring visits by the
sponsor or CRO personnel “because the 1988 guidance no longer reflected the current FDA
opinion” (Lindblad et al., 2014). A clinical monitoring guidance (FDA, 2013) encourages
sponsors to use a variety of approaches to meet their trial oversight responsibilities and optimize
the utilization of their resources
28 April, 2015
pg. 41
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
In my recent work (Tantsyura et al., 2015), I had estimated that over nine billion dollars
could be saved annually by the pharmaceutical industry if new standards for DQ process in
monitoring clinical trials are established and implemented by clinical trial operations across the
industry. Tantsyura et al. (2010), and Neilsen, Hyder, and Deng (2013) compared the cost
implications for several SDV models and concluded that a “mixed approach” (which is
characterized by a minimal amount of SDV relative to other alternatives) appears to be the most
efficient (Nielsen et al., 2013). In addition to site monitoring, new DQ standards would impact
all other functions, including data management, statistical programming, regulatory, quality, and
pharmacovigilance departments resulting in an even greater total potential savings.
Magnitude of Data Errors in Clinical Research
The error rates have long been an important consideration in analysis of clinical trials. As
discussed earlier, each data processing step in clinical trials may potentially introduce errors and
can be characterized by a corresponding “error rate.” For example, transcription (from the
“source document” to CRF) step is associated with a transcription error rate. Similarly, each data
entry step is associated with a data entry error rate. Subsequently, the overall (or “source-to-
database”) error rate is a sum of error rates introduced by each data processing step.
The historical contextual changes (when EDC eliminated the need for paper CRF and
thus eliminated the errors associated with data entry and, more recently, when eSource
technologies, such as EMR, DDE, began elimination of the paper source, thus eradicating the
transcription errors and further improving DQ) are important for understanding error rates
reported in the literature. Also, these evolutionary changes impacted the approaches to QA
audits. “Historically, only partial assessments of data quality have been performed in clinical
28 April, 2015
pg. 42
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
trials, for which the most common method measuring database error rates has been to compare
the case report forms (CRF) to database entries and count discrepancies” (Nahm et al., 2008). In
the contemporary data collection environment, even limited-in-scope single-step audits are rarely
used.
While actual error rates vary moderately from study to study due to multiple contextual
factors and are not assessed habitually, some reports on error rates are found in the literature. The
average source-to-database error rate in electronic data capture (EDC) – 14.3 errors per 10,000
fields or 0.143% – is significantly lower than the average published error rate (976 errors per
10,000 or 9.76%) calculated as a result of a literature review and analysis of 42 articles in the
period from 1981 to 2005, a majority of which were paper registries (Nahm, 2008).
Mitchel, Kim, Choi, Park, Schloss Markowitz and Cappi (2010) analyzed the proportion
of data corrections using a smaller sample for a paper CRF study and also found a very small
proportion (approx. 6%) of CRFs modified. In the most recent study by Mitchel and colleagues,
which used the innovative DDE system, the reported rate of error corrections is 1.44% (Mitchel
Gittleman, Park et al., 2014). Also, Medidata (Yong, 2013) analyzed the magnitude of data
modifications looking across a sample of 10,000+ clinical trials/millions data points. Many were
surprised by the tiny proportion of data points that were modified since the original data entry
was minimal – just under 3%. Grieve (2012) estimated data modification rates based on a
sample of 1,234 individual patient visits from 24 studies. His estimates were consistent with the
rate reported by Medidata. “The pooled estimate of the error rate across all projects was
approximately 3% with an associated 95% credible interval ranging from 2% to 4%.” [Note:
First, this imprecise use of term “error rate” is observed in several publications and likely a
reflection of the modern trend. Second, the methodology in the Mitchel et al. (2010) and
28 April, 2015
pg. 43
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
Medidata (Yong, 2013) studies were different – data-point-level vs. CRF level error correction.
As a result of these methodological differences, the numerator, denominator, and, ultimately, the
reported rates, were categorically different.]
The most recent retrospective multi-study data analysis by Transcelerate (Scheetz et al.,
2014) revealed that “only 3.7% of eCRF data are corrected following initial entry by site
personnel.” The difference between reported rates 3% (Yong, 2013) and 3.7% (Scheetz et al.,
2014) is primarily attributed to the fact that “the later estimates included cases of initially
missing data that were discovered and added later, usually as a result of monitoring activities by
CRA, which were not included as a separate category in the previous analysis (Tantsyura et al.,
2015). Overall, it has been established that the EDC environment with built-in real-time edit
checks has been reducing the error rate (and thus improving DQ) by a magnitude of 40-95%
(Bakobaki et al., 2012).
Now, when EDC is a dominant data acquisition tool in pharmaceutical clinical
development, and the data are collected not only much faster but with the higher quality (lower
error rates), and the first regulatory approval using DDE technology is on the horizon, the
important question is how much more data cleaning is necessary? What additional data cleaning
methods should or should not be used? Is investment in (expensive but still popular) human
source data verification effective and efficient, and does it produce a return commensurate with
its expense? Which processes truly improve the probability of a correct study conclusion and
drug approval, while reducing resource utilization at the same time?
28 April, 2015
pg. 44
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
Source Data Verification, Its Minimal Impact on DQ, and the Emergence of Central/Remote
Review of Data
Many authors look at SDV as an example of a labor-intensive and costly data-correction
method and present evidence on the effectiveness of SDV as it pertains to query generation and
data correction rates. All reviewed studies consistently demonstrated minimal effect of SDV-
related data corrections in respect to overall DQ.
More specifically, based on a sample of studies, TransCelerate (2013) calculated the
average percentage of SDV queries generated 7.8% of the total number of queries generated. The
average percentage of SDV queries that were generated in “critical” data exclusively was 2.4%.
A study by Cancer Research UK Liverpool Cancer Trials Unit assessed the value of SDV for
oncology studies and found it to be minimal (Smith et al., 2012). In this study data discrepancies
and comparative treatment effects obtained following 100% SDV were compared to those based
on data without SDV. In the sample of 533 subjects, baseline data discrepancies identified via
SDV varied from 0.6% (Gender), 0.8% (Eligibility Criteria), 1.3% (Ethnicity), 2.3% (Date of
Birth) to 3.0% (WHO PS), 3.2% (Disease Stage) and 9.9% (Date of Diagnosis).All discrepancies
were equally distributed across treatment groups and across sites, and no systematic patterns
were identified. The authors concluded that “in this empirical comparison, SDV was expensive
and identified random errors that made little impact on results and clinical conclusions of the
trial. Central monitoring using an external data source was a more efficient approach for the
primary outcome of overall survival. For the subjective outcome objective response, an
independent blinded review committee and tracking system to monitor missing scan data could
be more efficient than SDV.” Similarly to Mitchel et al. (2011), Smith and colleagues suggested
(as an alternative to SDV) “to safeguard against the effect of random errors might be to inflate
28 April, 2015
pg. 45
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
the target sample size…” This recommendation will be further explored in the “Practical
Implications” section of the discussion.
Mitchel, Kim, Hamrell et al. (2014) analyzed the impact of SDV and queries issued by
CRAs in a study with 180 subjects. In this study a total of 5,581 paper source records reviewed at
the site and compared with the clinical trial database. This experiment showed that only 3.9% of
forms were queried by CRAs and only 1.4% of forms had database changes as a result of queries
generated by CRAs (37% query effectiveness rate). Also, the “error rate” associated with SDV
alone was 0.86%.
The review by Bakobaki and colleagues “determined that centralized [as opposed to
manual/on-site] monitoring activities could have identified more than 90% of the findings
identified during on-site monitoring visits” [Bakobaki et al., 2012, as quoted by FDA RBM
guidance 2013]. This study leads to the conclusion that the manual data cleaning step is not only
too expensive, but also often unnecessary when and if a rigorous computerized data validation
process is employed.
Lindblad et al. (2013) attempted to “determine whether a central review by statisticians
using data submitted to the FDA… can identify problem sites and trials that failed FDA site
inspections.” The authors concluded that “systematic central monitoring of clinical trial data can
identify problems at the same trials and sites identified during FDA site inspections.”
In summary, SDV is the most expensive, but widely utilized, manual step in the data
cleaning process. However, multiple publications emphasized that while in the old 100% SDV
method, the SDV component was essential in identifying issues and addressing them, now,
utilizing modern technology, this step of the process is a largely wasted effort because computer-
enabled algorithms identify data issues much more efficiently.
28 April, 2015
pg. 46
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
Effect of Data Errors on Study Results
A recent study by Mitchel, Kim, Choi et al. (2011) uncovered the impact of data error
and error correction on study results by analyzing study data before and after data cleaning in a
study with 492 randomized subjects in a real pharmaceutical clinical development setting. In
this case, data cleaning did not change the study conclusions. Also, this study raised a legitimate
question: Does EDC, with its built-in on-line and off-line edit checks produce “good enough”
data with no further cleaning is necessary? More specifically, Mitchel and colleagues observed
the following three phenomena: (a) nearly identical means before and after data cleaning, (b) a
slight reduction in SD (by approximately 3%), and, most importantly, (c) the direct (inverse)
impact of the sample size on the “benefit” of data cleaning. [Further study of this last
phenomenon is the main focus of this dissertation thesis.]
Thus, regardless the systematic nature of data errors15
, in presence of built-in edit checks,
some evidence demonstrates the very limited impact of data errors and data cleaning on study
results.
Study Size Effect and Rational for the Study
What are the factors that influence the validity, reliability and integrity of study
conclusions (i.e. DQ)? There are many. For example, the FDA emphasizes “trial-specific factors
(e.g., design, complexity, size, and type of study outcome measures)” (FDA, 1998). Recent
literature points out the inverse relationship between the study size and the impact of errors on
study conclusions (Mitchel et al., 2011; Tantsyura et al., 2015), as well as the diminishing (with
study size) return on investment (ROI) of DQ-related activities (Tantsyura et al. 2010, Tantsyura
15
With the exception of fraud, which is “rare and isolated in scope” (Helfgott, 2014).
28 April, 2015
pg. 47
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
et al., 2015). Mitchel, Kim, Choi and colleagues (2011) point out “the impact on detectable
differences [between the “original” and the “clean” data]…, is a direct function of the sample
size.”
However, the impact of study size on DQ has not been examined systematically in the
literature to date. This proposed study is intended to fill in this knowledge gap. Monte-Carlo
simulations will be used as a mechanism to generate data for analysis, as it has been suggested
by some literature (Pipino & Kopcso, 2004). Statistical, regulatory, economic and policy aspects
of the issue will be examined and discussed. The specific research questions this study will be
focusing on are: (1) “What is the statistical and economic impact of the study size on study
conclusions (i.e. data quality)?” and (2) “Are there any opportunities for policy changes to
improve the effectiveness and efficiency of the decision-making process?”
28 April, 2015
pg. 48
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
III. Analytical Section/Methods of Analysis.
Objectives and End-Points
Primary Objectives: Primary End-Points:
Estimate statistical impact of
the study size on study
conclusions in the presence of
data errors
 Probability of the correct decision (i.e. match between
t-test by “error-free” data and t-test using “error-
induced” data) – Pcorrect;
 Probability of the false-negative decision - Pfalse-neg;
 Probability of the false-positive decision - Pfalse-pos.
Secondary Objectives: Secondary Endpoints:
Estimate the data cleaning
“cut-off point” for studies
with different sample sizes
 Minimal sample size (n95%) to achieve 95%
probability of correct decision without data cleaning
(Pcorrect ≥ 95%);
 Minimal sample size (n98%) to achieve 98%
probability of correct decision without data cleaning
(Pcorrect ≥ 98%);
 Minimal sample size (n99%) to achieve 99%
probability of correct decision without data cleaning
(Pcorrect ≥ 99%).
Estimate the economic impact
of reduction in data cleaning
activities associated with
introduction of “data cleaning
cut-off” policy
Estimated cost savings (% cost reduction) associated with
reduced data cleaning.
Policy recommendations. Generate a list of recommendations for (potential) process
modification and regulatory policy changes to improve the
effectiveness and efficiency of the DQ-related decision-
making process.
Critical Concepts and Definitions:
 “Accuracy” dimension of DQ is under investigation. Other dimensions are not
considered in this study.
 Conceptual definition of DQ: “High-quality” data is formally defined “…as data strong
enough to support conclusions and interpretations equivalent to those derived from
error-free data.” (IOM, 1999)
 Operational definition and measure of DQ: the probability of correct study conclusion
relative to “error-free data”. This definition is logically followed by the definitions of
(a) “increase in DQ” that is measured by an increase in probability of correct study
conclusion and (b) “reduction in DQ” that is measured by a decrease in probability of
correct study conclusion.
28 April, 2015
pg. 49
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
Synopsis:
Determine the impact of data errors on DQ and study conclusions for 147 hypothetical scenarios
while varying (1) the number of subjects per arm (n), (2) the effect size or difference between
means of active arm and comparator (delta), and (3) the number of errors per arm (e). Each trial
scenario is repeated 200 times and, thus, the results and probabilities calculated in this
experiment are based on 147 x 200 = 29400 trials. The input and output variables used in this
study are listed in the box below:
Input variables:
n – number of observations (pseudo-subjects) per arm
Co-variates:
e – number of induced errors per arm
d (delta) – effect size/difference between means of hypothetical “Pbo” and 7 different
“active” arms
Primary outcome variables of interest:
Pcorrect – probability of correct decision (i.e. match between t-test result from “error-
free” data and t-test from “error-induced” datasets)
Secondary outcome variables of interests:
Pfalse-neg – probability of false-negative decision
Pfalse-pos – probability of false-positive decision
Main Assumptions:
 All 147 scenarios are superiority trials
 Only one variable of analysis is considered for each simulated scenario
 The distributions of values in ‘active’ and ‘Pbo’ arms are assumed normal. Also
o Variability for Pbo arm and seven “active” arms are assumed identical and fixed
(SD=1) for all 147 scenarios
28 April, 2015
pg. 50
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
 Errors distributed as a Gaussian “white noise”
 Out of range edit checks (EC) assume 100% effectiveness (catch 100% of errors) within
pre-specified range (4 SD). Overall, it is presumed that adequate process exists for
detecting important issues.
 Number of errors in active and Pbo arms is assumed equal in all scenarios (0 v. 0, 1 v. 1,
2 vs. 2, 5 v. 5)
 Number of errors (e) and number of observations per arm (n) are considered independent
variables.
Design of the Experiment:
The idea and design of this experiment comes directly from the IOM (1999) definition of
high quality data, which is described “as data strong enough to support conclusions and
interpretations equivalent to those derived from error-free data.” In fact, in a hypothetical trial,
this definition implies that the data is of “good enough” quality (no matter how many errors are
in the datasets) as long as clinical significance of a statistical test is on the same side of the
statistical threshold (typically 0.05). Furthermore, this definition suggests a formal comparison
between an error-free data set and the same data set with induced data errors (i.e. several data
points removed and replaced with the “erroneous” values), which is implemented in this study.
Figure 6. High-level Design of the Experiment
28 April, 2015
pg. 51
© 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.
As the result of the experiment, the probabilities of “correct” (as well as “false-negative”
and “false-positive”) decisions are calculated 200 times for each of the 147 scenarios, producing
the equivalent of 29,400 hypothetical trials. [Note: The “correct” study conclusion, for the
purpose of this experiment, is defined as correct hypothesis acceptance or rejection (when the
“statistical significance” dictated by p-values in both erroneous and error-free datasets are the
same; denoted as code “0” in the table below). More specifically this probability is calculated as
a proportion of “matches” for p-values calculated from “error-free” and error-induced data-sets
with denominator 200, which reflects the number of Monte-Carlo iterations for each scenario.]
Similarly, the probability of incorrect (“false-negative” and “false-positive”) decisions are
calculated as well (codes “-1” and “+1” respectively). Table 4 is the visual demonstration of the
event coding for “hits” and misses.”
Table 4. Coding for ”Hits” and “Misses”
Erroneous
probability
(Pd_er)
True probability [P(d)]; d = 0, 0.05, 0.1, 0.2, 0.5, 1, 2
P(d) < 0.05 P(d) ≥ 0.05
Pd_er < 0.05 0 (correct) +1 (false-positive)
Pd_er ≥ 0.05 -1 (false-negative) 0 (correct)
Legend:
 Correct decision (i.e. the statistical significance is not changed by induced error(s)):
o code “0”
 Incorrect decision (i.e. the statistical significance is changed by induced error(s)):
o False positive decision: code “+1”
o False negative decision: code “-1”
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura
2015_05_12 Dissertation Binder_VTantsyura

Weitere ähnliche Inhalte

Was ist angesagt?

Research methodology khushbu
Research methodology khushbuResearch methodology khushbu
Research methodology khushbukhushbu mishra
 
Lemeshow samplesize
Lemeshow samplesizeLemeshow samplesize
Lemeshow samplesize1joanenab
 
statistics in pharmaceutical sciences
statistics in pharmaceutical sciencesstatistics in pharmaceutical sciences
statistics in pharmaceutical sciencesTechmasi
 
Introduction to biostatistics
Introduction to biostatisticsIntroduction to biostatistics
Introduction to biostatisticsWeam Banjar
 
Ct lecture 17. introduction to logistic regression
Ct lecture 17. introduction to logistic regressionCt lecture 17. introduction to logistic regression
Ct lecture 17. introduction to logistic regressionHau Pham
 
Health information & Basic statistics forparamedicals
Health information & Basic statistics forparamedicalsHealth information & Basic statistics forparamedicals
Health information & Basic statistics forparamedicalsPrithvipal Singh
 
scope and need of biostatics
scope and need of  biostaticsscope and need of  biostatics
scope and need of biostaticsdr_sharmajyoti01
 
Introduction to biostatistic
Introduction to biostatisticIntroduction to biostatistic
Introduction to biostatisticJoshua Anish
 
Epidemiolgy and biostatistics notes
Epidemiolgy and biostatistics notesEpidemiolgy and biostatistics notes
Epidemiolgy and biostatistics notesCharles Ntwale
 
Introduction to biostatistics
Introduction to biostatisticsIntroduction to biostatistics
Introduction to biostatisticsshivamdixit57
 
Sample size calculation in medical research
Sample size calculation in medical researchSample size calculation in medical research
Sample size calculation in medical researchKannan Iyanar
 
1. Introduction to biostatistics
1. Introduction to biostatistics1. Introduction to biostatistics
1. Introduction to biostatisticsRazif Shahril
 
BIOSTATISTICS + EXERCISES
BIOSTATISTICS + EXERCISESBIOSTATISTICS + EXERCISES
BIOSTATISTICS + EXERCISESMINANI Theobald
 

Was ist angesagt? (20)

Sample size calculation
Sample size calculationSample size calculation
Sample size calculation
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
Research methodology khushbu
Research methodology khushbuResearch methodology khushbu
Research methodology khushbu
 
Lemeshow samplesize
Lemeshow samplesizeLemeshow samplesize
Lemeshow samplesize
 
Biostatistics ug
Biostatistics  ug Biostatistics  ug
Biostatistics ug
 
statistics in pharmaceutical sciences
statistics in pharmaceutical sciencesstatistics in pharmaceutical sciences
statistics in pharmaceutical sciences
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
Introduction to biostatistics
Introduction to biostatisticsIntroduction to biostatistics
Introduction to biostatistics
 
Ct lecture 17. introduction to logistic regression
Ct lecture 17. introduction to logistic regressionCt lecture 17. introduction to logistic regression
Ct lecture 17. introduction to logistic regression
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
Health information & Basic statistics forparamedicals
Health information & Basic statistics forparamedicalsHealth information & Basic statistics forparamedicals
Health information & Basic statistics forparamedicals
 
scope and need of biostatics
scope and need of  biostaticsscope and need of  biostatics
scope and need of biostatics
 
P-values in crisis
P-values in crisisP-values in crisis
P-values in crisis
 
Introduction to biostatistic
Introduction to biostatisticIntroduction to biostatistic
Introduction to biostatistic
 
Epidemiolgy and biostatistics notes
Epidemiolgy and biostatistics notesEpidemiolgy and biostatistics notes
Epidemiolgy and biostatistics notes
 
Introduction to biostatistics
Introduction to biostatisticsIntroduction to biostatistics
Introduction to biostatistics
 
Sample size calculation in medical research
Sample size calculation in medical researchSample size calculation in medical research
Sample size calculation in medical research
 
Bioststistic mbbs-1 f30may
Bioststistic  mbbs-1 f30mayBioststistic  mbbs-1 f30may
Bioststistic mbbs-1 f30may
 
1. Introduction to biostatistics
1. Introduction to biostatistics1. Introduction to biostatistics
1. Introduction to biostatistics
 
BIOSTATISTICS + EXERCISES
BIOSTATISTICS + EXERCISESBIOSTATISTICS + EXERCISES
BIOSTATISTICS + EXERCISES
 

Ähnlich wie 2015_05_12 Dissertation Binder_VTantsyura

How To Write Good Reports In English - Ackland Writing
How To Write Good Reports In English - Ackland WritingHow To Write Good Reports In English - Ackland Writing
How To Write Good Reports In English - Ackland WritingLori Moore
 
Synopsis_V1_9-18-2023.docxddddddddddddddddddddddddddd
Synopsis_V1_9-18-2023.docxdddddddddddddddddddddddddddSynopsis_V1_9-18-2023.docxddddddddddddddddddddddddddd
Synopsis_V1_9-18-2023.docxdddddddddddddddddddddddddddsohailg3e6565
 
Practice and the Quadruple.docx
Practice and the Quadruple.docxPractice and the Quadruple.docx
Practice and the Quadruple.docxstudywriters
 
dissertation 0316 final AW
dissertation 0316 final AWdissertation 0316 final AW
dissertation 0316 final AWAnneWojtak
 
Hài lòng bệnh nhân tại bênh viện Nhi Hải Dương
Hài lòng bệnh nhân tại bênh viện Nhi Hải DươngHài lòng bệnh nhân tại bênh viện Nhi Hải Dương
Hài lòng bệnh nhân tại bênh viện Nhi Hải DươngNghiên Cứu Định Lượng
 
Examples Of University Application Essay
Examples Of University Application EssayExamples Of University Application Essay
Examples Of University Application EssaySue Jones
 
Dichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianDichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianLaure Wynants
 
Factors That Impacted Effective Diabetes Management Within...
Factors That Impacted Effective Diabetes Management Within...Factors That Impacted Effective Diabetes Management Within...
Factors That Impacted Effective Diabetes Management Within...Susan Tullis
 
Nursing Shortageby Monica CastelaoSubmission dat e 01-.docx
Nursing Shortageby Monica CastelaoSubmission dat e  01-.docxNursing Shortageby Monica CastelaoSubmission dat e  01-.docx
Nursing Shortageby Monica CastelaoSubmission dat e 01-.docxcherishwinsland
 

Ähnlich wie 2015_05_12 Dissertation Binder_VTantsyura (12)

How To Write Good Reports In English - Ackland Writing
How To Write Good Reports In English - Ackland WritingHow To Write Good Reports In English - Ackland Writing
How To Write Good Reports In English - Ackland Writing
 
PR2_Wk1-2.pptx
PR2_Wk1-2.pptxPR2_Wk1-2.pptx
PR2_Wk1-2.pptx
 
Synopsis_V1_9-18-2023.docxddddddddddddddddddddddddddd
Synopsis_V1_9-18-2023.docxdddddddddddddddddddddddddddSynopsis_V1_9-18-2023.docxddddddddddddddddddddddddddd
Synopsis_V1_9-18-2023.docxddddddddddddddddddddddddddd
 
Statistics Exericse 29
Statistics Exericse 29Statistics Exericse 29
Statistics Exericse 29
 
Practice and the Quadruple.docx
Practice and the Quadruple.docxPractice and the Quadruple.docx
Practice and the Quadruple.docx
 
dissertation 0316 final AW
dissertation 0316 final AWdissertation 0316 final AW
dissertation 0316 final AW
 
Hài lòng bệnh nhân tại bênh viện Nhi Hải Dương
Hài lòng bệnh nhân tại bênh viện Nhi Hải DươngHài lòng bệnh nhân tại bênh viện Nhi Hải Dương
Hài lòng bệnh nhân tại bênh viện Nhi Hải Dương
 
Sample Of Research Essay
Sample Of Research EssaySample Of Research Essay
Sample Of Research Essay
 
Examples Of University Application Essay
Examples Of University Application EssayExamples Of University Application Essay
Examples Of University Application Essay
 
Dichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianDichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatistician
 
Factors That Impacted Effective Diabetes Management Within...
Factors That Impacted Effective Diabetes Management Within...Factors That Impacted Effective Diabetes Management Within...
Factors That Impacted Effective Diabetes Management Within...
 
Nursing Shortageby Monica CastelaoSubmission dat e 01-.docx
Nursing Shortageby Monica CastelaoSubmission dat e  01-.docxNursing Shortageby Monica CastelaoSubmission dat e  01-.docx
Nursing Shortageby Monica CastelaoSubmission dat e 01-.docx
 

2015_05_12 Dissertation Binder_VTantsyura

  • 1. 28 April, 2015 pg. 1 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. Impact of Study Size on Data Quality in Regulated Clinical Research: Analysis of Probabilities of Erroneous Regulatory Decisions in the Presence of Data Errors Vadim Tantsyura A Doctoral Dissertation in the Program in Health Policy and Management Submitted to the Faculty of the Graduate School of Health Sciences and Practice In Partial Fulfillment of the Requirements for the Degree of Doctor of Public Health at New York Medical College 2015
  • 2.
  • 3.
  • 4. 28 April, 2015 pg. 3 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. Acknowledgements I would like to thank my colleagues, friends and family who have provided support throughout this journey. Thank you, Dr. Kenneth Knapp, for keeping this research focused. With your guidance and foresight, goals became reachable. Dr. Imogene McCanless Dunn, you stepped into my life many years ago and introduced me to the true meaning of scientific inquiry. You were the first one who planted the seeds more than a decade ago that led to this research. Your wisdom and ethics have helped me pull the pieces of this project together. I thank you, Kaye Fendt, for introducing this research area to me, for ideas, and for bringing this important issue forward so many people may benefit from. Your gentle touch changed the direction of my thinking multiple times over the past ten years. I would like to express my deepest gratitude to you, Dr. Jules Mitchel, for supporting me from my earlier years in the industry, for shaping my views and writing style, and for your long-time mentorship and regular encouragement. Your frame of reference allowed me to finalize my own thoughts, and your impact on my work is much greater than you might think. Dr. Rick Ittenbach, I greatly appreciate the time you took to listen, and I aspire to be someday as fair and wise as you are. I would also like to thank Joel Waters and Amy Liu for “turning things around” for me. I would especially like to thank my wonderful family, who allowed me to step away to complete my journey. I will make up the lost time, I promise. I hope my children, Eva, Daniel and Joseph, understand their education will go on for the rest of their lives. I hope they enjoy their journeys as much as I have mine. I thank my parents, Volodymyr and Lyuba, for their commitment to being role models for lifelong learning. Last, I would like to thank my wife, Nadia. She has been a constant source of support through this process and fourteen years of my other brainstorms. Thanks for understanding why I needed to do this.
  • 5. 28 April, 2015 pg. 4 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. Abstract. Background: Data errors are undesirable in clinical research because they tend to increase the chance of false-negative and false-positive study conclusions. While recent literature points out the inverse relationship between the study size and the impact of errors on data quality and study conclusions, no systematic assessment of this phenomenon has been conducted. The IOM (1999) definition of high quality data, described as “…data strong enough to support conclusions and interpretations equivalent to those derived from error-free data”, is used for this study. Purpose: To assess the statistical impact of the study size on data quality, and identify the areas for potential policy changes. Methods: A formal comparison between an error-free dataset and the same dataset with induced data errors via replacement of several data points with the “erroneous” values was implemented in this study. The data are simulated for one hundred and forty-seven hypothetical scenarios using the Monte-Carlo method. Each scenario was repeated 200 times, resulting in an equivalent of 29,400 hypothetical clinical trials. The probabilities of correct, false-positive and false- negative conclusions were calculated. Subsequently, the trend analysis of the simulated probabilities was conducted and the best fit regression lines were identified. Results: The data demonstrate that the monotonic, logarithmic-like asymptotic increase towards 100% is associated with the sample size increase. The strength of association between study size and probabilities is high (R2 = 0.84-0.93). Median probability of the correct study conclusion is equal to or exceeds 99% for all larger size studies – with 200 observations per arm or more. Also, marginal effects of additional errors on the study conclusions has been demonstrated. For
  • 6. 28 April, 2015 pg. 5 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. the smaller studies (up to 200 subjects per arm), the variability of changes is high with essentially no trend. The range of fluctuations of changes in the median probability of the correct study conclusion is within 3% (ΔPcorrect = [-3%-0%]). For the larger studies (n ≥ 200 per arm), on the other hand, the variability of ΔPcorrect and the negative impact of errors on ΔPcorrect is minimal (within 0.5% range). Limitations: The number of errors and study size are considered independent variables in this experiment. However, non-zero correlation between the sample size and the number of errors can be found in the real trials. Conclusions: (1) The “sample size effect,” i.e. the neutralizing effect of the sample size on the noise from data errors was consistently observed in the case of single data error as well as in the case of the incremental increase in the number of errors. The data cleaning threshold methodology as suggested by this manuscript can lead to a reduction in resource utilization. (2) Error rates have been considered the gold standard of DQ assessment, but have not been widely utilized in practice due to the prohibitively high cost. The proposed method suggests estimating DQ using the simulated probability of correct/false-negative/false-positive study conclusions as an outcome variable and the rough estimates of error rates as input variables.
  • 7. 28 April, 2015 pg. 6 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. TABLE OF CONTENTS: Section I. Introduction ................................................................................................................. 9 Section II. Background and Literature Review .......................................................................... 13 Basic Concepts and Terminology .................................................................................. 13 Measuring Data Quality ................................................................................................. 24 DQ in Regulatory and Health Policy Decision-Making ................................................ 28 Error-Free-Data Myth and the Multi-Dimensional Nature of DQ ................................. 31 Evolution of DQ Definition and Emergence of Risk-Based Approach to DQ .............. 32 Data Quality and Cost .................................................................................................... 34 Magnitude of Data Errors in Clinical Research ............................................................. 41 Source Data Verification, Its Minimal Impact on DQ and the Emergence of Central/Remote Review of Data ................................................................................ 44 Effect of Data Errors on Study Results .......................................................................... 46 Study Size, Effect and Rational for the Study ............................................................... 46 Section III. Methods of Analysis ............................................................................................... 48 Objectives and End-Points ............................................................................................. 48 Synopsis ......................................................................................................................... 49 Main Assumptions ......................................................................................................... 50 Design of the Experiment .............................................................................................. 51 Data Generation ............................................................................................................. 53 Data Verification ............................................................................................................ 54 Analysis Methods ........................................................................................................... 55 Section IV. Results and Findings ............................................................................................... 57
  • 8. 28 April, 2015 pg. 7 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. Section V. Discussion ................................................................................................................ 66 Practical Implications ..................................................................................................... 67 Economic Impact ........................................................................................................... 72 Policy Recommendations ............................................................................................... 77 Study Limitations and Suggestions for Future Research ............................................... 80 Conclusions .................................................................................................................... 82 References .................................................................................................................................. 84 LIST OF APPENDIXES. Appendix 1. Dimensions of DQ ..................................................................................... 93 Appendix 2. Simulation Algorithm ................................................................................ 94 Appendix 3. Excel VBA Code for Verification Program .............................................. 99 Appendix 4. Verification Program Output ................................................................... 105 Appendix 5. Simulation Results (SAS Output) ........................................................... 108 Appendix 6. Additional Verification Programming Results ........................................ 110 Appendix 7. Comparing the Mean in Two Samples .................................................... 113 LIST OF TABLES. Table 1. Hypothesis Testing in the Presence of Errors .................................................. 15 Table 2. Risk-Based Approach to DQ – Principles and Implications ............................ 33 Table 3. R&D Cost per Drug ......................................................................................... 37 Table 4. Coding For “Hits” and “Misses” ..................................................................... 51 Table 5. Input Variables and Covariates ........................................................................ 52 Table 6. Summary of Data Generation .......................................................................... 52
  • 9. 28 April, 2015 pg. 8 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. Table 7. Analysis Methods ............................................................................................ 56 Table 8. Descriptive Statistics ........................................................................................ 58 Table 9. Descriptive Statistics by Sample Size Per Study Arm ..................................... 58 Table 10. Example of Adjustment in Type I and Type II Errors ................................... 68 Table 11. Sample Size Increase Associated with Reduction in Alpha .......................... 69 Table 12. Data Cleaning Cut-Off Estimates .................................................................. 71 Table 13. Proposed Source Data Verification Approach ............................................... 73 Table 14. Estimated Cost Savings ................................................................................. 76
  • 10. 28 April, 2015 pg. 9 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. I. Introduction. Because clinical research, public health, regulatory and business decisions are largely determined by the quality of the data these decisions are based on, individuals, businesses, academic researchers, and policy makers all suffer when data quality (DQ) is poor. Many cases published in the news media and scientific literature exemplify the magnitude of the DQ problems that health care organizations and regulators face every day1 . Gartner estimates that data quality problems cost the U.S. economy $600 billion a year. With regard to the pharmaceutical industry, the issues surrounding DQ carry significant economic implications and contribute to the large cost of pharmaceutical drug and device development. In fact, the cost of clinical trials for medical product development has become prohibitive, and presents a significant problem for future pharmaceutical research, which, in turn, may pose a threat to the public health. This is why National Institute of Health (NIH) and Patient-Centered Outcome Research Institute (PCORI) make investments in DQ research (Kahn et al., 2013; Zozus et al., 2015). Data are collected during clinical trials on the safety and efficacy of new pharmaceutical products, and the regulatory decisions that follow are based on these data. Ultimately, reproducibility, the credibility of research, and regulatory decisions are only as good as the underlying data. Moreover, decision makers and regulators often depend on the investigator’s demonstration that the data on which conclusions are based are of sufficient quality to support 1 For example, the medical provider may change the diagnosis to a more serious one (especially when doctors/hospitals feel that insurance companies do not give fair value for a treatment). Fisher, Lauria, Chngalur- Smith and Wang (2006) site a study where “40% of physicians reported that they exaggerated the severity of patient condition, changed billing diagnoses, and reported non-existent symptoms to help patients recover medical expenses. The reimbursement for bacterial pneumonia averages $2500 more than for viral pneumonia. Multiplying this by the thousands of cases in hundreds of hospitals will give you an idea of how big the financial errors in the public health assessment field could be”. A new edition of the book (Fisher, et al., 2012) presents multiple examples of poor quality: “In industry, error rates as high as 75% are often reported, while error rates to 30% are typical. Of the data in mission-critical databases, 1% to 10% may be inaccurate. More than 60% of surveyed firms had problems with DQ. In one survey, 70% of the respondents reported their jobs had been interrupted at least once by poor-quality data, 32% experienced inaccurate data entry, 25% reported incomplete data entry, 69% described the overall quality of their data as unacceptable, and 44% had no system in place to check the quality of their data.”
  • 11. 28 April, 2015 pg. 10 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. them (Zozus, et al., 2015). However, with the exception of newly emerging NIH Collaboratory DQ assessment standards, no clear operational definition of DQ has been adopted to date. As a direct consequence of the lack of agreement on such a fundamental component of the quality system, data cleaning processes vary, conservative approaches dominate, and too many resources are devoted to meeting unnecessarily stringent quality thresholds. According to DQ expert Kaye Fendt (2004), the considerable financial impact of DQ-related issues on the U.S. health care system is due to the following: (1) “Our medical-scientific-regulatory system is built upon the validity of clinical observation,” and (2) “The clinical trials and drug regulatory processes must be trusted by our society”. It would therefore be reasonable to conclude that the importance of DQ in modern healthcare, health research, and health policy will continue to grow. Several experts in the field have been emphasizing the importance of DQ research for more than a decade. For instance, K. Fendt stated at the West Coast Annual DIA (2004): “We need a definition (Target) of data quality that allows the Industry to know when the data are clean enough.” Dr. J. Woodcock, FDA, Deputy Commissioner for Operations and Chief Operating Officer, echoed this sentiment at the Annual DIA (2006): “…we [the industry] need consensus on the definition of high quality data.” As a result of combined efforts over the past 20 years, the definition of high-quality data has evolved from a simple “error-free-data” to a much more sophisticated “absence of errors that matter and are the data fit for purpose” (CTTI, 2012). The industry, however, is slow in adopting this new definition of DQ. Experts agree that the industry is so big that, for numerous reasons, it is always slow to change. Some researchers “in the trenches” are simply not aware of the new definition. Others are not yet ready to embrace it due to organizational inertia and rigidity of the established processes, as well as because of insufficient knowledge, misunderstanding, and misinterpretation caused by the complexity of the
  • 12. 28 April, 2015 pg. 11 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. new paradigm. In addition, due to the fact that vendors and CROs make a lot of money on the monitoring of sites, monitoring has long become the “holy grail” of the QC regulatory requirement of operations at the site, regardless of its labor-intensive nature and low effectiveness. The fact that monitoring findings have “no denominator” is another reason. Each individual monitoring finding is given great attention, and not put in perspective with the other hundreds of forms and data points that are good. Also, the prevailing industry belief in training and certification of monitors might need reexamination. Perhaps it is time to shift resources in training and certifying coordinators and investigators. Finally, the industry has not agreed on what errors can be spotted by smart programs vs. what errors require review by monitors. In fact, only one publication to date addressed this topic. The review by Bakobaki and colleagues (2012) “determined that centralized [as opposed to manual / on-site] monitoring activities could have identified more than 90% of the findings identified during on-site monitoring visits.” Consequently, the industry continues to spend a major portion of the resources allocated for clinical research on the processes and procedures related to the quality of secondary data – which often have little direct impact on the study conclusions – while allowing some other critical quality components (e.g., quality planning, triple-checking of critical variables, data standardization, and documentation) to fall through the cracks. In my fifteen years in the industry, I have heard numerous anecdotal accounts of similar situations and have witnessed several examples firsthand. In a landmark new study, TransCelerate revealed what many clinical research professionals suspected for a long time: that there is a relatively low proportion (less than one-third) of data clarification queries that are related to “critical data” (Scheetz et al., 2014). How much does current processes that arise from the imprecision in defining and interpreting DQ cost the pharmaceutical and device companies? Some publications encourage
  • 13. 28 April, 2015 pg. 12 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. $4-9 billion annually in the U.S. (Ezekiel & Fuchs, 2008; Funning, Grahnén, Eriksson, & Kettis- policy changes (Ezekiel, 2003; Lorstad, 2004) and others estimate the potential savings in the neighborhood of Linblad, 2009; Getz et al., 2013; Tantsyura et al, 2015) Some cost savings are likely to come from leveraging innovative computerized data validation checks and reduction in manual efforts, such as source data verification (SDV). In previous decades, when paper case report forms (CRFs) were used, the manual review component was essential in identifying and addressing DQ issues. But with the availability of modern technology, this step of the process is largely a wasted effort because the computer-enabled algorithms are able to identify data issues much more efficiently. Undoubtedly, an in-depth discussion of the definition of DQ is essential and timely. Study size needs to be a key component of any DQ discussion. Recent literature points out the inverse relationship between the study size and the impact of errors on the quality of the decisions based on the data. However, the magnitude of this impact has not been systematically evaluated. The intent of the proposed study is to fill in this knowledge gap. The Monte-Carlo method was used to generate the data for the analysis. Because DQ is a severely under- researched area, a thorough analysis of this important concept could lead to new economic and policy breakthroughs. More specifically, if DQ thresholds can be established, or minimal impact of data errors under certain conditions uncovered, then these advancements might potentially simplify the data cleaning processes, free resources, and ultimately reduce the cost of clinical research operations. The primary focus of current research is the sufficient assurance of data quality in clinical trials for regulatory decision-making. Statistical, economic, regulatory, and policy aspects of the issue will be examined and discussed.
  • 14. 28 April, 2015 pg. 13 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. II. Background and Literature Review. Basic Concepts and Terminology A data error occurs when “a data point inaccurately represents a true value…” (GCDMP, v4, 2005 p. 77). This definition is intentionally broad and includes errors with root causes of misunderstanding, mistakes, mismanagement, negligence, and fraud. Similarly to GCDMP, the NIH Collaboratory white paper (Zozus et al., 2015) uses the term “error” to denote “any deviation from accuracy regardless of the cause.” Not every inaccuracy or deviation from the plan or from a regulatory standard in a clinical trial constitutes a “data error.” Noncompliance of procedures with regulations or an incomplete adherence of practices to written documentation cannot be considered examples of data errors. Frequently, for the monitoring step of Quality Control (QC), clinical researchers “conveniently” choose to define data error as a mismatch between the database and the source record (the document from which the data were originally captured). In the majority of cases, perhaps in as many as 99% or more, the source represents a “true value,” and that prompts many clinical researchers to assume mistakenly that the “source document” represents the true value, as well. One should keep in mind that this is not always the case, as can be clearly demonstrated in a situation where the recorded value is incompatible with life. This erroneous definition and the conventional belief that it reinforces are not only misleading, but also bear costs for society. From the statistical point of view, data errors are undesirable because they introduce variability2 into the analysis, and this variability makes it more difficult to establish statistical 2 Generally speaking, variability is viewed from three different angles: science, statistics and experimental design. More specifically, science is concerned with understanding variability in nature, statistics is concerned with making
  • 15. 28 April, 2015 pg. 14 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. significance and to observe a “treatment effect,” if one exists, because it reduces the standardized treatment effect size (difference in means divided by standard deviation). When superiority trials are affected by data errors, it becomes more difficult to demonstrate the superiority of one treatment over another. Without loss of generality, to describe the fundamental framework for hypothesis testing, it is assumed that there are two treatments, and the goal of the clinical study is to declare that an active drug is superior to a placebo. In clinical trial research, generally the null hypothesis expresses the hypothesis that two treatments are equivalent (or equal), and the alternative hypothesis (sometimes called the motivating hypothesis) expresses that the two treatments are not equivalent (or equal). Assuming an Aristotelian, 2-value logic system in support, in the true, unknown state of nature, one of the hypotheses is true and the other one is false. The goal of the clinical trial is to make a decision based on statistical testing that one hypothesis is true and the other is false. The 2-b-2 configuration of the possible outcomes has two correct decisions and two incorrect decisions. Table 1 displays the possible outcomes. Table 1: Hypothesis testing in the presence of errors In statistical terms, the presence of data errors increases variability for the study drug arm and the comparator, while increasing the proportion of false-positive results (Type I errors) and decisions about nature in the presence of variability, and experimental design is concerned with reducing and controlling variability in ways which make statistical theory applicable to decisions about nature (Winer, 1971). Increased due to errors Increased due to errors
  • 16. 28 April, 2015 pg. 15 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. false-negative results (Type 2 errors), thus reducing the power of the statistical test and the probability of the right conclusion (as shown in Table 1). Because the definitions of the Type I and Type II errors are reversed in non-inferiority trials relative to the superiority trials, the treatment of interest in non-inferiority trials becomes artificially more similar to the comparator in the presence of data errors. DQ experts state that “DQ is a matter of degree, not an absolute” (Fendt, 2004) and suggest reaching an agreement with regulators on acceptable DQ “cut-off” levels – similar to the commonly established alpha and beta levels, 0.05 and 0.20 respectively. This thesis offers a methodology to facilitate such discussion. Each additional data error gradually increases the probability of a wrong study conclusion (relative to the conclusion derived from the error-free dataset), but the results of this effect are inconsistent. For one study, under one set of study parameters, a single data error could lead to a reduction in the chance of the correct study conclusion from 100% to 99%. For another study of a much smaller sample size, for example, the same error could lead to a reduction from 100% to 93.5%. For this reason DQ can be viewed as a continuous variable characterized by the probability of the correct study conclusion in presence of errors. Obviously, such approach to DQ is in direct contradiction with the common-sense belief that DQ is a dichotomous variable – ”good”/”poor” quality. This alternative (continuous and probabilistic) view of DQ reduces subjectivity of “good/poor” quality assessment, quantifies the effect of data errors on the study conclusions, and leads to establishing the probability cut-off level (X%), which distinguishes between acceptable and not acceptable levels of the correct/false-positive/false-negative study conclusions that satisfies regulators and all stakeholders.
  • 17. 28 April, 2015 pg. 16 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. Numerous sources and origins of data errors exist. There are errors in source documents, such as an ambiguous question resulting in unintended responses, data collection outside of the required time window, or the lack of inter-rater reliability (which occurs in cases where subjectivity is inherently present or more than one individual is assessing the data). Transcription errors occur in the process of extraction from a source document to the CRF or the electronic case report form (eCRF). Additional errors occur during data processing, such as keying errors and errors related to data integration issues. There are also database errors, which include the data being stored in the wrong place. Multi-step data processing introduces even more data errors. Because the clinical research process can be complex, involving many possible process steps, each step at which the data are transcribed, transferred, or otherwise manipulated has an error rate associated with it. Each subsequent data processing step can create or correct errors as demonstrated in Figure 1. Figure 1. Sources of Errors in Data Processing (GCDMP, v4, 2005 p. 78)
  • 18. 28 April, 2015 pg. 17 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. Data errors can be introduced even after a study is completed and the database is “locked.” Data errors and loss of information often occur at the data integration step of data preparation, in cases of meta-analysis, or during the compilation of the integrated summary of efficacy (ISE)/integrated safety summary (ISS) for a regulatory submission. This could be a direct result of the lack of specific data standards. For example, if the code list for a variable has two choices in one study and five choices in the other, considerable data quality reduction would occur during the data integration step. How data are collected plays an important role not only in what is collected by also “with what quality.” It is known that if adverse events are elicited via checklists, there is an over- reporting of minor events. If no checklist is used and no elicitation is used, it is known that adverse events could be under-reported. Similarly, it is known that important medical events get reported. The FDA guidance on safety reviews expresses views consistent with this known result. For example, a patient may not recall episodes of insomnia or headaches that were not particularly bothersome, but they may recall them when probed. Important, bothersome events, however, tend to be remembered and reported as a result of open-ended questions. Checklists can also create a mind-set: if questions relate to common ailments of headache, nausea, vomiting, then the signs and symptoms related to energy, sleep patterns, mood, etc., may be lost, due to the focus on physical findings. Variability in the sources of data errors is accompanied by the substantial variation in data error detection techniques. Some errors, such as discrepancies between the source documents and the CRF, are easily detected at the SDV step of the data cleaning process. Other errors, such as misunderstandings, non-compliance, protocol violations, and fraud are more difficult to identify. The most frequently missed data error types include the data recorded under
  • 19. 28 April, 2015 pg. 18 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. the wrong subject number and data captured incorrectly from source records. Thus, data error detection in clinical research is not as trivial as it may appear on the surface. Computerized edit checks help identify illogical (such as visit two date is prior to visit one date), missing, or out of expected range values much more efficiently than the manual data review. Variability in the sources of data errors and the general complexity of this topic have created substantial impediments to reaching a consensus as to what constitutes “quality” in clinical research. The ISO 8000 (2012) focuses on data quality and defines “quality” as the “…degree to which a set of inherent characteristics fulfills requirements.” Three distinct approaches to defining data quality are found in the literature and in practice. “Error-free” or “100% accurate” is the first commonly used definition that has dominated clinical research. According to this definition, the data are deemed to be of high quality only if they correctly represent the real world construct to which they refer. There are many weaknesses to this approach. The lack of objective evidence supporting this definition, and prohibitively high cost of achieving “100% accuracy” are two major weakness of this approach. Additionally, the tedium associated with source-document verification and general quality control procedures involved in the process of trying to identify and remove all errors is actually a distraction from a focus on important issues. Humans are not very good at fending off tedium, and this reality results in errors going undetected in the vast platforms of errorless data. The “fit to use” definition, introduced by J.M. Juran (1986), is considered the gold standard in many industries. It states that the data are of high quality "if they are fit for their intended uses in operations, decision making and planning". This definition has been interpreted by many clinical researchers as “the degree to which a set of inherent characteristics of the data fulfills requirements for the data” (Zozus et al, 2015) or, simply, “meeting protocol-specified
  • 20. 28 April, 2015 pg. 19 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. parameters.” Ambiguity and implementation challenges are the main impediments to a wider acceptance of this definition. Unlike service or manufacturing environments, clinical trials vary dramatically and are consequently more difficult to standardize. These inherent limitations have resulted in the practice by many clinical research companies in the past three decades of using a third definition – namely arbitrarily acceptable levels of variation per explicit protocol specification (GCDMP, 2005). The development of objective data quality standards is more important today than ever before. The fundamental question of “how should adequate data quality be defined?” will require taking into consideration the study-specific scientific, statistical, economic, and technological data collection and cleaning context, as well as the DQ indicators3 . As mentioned above, data cleaning and elimination of data errors reduces variability and helps detect the “treatment effect.” On the other hand, the data cleaning process is not entirely without flaws or unintended consequences. Not only does it add considerable cost to the clinical trial conduct, it may potentially introduce bias and shift study conclusions. There are several possible scenarios of bias being introduced via data cleaning. Systematically focusing on the extreme values, when as many errors are likely to exist in the expected range (as shown in Figure 2), is one such scenario. Selectively prompting to modify or add non-numeric data to make the (edited) data appear “correct,” even when the cleaning is fully blinded and well-intended, is another example. Before selective cleaning, the data may be flawed, but the data errors are not systematically concentrated, and, thus, introduce no bias. However, when the data cleaning is non-random for any reason, it leads to a reduction of variance, and increases the Type I Error Rate and the risk of making incorrect inferences, i.e., finding statistically significant differences 3 Multiple terms are used in the literature to describe DQ indicators – “quality criteria,” “attributes,” and “dimensions.”
  • 21. 28 April, 2015 pg. 20 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. due to chance alone or failing to find differences in a non-inferiority comparison (declaring that a drug works when it does not work). Finally, new (processing) data errors are often introduced into the data during cleaning. Figure 2. Statistical Impact of “out-of-range” Edit Checks In an experiment described later in the manuscript, the data cleaning is abridged to “out of range” checks that are symmetrical. This eliminates all out of range errors, while introducing no bias. The data errors themselves are assumed distributed as a Gaussian white noise. Data cleaning is an important component of data management in regulated clinical development. It is typically handled by clinical data managers and involves many different scenarios. Impossible (e.g., incompatible with life) values are typically detected and corrected. Medical inconsistencies are usually resolved. For example, if the recorded diastolic blood pressure (BD) appears to be higher than the systolic BP, the values might be reversed (if confirmed by the investigator). Missing values are obtained or confirmed missing and extreme values are confirmed or replaced with more acceptable values. And finally, unexpected data are removed, modified, or explained away. For example, concomitant medications, lab tests, and adverse events (AEs) reported with start dates later than 30 days post-treatment are generally
  • 22. 28 April, 2015 pg. 21 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. queried and removed. Subjects with AEs are probed for previously unreported medical history details, which are then added to the database. Sites modify subjective assessments when the direction of a scale is misinterpreted (e.g., one is selected instead of nine on a zero- to- ten scale). All data corrections require the investigator’s endorsement and are subject to an audit trail. Data cleaning is a time-consuming, monotonous, repetitive and uninspiring process. The mind-numbing nature of this work makes it difficult for the data management professionals in the trenches to maintain a broader perspective and refrain from over-analysing trivial details. Data managers frequently ask the following question: If a site makes too many obvious transcription errors in the case of outliers, why would one think that fewer errors are made within the expected ranges? They use this as a justification to keep digging further, finding more and more errors, irrespective of the relevance or the significance of those findings. The desire to “fix” things that appear wrong is part of human nature, especially when one is prompted to look for inconsistencies. Each new “finding” and data correction make data managers proud of their detective abilities, but it costs their employers and society billions of dollars. Does the correction of the respiratory rate from twenty two to twenty one make a substantial difference? More often than not, the answer is no. As new research demonstrates, over two thirds of data corrections are clinically insignificant and do not add any value to study results (Mitchel, Kim, Choi, Park, Cappi, & Horn, 2011; TransCelerate, 2013; Scheetz et al., 2014). Unlike a generation ago, many researchers and practitioners in the modern economic environment are facing resource constraints. Therefore, they legitimately ask these types of questions in an attempt to gain a better understanding of the true value of data cleaning, and to improve effectiveness and efficiency in clinical trial operations.
  • 23. 28 April, 2015 pg. 22 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. The process of data collection has markedly transformed over the past 30 years. Manual processes dominated the 1980s and 1990s, accompanied by the heavy reliance on the verification steps of the process at that. As an example, John Cavalito (in his unpublished work at Burroughs Wellcome in approximately 1985) found that a good data entry operator could key 99 of 100 items, without error. The conclusion was made that the data entry operator error rate was 1%. With two data entry operators creating independent files, that would yield an error rate of 1/10,000 (.01 multiplied by .01 yields .0001, or 1/10,000). Similarly, Mei-Mei Ma’s dissertation (1986) evaluated the impact of each step in the paper process, and discovered that having two people work on the same file had a higher error rate than having the same person create two different files (on different days); the rule for independence was unexpected. It was more useful to separate the files than to use different people. The introduction of electronic data capture technology (EDC) eliminated the need for paper CRFs. Additionally, EDC had antiquated some types of extra work that was associated with the traditional paper process4 . More recently, introduction of eSource technologies such as electronic medical records (EMRs) and direct data entry (DDE) has started the process of eliminating paper source records, eradicating transcription errors and, as many believe, further improving DQ. These events have also led to growing reliance on computer-enabled data cleaning as opposed to manual processes such as SDV. While in the previous “paper” generation the SDV component was essential in identifying and addressing issues, modern technology has made this step of the process largely a redundant effort because the computer-enabled algorithms 4 Examples include (a) “None” was checked for “Any AEs?” on the AE CRF, but AEs were listed so the “None” is removed, (b) data recorded on page 7a that belonged on page 4 is moved, (c) sequence numbers are updated, (d) effort is spent cleaning data that will have little or no impact on conclusions and (e) comments recorded on the margin that have no place in the database.
  • 24. 28 April, 2015 pg. 23 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. can identify over 90% of data issues much more efficiently (Bakobaki et al., 2012). Nevertheless, some research professionals have expressed the opinion that the new technologies such as EDC, EMR, DDE, and their corresponding data cleaning processes may result in a lower DQ, compared to the traditional paper process, thus justifying their resistance to change. One argument against the elimination of paper is that lay typists are not as accurate as trained data entry operators (Nahm, Pieper, & Cunningham, 2008). The second argument, that EDC and DDE use logical checks to “find & fix” additional “typing errors” by selectively challenging item values (e.g., focusing on values outside of expected ranges) at the time of entry, also has some merit. The clinical site personnel performing data entry, using one of the new technologies, is vulnerable to the suggestion that a value may not be “right” and may unintentionally reject a true value. In the final analysis, these skeptical arguments will not reverse the visible trend in the evolution of clinical research that is characterized by greater reliance on technology. The growing body of evidence confirms that, in spite of its shortcomings, DDE leads to the higher overall data quality. The DQ debate is not new. It attracts attention through discussions led by the Institute of Medicine (IOM)5 , the Food and Drug Administration (FDA), DQRI6 , and MIT TDQM7 , but consensus in the debate over a definition of DQ has not been reached, nor have the practical problems standing in the way of implementing that definition been solved. According to K. 5 “The Institute of Medicine serves as adviser to the nation to improve health. Established in 1970 under the charter of the National Academy of Sciences, the Institute of Medicine provides independent, objective, evidence-based advice to policymakers, health professionals, the private sector, and the public.” (http://www.iom.edu/; Accessed in November, 2007) 6 The Data Quality Research Institute (DQRI) is a non-profit organization that existed in early 2000’s and provided an international scientific forum for academia, healthcare providers, industry, government, and other stakeholders to research and develop the science of quality as it applies to clinical research data. The Institute also had been assigned an FDA liaison, Steve Wilson. (http://www.dqri.org/; Accessed in November, 2007) 7 MIT Total Data Quality Management (TDQM) Program: “A joint effort among members of the TDQM Program, MIT Information Quality Program, CITM at UC Berkeley, government agencies such as the U.S. Navy, and industry partners.” (See http://web.mit.edu/tdqm/ for more details.)
  • 25. 28 April, 2015 pg. 24 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. Fendt, a former Director of DQRI, the following aspects of DQ are fundamentally important: (1) DQ is “a matter of degree; not an absolute,” (2) “There is a cost for each increment,” (3) There is a need to determine “what is acceptable,” (4) “It is important to provide a measure of confidence in DQ,” and (5) “Trust in the process” must be established. And the consequences of mistrust are: “(1) “poor medical care based on non-valid data,” (2) “poor enrollment in clinical trials,” (3) “public distrust” (of scientists/regulators), and (4) lawsuits.” (Fendt, 2004). Measuring Data Quality The purpose of measuring data quality is to identify, quantify, and interpret data errors, so that quality checks can be added or deleted and the desired level of data quality can be maintained (GCDMP, v 4, 2005, p. 80). The “error rate” is considered the gold standard metric for measuring DQ, and for easier comparison across studies it is usually expressed as the number of errors per 10,000 fields. It is expressed by the formula below. For the purpose of this thesis, it is important to note that an increase in the denominator (i.e, the sample size) reduces the error rate and dilutes the impact of data errors. InspectedFields FoundErrors RateError ofNumber ofNumber  “Measuring” DQ should not be confused with “assuring” DQ or “quality assurance” (QA), which focuses on infrastructures and practices used to assure data quality. According to the International Standards Organization (ISO), QA is the part of quality management focused on providing confidence that the quality requirements will be met8 (ISO 9000:2000 3.2.11). As an 8 ICH GCP interprets this requirement as “all those planned and systematic actions that are established to ensure that the trial is performed and the data are generated, documented (recorded), and reported in compliance with Good Clinical Practice (GCP) and the applicable regulatory requirement(s).” (ICH GCP 1.16). Thus, in order for ICH to be compatible with current ISO concepts, the word “ensure” should be replaced with “assure”. Hence the aim of
  • 26. 28 April, 2015 pg. 25 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. example, a QA plan for registries “should address: 1) structured training tools for data abstractors; 2) use of data quality checks for ranges and logical consistency for key exposure and outcome variables and covariates; and 3) data review and verification procedures, including source data verification plans and validation statistics focused on the key exposure and outcome variables and covariates for which sites may be especially challenged. A risk-based approach to quality assurance is advisable, focused on variables of greatest importance” (PCORI, 2013). Also, measuring DQ should not be confused with “quality control” (QC). According to ISO, QC is the part of quality management focused on fulfilling requirements9 . (ISO 9000:2000 3.2.10) Auditing is a crucial component of Quality Assurance. An audit is defined as “a systematic, independent and documented process for obtaining audit evidence objectively to determine the extent to which audit criteria are fulfilled”10 (ISO 19011:2002, 3.1). The FDA audits are called “inspections.” The scope of an audit may vary from “quality system” to “process” and to “product audit.” Similarly, audits of a clinical research database may vary from the assessment of a one-step process (e.g., data entry) to a multi-step (e.g., source-to-database) audit. The popularity of one-step audits comes from their low cost. However, the results of such one-step audits are often over-interpreted. For example, if a data entry step audit produces an acceptably low error rate, it is sometimes erroneously concluded that the database is of a “good quality.” The reality is that the low error rate and high quality of one step of the data handling Clinical Quality Assurance should be to give assurance to management (and ultimately to the Regulatory Authorities) that the processes are reliable and that no major system failures are expected to occur that would expose patients to unnecessary risks, violate their legal or ethical rights, or result in unreliable data. 9 ICH GCP interprets these requirements as “the operational techniques and activities undertaken within the quality assurance systems to verify that the requirements for quality of the trial have been fulfilled.” (ICH GCP 1.47) In order for ICH language to be compatible with current ISO concepts, the word “assurance” should be replaced with “management.” 10 ICH GCP interprets this requirement as “A systematic and independent examination of trial related activities and documents to determine whether the evaluated trial related activities were conducted, and the data were recorded, analyzed and accurately reported according to the protocol, the sponsor’s Standard Operating Procedures (SOPs), Good Clinical Practice (GCP), and the applicable regulatory requirement(s). (ICH GCP 1.6)
  • 27. 28 April, 2015 pg. 26 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. process does not necessarily equate to a high quality/low error rate associated with the other steps of data handling. Audits are popular in many industries, such as manufacturing or services, for their ability to provide tangible statistical evidence about the quality and reliability of products. However, the utility of audits in clinical research is hindered by the substantial costs associated with them, especially in the case of the most informative “source-to-database” audits. This accounts for the reduction in error rate reporting in the literature in the past seven to ten years, since the elimination of paper CRFs and the domination of EDC on the data collection market. At the same time, such a surrogate measure of DQ as the rate of data correction has been growing in popularity and reported in all recent landmark DQ-related studies (Mitchel et al., 2011, Yong, 2013; TransCelerate, 2013; Scheetz et al., 2014). Data corrections are captured in all modern research databases (as required by the “audit-trail” stipulation of the FDA), making the rate of data correction calculation an easily automatable and inexpensive benchmark for the DQ research. The rates of data corrections were originally coined by statisticians as a measure of “stability” of a variable, which is outside of the scope of this discussion. Theoretically speaking, error rates calculated at the end of a trial and the rates of data corrections are not substitutes one for the other, but rather complementary to each other. Ideally, when all errors are captured via the data cleaning techniques, the error rate from a source-to- database audit should be zero and the rate of corrections (data changes) should be equivalent to the pre-data cleaning error rate. In a less than ideal scenario, when a small proportion of the total errors is eliminated via data cleaning, the rate of data correction is less informative, leaving the observer wondering how many errors are remaining in the database. The first (ideal) scenario is closer to the real world of pharmaceutical clinical trials, where extensive (and extremely
  • 28. 28 April, 2015 pg. 27 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. expensive) data cleaning efforts lead to supposedly “error-free” databases. The second (less than ideal) scenario is more typical of academic clinical research, where the amount of data cleaning is limited for financial and other reasons. Thus, the rate of data changes is primarily a measure of the reduction of the error rate (or DQ improvement) as a result of data cleaning, and a measure of effectiveness of the data cleaning in a particular study. Low cost is the main benefit of the data correction rates over the error rates, while obvious incongruence of data correction rates to the error rates calculated via audits is the main flaw of this fashionable metric. DQ in Regulatory and Health Policy Decision-Making The amount of clinical research data has been growing exponentially over the past half century. ClinicalTrials.gov currently lists 188,494 studies with locations in all 50 States and in 190 countries. BCC Research, predicted a growth rate of 5.8%, roughly doubling the number of clinical trials every ten years (Fee, 2007). Figure 3. The number of published trials, 1950 to 2007
  • 29. 28 April, 2015 pg. 28 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. This unprecedented growth, especially in the past decade, coupled with rising economic constraints, places enormous pressure on regulators such as the FDA and the EMA. Regulators ask and address a number of important questions in order to claim that the quality of decisions made by policy makers and clinicians meets an acceptable level. There are numerous examples of the regulatory documents that clarify and provide guidance in respect to many aspects of Quality System and DQ (ICH Q8, 2009; ICH Q9, 2006; ICH Q10, 2009; FDA, 2013; EMA, 2013). Fisher and colleagues reference Kingma’s finding that “even a suspicion of poor-quality data influences decision making.” A lack of consumer confidence in data quality potentially leads to a lack of trust in a study’s conclusions. (Kingma, 1996.) Thus, one of the most important questions the policy-makers and regulators ask is to what extent do the data limitations prohibit one from having confidence in the study conclusions. Some of these questions: What methods were used to collect data? Are the sources of data relevant and credible? Are the data reliable, valid and accurate? One hundred percent data accuracy is too costly, if ever achievable. Would 99% accurate data be acceptable? Why or why not? Are there any other important “dimensions” of quality, beyond “accuracy,” “validity,” and “reliability?” (Fisher, 2006; Fink 2005). The DQ discussion (that is often spearheaded by regulators and regulatory experts) has to date made progress on several fronts. There is a general consensus on the “hierarchy of errors” (i.e. some errors are more important than others), and on the impossibility of achieving a complete elimination of data errors. There will always be some errors that are not addressed by quality checks, as well as those that slip through the quality check process undetected (IOM, 1999). Therefore, the clinical research practitioners’ primary goal should be to correct the errors
  • 30. 28 April, 2015 pg. 29 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. that will have an impact on study results (GCDMP, v4, 2005, p. 76). Quality checks performed as part of data processing, such as data validation or edit checks, should target the fields that are (1) critical to the analysis, (2) where errors frequently occur, and (3) where a high percent of error resolution can be reasonably expected. “Because a clinical trial often generates millions of data points, ensuring 100% completeness, for example, is often not possible; however it is not necessary” (IOM, 1999) and “trial quality ultimately rests on having a well-articulated investigational plan with clearly defined objectives and associated outcome measures” (CTTI, 2012) are the key message from the regulatory experts to the industry. Furthermore, the growing importance of pragmatic clinical trials (PCT)11 , the difficuty in estimating error rates using traditional “audits,” and a deeper understanding of the multi- dimensional nature of DQ (that is discussed next) have led to the development over the last couple of years of novel approaches to DQ assessment. An NIH Collaboratory white paper, which was supported by a cooperative agreement (U54 AT007748) from the NIH Common Fund for the NIH Health Care Systems Research Collaboratory (Zozus et al., 2015) is the best example of such innovative methodology that stresses the multi-dimensional nature of DQ assessment and the determination of “fitness for use of research data.” The following statement by the authors summarizes the rational, and approach taken by this nationally recognized group of thought leaders: Pragmatic clinical trials in healthcare settings rely upon data generated during routine patient care to support the identification of individual research subjects or cohorts as well as outcomes. Knowing whether data are accurate depends on some comparison, e.g., comparison to a source of “truth” or to an independent source of data. Estimating an error or discrepancy rate, of course, requires a representative sample for the comparison. Assessing variability in the error or discrepancy rates between multiple clinical research 11 PCT is defined as “a prospective comparison of a community, clinical, or system-level intervention and a relevant comparator in participants who are similar to those affected by the condition(s) under study and in settings that are similar to those in which the condition is typically treated.” (Saltz, 2014)
  • 31. 28 April, 2015 pg. 30 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. sites likewise requires a sufficient sample from each site. In cases where the data used for the comparison are available electronically, the cost of data quality assessment is largely based on time required for programming and statistical analysis. However, when labor- intensive methods such as manual review of patient charts are used, the cost is considerably higher. The cost of rigorous data quality assessment may in some cases present a barrier to conducting PCTs. For this reason, we seek to highlight the need for more cost-effective methods for assessing data quality… Thus, the objective of this document is to provide guidance, based on the best available evidence and practice, for assessing data quality in PCTs conducted through the National Institutes of Health (NIH) Health Care Systems Research Collaboratory. (Zozus et al., 2015) Multiple recommendations provided by this white paper (conditional upon the advances in data standards or “common data elements”) lead in a new direction in the evolution of the data quality assessment to support important public health policy decisions. Error-Free-Data Myth and the Multi-dimensional Nature of DQ In clinical research practice, DQ is often confused with “100% accuracy.” However, “accuracy” is only one of very many attributes of DQ. For example, Wang and Strong (1996) identified 16 dimensions and 156 attributes of DQ. Accuracy and precision, reliability, timeliness and currency, completeness, and consistency are among the most commonly cited dimensions of DQ in the literature (Wand & Wang, 1996; Kahn, Strong, & Wang, 2002; see Appendix 1 for more details). The Code of Federal Regulation (21CFR Part 11, initially published in 1997, revised in 2003 and 2013) emphasized the importance of “accuracy, reliability, integrity, availability, and authenticity12 " (FDA, 1997; FDA, 2003). Additionally, some experts and regulators emphasize “trustworthiness” (Wilson, 2006), “reputation” and “believability” (Fendt, 2004) as key dimensions of DQ. More recently, Weiskopf and Weng (2013) identified five dimensions that are pertinent to electronic health record (EHR) data used for research: 12 The original version of the document listed different attributes - Attributable-Legible-Contemporaneous-Original- Accurate that were often called the ALCOA standards for data quality and integrity; eSource and “direct data entry” exponentially increase their share and threaten to eliminate the paper source documents, making “Legible” dimension of DQ an obsolete.
  • 32. 28 April, 2015 pg. 31 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. completeness, correctness, concordance, plausibility and currency, and NIH Collaboratory stressed completeness, accuracy and consistency as the key DQ dimensions determining fitness for use of research data where completeness is defined as “presence of the necessary data”, accuracy as “closeness of agreement between a data value and the true value,” and consistency as “relevant uniformity in data across clinical investigation sites, facilities, departments, units within a facility, providers, or other assessors” (Zozus et al., 2015). Even in the presence of its established multi-dimensional nature, DQ is still often interpreted exclusively as data accuracy13 . High data quality for many clinical research professionals simply means 100% accuracy, which is costly, unnecessary, and frequently unattainable. Funning et al. (2009) based on their survey of 97% (n=250) of phase III trials performed in Sweden in 2005, concluded that significant resources are wasted in the name of higher data quality. “The bureaucracy that the Good Clinical Practice (GCP) system generates, due to industry over-interpretation of documentation requirements, clinical monitoring, data verification etc. is substantial. Approximately 50% of the total budget for a phase III study was reported to be GCP-related. 50% of the GCP-related cost was related to Source Data Verification (SDV). A vast majority (71%) of respondents did not support the notion that these GCP-related activities increase the scientific reliability of clinical trials.” This confusion between DQ and “100% accuracy” contributes a substantial amount to the costs of clinical research and subsequent costs to the public. 13 “Assessing data accuracy, primarily with regard to information loss and degradation, involves comparisons, either of individual values (as is commonly done in clinical trials and registries) or of aggregate or distributional statistics… In the absence of a source of truth, comprehensive accuracy assessment of multisite studies includes use of individual value, aggregate, and distributional measures.” (Zozus et al., 2015)
  • 33. 28 April, 2015 pg. 32 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. Evolution of DQ Definition and Emergence of Risk-based (RB) Approach to DQ Prior to 1999, there was no commonly accepted formal definition of DQ. As a result, everyone interpreted the term “high-quality” data differently and, in most cases, these interpretations were very conservative. The overwhelming majority of researchers of that time believed (and many continue to believe) that all errors are equally bad. This misguided belief leads to costly consequences. As a result of this conservative interpretation of DQ, hundreds of thousands (if not millions) of man-hours have been spent attempting to ensure the accuracy of every minute detail. “High-quality” data was first formally defined in 1999 at the IOM/FDA workshop “…as data strong enough to support conclusions and interpretations equivalent to those derived from error-free data.” This definition could be illustrated by a simple algorithm (Figure 4). Figure 4. Data Quality Assessment Algorithm This workshop also introduced important concepts, such as “Greater Emphasis on Building Quality into the Process,” “Data Simplification,” “Hierarchy of Errors,” and “Targeted Strategies [to DQ].” Certain data points are more important to interpreting the outcome of a study than
  • 34. 28 April, 2015 pg. 33 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. others, and these should receive the greatest effort and focus. Implementation of this definition would require agreement on data standards. Below is a summary of the main DQ concepts reflected in the literature between 1998- 2010 (Table 2 is extracted from Tantsyura et. al, 2015): Table 2. Risk-based approach to DQ –Principles and Implications Fundamental Principles Practical and Operational Implications Data fit for use. (IOM, 1999) The Institute of Medicine defines quality data as “data that support conclusions and interpretations equivalent to those derived from error- free data.” (IOM, 1999) “…the arbitrarily set standards ranging from a 0 to 0.5% error rate for data processing may be unnecessary and masked by other, less quantified errors.” (GCDMP, v4, 2005) Hierarchy of errors. (IOM, 1999; CTTI, 2009) “Because a clinical trial often generates millions of data points, ensuring 100 percent completeness, for example, is often not possible; however, it is also not necessary.” (IOM, 1999) Different data points carry different weights for analysis and, thus, require different levels of scrutiny to insure quality. “It is not practical, necessary, or efficient to design a quality check for every possible error, or to perform a 100% manual review of all data... There will always be errors that are not addressed by quality checks or reviews, and errors that slip through the quality check process undetected” (GCDMP, 2008) Focus on critical variables. (ICH E9, 1998; CTTI, 2009) Multi-tiered approach to monitoring (and data cleaning in general) is recommended. (Khosla, Verma, Kapur, & Khosla, 2000; GCDMP v4, 2005; Tantsyura et al., 2010) Advantages of early error detection. (ICH E9, 1998; CTTI, 2009) “The identification of priorities and potential risks should commence at a very early stage in the preparation of a trial, as part of the basic design process with the collaboration of expert functions…” (EMA, 2013) A decade later, two FDA officials (Ball & Meeker-O’Connell, 2011) had reiterated that “Clinical research is an inherently human activity, and hence subject to a higher degree of variation than in the manufacture of a product; consequently, the goal is not an error-free dataset...” Recently, the Clinical Trials Transformation Initiative (CTTI), a public-private partnership to identify and promote practices that will increase the quality and efficiency of
  • 35. 28 April, 2015 pg. 34 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. clinical trials, has introduced a more practical version of the definition, namely “the absence of errors that matter.” Furthermore, CTTI (2012) stresses that “the likelihood of a successful, quality trial can be dramatically improved through prospective attention to preventing important errors that could undermine the ability to obtain meaningful information from a trial.” Finally, in 2013, in its Risk-Based Monitoring Guidance, the FDA re-stated its commitments, as follows: “…there is a growing consensus that risk-based approaches to monitoring, focused on risks to the most critical data elements and processes is necessary to achieve study objectives...” (FDA, 2013) Similarly, and more generally, a risk-based approach to DQ requires focus on “the most critical data elements and processes.” A risk-based approach to DQ is very convincing and understandable as a theoretical concept. The difficulty comes when one attempts to implement it in practice and faces surmounting variability among clinical trials. The next generation of DQ discussion should focus on segregating different types of clinical trials and standard end-points, identifying acceptable DQ thresholds and other commonalities in achieving DQ for each category. Data Quality and Cost If it is established that DQ is a “matter of degree, not an absolute,” then the next question is where to draw the fine line between “good”/acceptable and “poor”/unacceptable quality. D. Hoyle (1998, p. 28) writes in his “ISO 9000 Quality Systems Development Handbook”: “When a product or service satisfies our needs we are likely to say it is of good quality and likewise when we are dissatisfied we say the product or service is of poor quality. When the product or service exceeds our needs we will probably say it is of high quality and likewise if it falls well below our
  • 36. 28 April, 2015 pg. 35 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. expectations we say it is of low quality. These measures of quality are all subjective. What is good to one may be poor to another…” Categorizing “good enough” and “poor” quality is a particularly crucial question for clinical drug developers, because the answer carries serious cost implications. The objectively defined acceptable levels of variation and errors have not yet been established. Despite the significant efforts to eliminate inefficiencies in data collection and cleaning, there are still significant resources devoted to low-value variables that have minimal to no impact on the critical analyses. Focusing data cleaning efforts on “critical data” and establishing industry-wide DQ thresholds are two of the main areas of interest. When implemented, there will be a major elimination of waste in the current clinical development system. A new area of research called the “Cost of Quality” suggests viewing costs associated with quality as a combination of two drivers. According to this model, the presence of additional errors carries negative consequences and cost (reflected in the monotonically decreasing “Cost of poor DQ” line in the Figure 5). Thus, the first driver is a reflection of risk and the cost increase due to the higher error rate/lower DQ. The second driver is the cost of data cleaning. Because the elimination of data errors requires significant resources, it increases the costs, as manifested by the monotonically increasing “Cost of DQ Maintenance” line in Figure 5. It is important to note that the function is non-linear but asymptotic, i.e. the closer one comes to the “error-free” state (100% accuracy), the more expensive each increment of DQ becomes. The “overall cost” is a sum of both components, as depicted by the convex line in Figure 5. Consequently, there always exists an optimal (lowest cost) point below the point that provides 100% accuracy. The “DQ Cost Model” appears to be applicable and useful in pharmaceutical clinical development.
  • 37. 28 April, 2015 pg. 36 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. Figure 5. Data Quality Cost Model (Riain, and Helfert, 2005) Pharmaceutical clinical trials are extremely expensive. On the surface, the clinical development process appears to be fairly straight-forward – recruit a patient, collect safety and efficacy data, and submit the data to the regulators for approval. In reality, only nine out of a hundred of compounds entering pharmaceutical human clinical trials, ultimately get approved and only two out of ten marketed drugs return revenues that match or exceed research and development (R&D) costs (Vernon, Golec, & DiMasi, 2009). “Strictly speaking, a product’s fixed development costs are not relevant to how it is priced because they are sunk (already incurred and not recoverable) before the product reaches the market. But a company incurs R&D costs in expectation of a product’s likely price, and on average, it must cover those fixed costs if it is to continue to develop new products” (CBO, 2006).
  • 38. 28 April, 2015 pg. 37 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. According to various estimates, “the industry’s real (inflation-adjusted) spending on drug R&D has grown between threefold and sixfold over the past 25 years—and that rise has been closely matched by growth in drug sales.” (CBO, 2006) Most often quoted cost estimate by DiMasi, Hansen and Grabowski (2003) states that the process of data generation costs more than $800 million per approved drug14 , up from $100 million in 1975 and 403 million US dollars in 2000 (in 2000 dollars). Some experts believe the real cost, as measured by the average cost to develop one drug, to be closer to $4 billion (Miseta, 2013). However, this (high) estimate is criticized by some experts for including the cost of drug failures but not the R&D Tax credit afforded to the pharmaceutical companies. For some drug development companies the cost is even higher – 14 companies spend in excess of $5 billion per new drug according to Forbes (Harper, 2013) as referenced in the Table 3. Table 3. R&D cost per drug (Liu, Constantinides, & Li, 2013) 14 “A recent, widely circulated estimate put the average cost of developing an innovative new drug at more than $800 million, including expenditures on failed projects and the value of forgone alternative investments. Although that average cost suggests that new-drug discovery and development can be very expensive, it reflects the research strategies and drug-development choices that companies make on the basis of their expectations about future revenue. If companies expected to earn less from future drug sales, they would alter their research strategies to lower their average R&D spending per drug. Moreover, that estimate represents only NMEs developed by a sample of large pharmaceutical firms. Other types of drugs often cost much less to develop (although NMEs have been the source of most of the major therapeutic advances in pharmaceuticals)” (CBO, 2006).
  • 39. 28 April, 2015 pg. 38 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. In fact, the exponential growth of the clinical trial cost impedes progress in medicine and, ultimately, may put the national public health at risk. CBO (2006) reported that the federal government spent more than $25 billion on health-related R&D in 2005. Fee (2007), based on the CMSInfo analysis, reported that national spending on clinical trials (including new indications and marketing studies) in the United States was nearly $24 billion in 2005. The research institute expected this number to rise to $25.6 billion in 2006 and $32.1 billion in 2011—growing at an average rate of 4.6% per year. The reality exceeded the initial projections. It is believed that the nation spends over $50 billion per year on pharmaceutical clinical trials (CBO, 2006; Kaitin, 2008). Research companies need to find a way to reduce these costs if the industry to sustain growth and fund all necessary clinical trials. Some authors question the effectiveness and efficiency of the processes employed by the industry. Ezekiel (2003) pointed out more than a decade ago that a large proportion of the clinical trial costs has been devoted to the non-treatment trial activities. More recently, Tufts Center for the Study of Drug Development conducted an extensive study among a working group of 15 pharmaceutical companies in which a total of 25,103 individual protocol procedures were evaluated and classified using clinical study reports and analysis plans. This study uncovered significant waste in clinical trial conduct across the industry. More specifically, the results demonstrate that …the typical later-stage protocol had an average of 7 objectives and 13 end points of which 53.8% are supplementary. One (24.7%) of every 4 procedures performed per phase-III protocol and 17.7% of all phase-II procedures per protocol were classified as "Noncore" in that they supported supplemental secondary, tertiary, and exploratory end points. For phase-III protocols, 23.6% of all procedures supported regulatory compliance requirements and 15.9% supported those for phase-II protocols. The study also found that on average, $1.7 million (18.5% of the total) is spent in direct costs to administer Noncore procedures per phase-III protocol and $0.3 million (13.1% of the total) in direct costs are spent on Noncore procedures for each phase-II protocol. Based on the results of
  • 40. 28 April, 2015 pg. 39 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. this study, the total direct cost to perform Noncore procedures for all active annual phase- II and phase-III protocols is conservatively estimated at $3.7 billion annually. (Getz et al., 2013) A major portion of the clinical trial operation cost is devoted to “data cleaning” and other data handling activities that are intended to increase the likelihood of drug approval. It appears that the pharmaceutical industry (and society in general) continues to pay the price of not using a clear-cut definition of DQ, in the hopes that 100% accuracy will make their approval less painful. The title of Lörstad’s publication (2004) “Data Quality of the Clinical Trial Process – Costly Regulatory Compliance at the Expense of Scientific Proficiency” summarizes the concerns of the scientific community as they pertain to unintelligent utilization of resources. A vital question– what proportion of these resources could be saved and how? – has been frequently posed in the scientific literature over that past decade. Some authors estimate that significant portion of the funds allocated for clinical research may be wasted, due to issues related to data quality (DQ) (Ezekiel & Fuchs, 2008; Funning, 2009; Tantsyura et al, 2015). Many recent publications focus on the reduction of barely efficient manual data cleaning techniques, such as SDV. Typically, such over-utilization of resources is a direct result of the conservative interpretation of the regulatory requirements for quality. Lörstad (2004) calls this belief in the need for perfection of clinical data a “carefully nursed myth” and “nothing but embarrassing to its scientifically trained promoters.” The fundamental uniqueness of QA in clinical development relative to other industries (manufacturing being the most extreme example), is the practical difficulty of using the classic quality assurance approach, where a sample of “gadgets” is inspected, the probability of failure is estimated, and the conclusions about quality are drawn. Such an approach to QA is not only impractical in the pharmaceutical clinical trials because of the prohibitively high cost, but may
  • 41. 28 April, 2015 pg. 40 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. also be unnecessary. To further complicate the situation, the efficacy end-points and data collection instruments and mechanisms are not yet standardized across the industry. This fact makes almost every trial (and DQ context for this trial) unique, and thus requires intelligent variation in the approaches to DQ from one trial to another in order to eliminate waste. The industry does not appear to be fully prepared for such a cerebral and variable approach to quality. Very often, from the senior management (allocating the data cleaning resources) perspective, it is much safer, easier and surprise-free to stick to the uniformly conservative interpretations of the regulatory requirements than allow a study team to determine the quality acceptance criteria on a case-by-case basis. Study teams are not eager to change either because of insufficient training and lack of motivation to change. Generalization and application of DQ standards from one trial to another lead to imprecision in defining study-specific levels of “good” and “poor” quality and, subsequently, to overspending in the name of high DQ. Thus, the definition of DQ is at the heart of inefficient utilization of resources. An anticipated dramatic reduction in reliance on manual SDV and its replacement with the statistically powered computerized algorithms will lead to DQ improvements while reducing cost. One can consider on-site monitoring, the most expensive and widely used element of the process to assure DQ, as a vivid example of such inefficiency. The FDA has made it clear that extensive on-site monitoring is no longer essential. In April 2011, the FDA withdrew a 1988 guidance on clinical monitoring (FDA, 1988) that emphasized on-site monitoring visits by the sponsor or CRO personnel “because the 1988 guidance no longer reflected the current FDA opinion” (Lindblad et al., 2014). A clinical monitoring guidance (FDA, 2013) encourages sponsors to use a variety of approaches to meet their trial oversight responsibilities and optimize the utilization of their resources
  • 42. 28 April, 2015 pg. 41 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. In my recent work (Tantsyura et al., 2015), I had estimated that over nine billion dollars could be saved annually by the pharmaceutical industry if new standards for DQ process in monitoring clinical trials are established and implemented by clinical trial operations across the industry. Tantsyura et al. (2010), and Neilsen, Hyder, and Deng (2013) compared the cost implications for several SDV models and concluded that a “mixed approach” (which is characterized by a minimal amount of SDV relative to other alternatives) appears to be the most efficient (Nielsen et al., 2013). In addition to site monitoring, new DQ standards would impact all other functions, including data management, statistical programming, regulatory, quality, and pharmacovigilance departments resulting in an even greater total potential savings. Magnitude of Data Errors in Clinical Research The error rates have long been an important consideration in analysis of clinical trials. As discussed earlier, each data processing step in clinical trials may potentially introduce errors and can be characterized by a corresponding “error rate.” For example, transcription (from the “source document” to CRF) step is associated with a transcription error rate. Similarly, each data entry step is associated with a data entry error rate. Subsequently, the overall (or “source-to- database”) error rate is a sum of error rates introduced by each data processing step. The historical contextual changes (when EDC eliminated the need for paper CRF and thus eliminated the errors associated with data entry and, more recently, when eSource technologies, such as EMR, DDE, began elimination of the paper source, thus eradicating the transcription errors and further improving DQ) are important for understanding error rates reported in the literature. Also, these evolutionary changes impacted the approaches to QA audits. “Historically, only partial assessments of data quality have been performed in clinical
  • 43. 28 April, 2015 pg. 42 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. trials, for which the most common method measuring database error rates has been to compare the case report forms (CRF) to database entries and count discrepancies” (Nahm et al., 2008). In the contemporary data collection environment, even limited-in-scope single-step audits are rarely used. While actual error rates vary moderately from study to study due to multiple contextual factors and are not assessed habitually, some reports on error rates are found in the literature. The average source-to-database error rate in electronic data capture (EDC) – 14.3 errors per 10,000 fields or 0.143% – is significantly lower than the average published error rate (976 errors per 10,000 or 9.76%) calculated as a result of a literature review and analysis of 42 articles in the period from 1981 to 2005, a majority of which were paper registries (Nahm, 2008). Mitchel, Kim, Choi, Park, Schloss Markowitz and Cappi (2010) analyzed the proportion of data corrections using a smaller sample for a paper CRF study and also found a very small proportion (approx. 6%) of CRFs modified. In the most recent study by Mitchel and colleagues, which used the innovative DDE system, the reported rate of error corrections is 1.44% (Mitchel Gittleman, Park et al., 2014). Also, Medidata (Yong, 2013) analyzed the magnitude of data modifications looking across a sample of 10,000+ clinical trials/millions data points. Many were surprised by the tiny proportion of data points that were modified since the original data entry was minimal – just under 3%. Grieve (2012) estimated data modification rates based on a sample of 1,234 individual patient visits from 24 studies. His estimates were consistent with the rate reported by Medidata. “The pooled estimate of the error rate across all projects was approximately 3% with an associated 95% credible interval ranging from 2% to 4%.” [Note: First, this imprecise use of term “error rate” is observed in several publications and likely a reflection of the modern trend. Second, the methodology in the Mitchel et al. (2010) and
  • 44. 28 April, 2015 pg. 43 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. Medidata (Yong, 2013) studies were different – data-point-level vs. CRF level error correction. As a result of these methodological differences, the numerator, denominator, and, ultimately, the reported rates, were categorically different.] The most recent retrospective multi-study data analysis by Transcelerate (Scheetz et al., 2014) revealed that “only 3.7% of eCRF data are corrected following initial entry by site personnel.” The difference between reported rates 3% (Yong, 2013) and 3.7% (Scheetz et al., 2014) is primarily attributed to the fact that “the later estimates included cases of initially missing data that were discovered and added later, usually as a result of monitoring activities by CRA, which were not included as a separate category in the previous analysis (Tantsyura et al., 2015). Overall, it has been established that the EDC environment with built-in real-time edit checks has been reducing the error rate (and thus improving DQ) by a magnitude of 40-95% (Bakobaki et al., 2012). Now, when EDC is a dominant data acquisition tool in pharmaceutical clinical development, and the data are collected not only much faster but with the higher quality (lower error rates), and the first regulatory approval using DDE technology is on the horizon, the important question is how much more data cleaning is necessary? What additional data cleaning methods should or should not be used? Is investment in (expensive but still popular) human source data verification effective and efficient, and does it produce a return commensurate with its expense? Which processes truly improve the probability of a correct study conclusion and drug approval, while reducing resource utilization at the same time?
  • 45. 28 April, 2015 pg. 44 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. Source Data Verification, Its Minimal Impact on DQ, and the Emergence of Central/Remote Review of Data Many authors look at SDV as an example of a labor-intensive and costly data-correction method and present evidence on the effectiveness of SDV as it pertains to query generation and data correction rates. All reviewed studies consistently demonstrated minimal effect of SDV- related data corrections in respect to overall DQ. More specifically, based on a sample of studies, TransCelerate (2013) calculated the average percentage of SDV queries generated 7.8% of the total number of queries generated. The average percentage of SDV queries that were generated in “critical” data exclusively was 2.4%. A study by Cancer Research UK Liverpool Cancer Trials Unit assessed the value of SDV for oncology studies and found it to be minimal (Smith et al., 2012). In this study data discrepancies and comparative treatment effects obtained following 100% SDV were compared to those based on data without SDV. In the sample of 533 subjects, baseline data discrepancies identified via SDV varied from 0.6% (Gender), 0.8% (Eligibility Criteria), 1.3% (Ethnicity), 2.3% (Date of Birth) to 3.0% (WHO PS), 3.2% (Disease Stage) and 9.9% (Date of Diagnosis).All discrepancies were equally distributed across treatment groups and across sites, and no systematic patterns were identified. The authors concluded that “in this empirical comparison, SDV was expensive and identified random errors that made little impact on results and clinical conclusions of the trial. Central monitoring using an external data source was a more efficient approach for the primary outcome of overall survival. For the subjective outcome objective response, an independent blinded review committee and tracking system to monitor missing scan data could be more efficient than SDV.” Similarly to Mitchel et al. (2011), Smith and colleagues suggested (as an alternative to SDV) “to safeguard against the effect of random errors might be to inflate
  • 46. 28 April, 2015 pg. 45 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. the target sample size…” This recommendation will be further explored in the “Practical Implications” section of the discussion. Mitchel, Kim, Hamrell et al. (2014) analyzed the impact of SDV and queries issued by CRAs in a study with 180 subjects. In this study a total of 5,581 paper source records reviewed at the site and compared with the clinical trial database. This experiment showed that only 3.9% of forms were queried by CRAs and only 1.4% of forms had database changes as a result of queries generated by CRAs (37% query effectiveness rate). Also, the “error rate” associated with SDV alone was 0.86%. The review by Bakobaki and colleagues “determined that centralized [as opposed to manual/on-site] monitoring activities could have identified more than 90% of the findings identified during on-site monitoring visits” [Bakobaki et al., 2012, as quoted by FDA RBM guidance 2013]. This study leads to the conclusion that the manual data cleaning step is not only too expensive, but also often unnecessary when and if a rigorous computerized data validation process is employed. Lindblad et al. (2013) attempted to “determine whether a central review by statisticians using data submitted to the FDA… can identify problem sites and trials that failed FDA site inspections.” The authors concluded that “systematic central monitoring of clinical trial data can identify problems at the same trials and sites identified during FDA site inspections.” In summary, SDV is the most expensive, but widely utilized, manual step in the data cleaning process. However, multiple publications emphasized that while in the old 100% SDV method, the SDV component was essential in identifying issues and addressing them, now, utilizing modern technology, this step of the process is a largely wasted effort because computer- enabled algorithms identify data issues much more efficiently.
  • 47. 28 April, 2015 pg. 46 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. Effect of Data Errors on Study Results A recent study by Mitchel, Kim, Choi et al. (2011) uncovered the impact of data error and error correction on study results by analyzing study data before and after data cleaning in a study with 492 randomized subjects in a real pharmaceutical clinical development setting. In this case, data cleaning did not change the study conclusions. Also, this study raised a legitimate question: Does EDC, with its built-in on-line and off-line edit checks produce “good enough” data with no further cleaning is necessary? More specifically, Mitchel and colleagues observed the following three phenomena: (a) nearly identical means before and after data cleaning, (b) a slight reduction in SD (by approximately 3%), and, most importantly, (c) the direct (inverse) impact of the sample size on the “benefit” of data cleaning. [Further study of this last phenomenon is the main focus of this dissertation thesis.] Thus, regardless the systematic nature of data errors15 , in presence of built-in edit checks, some evidence demonstrates the very limited impact of data errors and data cleaning on study results. Study Size Effect and Rational for the Study What are the factors that influence the validity, reliability and integrity of study conclusions (i.e. DQ)? There are many. For example, the FDA emphasizes “trial-specific factors (e.g., design, complexity, size, and type of study outcome measures)” (FDA, 1998). Recent literature points out the inverse relationship between the study size and the impact of errors on study conclusions (Mitchel et al., 2011; Tantsyura et al., 2015), as well as the diminishing (with study size) return on investment (ROI) of DQ-related activities (Tantsyura et al. 2010, Tantsyura 15 With the exception of fraud, which is “rare and isolated in scope” (Helfgott, 2014).
  • 48. 28 April, 2015 pg. 47 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. et al., 2015). Mitchel, Kim, Choi and colleagues (2011) point out “the impact on detectable differences [between the “original” and the “clean” data]…, is a direct function of the sample size.” However, the impact of study size on DQ has not been examined systematically in the literature to date. This proposed study is intended to fill in this knowledge gap. Monte-Carlo simulations will be used as a mechanism to generate data for analysis, as it has been suggested by some literature (Pipino & Kopcso, 2004). Statistical, regulatory, economic and policy aspects of the issue will be examined and discussed. The specific research questions this study will be focusing on are: (1) “What is the statistical and economic impact of the study size on study conclusions (i.e. data quality)?” and (2) “Are there any opportunities for policy changes to improve the effectiveness and efficiency of the decision-making process?”
  • 49. 28 April, 2015 pg. 48 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. III. Analytical Section/Methods of Analysis. Objectives and End-Points Primary Objectives: Primary End-Points: Estimate statistical impact of the study size on study conclusions in the presence of data errors  Probability of the correct decision (i.e. match between t-test by “error-free” data and t-test using “error- induced” data) – Pcorrect;  Probability of the false-negative decision - Pfalse-neg;  Probability of the false-positive decision - Pfalse-pos. Secondary Objectives: Secondary Endpoints: Estimate the data cleaning “cut-off point” for studies with different sample sizes  Minimal sample size (n95%) to achieve 95% probability of correct decision without data cleaning (Pcorrect ≥ 95%);  Minimal sample size (n98%) to achieve 98% probability of correct decision without data cleaning (Pcorrect ≥ 98%);  Minimal sample size (n99%) to achieve 99% probability of correct decision without data cleaning (Pcorrect ≥ 99%). Estimate the economic impact of reduction in data cleaning activities associated with introduction of “data cleaning cut-off” policy Estimated cost savings (% cost reduction) associated with reduced data cleaning. Policy recommendations. Generate a list of recommendations for (potential) process modification and regulatory policy changes to improve the effectiveness and efficiency of the DQ-related decision- making process. Critical Concepts and Definitions:  “Accuracy” dimension of DQ is under investigation. Other dimensions are not considered in this study.  Conceptual definition of DQ: “High-quality” data is formally defined “…as data strong enough to support conclusions and interpretations equivalent to those derived from error-free data.” (IOM, 1999)  Operational definition and measure of DQ: the probability of correct study conclusion relative to “error-free data”. This definition is logically followed by the definitions of (a) “increase in DQ” that is measured by an increase in probability of correct study conclusion and (b) “reduction in DQ” that is measured by a decrease in probability of correct study conclusion.
  • 50. 28 April, 2015 pg. 49 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. Synopsis: Determine the impact of data errors on DQ and study conclusions for 147 hypothetical scenarios while varying (1) the number of subjects per arm (n), (2) the effect size or difference between means of active arm and comparator (delta), and (3) the number of errors per arm (e). Each trial scenario is repeated 200 times and, thus, the results and probabilities calculated in this experiment are based on 147 x 200 = 29400 trials. The input and output variables used in this study are listed in the box below: Input variables: n – number of observations (pseudo-subjects) per arm Co-variates: e – number of induced errors per arm d (delta) – effect size/difference between means of hypothetical “Pbo” and 7 different “active” arms Primary outcome variables of interest: Pcorrect – probability of correct decision (i.e. match between t-test result from “error- free” data and t-test from “error-induced” datasets) Secondary outcome variables of interests: Pfalse-neg – probability of false-negative decision Pfalse-pos – probability of false-positive decision Main Assumptions:  All 147 scenarios are superiority trials  Only one variable of analysis is considered for each simulated scenario  The distributions of values in ‘active’ and ‘Pbo’ arms are assumed normal. Also o Variability for Pbo arm and seven “active” arms are assumed identical and fixed (SD=1) for all 147 scenarios
  • 51. 28 April, 2015 pg. 50 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED.  Errors distributed as a Gaussian “white noise”  Out of range edit checks (EC) assume 100% effectiveness (catch 100% of errors) within pre-specified range (4 SD). Overall, it is presumed that adequate process exists for detecting important issues.  Number of errors in active and Pbo arms is assumed equal in all scenarios (0 v. 0, 1 v. 1, 2 vs. 2, 5 v. 5)  Number of errors (e) and number of observations per arm (n) are considered independent variables. Design of the Experiment: The idea and design of this experiment comes directly from the IOM (1999) definition of high quality data, which is described “as data strong enough to support conclusions and interpretations equivalent to those derived from error-free data.” In fact, in a hypothetical trial, this definition implies that the data is of “good enough” quality (no matter how many errors are in the datasets) as long as clinical significance of a statistical test is on the same side of the statistical threshold (typically 0.05). Furthermore, this definition suggests a formal comparison between an error-free data set and the same data set with induced data errors (i.e. several data points removed and replaced with the “erroneous” values), which is implemented in this study. Figure 6. High-level Design of the Experiment
  • 52. 28 April, 2015 pg. 51 © 2015, Vadim Tantsyura, ALL RIGHTS RESERVED. As the result of the experiment, the probabilities of “correct” (as well as “false-negative” and “false-positive”) decisions are calculated 200 times for each of the 147 scenarios, producing the equivalent of 29,400 hypothetical trials. [Note: The “correct” study conclusion, for the purpose of this experiment, is defined as correct hypothesis acceptance or rejection (when the “statistical significance” dictated by p-values in both erroneous and error-free datasets are the same; denoted as code “0” in the table below). More specifically this probability is calculated as a proportion of “matches” for p-values calculated from “error-free” and error-induced data-sets with denominator 200, which reflects the number of Monte-Carlo iterations for each scenario.] Similarly, the probability of incorrect (“false-negative” and “false-positive”) decisions are calculated as well (codes “-1” and “+1” respectively). Table 4 is the visual demonstration of the event coding for “hits” and misses.” Table 4. Coding for ”Hits” and “Misses” Erroneous probability (Pd_er) True probability [P(d)]; d = 0, 0.05, 0.1, 0.2, 0.5, 1, 2 P(d) < 0.05 P(d) ≥ 0.05 Pd_er < 0.05 0 (correct) +1 (false-positive) Pd_er ≥ 0.05 -1 (false-negative) 0 (correct) Legend:  Correct decision (i.e. the statistical significance is not changed by induced error(s)): o code “0”  Incorrect decision (i.e. the statistical significance is changed by induced error(s)): o False positive decision: code “+1” o False negative decision: code “-1”