2. We must all accept that science is data and
that data are science, and thus provide
for, and justify the need for the support
of, much improved data curation.
Brooks Hanson , Andrew Sugden , Bruce Alberts (Science Editorial February 11th 2011)
3. Data Munging?
• Manipulating raw data to achieve a final
form
• Parsing or filtering data, or the many steps
required for data recognition.
• Cleaning the raw data using algorithms
(e.g. sorting) or parsing the data into
predefined data structures.
4. Clinical Data Munging?
• Following clinical research ethics to
manipulate clinical data to achieve an
acceptable form
– Respect of Persons (Autonomy)
– Data Security and Storage
– Data Integrity / Data Quality
– Privacy and Confidentiality
5. Why Clinical data
munging ?
• Analyst devotes up to 85% of total time to
data cleaning and preparation.
• Health science is driven by data than by
computation
• Identify missing data
6. Why data munging? Cont.
• Extreme Scores - Data value falling
outside the expected range
• Identify erroneous dates
• Confounders
7. Phases in clinical Data
Munging
• Screening
Phase:
– lack or excess of
data;
– inconsistencies;
– strange patterns
in distributions;
– unexpected
analysis results
and other types
of inferences and
abstractions
8. Phases in clinical Data
Munging
• Diagnostic
Phase: The
purpose is to clarify
the true nature of the
worrisome data
points, patterns, and
statistics.
-Documentation
should start at this
point.
• Treatment
Phase: What to do
with problematic
observation. The
options are limited to
correcting, deleting,
or leaving
unchanged.
10. Data screening?
• Understand the clinical data and the
different clinical data variables
• Categorise the data into groups/cores
• Determine the unique identifier
• Check data normality using frequency
distributions, skweness and kurtosis
,summary statistics and cross-tabulations
12. Missing values
• Occur if respondents refuse to answer,
malfunction of tools, subjects withdrawal
from studies
• Missing values are categorized as
– MAR ,MCAR or MNAR
• Most modern stat packages require
complete data
13. Dealing with Missing Values
• Use analysis that can deal with incomplete
data (Hierarchical Linear Modelling),survival
analysis
• Adjusting the denominator – remove the
unmarried from married
• Delete values with missing data- lead to
misestimating of population thus lower the
power
• Mean substitution – reduces the variance
• Imputation via multiple regression
16. Other Data Errors
• Duplications- take the first admission using
time
• Biologically impossible results
– Robust estimation: Estimation of statistical
parameters, using methods that are less
sensitive to the effect of outliers than more
conventional methods
• Questionable values
17. Given the rapid growth of the internet such
techniques will become increasingly
important in the organization of the growing
amounts of data available.
Large synoptic survey telescope 40tb of
data per day calls for a different way of
approach….100+PB of data in 10 yrs
18. tOOLs for a Clinical Data
Munger
Features Stata R SPSS SAS
Learning
Curve
Steep/Gradual Pretty Steep GradualFlat Pretty Steep
User
Interface
Code/PnC Code Mostly PnC Very Strong
Data
Manipulation
Very Strong Very Strong Moderate Very Strong
Data Analysis Versatile Versatile Powerful Powerful/Vers
atile
Graphics V good Excellent v good good
Cost Renewal on
upgrade -
affordable
Open Source Expensive Expensive(yea
rly renewal)
19. Other Important Tools
• Python - Getting real time data from social
networks
• Nvivo – for qualitative data
• perl
This is convenient to distinguish following areas lack or excess of data; outliers, including inconsistencies; strange patterns in (joint) distributions; and unexpected analysis results and other types of inferences and abstractions
. During the diagnostic phase, the data munger may have to reconsider prior expectations and/or review quality assurance procedures.
data sink for storage, modelling or future use.
Graphical exploration of distributions: box plots, histograms, and scatter plots.Plots of repeated measurements on the same individual, e.g., growth curves.Statistical outlier detection
In statistics, hierarchical linear modeling (HLM), also known as multi-level analysis, is a more advanced form of simple linear regressi...
- (transform, truncate)-Robust estimation: Estimation of statistical parameters, using methods that are less sensitive to the effect of outliers than more conventional methods. Accomodate and reduce errrors(LSE, TrimedMEAN,Windsorized mean – mean by removing extreme and calculate with the closest)