Overview of how to estimate data quality and validate normalization approaches to remove analytical variance.
See here for animations used in the presentation:
http://imdevsoftware.wordpress.com/2014/06/04/using-repeated-measures-to-remove-artifacts-from-longitudinal-data/
2. Analytical Variance
Variation in sample measurements stemming from sample
handling, data acquisition, processing, etc
• Can modify or mask true biological variability
• Calculated based on variance in replicated measurements
• Can be accounted for using data normalization approaches
Goal- minimize analytical variance using data normalization
Drift in >400 replicated measurements across >100 batches
3. Need for Normalization
To remove non-biological (e.g. analytical)
drift/variance/artifacts in measurements
Acquisition order Processing/acquisition batches
Samples
Quality Controls (QCs)
4. Quantifying Data Quality (precision)
Calculate median inter- and intra-batch %RSD
(for replicated measurements)
Analyte specific
performance across
whole study
Within batch
performance
5. Visualizing Performance
Intra-batch (within) precision for
normalization methods
Inter-batch (across) precision for
normalization methods
RSD = relative standard deviation = standard deviation/mean
7. Common Normalization Approaches
Sample-wise scalar corrections
• L2 norm, mean, median, sum, etc.
Internal standard (ISTD)
• Ratio response (metabolite/ISTD)
• NOMIS (Sysi-Aho et al., 2007; selection of optimal combination ISTDs)
• CCRMN (Redestig et al., 2009; removal of metabolite cross contribution to ISTDs)
Quality control (QC) or reference sample
• Batch ratio (mean, median)
• Loess (doi:10.1038/nprot.2011.335; locally estimated scatterplot smoothing)
• Hierarchical mixed effects (Jauhiainen et al. 2014)
• Quantile (Bolstad et al., 2003; minimize variance in metabolite distribution)
Variance Based
• RUV-2 (De Livera et al., 2012; variance removal for hypothesis testing)
• Variance stabilizing normalization (Huber et al. 2002)
8. Evaluation of Normalizations
Use QC to define:
• Median within batch %RSD
• Median analyte study wide %RSD
• All normalization specific parameters
• Split QCs into training and test set
• Optimize tuning parameters using leave-one-out
cross-validation
• Assess performance on test set
Image: http://pingax.com/regularization-implementation-r/?utm_source=rss&utm_medium=rss&utm_campaign=regularization-implementation-r
9. Scalar Normalization
Calculate sample-
specific scalar to ensure
each sample’s (sum,
mean, median, etc)
signal is equivalent
• Using sum signal
normalization (sum
norm) assumes
equivalent total
metabolite signal per
sample
• Can correct for batch
effects when valid
BMC Bioinformatics 2007, 8:93 doi:10.1186/1471-2105-8-93
Theses normalizations may hide true
biological trends or create false ones
After sum norm phospholipids
seem lower in ob/ob when in
reality theses are the same as
in wt samples
10. Batch Ratio (BR) Normalization
Use QCs to calculate:
1. batch/analyte specific
correction factor =
(batch median /global
median)
2. Apply ratio to samples
• simple
11. LOESS Normalization (local smoothing)
For each analyte use QCs to:
• Tune LOESS model (span or degree of smoothing)
• LOESS model to remove analytical variance from samples
raw LOESS normalized
12. LOESS Normalization
LOESS span has a large effect model fit
span (α) defines the degree of
smoothing and is critical for
controlling overfitting
13. LOESS Normalization
raw samples (red) normalized based on QCs (black)
model is trained on QCs and applied to samples
span: too high just right?
Can not assume convergence of training and test performance because
test data has analytical + biological variance
16. Metabolomic Data Case Study I
GC-TOF
• 310 metabolites for 4930 samples
• 132 batches
• ~41 samples per batch
• ~1:10 QCs/samples (487 QCs or 9%)
• No Internal Standards (ISTDs)
Normalizations Implemented
• Batch ratio
• LOESS
• Sum known metabolite signal (mTIC) normalization
23. BR Normalization Limitations
• Very susceptible to
outliers
• Requires many QCs
• Can inflate variance
when training and test
set trends do not
match
27. LOESS Normalization Limitations
raw normalized
LOESS normalization can
inflate variance when:
• overtrained
• training examples do
not match test set
28. Sum mTIC Normalization (GC-TOF)
Improved performance over
raw and BR, but alters data
from magnitudinal to
compositional
29. Sum mTIC Normalization (GC-TOF)
Poor removal of trends due to acquisition time, but limits magnitude of
outliers samples compared to other approaches
time
Raw
mTIC Normalized
30. Metabolomic Data Case
Study II
LC-Q-TOF
• 340+ metabolites for 4930 samples
• 132 batches
• ~41 samples per batch
• ~1:10 QC/samples (524 QCs or 11%)
• NIST reference (63 or 1%)
• 14 internal standards (ISTDs)
• NOMIS (IS = ISTD)
• qcISTD
31. Internal Standards Normalization
Analyte
Retention time
Internal standards (ISTD)
• qcISTD(QC optimized
metabolite/ISTD)
• NOMIS(Sysi-Aho et al., 2007;
selection of optimal combination ISTDs)
• CCRMN (Redestig et al., 2009;
removal of metabolite cross contribution
to ISTDs)
NOMIS
32. ISTD Based Normalizations (LC/Q-TOF)
• NOMIS (linear combination of optimal ISTDs;
Sysi-Aho et al., 2007)
• qcISTD (QC optimized ISTD strategy)
PC 38:6
Poor
performance
with NOMIS
33. qcISTD Normalization
Use QC samples to:
1. Evaluate analyte %RSD
before and after corrections
using all ISTDs
2. Select analyte/ISTD
combinations with %RSD
improvement over raw data
at some threshold (e.g 10%)
3. Correct sample analytes
with QC defined ISTD if ISTD
recovery is above some
minimal threshold (e.g. >
20% of median)
• Subject to overfitting
191 of 326 (60%) are
ISTD corrected
36. Normalizations (LC-Q-TOF)
LOESS performs very
poorly for two
metabolites
• qcISTD performs better than LOESS
• qcISTD + LOESS leads to highest replicate
precision
37. PCA (LC/Q-TOF)
Raw (%RSD = 13) qcISTD (9)
LOESS (12)
qcISTD +
LOESS (8)
Only LOESS included
normalizations effectively
remove analytical batch
effects
38. Conclusion
• Comparison of common data normalization approaches
suggests that in addition to ISTD corrections, LOESS
(analyte-specific, non-linear adjustment based on QC
performance at various data acquisition times) is superior
to batch based corrections.
• Further validations need to be completed to confirm the
effects of normalizations on samples’ variance
• These findings suggest that inclusion of “batch” as a
covariate in statistical models will not fully account for
analytical variance
R code for all normalization functions can be found at :
https://github.com/dgrapov/devium/blob/master/R/Devium%20Normalization.r