Data Normalization Approaches for Large-scale Biological Studies

Data Normalization Approaches for
Large-Scale Metabolomic Studies
Dmitry Grapov, PhD

Analytical Variance
Variation in sample measurements stemming from sample
handling, data acquisition, processing, etc
• Can modify or mask true biological variability
• Calculated based on variance in replicated measurements
• Can be accounted for using data normalization approaches
Goal- minimize analytical variance using data normalization
Drift in >400 replicated measurements across >100 batches

Need for Normalization
To remove non-biological (e.g. analytical)
drift/variance/artifacts in measurements
Acquisition order Processing/acquisition batches
Samples
Quality Controls (QCs)

Quantifying Data Quality (precision)
Calculate median inter- and intra-batch %RSD
(for replicated measurements)
Analyte specific
performance across
whole study
Within batch
performance

Visualizing Performance
Intra-batch (within) precision for
normalization methods
Inter-batch (across) precision for
normalization methods
RSD = relative standard deviation = standard deviation/mean

Visualizing Metabolite Performance
acquisition time
batch
Univariate Multivariate
PCA

Common Normalization Approaches
Sample-wise scalar corrections
• L2 norm, mean, median, sum, etc.
Internal standard (ISTD)
• Ratio response (metabolite/ISTD)
• NOMIS (Sysi-Aho et al., 2007; selection of optimal combination ISTDs)
• CCRMN (Redestig et al., 2009; removal of metabolite cross contribution to ISTDs)
Quality control (QC) or reference sample
• Batch ratio (mean, median)
• Loess (doi:10.1038/nprot.2011.335; locally estimated scatterplot smoothing)
• Hierarchical mixed effects (Jauhiainen et al. 2014)
• Quantile (Bolstad et al., 2003; minimize variance in metabolite distribution)
Variance Based
• RUV-2 (De Livera et al., 2012; variance removal for hypothesis testing)
• Variance stabilizing normalization (Huber et al. 2002)

Evaluation of Normalizations
Use QC to define:
• Median within batch %RSD
• Median analyte study wide %RSD
• All normalization specific parameters
• Split QCs into training and test set
• Optimize tuning parameters using leave-one-out
cross-validation
• Assess performance on test set
Image: http://pingax.com/regularization-implementation-r/?utm_source=rss&utm_medium=rss&utm_campaign=regularization-implementation-r

Scalar Normalization
Calculate sample-
specific scalar to ensure
each sample’s (sum,
mean, median, etc)
signal is equivalent
• Using sum signal
normalization (sum
norm) assumes
equivalent total
metabolite signal per
sample
• Can correct for batch
effects when valid
BMC Bioinformatics 2007, 8:93 doi:10.1186/1471-2105-8-93
Theses normalizations may hide true
biological trends or create false ones
After sum norm phospholipids
seem lower in ob/ob when in
reality theses are the same as
in wt samples

Batch Ratio (BR) Normalization
Use QCs to calculate:
1. batch/analyte specific
correction factor =
(batch median /global
median)
2. Apply ratio to samples
• simple

LOESS Normalization (local smoothing)
For each analyte use QCs to:
• Tune LOESS model (span or degree of smoothing)
• LOESS model to remove analytical variance from samples
raw LOESS normalized

LOESS Normalization
LOESS span has a large effect model fit
span (α) defines the degree of
smoothing and is critical for
controlling overfitting

LOESS Normalization
raw samples (red) normalized based on QCs (black)
model is trained on QCs and applied to samples
span: too high just right?
Can not assume convergence of training and test performance because
test data has analytical + biological variance

LOESS Normalization
Avoiding over fitting is critical using the LOESS normalization

Exammple LOESS Normalization
raw span =0.75 span =0.005

Metabolomic Data Case Study I
GC-TOF
• 310 metabolites for 4930 samples
• 132 batches
• ~41 samples per batch
• ~1:10 QCs/samples (487 QCs or 9%)
• No Internal Standards (ISTDs)
Normalizations Implemented
• Batch ratio
• LOESS
• Sum known metabolite signal (mTIC) normalization

Batch Performance (GC-TOF Raw)
Within batch
• Median: 26
• Min: 19
• Max: 69
Median
RSD count cumulative %
10-20 3 2
20-30 98 76
30-40 26 96
40-50 3 98
50-60 1 99
60-70 1 100

Median
0-10 10 3
10-20 83 30
20-30 100 62
30-40 69 84
40-50 32 94
50-60 6 96
60-70 3 97
70-80 5 98
80-90 1 99
90-100 1 100
Analyte Performance (GC-TOF Raw)
Within Batch
• Median: 24
• Min: 7
• Max: 79

Within batches
• Median: 23
• Min: 17
• Max: 69
Median
10-20 25 23
20-30 67 85
30-40 15 99
40-50 1 100
60-70 1 101
Batch Performance (GC-TOF BR)

Median
0-10 17 6
10-20 103 39
20-30 112 75
30-40 57 93
40-50 12 97
50-60 5 99
60-70 3 100
70-80 1 100
Across batches
• Median: 24
• Min: 7
• Max: 79
Batch Performance (GC-TOF BR)

BR Normalization Limitations
• Very susceptible to
outliers
• Requires many QCs
• Can inflate variance
when training and test
set trends do not
match

Within batches
• Median: 19
• Min: 11
• Max: 58
Median
10-20 75 57
20-30 51 96
30-40 4 99
40-50 1 99
50-60 1 100
Batch Performance (GC-TOF LOESS)

Median
0-10 17 6
10-20 103 39
20-30 112 75
30-40 57 93
40-50 12 97
50-60 5 99
60-70 3 100
70-80 1 100
Across batches
• Median: 19
• Min: 2.9
• Max: 66
Batch Performance (GC-TOF LOESS)

LOESS Normalization Limitations
raw normalized
LOESS normalization can
inflate variance when:
• overtrained
• training examples do
not match test set

Sum mTIC Normalization (GC-TOF)
Improved performance over
raw and BR, but alters data
from magnitudinal to
compositional

Sum mTIC Normalization (GC-TOF)
Poor removal of trends due to acquisition time, but limits magnitude of
outliers samples compared to other approaches
time
Raw
mTIC Normalized

Metabolomic Data Case
Study II
LC-Q-TOF
• 340+ metabolites for 4930 samples
• 132 batches
• ~41 samples per batch
• ~1:10 QC/samples (524 QCs or 11%)
• NIST reference (63 or 1%)
• 14 internal standards (ISTDs)
• NOMIS (IS = ISTD)
• qcISTD

Internal Standards Normalization
Analyte
Retention time
Internal standards (ISTD)
• qcISTD(QC optimized
metabolite/ISTD)
• NOMIS(Sysi-Aho et al., 2007;
selection of optimal combination ISTDs)
• CCRMN (Redestig et al., 2009;
removal of metabolite cross contribution
to ISTDs)
NOMIS

ISTD Based Normalizations (LC/Q-TOF)
• NOMIS (linear combination of optimal ISTDs;
Sysi-Aho et al., 2007)
• qcISTD (QC optimized ISTD strategy)
PC 38:6
Poor
performance
with NOMIS

qcISTD Normalization
Use QC samples to:
1. Evaluate analyte %RSD
before and after corrections
using all ISTDs
2. Select analyte/ISTD
combinations with %RSD
improvement over raw data
at some threshold (e.g 10%)
3. Correct sample analytes
with QC defined ISTD if ISTD
recovery is above some
minimal threshold (e.g. >
20% of median)
• Subject to overfitting
191 of 326 (60%) are
ISTD corrected

qcISTD Normalization
ISTD used by retention time (Rt) Total number of analytes corrected by ISTD

Normalizations (LC-Q-TOF)
LOESS performs very
poorly for two
metabolites
• qcISTD performs better than LOESS
• qcISTD + LOESS leads to highest replicate
precision

PCA (LC/Q-TOF)
Raw (%RSD = 13) qcISTD (9)
LOESS (12)
qcISTD +
LOESS (8)
Only LOESS included
normalizations effectively
remove analytical batch
effects

Conclusion
• Comparison of common data normalization approaches
suggests that in addition to ISTD corrections, LOESS
(analyte-specific, non-linear adjustment based on QC
performance at various data acquisition times) is superior
to batch based corrections.
• Further validations need to be completed to confirm the
effects of normalizations on samples’ variance
• These findings suggest that inclusion of “batch” as a
covariate in statistical models will not fully account for
analytical variance
R code for all normalization functions can be found at :
https://github.com/dgrapov/devium/blob/master/R/Devium%20Normalization.r

dgrapov@ucdavis.edu
metabolomics.ucdavis.edu
This research was supported in part by NIH 1 U24 DK097154

Data Normalization Approaches for Large-scale Biological Studies

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Data Normalization Approaches for Large-scale Biological Studies

Ähnlich wie Data Normalization Approaches for Large-scale Biological Studies (20)

Mehr von Dmitry Grapov

Mehr von Dmitry Grapov (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Normalization Approaches for Large-scale Biological Studies