2. Introduction
Important
•This is an introduction to a series
of 8 tutorials for metabolomic data
analysis
•Download all the required files and
software here:
https://sourceforge.net/projects/teachingdemos/files/Winter%202014%20LC-MS%20and%20Statistics%20Course/
•Then follow the directions in the
software/startup.R to launch all
accompanying software
8. Data Analysis Goals
Exploration
Classification
• Are there any trends in my data?
– analytical sources
– meta data/covariates
• Useful Methods
– matrix decomposition (PCA, ICA, NMF)
– cluster analysis
• Differences/similarities between groups?
– discrimination, classification, significant changes
• Useful Methods
– analysis of variance (ANOVA), mixed effects models
– partial least squares discriminant analysis (O-/PLS-DA)
– Others: random forest, CART, SVM, ANN
• What is related or predictive of my variable(s) of interest?
– Regression, correlation
• Useful Methods
– correlation
– partial least squares (O-/PLS)
Prediction
12. Univariate Analyses
•Identify differences in sample population
means
•sensitive to distribution shape
•parametric = assumes normality
•error in Y, not in X (Y = mX + error)
wide
•optimal for long data
•assumed independence
•false discovery rate (FDR)
long
n-of-one
13. False Discovery Rate (FDR)
Type I Error: False Positives
•Type II Error: False Negatives
•Type I risk =
•1-(1-p.value)m
m = number of variables tested
FDR correction
• p-value adjustment or estimate of FDR (Fdr, q-value)
Bioinformatics (2008) 24 (12):1461-1462
14. Achieving “significance” is a function of:
significance level (α) and power (1-β )
effect size (standardized difference in means)
sample size (n)
*finish lab
1-statistical analysis
16. Cluster Analysis
Use the concept similarity/dissimilarity
to group a collection of samples or
variables
Linkage
Approaches
•hierarchical (HCA)
•non-hierarchical (k-NN, k-means)
•distribution (mixtures models)
•density (DBSCAN)
•self organizing maps (SOM)
Distribution
k-means
Density
17. Hierarchical Cluster Analysis
• similarity/dissimilarity
defines “nearness” or
distance
euclidean manhattan Mahalanobis non-euclidean
X
X
X
*
Y
Y
Y
21. Projection of Data
The algorithm defines the position of the light source
Principal Components Analysis (PCA)
• unsupervised
• maximize variance (X)
Partial Least Squares Projection to
Latent Structures (PLS)
• supervised
• maximize covariance (Y ~ X)
James X. Li, 2009, VisuMap Tech.
25. Use PLS to test a hypothesis
Partial Least Squares (PLS) is used to identify planes of maximum
correlation between X measurements and Y (hypothesis)
PLS
PCA
time = 0
120 min.
27. PLS Related Objects
Model
•dimensions, latent variables (LV)
•performance metrics (Q2, RMSEP, etc)
•validation (training/testing, permutation, cross-validation)
•orthogonal correction
Samples
•scores
•predicted values
•residuals
Variables
•Loadings
•Coefficients, summary of loadings based on all LVs
•VIP, variable importance in projection
•Feature selection
28. “goodness” of the model is all about the
perspective
Determine in-sample (Q2) and outof-sample error (RMSEP) and
compare to a random model
•permutation tests
•training/testing
*finish lab 4-Partial Least Squares and lab 5-Data Analysis Case Study
29. Biological Interpretation
Projection or mapping of analysis results
into a biological context.
• Visualization
• Enrichment
• Networks
– biochemical
– structural
– spectral
– empirical
30. Identification of alterations in
biochemical domains
Organism specific biochemical relationships and information
Multiple organism DBs
•KEGG
•BioCyc
•Reactome
•Human
•HMDB
•SMPDB
*finish lab 6-Metabolite Enrichment Analysis
31. Network Mapping
1. Generate
Connections
2. Calculate
Mappings
3. Create
Network
Grapov D., Fiehn O., Multivariate and network tools for analysis and visualization of metabolomic data, ASMS, June 08, 2013, Minneapolis, MN
32. Connections and
Contexts
Biochemical (substrate/product)
•Database lookup
•Web query
Chemical (structural or
spectral similarity )
•fingerprint generation
BMC Bioinformatics 2012, 13:99 doi:10.1186/1471-2105-13-99
Empirical (dependency)
•correlation, partial-correlation