This document describes the application of principal component analysis (PCA) based feature extraction to three bioinformatics studies without using class labels.
In the first study, PCA was used to select methylation sites that discriminated between healthy and disease samples in twins across three autoimmune diseases, finding many common sites. The second study applied PCA to select stable circulating microRNA biomarkers that classified 14 diseases from controls with high accuracy. The third study used PCA to extract prominent proteins in bacterial cells under different growth conditions without predefined classes.
The document discusses the advantages of unsupervised PCA-based feature extraction in providing stability and extracting biologically relevant features even without classification information. It questions when the approach works best and how to evaluate unsupervised feature selection
Heuristic PCA for Unsupervised Bioinformatics Feature Extraction
1. Heuristic PCA Based Feature Extraction
and
Its Application to Bioinformatics
Y-h. Taguchi, Dept. Phys., Chuo Uinv.,
Y. Murakami, Grad. Sch. Med., Osaka City Univ.
M. Iwadate, Dept. Biol. Sci., Chuo Univ.
H. Umeyama, Dept. Biol. Sci., Chuo Univ.
A. Okamoto, Dept. Sch. Health Sci.,
Aichi Univ. Edu.
2. 0. Why PCA?
PCA = principal component analysis
Motivation:
Unsupervised Feature Selection
How PCA?
3. 10 Ordered
Features
90 random
Features
100 Features
20 samples
Class 1
Class 2
11111111110000000000
11111111110000000000
.
.
11111111110000000000
01000000110110011111
00011110000101011101
.
.
.
01000011000110101111
How to select 10 ordered features,
without classification information?
6. Applying “weak” unitary transformation to
the space spanned by 20 samples...
20 samples
20 samples
100 Features
Class 1 Class 2 10 Ordered
Features
90 random
Features
Class 1 Class 2
7. The same 2D embedding.
Thus we can select 10 features.
10 Ordered
Features
90 random
Features
8. PC1 “weakly” represents discrimination
between class 1 and class 2
Class 1
Class 2
20 samples
9. Linear discriminant analysis
+ leave one out cross validation
using 10 ordered features ….
True
class 1 2
Predict 1 8 2
228
Accuracy=Sensitivity=Specificity=80%
How about real examples?
10. 1. Real example 1: Disease associated
aberrant promoter methylation
methylation
gene
promoter
three autoimmune diseases
SLE
RA
DM
[ MZ twins (healthy+sick) + 2 healthy controls] ✕ 5
= 20 samples → ✕3 diseases = 60 samples
vs ≈ 1000 potential methylation sites
11. Embedding of 〜1000 promoters within 20
RA samples into 2D with PCA (PC2 vs PC3)
PC3
Outlier promoters,
Selected
PC2
12. PC2:RA
Male Female
◯:Sick Twin
△:Healthy Twin
+:Healthy Control 1
☓:Healthy Control 2
Twins: Healthy > Sick
Controls: No
The 4th set: No
→ The reason why
unsupervised feature
selection is needed.
20 samples
13. Scatter plots between healthy/RA twins.
Red dots = selected promoters
Healthy twins
RA twins
P<2.2 ✕10
-16
-12
P=2.2✕10
-12
P=3.7✕10
P=3.9✕10
-1
P<2.2✕10
-16
Individual promoters are significantly aberrantly
methylated. Thus, feature selections are successful.
After repeating the same procedures to additional two
diseases (SLE and DM)....
14. Among three autoimmune diseases,
selected promoters are mostly common.
No other methods can achieve such an excellent
coincidence between three autoimmune diseases.
15. Lessons to learn:
Predefined class definition (e.g., 'sick
twin' vs 'healthy twin + two healthy
controls') is not a good strategy to
extract “important” features that can
exhibit much more complicated behavior
(e.g., upregulated for male while
downregulated for female)
16. Additional Remarks
Similar procedures were applied to
squamous cell carcinoma(*) and genes with
genotype-specific DNA methylation were
extracted. These genes were identified as
cancer-related genes using literature
searches and in silico drug screening was
performed for these genes (BMC Sys, Biol.
in press, to be presented at APBC2014).
(*) 食道がん
17. 2. Real example 2: Circulating biomarker
findings for liver diseases
Why “circulating biomaker”?
→ non-invasive, thus less stresses.
Circulating = blood, etc
Target in this talk:
microRNAs in blood
→ microRNA is non-protein coding
RNA that regulates other transcript.
18. Data set: 14 diseases + healthy control
For example,
2D embeddings of 〜900 blood miRNAs using PCA
in 32 lung cancer + 70 healthy controls
PC2
10 outlier
miRNAs
PC1
However PC1 does
not exhibit clear
distinction between
lung
cancer
and
normal control any
more.... (not shown
here)
19. Prediction
Control vs Lung Cancer
LDA with PCA, leave one out cross validation
(using 10 miRNAs, up to the 5th PC)
True
control
lung cancer
control
56
8
lung cancer
14
24
Accuracy 0.784
Specificity 0.800
Sensitivity 0.750
Precision 0.632
20.
21. What is the advantage of PCA based
feature extraction? → stability
Cross validation test (10 folds) of stability of
feature extraction (100 trials):
14 diseases vs normal control ✕ 10 miRNAs
= 140 miRNAs selected.
Ideally 140 miRNAs are always selected over
100 trials.
As a result, 129 out of 140 miRNAs are
selected by 100% probabilities.
22. Comparison of stabilities with other feature
extraction methods
UFF(*) : 111 out of 140 miRNAs
t-test based : 40 out of 140 miRNAs
SAM : 30 out of 140 miRNAs
gsMMD : 5 and 1 out of 140 miRNAs
RFE : 1 out of 140 miRNAs
ensemble RFE : 0 out of 140 miRNAs
(*) only another unsupervised FE
23. Lessons to learn:
Predefined class definition (e.g., 'sick
twin' vs 'healthy twin+two healthy
controls') is not a good strategy to
extract “stable” features. Too serious
consideration
of
classification
information may injure stability of
selected features.
24. Additional remarks:
10 miRNAs selected as biomarkers that
discriminate 14 diseases from normal control
were largely overlapped (every 10 miRNAs
were chosen from common 12 miRNAs).
In addition to this, these 12 miRNAs
discriminate seven additional diseases from
healthy controls, even using different
measuring methodology, samples and studies
(submitted).
25. 3. Real example 3: Analysis of proteome
during bacterial incubation
Purpose :
Antibiotics are nothing but disaster of bacteria.
They try to kill not toxic bacteria and thus cause
resistance to drugs. If any other drugs that target
to proteins that are more specific to each bacteria
are targeted, it will be much better and effective.
In order to do this, at first, we need to know how
proteome
can
change
in
response
to
environmental changes.
26. Data set:
Two incubation conditions:
stable (normal) and shaking (oxidative stress)
Two fractions:
cellular and supernatant
Four time points:
From early to final through meddle growth phase
Three biological replicates.
In total:
2 ✕2 ✕4 ✕ 3 = 48 samples are available
27. 2D embedding of 48 samples using PCA
Cellular
PC2
early
supernatant
PC1
late
supernatant
31. Lessons to learn:
Even if there are no criterion about what
kind of classifications are assumed,
unsupervised feature extraction can select
prominent features.
32. 4. Discussion
Real example 1:
Commonly methylated promoters between three
autoimmune
diseases
were
found
by
unsupervised feature extraction.
Real example 2:
Stable circulating biomarkers were selected for
14
diseases
using
unsupervised
feature
extraction.
Real example 3:
Successful extraction of prominent features with
unsupervised feature extraction
33. Unsupervised feature extraction seems
to be the best method, however...
When does PCA based feature extraction work?
Is PCA based feature extraction the best?
Are there any other better unsupervised feature
extraction?
How can we evaluate unsupervised feature
extraction?
Are there any variables to be maximize?
34. I believe that people here
should be experts on this topics.
Help me....