Heuristic PCA for Unsupervised Bioinformatics Feature Extraction

Heuristic PCA Based Feature Extraction
and
Its Application to Bioinformatics
Y-h. Taguchi, Dept. Phys., Chuo Uinv.,
Y. Murakami, Grad. Sch. Med., Osaka City Univ.

M. Iwadate, Dept. Biol. Sci., Chuo Univ.
H. Umeyama, Dept. Biol. Sci., Chuo Univ.
A. Okamoto, Dept. Sch. Health Sci.,
Aichi Univ. Edu.

0. Why PCA?
PCA = principal component analysis
Motivation:
Unsupervised Feature Selection
How PCA?

10 Ordered
Features
90 random
Features

100 Features

20 samples
Class 1
Class 2
11111111110000000000
11111111110000000000
.
.
11111111110000000000
01000000110110011111
00011110000101011101
.
.
.
01000011000110101111
How to select 10 ordered features,
without classification information?

Embedding 100 features into 2D using PCA
90 random
Features

10 Ordered
Features

PC1 represents discrimination
between class 1 and class 2

Class 1

Class 2

20 samples

Applying “weak” unitary transformation to
the space spanned by 20 samples...
20 samples

20 samples
100 Features

Class 1 Class 2 10 Ordered
Features
90 random
Features

Class 1 Class 2

The same 2D embedding.
Thus we can select 10 features.

10 Ordered
Features

90 random
Features

PC1 “weakly” represents discrimination
between class 1 and class 2

Class 1

Class 2

20 samples

Linear discriminant analysis
+ leave one out cross validation
using 10 ordered features ….

True
class 1 2
Predict 1 8 2
228
Accuracy=Sensitivity=Specificity=80%

How about real examples?

1. Real example 1: Disease associated
aberrant promoter methylation
methylation
gene
promoter
three autoimmune diseases
SLE
RA
DM
[ MZ twins (healthy+sick) + 2 healthy controls] ✕ 5
= 20 samples → ✕3 diseases = 60 samples
vs ≈ 1000 potential methylation sites

Embedding of 〜1000 promoters within 20
RA samples into 2D with PCA (PC2 vs PC3)

PC3
Outlier promoters,
Selected

PC2

PC2:RA
Male Female
◯：Sick Twin
△：Healthy Twin
+:Healthy Control 1
☓:Healthy Control 2
Twins: Healthy > Sick
Controls: No
The 4th set: No
→ The reason why
unsupervised feature
selection is needed.

20 samples

Scatter plots between healthy/RA twins.
Red dots = selected promoters
Healthy twins
RA twins
P<2.2 ✕10

-16

-12
P=2.2✕10

-12
P=3.7✕10

P=3.9✕10

-1

P<2.2✕10

-16

Individual promoters are significantly aberrantly
methylated. Thus, feature selections are successful.
After repeating the same procedures to additional two
diseases (SLE and DM)....

Among three autoimmune diseases,
selected promoters are mostly common.

No other methods can achieve such an excellent
coincidence between three autoimmune diseases.

Lessons to learn:
Predefined class definition (e.g., 'sick
twin' vs 'healthy twin + two healthy
controls') is not a good strategy to
extract “important” features that can
exhibit much more complicated behavior
(e.g., upregulated for male while
downregulated for female)

Additional Remarks
Similar procedures were applied to
squamous cell carcinoma(*) and genes with
genotype-specific DNA methylation were
extracted. These genes were identified as
cancer-related genes using literature
searches and in silico drug screening was
performed for these genes (BMC Sys, Biol.
in press, to be presented at APBC2014).
(*) 食道がん

2. Real example 2: Circulating biomarker
findings for liver diseases
Why “circulating biomaker”?
→ non-invasive, thus less stresses.
Circulating = blood, etc
Target in this talk:
microRNAs in blood
→ microRNA is non-protein coding
RNA that regulates other transcript.

Data set: 14 diseases + healthy control
For example,
2D embeddings of 〜900 blood miRNAs using PCA
in 32 lung cancer + 70 healthy controls

PC2

10 outlier
miRNAs

PC1

However PC1 does
not exhibit clear
distinction between
lung
cancer
and
normal control any
more.... (not shown
here)

Prediction

Control vs Lung Cancer
LDA with PCA, leave one out cross validation
(using 10 miRNAs, up to the 5th PC)
True
control
lung cancer
control
56
8
lung cancer
14
24
Accuracy 0.784
Specificity 0.800
Sensitivity 0.750
Precision 0.632

What is the advantage of PCA based
feature extraction? → stability
Cross validation test (10 folds) of stability of
feature extraction (100 trials):
14 diseases vs normal control ✕ 10 miRNAs
= 140 miRNAs selected.
Ideally 140 miRNAs are always selected over
100 trials.
As a result, 129 out of 140 miRNAs are
selected by 100% probabilities.

Comparison of stabilities with other feature
extraction methods
UFF(*) : 111 out of 140 miRNAs
t-test based : 40 out of 140 miRNAs
SAM : 30 out of 140 miRNAs
gsMMD : 5 and 1 out of 140 miRNAs
RFE : 1 out of 140 miRNAs
ensemble RFE : 0 out of 140 miRNAs
(*) only another unsupervised FE

Lessons to learn:
Predefined class definition (e.g., 'sick
twin' vs 'healthy twin+two healthy
controls') is not a good strategy to
extract “stable” features. Too serious
consideration
of
classification
information may injure stability of
selected features.

Additional remarks:
10 miRNAs selected as biomarkers that
discriminate 14 diseases from normal control
were largely overlapped (every 10 miRNAs
were chosen from common 12 miRNAs).
In addition to this, these 12 miRNAs
discriminate seven additional diseases from
healthy controls, even using different
measuring methodology, samples and studies
(submitted).

3. Real example 3: Analysis of proteome
during bacterial incubation
Purpose :
Antibiotics are nothing but disaster of bacteria.
They try to kill not toxic bacteria and thus cause
resistance to drugs. If any other drugs that target
to proteins that are more specific to each bacteria
are targeted, it will be much better and effective.
In order to do this, at first, we need to know how
proteome
can
change
in
response
to
environmental changes.

Data set:
Two incubation conditions:
stable (normal) and shaking (oxidative stress)
Two fractions:
cellular and supernatant
Four time points:
From early to final through meddle growth phase
Three biological replicates.
In total:
2 ✕2 ✕4 ✕ 3 = 48 samples are available

2D embedding of 48 samples using PCA
Cellular

PC2
early
supernatant

PC1

late
supernatant

PCA embeddings of proteins
23 proteins selcted
(underlined are ribosomal ptoteins)

PC2
PC1

SPy1489:hlpA
SPy2039:speB
Spy1073:rplL
SPy2005
SPy2018:emm1
Spy0059:rpmC
Spy0611:tufA
Spy0274:plr
Spy0062:rplX
SPy2043:mf
Spy0613:tpi
Spy2079:AhpC
SPy1831:rpsF}
Spy2160:rpmG
SPy1373:ptsH
SPy0731:eno
Spy1371:gapN
Spy1881:pgk
SPy0711:speC
Spy0071:rpmD
SPy2070:groEL
Spy0019
SPy0712:mf2

using 23 proteins extracted via PCA

PC2
PC1

Lessons to learn:
Even if there are no criterion about what
kind of classifications are assumed,
unsupervised feature extraction can select
prominent features.

4. Discussion
Real example 1:
Commonly methylated promoters between three
autoimmune
diseases
were
found
by
unsupervised feature extraction.
Real example 2:
Stable circulating biomarkers were selected for
14
diseases
using
unsupervised
feature
extraction.
Real example 3:
Successful extraction of prominent features with
unsupervised feature extraction

Unsupervised feature extraction seems
to be the best method, however...
When does PCA based feature extraction work?
Is PCA based feature extraction the best?
Are there any other better unsupervised feature
extraction?
How can we evaluate unsupervised feature
extraction?
Are there any variables to be maximize?

I believe that people here
should be experts on this topics.
Help me....

Heuristic PCA for Unsupervised Bioinformatics Feature Extraction

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Heuristic PCA for Unsupervised Bioinformatics Feature Extraction

Ähnlich wie Heuristic PCA for Unsupervised Bioinformatics Feature Extraction (20)

Mehr von Y-h Taguchi

Mehr von Y-h Taguchi (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Heuristic PCA for Unsupervised Bioinformatics Feature Extraction