Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities
1. Making
protein
func0on
and
subcellular
localiza0on
predic0ons
–
challenges
and
opportuni0es
Fiona
Brinkman
Department
of
Molecular
Biology
and
Biochemistry
(Associate,
Faculty
of
Health
Sciences
and
School
of
Compu0ng
Sciences)
Simon
Fraser
University
Greater
Vancouver,
BC,
Canada
April
2014
2. • Improving
seq
similarity/orthology-‐based
predic0ons
–
a
keystone
of
many
predictors
• Improving
pathway/network-‐based
analysis
to
iden0fy
protein
func0ons
• Future
challenges
and
opportuni0es
(using
protein
localiza0on
as
an
example
of
what
is
to
come)
What
we
MUST
do
to
move
AFP
forward….
2
3. 3
One-‐to-‐one
orthologs
are,
in
par0cular,
more
func0onally
similar
to
each
other,
vs
other
orthologs,
paralogs,
when
>80%
seq
iden0ty
Func0onal
similarity
measured
by
GO
annota0on
similarity
(13
species)
Altenhoff
AM
et
al.
PLoS
Comput
Biol.
2012
4. 4
One-‐to-‐one
orthologs
are,
in
par0cular,
more
func0onally
similar
to
each
other,
vs
other
orthologs,
paralogs,
when
>80%
seq
iden0ty
Func0onal
similarity
measured
by
GO
annota0on
similarity
(13
species)
Altenhoff
AM
et
al.
PLoS
Comput
Biol.
2012
5.
6. 6
If
true
ortholog
is
missing…
(gene
loss,
or
incomplete
genome)
Ingroup1
Ingroup2
Outgroup
Species
Tree:
Gene
Tree:
Ingroup1
Ingroup2
Outgroup
RBBH
Reciprocal
Best
Blast
Hit
FAIL
Gene
Tree:
Ingroup1
Outgroup
Ingroup2
Usual
Divergence
One
of
the
orthologous
genes
diverges
faster…
Paralog
RBBH
Paralog
7. Ortholuge
Uses
phyle0c
ra0os
to
differen0ate
Suppor0ng
Species
Divergence
(SSD)
orthologs
vs
proteins
more
divergent
than
expected
(non-‐SSD)
7
Ra*o1
distance{ ingroup1-‐ingroup2}
distance{ ingroup1-‐outgroup }
Ingroup1
Ingroup2
Outgroup
SSD
Non-‐SSD
Ortholuge
analysis
comparing
Burkholderia
cepacia
&
B.cenocepacia
(outgroup:
B.pseudomallei)
Ra*o2
distance{ ingroup1-‐ingroup2}
distance{ ingroup2-‐outgroup }
Ingroup1
Ingroup2
Outgroup
Whiteside
et
al
2013
PMID
23203876
8. 0.000
0.200
0.400
0.600
0.800
1.000
KEGG
Orthology
Pfam
Domains
Tigrfam
Annota0ons
Subcellular
Localiza0ons
Propor*on
Predicted
Orthologs
in
600
Pairs
of
Bacterial
Species
SSD
Ortholog
Non-‐SSD
8
*
*
*
*
*
p-‐value
<
0.05
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
One
or
more
homologs
(based
on
BLAST
hits)
Propor*on
SSD
orthologs
Non-‐SSD
*
*
p-‐value
<
0.05
Non-‐SSD
“Orthologs”
more
likely:
-‐
Func0onally
dissimilar
-‐
Have
one
or
more
homologs
9. A Database of Ortholuge Evaluations
OrtholugeDB
(0nyurl.com/ortholugeDB)
• Provides
pre-‐computed
ortholog
predic0ons
for
>1400
bacteria
and
archaea
(update
coming
next
month!),
with
further
Ortholuge
assessments
• Covers
all
genes
in
fully
sequenced
bacterial
and
archaeal
genomes
• Facilitates
visualiza0on
and
evalua0on
of
ortholog
predic0ons
9
10. Similar
issue
with
ini0al
metagenomics
seq
func0onal
evalua0on
1. Simulated
reads
from
Pseudomonas
aeruginosa
PAO1
2. Created
databases
at
different
levels
of
clade
exclusion
• E.g.
for
species
clade
exclusion
removed
all
Pseudomonas
aeruginosa
genomes
from
the
database
3. Used
RAPSearch2
and
MEGAN5
to
assign
func0onal
categories
to
the
simulated
reads
4. Calculated
propor0on
of
reads
assigned
to
each
func0onal
category
rela0ve
to
how
many
reads
expected
• E.g:
10
Category
Expected
#
assigned
Actual
#
assigned
Rela0ve
Propor0on
Membrane
Transport
567
583
1.02822
11. Most
func0onal
categories
are
predicted
well
but
some
are
overpredicted
(ra0o
notably
>1)
0
0.5
1
1.5
2
2.5
Ra*o
of
assigned
rela*ve
to
expected
None
Species
Family
Class
Level of
clade
exclusion:
Ie. Endocrine system: 3 problematic
orthology groups – all with high #’s of
proteins (one has 3538 when median is 54!)
12. The
rela0ve
propor0ons
of
func0onal
categories
stays
rela0vely
consistent
as
clade
exclusion
level
increases
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
None
Species
Family
Class
Propor*on
of
reads
assigned
Clade
exclusion
level
Xenobio0cs
Biodegrada0on
and
Metabolism
Transcrip0on
Signal
Transduc0on
Replica0on
and
Repair
Infec0ous
Diseases
Nucleo0de
Metabolism
Neurodegenera0ve
Diseases
Metabolism
of
Other
Amino
Acids
Metabolism
of
Cofactors
and
Vitamins
Membrane
Transport
…
13. Improving
pathway-‐based
analysis
Issue:
Biomolecular
pathway
classifica0ons
can
bias
analyses
of
pathways
found
to
be
upregulated
or
downregulated
by
transcriptome
(or
other
omics-‐level)
analysis
What
you
iden0fy
depends
on
how
everything
is
classified….
Need
beper
“signatures”
of
pathways…
14. Dealing
with
PART
of
the
issue…
Distribu0on
of
the
number
of
associated
pathways
for
human
genes
in
KEGG.
1
7-45
2
3
4
5
6
Membership
of
a
gene
in
mul0ple
pathways
is
the
norm,
not
the
excep0on…
Foroushani et al, 2014 PMCID: PMC3883547
15. Not
all
genes
are
equal…
Maroon:
pathway
member
White:
no
membership
All
genes
are
not
equivalent
signatures
of
a
given
pathway
Foroushani et al, 2014
PMCID: PMC3883547
16. Individual Gene ORA
Antigen processing and presentation
Graft-versus-host disease
Natural killer cell mediated cytotoxicity
Viral myocarditis
Allograft rejection
Cell adhesion molecules (CAMs)
Chemokine signaling pathway
Type I diabetes mellitus
Toll-like receptor signaling pathway
Cytokine-cytokine receptor interaction
Example:
Treated
vs
Untreated
Mouse
Severe
InflammaIon
–
Gene
Expression
Dataset
Standard Over-
Representation Analysis
(ORA) and Gene Set
Enrichment Analysis
(GSEA) treat all genes in
a given pathway as equal
indicators that that
pathway is significant.
à Emphasizes
generalist genes/
pathways
Foroushani et al, 2014 PMCID: PMC3883547
17. Pathway
Signatures
using
SIGORA:
IdenIfying
genes/gene
pairs
uniquely
associated
with
a
single
pathway
SIGORA identifies statistically significant enrichment of
Pathway Signatures in a gene list of interest.
Foroushani et al, 2014 PMCID: PMC3883547
18. Example: Treated vs Untreated Mouse Severe Inflammation –
Gene Expression Dataset
SIGORA
avoids
many
biologically
less
plausible
results
seen
by
other
methods
that
over-‐emphasize
generalist
genes/pathways.
For example, 6/8 up-regulated genes in “Type I diabetes mellitus”
pathway are also in the "Antigen processing and presentation" pathway.
Individual Gene ORA SIGORA
Antigen processing and presentation Antigen processing and presentation
Graft-versus-host disease Natural killer cell mediated cytotoxicity
Natural killer cell mediated cytotoxicity Complement and coagulation cascades
Viral myocarditis Toll-like receptor signaling pathway
Allograft rejection Cytokine-cytokine receptor interaction
Cell adhesion molecules (CAMs) Leukocyte transendothelial migration
Chemokine signaling pathway Cell adhesion molecules (CAMs)
Type I diabetes mellitus Cytosolic DNA-sensing pathway
Toll-like receptor signaling pathway Chemokine signaling pathway
Cytokine-cytokine receptor interaction
19. Future
challenges
and
opportuni0es
(using
bacterial
protein
localiza0on
as
an
example
of
what
is
to
come)
(Gardy & Brinkman 2006 Nature Reviews Microbiology 4:741)
19
20. Bacterial
protein
subcellular
localiza0on
predic0on
• Aids
genome
annota0on
and
predic0on
of
protein
func0on
• Used
to
iden0fy
cell
surface/secreted
targets
for
drugs
and
diagnos0cs,
as
well
as
poten0al
vaccine
components
• Many
pathogen-‐associated
virulence
factors
predicted
as
secreted
(Gardy & Brinkman 2006 Nature Reviews Microbiology 4:741)
20
21. Signal
pep0des:
Non-‐cytoplasmic
Amino
acid
composi0on/paperns:
All
localiza0ons
-‐
Support
Vector
Machine’s
trained
with
amino
acid
composi0ons
or
frequent
subsequences
Transmembrane
helices:
Cytoplasmic
membrane
-‐
HMMTOP
PROSITE
mo0fs
with
100%
precision:
All
localiza0ons
Outer
membrane
mo0fs:
Outer
membrane
-‐
Iden0fied
by
associa0on-‐rule
mining
Homology
to
proteins
of
experimentally
known
localiza0on:
All
loc.
-‐
“SCL-‐BLAST”
against
pro
of
known
localiza0on
-‐
E=10e-‐10
and
length
restric0on
for
precision
Integra0on
with
a
Baysian
Network
Yu
et
al
(2010)
BioinformaIcs
26:1608
PSORTb:
bacterial
protein
subcellular
localiza0on
(SCL)
predic0on
sosware
22. PSORTb:
version
3
22
• Type
III
secre0on
apparatus
• Pili/fimbria
• Host-‐associated
SCL
• Flagellum
• Spore
• Gas
vesicle
Sub-‐category
localiza0on
predic0ons
Main
localiza0ons
predicted
Bacteria
and
Archaea
predic0ons
24.
Classic
Gram
posi0ve
bacteria,
monoderms:
Thick
pep0doglycan,
no
outer
membrane
Classic
Gram
nega0ve
bacteria,
diderms:
Thin
pep0doglycan
+
outer
membrane
…but
can
have
Gram
nega0ves
with
no
outer
membrane
(i.e.
Mycoplasma)
or
a
different
outer
membrane
(Synergistetes,
Sphingomonas),
or
Gram
posi0ve
(thick
peptdoglycan)
with
a
different
outer
membrane
(Deinococcus
–
6
layers
in
cell
envelope!),
or
“acid
fast”with
asymmetric
lipid-‐containing
thick
cell
wall
(Mycobacteria)
Plus
bacterial
organelles
and
other
substructures
(ie.
magnetosome
of
Magnetospirillum)...
Solu*on:
-‐
For
whole
genome
(deduced-‐proteome)
analysis,
detect
key
protein
markers
of
a
par0cular
cell
type
(i.e.
Omp85
essen0al
for
classic
Gram
nega0ve
membrane)
-‐
For
single
protein
analysis,
learn
from
above
analysis,
plus
literature
cura0on,
the
most
likely
cell
type
for
a
given
phyla
…then
make
predic0ons
assuming
that
cell
“type”
Challenge:
Organismal
diversity
24
Reproduced under Fair Use
25. Challenge:
Temporal,
contextual
diversity
Proteins
can
be
associated
with
mul0ple
subcellular
localiza0ons
i.e.
Cell
division
proteins,
Autotransporters,
“protein
A
dependant
on
protein
B”
Solu0on:
Note
all
possible
localizaIons
since
Temporal,
contextual
predic0ons
non-‐trivial
–
not
enough
knowledge
for
most
Kjærgaard K et al. J. Bacteriol. 2000;182:4789-4796
26. Challenge:
Metagenomics
High
demand
for
PSORTb
to
be
able
to
analyze
metagenomic
sequences
….
under
development
Need
taxonomy
data
to
aid
predic0ons
(then
enable
appropriate
cell
type
analysis)
27.
Through
over
a
decade
of
cura0ng
for,
making
and
evalua0ng
predictors
of
protein
localiza0on,
genomic
islands,
etc
What
makes
a
great
predictor?
28.
Through
over
a
decade
of
cura0ng
for,
making
and
evalua0ng
predictors
of
protein
localiza0on,
genomic
islands,
etc
What
makes
a
great
predictor?
(besides
it
being
right)
☺
29. Bioinforma0cs
Predictor’s
Code
of
Conduct
-‐
Never
force
predic0ons
-‐
always
have
a
predic0on
op0on/category
of
“unknown”
Inspired
by
the
classic
“Data
Provider’s
Code
of
Conduct”
in
Stein
(2002)
Nature
417,
119-‐120
30. Example
of
forced
predic0ons:
PSORT
I
predic0on
method
Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991) Overall Accuracy = 69%
What’s
wrong
here?
31. Example
of
forced
predic0ons:
PSORT
I
predic0on
method
Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991) Overall Accuracy = 69%
No secreted/
extracellular
localization!
32. Inspired
by
the
classic
“Data
Provider’s
Code
of
Conduct”
in
Stein
(2002)
Nature
417,
119-‐120
-‐
Never
force
predic0ons
-‐
always
have
“unknown”
op0on/category
-‐
Ensure
open
source
-‐
enable
viewing
of
predic0on
method
details
-‐
Predictor
should
easily
be
trainable
with
different
datasets
(if
applicable;
so
others
can
robustly
evaluate
accuracy)
-‐
Have
ability
to
run
locally
or
over
web
(with
an
API
is
preferred)
-‐
Provide
access
to
old
versions
(at
minimum
when
transi0oning
to
new
version)
-‐
Encourage
con0nuing
cura0on
from
the
literature/lab
experiments!
Incorporate
some
curaIon
efforts
into
predictor
funding
applicaIons
Bioinforma0cs
Predictor’s
Code
of
Conduct
33. Bioinforma0cs
Predictor’s
Code
of
Conduct
-‐
evalua*on
33
-‐
Evaluate
precision
and
recall
(and
accuracy
measure
combos
thereof)
with
x-‐fold
cross
valida0on
and/or
new
datasets
(like
CAFA!)
-‐
ID
errors,
biases
and
provide
guidance
to
users
re
issues
to
watch
for
-‐
bias
in
training
and/or
tes0ng
datasets
(“homology
reduc0on”,
“clade
exclusion”
may
help)
-‐
errors
in
“gold
standard”
lab-‐based
measure
-‐
contextual/temporal
changes
in
proteins,
impac0ng
predic0on
(ie.
Func0on
changes
when
another
protein/compound
present)
What
we
MUST
do:
Guide
users
to
not
just
blindly
use
a
predictor
and
its
default
output.
34. What
we
MUST
do:
Guide
users
to
not
just
blindly
use
a
predictor
and
its
default
output.
Curators,
experimentalists,
and
automated
funcIon
predictor
developers
must
coordinate
efforts
more
• Experimentalists
working
on
what
they
think
best…
• Curators
cura0ng
what
they
priori0ze…
• Func0on
predictors
op0mizing
predic0on
using
exis0ng
data….
FuncIon
predictors/bioinformaIcists
need
to
get
in
the
drivers
seat
more
for
research
Bioinforma0cs
Predictor’s
Code
of
Conduct
35. Brinkman
Lab
Kayaking
Trip,
Summer
2013
(Next
up,
Archery
Tag!)
Amir
Foroushani
Maphew
Laird
David
Lynn
Raymond
Lo
Mike
Peabody
Thea
Van
Rossum
Maphew
Whiteside
Nancy
Yu