In this talk we demonstrate an ECGA and LCS pipeline for reducing protein alphabets from the standard 20 to 5 or less symbols without significant loss of information. The pipeline tailors the reduction to different problems thus resulting on different optimal minimal alphabets.
Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case
1. Extended Compact Genetic Algorithms and Learning
Classifier Systems for Dimensionality Reduction: a
Protein Alphabet Reduction Study Case
Jaume Bacardit & Natalio Krasnogor
ASAP - Interdisciplinary Optimisation Laboratory
School of Computer Science
Centre for Integrative Systems Biology
School of Biology
Centre for Healthcare Associated Infections
Institute of Infection, Immunity & Inflammation
University of Nottingham
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 1 /73
Tuesday, 30 June 2009
2. Acknowledgements
(in no particular order) (in no particular order)
Peter Siepmann
Contributors to the talks I will give at BGU
Pawel Widera School of Physics and Astronomy
James Smaldon School of Chemistry
School of Pharmacy
Azhar Ali Shah
School of Biosciences
Jack Chaplin
School of Mathematics
Enrico Glaab School of Computer Science
German Terrazas Centre for Biomolecular Sciences
Hongqing Cao all the above at UoN
Jamie Twycross
Jonathan Blake Thanks also go to:
Francisco Romero-Campero
Maria Franco Ben Gurion University of the
Adam Sweetman
Linda Fiaschi
Negev’s Distinguished Scientists
Visitor Program
Funding From:
BBSRC, EPSRC, EU, ESF, UoN Professor Dr. Moshe Sipper
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 2 /73
Tuesday, 30 June 2009
3. Outline
Introduction to Learning Classifier Systems
and Extended Compact GA
Problem Definition
Methods (ECGA, LCS, Mutual Information)
Results
Conclusions and further work
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 3 /73
Tuesday, 30 June 2009
4. Based on Various Papers
J.Bacardit, M.Stout, J.D. Hirst, A.Valencia, R.E.Smith, and N.Krasnogor. Automated
alphabet reduction for protein datasets. BMC Bioinformatics, 10(6), 2009.
J. Bacardit, M. Stout, J.D. Hirst, K. Sastry, X. Llora, and N. Krasnogor. Automated
alphabet reduction method with evolutionary algorithms for protein structure prediction.
In Proceedings of the 2007 Genetic and Evolutionary Computation Conference,
number ISBN 978-1-59593-697-4, pages 346-353. ACM Press, 2007. This paper won
the Bronze Medal in the THE 2007 “HUMIES” AWARDS FOR HUMAN-COMPETITIVE
RESULTS PRODUCED BY GENETIC AND EVOLUTIONARY COMPUTATION. J.
J.Bacardit and N. Krasnogor. Performance and efficiency of memetic pittsburgh
learning classifier systems. Evolutionary Computation, 17(3):(to appear), 2009.
J.Bacardit, E.K. Burke, and N.Krasnogor. Improving the scalability of rule-based
evolutionary learning. Memetic Computing, 1(1):(to appear), 2009
J. Bacardit, M. Stout, and N. Krasnogor. A tale of human-competiveness in
bioinformatics. Newsletter of ACM Special Interest Group on Genetic and Evolutionary
Computation: SIGEvolution, 3(1):2-10, 2008.
All papers available from: www.cs.nott.ac.uk/~nxk/publications.html
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 4 /73
Tuesday, 30 June 2009
5. Learning Classifier Systems (LCS) are one of
the major families of techniques that apply
evolutionary computation to machine learning
tasks
Machine learning: How to construct
programs that automatically improve with
experience [Mitchell, 1997]
Classification task: Learning how to label
correctly new instances from a domain based
on a set of previously labeled instances
LCS are almost as ancient as GAs, Holland
made one of the first proposals
Two of the first LCS proposals are [Holland &
Reitman, 78] and [Smith, 80]
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 5 /73
Tuesday, 30 June 2009
6. Traditionally there have been two different
paradigms of LCS
The Pittsburgh approach [Smith, 80]
The Michigan approach [Holland & Reitman,
78]
More recently: The Iterative Rule Learning
approach [Venturini, 93]
Knowledge representations
All the initial approaches were rule-based
In recent years several knowledge
representations have been used in the LCS
field: decision trees, synthetic prototypes, etc.
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 6 /73
Tuesday, 30 June 2009
7. Classification task
Classification task: Learning how to label correctly new
instances from a domain based on a set of previously labeled
instances
New Instance
Learning Inference
Training Set
Algorithm Engine
Class
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 7 /73
Tuesday, 30 June 2009
8. Classification task
1
If (X<0.25 and
Y>0.75) or
(X>0.75 and
Y
Y<0.25) then
0 1
X
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 8 /73
Tuesday, 30 June 2009
9. Paradigms of LCS
The Pittsburgh approach
Each individual is a complete solution to
the classification problem
Traditionally this means that each
individual is a variable-length set of rules
GABIL [De Jong & Spears, 93] is a well-
known representative of this approach
Fitness function is based on the rule set
accuracy on the training set (usually also
on complexity)
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 9 /73
Tuesday, 30 June 2009
10. Paradigms of LCS
The Pittsburgh approach
Crossover operator
Parents Offspring
Mutation operator: bit flipping
Individuals are interpreted as a decision list: an ordered rule set
Instance 1 matches rules 2, 3 and 7 Rule 2 will be used
Instance 2 matches rules 1 and 8 Rule 1 will be used
1 2 3 4 5 6 7 8 Instance 3 matches rule 8 Rule 8 will be used
Instance 4 matches no rules Instance 4 will not be
classified
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 10 /73
Tuesday, 30 June 2009
11. Paradigms of LCS
The Michigan approach
Each individual is a single rule
The whole population cooperates to solve
the classification problem
A reinforcement system is used to identify
the good rules
A GA is used to explore the search space
for more rules
XCS [Wilson, 95] is the most well-known
Michigan LCS
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 11 /73
Tuesday, 30 June 2009
12. Paradigms of LCS
Working cycle
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 12 /73
Tuesday, 30 June 2009
13. Paradigms of LCS
The Iterative Rule Learning approach
Each individual is a single rule
Individuals compete as in a standard
GA A single GA run generates one
rule
The GA is run iteratively to learn all rules
that solve the problem
Instances already covered by previous
rules are removed from the training set
of the next iteration
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 13 /73
Tuesday, 30 June 2009
14. Paradigms of LCS
The Iterative Rule Learning approach
HIDER System [Aguilar, Riquelme & Toro, 03]
1. Input: Examples
2. RuleSet = Ø
3. While |Examples| > 0
1. Rule = Run GA with Examples
2. RuleSet = RuleSet U Rule
3. Examples = Examples Covered(Rule)
4. EndWhile
5. Output: RuleSet
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 14 /73
Tuesday, 30 June 2009
15. Bioinformatics-oriented Hierarchical
Evolutionary Learning (BioHel)
BioHEL [Bacardit et al., 07] is a recent learning
system that applies the Iterative Rule Learning
(IRL) approach to generate sets of rules
IRL was first used in EC by the SIA system
[Venturini, 93]
BioHEL is strongly inspired by GAssist
[Bacardit, 04], a Pittsburgh approach Learning
Classifier System
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 15 /73
Tuesday, 30 June 2009
16. BioHEL learning paradigm
IRL has been used for many years in the ML
community, with the name of separate-and-conquer
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 16 /73
Tuesday, 30 June 2009
17. BioHEL’s objective function
An objective function based on the Minimum-
Description-Length (MDL) (Rissanen,1978) principle
that tries to promote rules with
High accuracy: not making mistakes
High coverage: covering as many examples as possible
without sacrificing accuracy. Recall (TP/(TP+FN)) will be
used to define coverage
Low complexity: rules as simple and general as possible
The objective function is a linear combination of the three
objectives above
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 17 /73
Tuesday, 30 June 2009
18. BioHEL’s objective function
Intuitively, we would like to have accurate rules
covering as many examples as possible.
However, in complex and inconsistent domains it is
rare to obtain such rules
In these cases, easier path for evolutionary search is to
maximize accuracy at the expense of coverage
Therefore, we need to enforce that the evolved rules
cover enough examples
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 18 /73
Tuesday, 30 June 2009
19. Methods: BioHEL’s objective function
Three parameters define the shape of the function
The choice of the coverage break is crucial for the proper performance of the
system
Also, coverage term penalizes rules that do not cover a minimum percentage of
examples or that cover too many
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 19 /73
Tuesday, 30 June 2009
20. BioHEL’s other characteristics
Attribute list rule representation
Automatically identifying the relevant attributes for a given rule and
discarding all the other ones
The ILAS windowing scheme
Efficiency enhancement method, not all training points are used for each
fitness computation
An explicit default rule mechanism
Generating more compact rule sets
Iterative process terminates when it is impossible to evolve a rule where
the associated class is the majority class among the matched examples
At this point, all remaining training instances are assigned to the default
class
Ensembles for consensus prediction
Easy way of boosting robustness
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 20 /73
Tuesday, 30 June 2009
21. Knowledge representations
Representation of XCS for binary problems:
ternary representation
Ternary alphabet {0,1,#}
If A1=0 and A2=1 and A3 is irrelevant class 0
01#|0
Representation of XCS for real-valued attributes:
real-valued interval.
XCSR [Wilson, 99]
Interval is codified with two variables: center & spread: [center-
spread, center+spread]
UBR [ Stone & Bull, 03]
The two bounds of the interval are codified directly with two real-
valued variables. The variable with lowest value is the lower
bound, the variable with higher value is the upper bound
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 21 /73
Tuesday, 30 June 2009
22. Knowledge representations
Representation of GABIL for nominal attributes
Predicate → Class
Predicate: Conjunctive Normal Form (CNF) (A1=V11∨..
∨ A1=V1n) ∧.....∧ (An=Vn2∨.. ∨ An=Vnm)
Ai : ith attribute
Vij : jth value of the ith attribute
The rules can be mapped into a binary string, e.g., 3
attributes with {3,5,2} values each respectively:
(A1=V11∨ A1=V13) ∧ (A2=V22 ∨ A2=V24 ∨ A2=V25) ∧
(A3=V31) 101|01011|10
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 22 /73
Tuesday, 30 June 2009
23. Knowledge representations
Pittsburgh representations for real-valued attributes:
Rule-based: Adaptive Discretization Intervals (ADI)
representation [Bacardit, 04]
Intervals in ADI are build using as possible
bounds the cut-points proposed by a
discretization algorithm
Search bias promotes maximally general
intervals
Several discretization algorithms are used at the
same time to reduce bias
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 23 /73
Tuesday, 30 June 2009
24. Knowledge representations
Pittsburgh representations for real-valued attributes:
Decision trees [Llorà, 02]
Nodes in the trees can use orthogonal or oblique criteria
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 24 /73
Tuesday, 30 June 2009
25. Knowledge representations
Pittsburgh representations for real-valued attributes
Synthetic prototypes [Llorà, 02]
Each individual is a set of synthetic instances
These instances are used as the core of a nearest-neighbor
classifier
?
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 25 /73
Tuesday, 30 June 2009
26. Extended Compact Genetic
Algorithm (ECGA)
ECGA belongs to a class of Evolutionary
Algorithms called Estimation of Distribution
Algorithms (EDA)
no crossover or mutation!
instead a probabilistic model of the
structure of the problem is kept
individuals are sampled from this probability
distribution model
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 26 /73
Tuesday, 30 June 2009
27. Taken from: Linkage Learning via Probabilistic Modeling in the Extended Compact
Genetic Algorithm (ECGA) by Georges R. Harik, Fernando G. Lobo and Kumara
Sastry. Studies in Computational Intelligence, Volume 33/2006, Springer, 2007.
Text
Text
Key Idea Behind Compact GA (CGA)
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 27 /73
Tuesday, 30 June 2009
28. Taken from: Linkage Learning via Probabilistic Modeling in the Extended Compact
Genetic Algorithm (ECGA) by Georges R. Harik, Fernando G. Lobo and Kumara
Sastry. Studies in Computational Intelligence, Volume 33/2006, Springer, 2007.
Genes Interactions must be accounted for
Approximates complex distributions by
Marginal Distribution Models (i.e. genes
partitions)
Selects amongst alternative models by
means of the MDL:
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 28 /73
Tuesday, 30 June 2009
29. Taken from: Linkage Learning via Probabilistic Modeling in the Extended Compact
Genetic Algorithm (ECGA) by Georges R. Harik, Fernando G. Lobo and Kumara
Sastry. Studies in Computational Intelligence, Volume 33/2006, Springer, 2007.
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 29 /73
Tuesday, 30 June 2009
30. Outline
Introduction to Learning Classifier Systems
and Extended Compact GA
Problem Definition
Methods (ECGA, LCS, Mutual Information)
Results
Conclusions and further work
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 30 /73
Tuesday, 30 June 2009
31. Protein Structure Prediction (PSP) has as goal to predict
the 3D structure of a protein based on its primary
sequence
Primary Sequence 3D Structure
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 31 /73
Tuesday, 30 June 2009
32. PSP is a very costly process
As an example, one of the best PSP
methods in the last CASP meeting,
Rosetta@Home used up to 104
computing years to predict a single
protein’s 3D structure
Ways to alleviate computational burden:
to simplify the problem
to simplify the representation used to
model the proteins
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 32 /73
Tuesday, 30 June 2009
33. From Full PSP to CN prediction
Two residues of a chain are said to be in contact if
their distance is less than a certain threshold
Primary Contact Native State
Sequence
CN of a residue : number of contacts that a certain
residue has
In this specific case we predict, e.g., whether the
CN of a residue is smaller or higher than the
middle point of the CN domain
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 33 /73
Tuesday, 30 June 2009
34. From Full PSP to SA prediction
Solvent Accessibility:
Amount of surface of
each residue that is
exposed to the solvent
(e.g. water)
Metric is normalised for
each AA type
Problem is to predict
whether SA is lower or
higher than 25%
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 34 /73
Tuesday, 30 June 2009
35. PSP is a very costly process
As an example, one of the best PSP
methods in the last CASP meeting,
Rosetta@Home used up to 104
computing years to predict a single
protein’s 3D structure
Ways to alleviate computational burden:
to simplify the problem
to simplify the representation used to
model the proteins
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 35 /73
Tuesday, 30 June 2009
36. Primary Sequence of a protein (the amino acid
type of the elements of a protein chain) is an usual
target for such simplification
It is composed of a quite high cardinality
alphabet of 20 symbols
One example of reduction widely used in the
community is the hydrophobic-polar (HP)
alphabet, reducing these 20 symbols to just two
HP representation usually is too simple,
information is lost in the reduction process
M. Stout, et al. Prediction of residue exposure and contact number for simplified hp lattice model proteins using
learning classifier systems. In Proceedings of the 7th International FLINS Conference on Applied Artificial Intelligence,
pages 601-608. World Scientific, August 2006.
M. Stout, J. Bacardit, J.D. Hirst, N. Krasnogor, and J. Blazewicz. From hp lattice models to real proteins: coordination
number prediction using learning classifier systems. In 4th European Workshop on Evolutionary Computation and
Machine Learning in Bioinformatics, volume 3907 of Springer Lecture Notes in Computer Science, page 208–220,
Budapest, Hungary, April 2006. Springer. ISBN 978-3-540-33237-4.
papers at: http://www.cs.nott.ac.uk/~nxk/publications.html
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 36 /73
Tuesday, 30 June 2009
37. Research Question:
Are there “simplified” alphabets that
retain key information content while
simplifying interpretation,processing
time, etc?
If yes, are these alphabet general for
any problem domain or domain
specific?
Can we automatically generate these
alphabets and tailor them to the
specific domain we are predicting?
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 37 /73
Tuesday, 30 June 2009
38. Outline
Introduction to Learning Classifier Systems
and Extended Compact GA
Problem Definition
Methods (ECGA, LCS, Mutual Information)
Results
Conclusions and further work
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 38 /73
Tuesday, 30 June 2009
39. Use an (automated) information theory-driven pipeline to
reduce alphabet for PSP datasets
Use the Extended Compact Genetic Algorithm (ECGA) to find
a dimensionality reduction policy (guided by a fitness function
based on the Mutual Information (MI) metric)
Two PSP datasets will be used as testbed:
Coordination Number (CN) prediction
Relative Solvent Accessibility (SA) prediction
Verify the optimized reduction policies with BioHEL, an
evolutionary-computation based rule learning system
J.Bacardit, M.Stout, J.D. Hirst, A.Valencia, R.E.Smith, and N.Krasnogor. Automated alphabet reduction for
protein datasets. BMC Bioinformatics, 10(6), 2009.
J. Bacardit, M. Stout, and N. Krasnogor. A tale of human-competiveness in bioinformatics. Newsletter of
ACM Special Interest Group on Genetic and Evolutionary Computation: SIGEvolution, 3(1):2-10, 2008.
J. Bacardit, M. Stout, J.D. Hirst, K. Sastry, X. Llora, and N. Krasnogor. Automated alphabet reduction
method with evolutionary algorithms for protein structure prediction. In Proceedings of the 2007 Genetic
and Evolutionary Computation Conference, number ISBN 978-1-59593-697-4, pages 346-353. ACM Press,
2007.
All papers at: http://www.cs.nott.ac.uk/~nxk/publications.html
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 39 /73
Tuesday, 30 June 2009
40. Protein dataset proposed by [Kinjo et al., 05]
1050 proteins
259768 residues
Proteins were selected from PDB-REPRDB using
these conditions:
Less than 30% sequence identity
More than 50 residues
Resolution better than 2Å
No membrane proteins, no chain breaks, no non-
standard residues
Crystallographic R-factor better than 20%
Dataset is partitioned into training/test sets using ten-
fold cross-validation
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 40 /73
Tuesday, 30 June 2009
41. Instance Representation
AAi-5 AAi-4 AAi-3 AAi-2 AAi-1 AAi AAi+1 AAi+2 AAi+3 AAi+4 AAi+5
CNi-5 CNi-4 CNi-3 CNi-2 CNi-1 CNi CNi+1 CNi+2 CNi+3 CNi+4 CNi+5
AAi-1,AAi,AAi+1 CNi
AAi,AAi+1,AAi+2 CNi+1
AAi+1,AAi+2,AAi+3 CNi+2
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 41 /73
Tuesday, 30 June 2009
42. Taken from: J. Bacardit, M. Stout, J.D. Hirst, K. Sastry, X. Llora, and N. Krasnogor.
Automated alphabet reduction method with evolutionary algorithms for protein
structure prediction. In Proceedings of the 2007 Genetic and Evolutionary
Computation Conference, number ISBN 978-1-59593-697-4, pages 346-353. ACM
Press, 2007.
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 42 /73
Tuesday, 30 June 2009
43. General Workflow of the Alphabet Reduction Pipeline
Size = N Test set
Dataset ECGA Dataset BioHEL Ensemble
|∑|=20 |∑|=N of rule sets
Accuracy
Mutual
Information
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 43 /73
Tuesday, 30 June 2009
44. Methods: alphabet reduction
strategies
Three strategies were evaluated
They represent progressive levels of
sophistication
Mutual Information (MI)
Robust Mutual Information (RMI)
Dual Robust Mutual Information (DualRMI)
Thus MI, RMI, DualRMI were used in
separate experiments as the “fitness”
function for the ECGA tournament phase.
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 44 /73
Tuesday, 30 June 2009
45. Methods: MI strategy
There are 21 symbols (20AA+end of chain) in the
alphabet
Each symbol will be assigned to a group in the
chromosome used by ECGA
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 45 /73
Tuesday, 30 June 2009
46. Methods: MI stragegy
Objective function for MI strategy: Mutual Information
Mutual Information is a measure that quantifies the
interrelationship that two discrete variables have
among each other
X is the reduced representation of the window of
residues around the target.
Y is the two-state definition fo CN or SA
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 46 /73
Tuesday, 30 June 2009
47. Methods: MI strategy
Steps of objective function computation for the
MI strategy
1. Reduction mappings are extracted from the
chromosome
2. Instances of the training set are transformed into the
lower cardinality alphabet
3. Mutual information between the class attribute and
the string formed by concatenating the input
attributes is computed
4. This MI is assigned as the result of the evaluation
function
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 47 /73
Tuesday, 30 June 2009
48. Methods: MI strategy
Problem of MI strategy
Mutual Information needs redundancy in order to become a good
estimator
That is, each possible pattern in X and Y should be well represented in
the dataset
Patterns in Y are always well represented. What happens with patterns in
X in our dataset?
Our sample, despite having almost 260000 residues is too small
#letters Represented patterns
2 100%
3 97.8%
4 57.6%
5 11.3%
20 3.1E-07
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 48 /73
Tuesday, 30 June 2009
49. Methods: RMI strategy
In order to solve the sample size problem of the MI strategy, we use
a robust MI estimator proposed by [Cline et al., 02]
Pairs of (x,y) in the dataset are scrambled
That is, each x in the dataset is randomly joined to an y but the
distribution of x and y remains equal
MI is computed for the scrambled dataset
This process is repeated N time, and the average scrambled MI is
computed
Finally, the value for the objective function is MI – Mis
Mis is an estimation of the sampling bias in the data. By subtracting
it from the original MI metric we obtain a less biased objective
function
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 49 /73
Tuesday, 30 June 2009
50. Methods: DualRMI strategy
The next strategy is based on some observations we
did in previous work [Bacardit et al., 06]
Example of a rule set for prediction CN from primary
sequence
Predicate associated to the target residue (AA) is
very different from the predicates associated to the
other window positions
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 50 /73
Tuesday, 30 June 2009
51. Methods: DualRMI strategy
Why not generating two reduced alphabets
at the same time?
One for the target residue
One for the other residues in the window
Objective function remains unchanged
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 51 /73
Tuesday, 30 June 2009
52. Outline
Introduction to Learning Classifier Systems
and Extended Compact GA
Problem Definition
Methods (ECGA, LCS, Mutual Information)
Results
Conclusions and further work
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 52 /73
Tuesday, 30 June 2009
53. Experimental design
For each problem (CN, SA)
For each reduction strategy (MI, RMI, DualRMI)
ECGA was run to generate alphabets of two, three,
four and five letters
Afterwards, BioHEL was trained over the reduced
datasets to determine the prediction accuracy that
could be obtained from each alphabet size
Comparisons are drawn
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 53 /73
Tuesday, 30 June 2009
54. Reduced alphabets for CN
Amino acids that remain always in the same group
are marked with solid rectangles
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 54 /73
Tuesday, 30 June 2009
55. Alphabets for CN
The two-letter alphabet divides the amino-acids between
hydrophobic and polar
RMI could not find a five-letter alphabet
DualRMI did, but only for the target residue
RMI and DualRMI have a much larger number of framed
residues, showing more robustness
For DualRMI we can observe small groups of hydrophobic
residues, while all polar ones are in the same group
We can also observe a strange group, GHTS, that mixes
different kind of physico-chemical properties
Not explained by properties but by inherent distribution in
datasets
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 55 /73
Tuesday, 30 June 2009
56. A retrospective
analysis of the dataset
reveals why GHTS
are clustered together
We computed the
proportion of residues
for each amino acid
type with high CN
These four residues
have very similar
average behavior in
relation to CN
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 56 /73
Tuesday, 30 June 2009
57. Accuracy of CN prediction Using Biohel
Accuracy difference
between the AA
representation and the
best reduced
alphabets is 0.7%
Difference in non-
significant according to
t-tests
RMI and DualRMI
perform similarly
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 57 /73
Tuesday, 30 June 2009
58. Reduced alphabets for SA
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 58 /73
Tuesday, 30 June 2009
59. Reduced alphabets for SA
Even though SA and CN are somewhat related
structural features, the resulting alphabets are
different
These alphabets contain more groups of polar
residues, and less groups of hydrophobic ones (in
contrast with CN)
In DualRMI and 5 letters we can observe very small
groups
A, EK for the target alphabet
G,X for the other residues alphabet
Again, the GHTS group appears
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 59 /73
Tuesday, 30 June 2009
60. Analysis of
average SA
behavior for
each AA type
The reduced
alphabet
matched
perfectly the
properties of the
SA features
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 60 /73
Tuesday, 30 June 2009
61. Accuracy of SA prediction with BioHel
Accuracy of reduced
alphabets for SA
prediction
Only DualRMI
managed to give a
performace
statistically similar to
the original AA
representation
Accuracy difference
is 0.4%
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 61 /73
Tuesday, 30 June 2009
62. Comparison to Other Reduced Alphabets from the
Literature and Expert-Designed Alphabets Based on
Physico-Chemical Properties
Alphabet Letters CN acc. SA acc. Diff. Ref.
AA 20 74.0±0.6 70.7±0.4 --- ---
DualRMI 5 73.3±0.5 70.3±0.4 0.7/0.4 This work
WW5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Wang & Wang, 99]
Alphabets from
the literature
SR5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Solis & Rackovsky, 00]
MU4 5 72.6±0.7 69.4±0.4 1.4/1.3 [Murphy et al., 00]
MM5 6 73.1±0.6 69.3±0.3 0.9/1.4 [Melo & Marti-Renom, 06]
Expert designed
HD1 7 72.9±0.6 69.3±0.4 1.1/1.4 This work
alphabets
HD2 9 73.0±0.6 69.3±0.4 1.0/1.4 This work
HD3 11 73.2±0.6 69.9±0.4 0.8/0.8 This work
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 62 /73
Tuesday, 30 June 2009
63. Reduced Alphabets Comparison
Automatically reduced alphabets obtain better accuracy,
but how different are the alphabets themselves?
We applied again the AA-wise high CN/SA analysis
Two metrics were computed
Transitions: how many times does the group index change
through the list of sorted AA.
The less number of changes, the more homogenous the groups are
Average range: The range of a reduction group is the
difference between the minimum and maximum CN/SA of the
AAs belonging to that group
The smaller the average range, the more focused the reduction
groups are in relation to that structural property
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 63 /73
Tuesday, 30 June 2009
64. Reduced Alphabets Comparison (CN)
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 64 /73
Tuesday, 30 June 2009
65. Reduced Alphabets Comparison (SA)
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 65 /73
Tuesday, 30 June 2009
66. Additional Results
Are the alphabets interchangeable across
problems?
Can these reduced alphabets be applied to
an evolutionary information-based
representation?
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 66 /73
Tuesday, 30 June 2009
67. Results: Are the alphabets interchangeable?
We applied the alphabet optimized for CN to SA and vice
versa
SA alphabet is good for predicting CN, but CN alphabet
obtains poor performance on SA
Reduced alphabets must always be tailored to the
domain at hand
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 67 /73
Tuesday, 30 June 2009
68. Results
Application of the reduced alphabets to an evolutionary
information-based representation
So far we have used only the simple primary sequence
representation
Can this process be applied to much richer (and complex)
representations?
We computed the position-specific scoring matrices (PSSM)
representation of our dataset using PSI-BLAST. Each
instance (9 window positions) is represented by 180
continuous variables (rather than 20+1 as originally done)
Then, we reduced this representation using our alphabets
The values of each PSSM profile corresponding to amino acids in
the same reduction group are averaged
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 68 /73
Tuesday, 30 June 2009
69. Results
Application of
reduced
alphabets to a
PSSM
representation
Thus, we
reduced the
representation
from 180
attributes to 45
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 69 /73
Tuesday, 30 June 2009
70. Results
Results of learning from the reduced PSSM
representation
Accuracy difference is still less than 1%
Obtained rules sets are simpler and training process
is much faster
Performance levels are similar to recent works in the
literature [Kinjo et al., 05][Dor and Zhou, 07]
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 70 /73
Tuesday, 30 June 2009
71. Conclusions
We have proposed an automated alphabet reduction protocol
for protein datasets
Protocol does not use any domain knowledge
It automatically tailors the reduced datasets to the domain at hand
Our experiments show that it is possible to obtain quite
reduced alphabets (5 letters) with similar performance than
the original AA alphabet
Our reduced alphabets are better at CN and SA prediction
than other alphabet from the literature, as they are better
suited for these tasks
The findings from the protocol can be used in state-of-the-art
protein representations as PSSM profiles
We found some unexpected reduction groups (GHTS) but the
properties of the data showed us that this is not an artifact
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 71 /73
Tuesday, 30 June 2009
72. Future work
Explore alternative objective evaluation functions
Other robust MI estimation
Explore slightly higher cardinality alphabets
Is it possible to close the accuracy gap even more?
Apply this protocol to other kind of datasets
E.g. protein mutations
Structural aspects defined as continuous variables, not
just discrete ones
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 72 /73
Tuesday, 30 June 2009
73. Acknowledgements
(in no particular order) (in no particular order)
Peter Siepmann
Contributors to the talks I will give at BGU
Pawel Widera School of Physics and Astronomy
James Smaldon School of Chemistry
School of Pharmacy
Azhar Ali Shah
School of Biosciences
Jack Chaplin
School of Mathematics
Enrico Glaab School of Computer Science
German Terrazas Centre for Biomolecular Sciences
Hongqing Cao all the above at UoN
Jamie Twycross
Jonathan Blake Thanks also go to:
Francisco Romero-Campero
Maria Franco Ben Gurion University of the
Adam Sweetman
Linda Fiaschi
Negev’s Distinguished Scientists
Visitor Program
Funding From:
BBSRC, EPSRC, EU, ESF, UoN Professor Dr. Moshe Sipper
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 73 /73
Tuesday, 30 June 2009