SlideShare ist ein Scribd-Unternehmen logo
1 von 73
Downloaden Sie, um offline zu lesen
Extended Compact Genetic Algorithms and Learning
   Classifier Systems for Dimensionality Reduction: a
        Protein Alphabet Reduction Study Case

                            Jaume Bacardit & Natalio Krasnogor

           ASAP - Interdisciplinary Optimisation Laboratory
           School of Computer Science



           Centre for Integrative Systems Biology
           School of Biology



           Centre for Healthcare Associated Infections
           Institute of Infection, Immunity & Inflammation



                                                           University of Nottingham

           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   1 /73
Tuesday, 30 June 2009
Acknowledgements
                                                   (in no particular order)                                                        (in no particular order)
                                                  Peter Siepmann
Contributors to the talks I will give at BGU




                                                  Pawel Widera                                                                   School of Physics and Astronomy
                                                  James Smaldon                                                                  School of Chemistry
                                                                                                                                  School of Pharmacy
                                                  Azhar Ali Shah
                                                                                                                                  School of Biosciences
                                                  Jack Chaplin
                                                                                                                                  School of Mathematics
                                                  Enrico Glaab                                                                   School of Computer Science
                                                  German Terrazas                                                                Centre for Biomolecular Sciences
                                                  Hongqing Cao                                                                   all the above at UoN
                                                  Jamie Twycross
                                                  Jonathan Blake                                            Thanks also go to:
                                                  Francisco Romero-Campero
                                                  Maria Franco                                                     Ben Gurion University of the
                                                  Adam Sweetman
                                                  Linda Fiaschi
                                                                                                                    Negev’s Distinguished Scientists
                                                                                                                    Visitor Program
              Funding From:
                          BBSRC, EPSRC, EU, ESF, UoN                                                                Professor Dr. Moshe Sipper
                                                     Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   2 /73
Tuesday, 30 June 2009
Outline
      Introduction to Learning Classifier Systems
       and Extended Compact GA
      Problem Definition
      Methods (ECGA, LCS, Mutual Information)
      Results
      Conclusions and further work




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   3 /73
Tuesday, 30 June 2009
Based on Various Papers
     J.Bacardit, M.Stout, J.D. Hirst, A.Valencia, R.E.Smith, and N.Krasnogor. Automated
      alphabet reduction for protein datasets. BMC Bioinformatics, 10(6), 2009.
     J. Bacardit, M. Stout, J.D. Hirst, K. Sastry, X. Llora, and N. Krasnogor. Automated
      alphabet reduction method with evolutionary algorithms for protein structure prediction.
      In Proceedings of the 2007 Genetic and Evolutionary Computation Conference,
      number ISBN 978-1-59593-697-4, pages 346-353. ACM Press, 2007. This paper won
      the Bronze Medal in the THE 2007 “HUMIES” AWARDS FOR HUMAN-COMPETITIVE
      RESULTS PRODUCED BY GENETIC AND EVOLUTIONARY COMPUTATION. J.
     J.Bacardit and N. Krasnogor. Performance and efficiency of memetic pittsburgh
      learning classifier systems. Evolutionary Computation, 17(3):(to appear), 2009.
     J.Bacardit, E.K. Burke, and N.Krasnogor. Improving the scalability of rule-based
      evolutionary learning. Memetic Computing, 1(1):(to appear), 2009
     J. Bacardit, M. Stout, and N. Krasnogor. A tale of human-competiveness in
      bioinformatics. Newsletter of ACM Special Interest Group on Genetic and Evolutionary
      Computation: SIGEvolution, 3(1):2-10, 2008.



         All papers available from: www.cs.nott.ac.uk/~nxk/publications.html


           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   4 /73
Tuesday, 30 June 2009
   Learning Classifier Systems (LCS) are one of
     the major families of techniques that apply
     evolutionary computation to machine learning
     tasks
       Machine learning: How to construct
        programs that automatically improve with
        experience [Mitchell, 1997]
       Classification task: Learning how to label
        correctly new instances from a domain based
        on a set of previously labeled instances
    LCS are almost as ancient as GAs, Holland
     made one of the first proposals
    Two of the first LCS proposals are [Holland &
     Reitman, 78] and [Smith, 80]

           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   5 /73
Tuesday, 30 June 2009
    Traditionally there have been two different
      paradigms of LCS
        The Pittsburgh approach [Smith, 80]
        The Michigan approach [Holland & Reitman,
         78]
     More recently: The Iterative Rule Learning
      approach [Venturini, 93]

     Knowledge representations
       All the initial approaches were rule-based
       In recent years several knowledge
        representations have been used in the LCS
        field: decision trees, synthetic prototypes, etc.

           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   6 /73
Tuesday, 30 June 2009
Classification task
         Classification task: Learning how to label correctly new
          instances from a domain based on a set of previously labeled
          instances

                                                                                                                         New Instance




                                                                      Learning                                          Inference
                           Training Set
                                                                     Algorithm                                            Engine



                                                                                                                            Class


           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   7 /73
Tuesday, 30 June 2009
Classification task

       1
                                                                                            If (X<0.25 and
                                                                                            Y>0.75) or
                                                                                               (X>0.75 and
   Y
                                                                                            Y<0.25) then 



        0                                                                         1
                                            X
            Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   8 /73
Tuesday, 30 June 2009
Paradigms of LCS
         The Pittsburgh approach
           Each individual is a complete solution to
            the classification problem
           Traditionally this means that each
            individual is a variable-length set of rules
           GABIL [De Jong & Spears, 93] is a well-
            known representative of this approach
           Fitness function is based on the rule set
            accuracy on the training set (usually also
            on complexity)
           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   9 /73
Tuesday, 30 June 2009
Paradigms of LCS
     The Pittsburgh approach
           Crossover operator
                            Parents                                                                                         Offspring




           Mutation operator: bit flipping
           Individuals are interpreted as a decision list: an ordered rule set


                                                                           Instance 1 matches rules 2, 3 and 7  Rule 2 will be used
                                                                           Instance 2 matches rules 1 and 8  Rule 1 will be used
                  1    2     3     4     5    6     7     8                Instance 3 matches rule 8  Rule 8 will be used
                                                                           Instance 4 matches no rules  Instance 4 will not be
                                                                           classified




            Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   10 /73
Tuesday, 30 June 2009
Paradigms of LCS
      The Michigan approach
        Each individual is a single rule
        The whole population cooperates to solve
         the classification problem
        A reinforcement system is used to identify
         the good rules
        A GA is used to explore the search space
         for more rules
        XCS [Wilson, 95] is the most well-known
         Michigan LCS

           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   11 /73
Tuesday, 30 June 2009
Paradigms of LCS
      Working cycle




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   12 /73
Tuesday, 30 June 2009
Paradigms of LCS
         The Iterative Rule Learning approach
           Each individual is a single rule
           Individuals compete as in a standard
            GA  A single GA run generates one
            rule
           The GA is run iteratively to learn all rules
            that solve the problem
           Instances already covered by previous
            rules are removed from the training set
            of the next iteration

           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   13 /73
Tuesday, 30 June 2009
Paradigms of LCS
      The Iterative Rule Learning approach
        HIDER System [Aguilar, Riquelme & Toro, 03]

         1. Input: Examples

         2. RuleSet = Ø

         3. While |Examples| > 0

            1. Rule = Run GA with Examples

            2. RuleSet = RuleSet U Rule

            3. Examples = Examples  Covered(Rule)

         4. EndWhile

         5. Output: RuleSet




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   14 /73
Tuesday, 30 June 2009
Bioinformatics-oriented Hierarchical
            Evolutionary Learning (BioHel)

              BioHEL [Bacardit et al., 07] is a recent learning
               system that applies the Iterative Rule Learning
               (IRL) approach to generate sets of rules
              IRL was first used in EC by the SIA system
               [Venturini, 93]
              BioHEL is strongly inspired by GAssist
               [Bacardit, 04], a Pittsburgh approach Learning
               Classifier System




             Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   15 /73
Tuesday, 30 June 2009
   BioHEL learning paradigm
              IRL has been used for many years in the ML
               community, with the name of separate-and-conquer




             Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   16 /73
Tuesday, 30 June 2009
BioHEL’s objective function
        An objective function based on the Minimum-
         Description-Length (MDL) (Rissanen,1978) principle
         that tries to promote rules with
              High accuracy: not making mistakes
              High coverage: covering as many examples as possible
               without sacrificing accuracy. Recall (TP/(TP+FN)) will be
               used to define coverage
              Low complexity: rules as simple and general as possible
              The objective function is a linear combination of the three
               objectives above



             Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   17 /73
Tuesday, 30 June 2009
BioHEL’s objective function
        Intuitively, we would like to have accurate rules
         covering as many examples as possible.
        However, in complex and inconsistent domains it is
         rare to obtain such rules
        In these cases, easier path for evolutionary search is to
         maximize accuracy at the expense of coverage
        Therefore, we need to enforce that the evolved rules
         cover enough examples




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   18 /73
Tuesday, 30 June 2009
Methods: BioHEL’s objective function




     Three parameters define the shape of the function
     The choice of the coverage break is crucial for the proper performance of the
      system
     Also, coverage term penalizes rules that do not cover a minimum percentage of
      examples or that cover too many
           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   19 /73
Tuesday, 30 June 2009
BioHEL’s other characteristics
        Attribute list rule representation
              Automatically identifying the relevant attributes for a given rule and
               discarding all the other ones
        The ILAS windowing scheme
              Efficiency enhancement method, not all training points are used for each
               fitness computation
        An explicit default rule mechanism
              Generating more compact rule sets
              Iterative process terminates when it is impossible to evolve a rule where
               the associated class is the majority class among the matched examples
              At this point, all remaining training instances are assigned to the default
               class
        Ensembles for consensus prediction
              Easy way of boosting robustness

             Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   20 /73
Tuesday, 30 June 2009
Knowledge representations
     Representation of XCS for binary problems:
      ternary representation
          Ternary alphabet {0,1,#}
          If A1=0 and A2=1 and A3 is irrelevant  class 0
                          01#|0
     Representation of XCS for real-valued attributes:
      real-valued interval.
          XCSR [Wilson, 99]
                 Interval is codified with two variables: center & spread: [center-
                  spread, center+spread]
          UBR [ Stone & Bull, 03]
                 The two bounds of the interval are codified directly with two real-
                  valued variables. The variable with lowest value is the lower
                  bound, the variable with higher value is the upper bound
           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   21 /73
Tuesday, 30 June 2009
Knowledge representations
     Representation of GABIL for nominal attributes
          Predicate → Class
          Predicate: Conjunctive Normal Form (CNF) (A1=V11∨..
           ∨ A1=V1n) ∧.....∧ (An=Vn2∨.. ∨ An=Vnm)
                 Ai : ith attribute
                 Vij : jth value of the ith attribute
          The rules can be mapped into a binary string, e.g., 3
           attributes with {3,5,2} values each respectively:
          (A1=V11∨ A1=V13) ∧ (A2=V22 ∨ A2=V24 ∨ A2=V25) ∧
           (A3=V31)           101|01011|10



           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   22 /73
Tuesday, 30 June 2009
Knowledge representations
     Pittsburgh representations for real-valued attributes:
       Rule-based: Adaptive Discretization Intervals (ADI)
         representation [Bacardit, 04]
           Intervals in ADI are build using as possible
            bounds the cut-points proposed by a
            discretization algorithm
           Search bias promotes maximally general
            intervals
           Several discretization algorithms are used at the
            same time to reduce bias



           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   23 /73
Tuesday, 30 June 2009
Knowledge representations
         Pittsburgh representations for real-valued attributes:
                Decision trees [Llorà, 02]
                        Nodes in the trees can use orthogonal or oblique criteria




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   24 /73
Tuesday, 30 June 2009
Knowledge representations
         Pittsburgh representations for real-valued attributes
                Synthetic prototypes [Llorà, 02]
                        Each individual is a set of synthetic instances
                        These instances are used as the core of a nearest-neighbor
                         classifier




                                                               ?




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   25 /73
Tuesday, 30 June 2009
Extended Compact Genetic
                      Algorithm (ECGA)
      ECGA belongs to a class of Evolutionary
       Algorithms called Estimation of Distribution
       Algorithms (EDA)
      no crossover or mutation!
      instead a probabilistic model of the
       structure of the problem is kept
      individuals are sampled from this probability
       distribution model


           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   26 /73
Tuesday, 30 June 2009
Taken from: Linkage Learning via Probabilistic Modeling in the Extended Compact
         Genetic Algorithm (ECGA) by Georges R. Harik, Fernando G. Lobo and Kumara
         Sastry. Studies in Computational Intelligence, Volume 33/2006, Springer, 2007.


                                                   Text




                                                                              Text




                                                   Key Idea Behind Compact GA (CGA)




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   27 /73
Tuesday, 30 June 2009
Taken from: Linkage Learning via Probabilistic Modeling in the Extended Compact
         Genetic Algorithm (ECGA) by Georges R. Harik, Fernando G. Lobo and Kumara
         Sastry. Studies in Computational Intelligence, Volume 33/2006, Springer, 2007.

         Genes Interactions must be accounted for
         Approximates complex distributions by
          Marginal Distribution Models (i.e. genes
          partitions)
         Selects amongst alternative models by
          means of the MDL:




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   28 /73
Tuesday, 30 June 2009
Taken from: Linkage Learning via Probabilistic Modeling in the Extended Compact
       Genetic Algorithm (ECGA) by Georges R. Harik, Fernando G. Lobo and Kumara
       Sastry. Studies in Computational Intelligence, Volume 33/2006, Springer, 2007.




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   29 /73
Tuesday, 30 June 2009
Outline
      Introduction to Learning Classifier Systems
       and Extended Compact GA
      Problem Definition
      Methods (ECGA, LCS, Mutual Information)
      Results
      Conclusions and further work




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   30 /73
Tuesday, 30 June 2009
Protein Structure Prediction (PSP) has as goal to predict
        the 3D structure of a protein based on its primary
        sequence




                      Primary Sequence                                                 3D Structure

           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   31 /73
Tuesday, 30 June 2009
 PSP   is a very costly process
      As an example, one of the best PSP
       methods in the last CASP meeting,
       Rosetta@Home used up to 104
       computing years to predict a single
       protein’s 3D structure
      Ways to alleviate computational burden:
        to simplify the problem
        to simplify the representation used to
        model the proteins
           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   32 /73
Tuesday, 30 June 2009
From Full PSP to CN prediction
        Two residues of a chain are said to be in contact if
         their distance is less than a certain threshold

       Primary                                                                    Contact                       Native State
       Sequence




        CN of a residue : number of contacts that a certain
         residue has
        In this specific case we predict, e.g., whether the
         CN of a residue is smaller or higher than the
         middle point of the CN domain
           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   33 /73
Tuesday, 30 June 2009
From Full PSP to SA prediction
   Solvent Accessibility:
    Amount of surface of
    each residue that is
    exposed to the solvent
    (e.g. water)
   Metric is normalised for
    each AA type
   Problem is to predict
    whether SA is lower or
    higher than 25%




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   34 /73
Tuesday, 30 June 2009
 PSP   is a very costly process
      As an example, one of the best PSP
       methods in the last CASP meeting,
       Rosetta@Home used up to 104
       computing years to predict a single
       protein’s 3D structure
      Ways to alleviate computational burden:
        to simplify the problem
        to simplify the representation used to
        model the proteins
           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   35 /73
Tuesday, 30 June 2009
   Primary Sequence of a protein (the amino acid
    type of the elements of a protein chain) is an usual
    target for such simplification
     It is composed of a quite high cardinality
      alphabet of 20 symbols
     One example of reduction widely used in the
      community is the hydrophobic-polar (HP)
      alphabet, reducing these 20 symbols to just two
     HP representation usually is too simple,
      information is lost in the reduction process
           M. Stout, et al. Prediction of residue exposure and contact number for simplified hp lattice model proteins using
            learning classifier systems. In Proceedings of the 7th International FLINS Conference on Applied Artificial Intelligence,
            pages 601-608. World Scientific, August 2006.
           M. Stout, J. Bacardit, J.D. Hirst, N. Krasnogor, and J. Blazewicz. From hp lattice models to real proteins: coordination
            number prediction using learning classifier systems. In 4th European Workshop on Evolutionary Computation and
            Machine Learning in Bioinformatics, volume 3907 of Springer Lecture Notes in Computer Science, page 208–220,
            Budapest, Hungary, April 2006. Springer. ISBN 978-3-540-33237-4.
           papers at: http://www.cs.nott.ac.uk/~nxk/publications.html




            Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   36 /73
Tuesday, 30 June 2009
 Research                                Question:

      Are  there “simplified” alphabets that
       retain key information content while
       simplifying interpretation,processing
       time, etc?
      If yes, are these alphabet general for
       any problem domain or domain
       specific?
      Can we automatically generate these
       alphabets and tailor them to the
       specific domain we are predicting?
           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   37 /73
Tuesday, 30 June 2009
Outline
      Introduction to Learning Classifier Systems
       and Extended Compact GA
      Problem Definition
      Methods (ECGA, LCS, Mutual Information)
      Results
      Conclusions and further work




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   38 /73
Tuesday, 30 June 2009
   Use an (automated) information theory-driven pipeline to
     reduce alphabet for PSP datasets
    Use the Extended Compact Genetic Algorithm (ECGA) to find
     a dimensionality reduction policy (guided by a fitness function
     based on the Mutual Information (MI) metric)
    Two PSP datasets will be used as testbed:
      Coordination Number (CN) prediction
      Relative Solvent Accessibility (SA) prediction
    Verify the optimized reduction policies with BioHEL, an
     evolutionary-computation based rule learning system
     J.Bacardit, M.Stout, J.D. Hirst, A.Valencia, R.E.Smith, and N.Krasnogor. Automated alphabet reduction for
     protein datasets. BMC Bioinformatics, 10(6), 2009.
     J. Bacardit, M. Stout, and N. Krasnogor. A tale of human-competiveness in bioinformatics. Newsletter of
     ACM Special Interest Group on Genetic and Evolutionary Computation: SIGEvolution, 3(1):2-10, 2008.
     J. Bacardit, M. Stout, J.D. Hirst, K. Sastry, X. Llora, and N. Krasnogor. Automated alphabet reduction
     method with evolutionary algorithms for protein structure prediction. In Proceedings of the 2007 Genetic
     and Evolutionary Computation Conference, number ISBN 978-1-59593-697-4, pages 346-353. ACM Press,
     2007.


     All papers at: http://www.cs.nott.ac.uk/~nxk/publications.html
           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   39 /73
Tuesday, 30 June 2009
   Protein dataset proposed by [Kinjo et al., 05]
       1050 proteins
       259768 residues
     Proteins were selected from PDB-REPRDB using
      these conditions:
       Less than 30% sequence identity
       More than 50 residues
       Resolution better than 2Å
       No membrane proteins, no chain breaks, no non-
        standard residues
       Crystallographic R-factor better than 20%
     Dataset is partitioned into training/test sets using ten-
      fold cross-validation
           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   40 /73
Tuesday, 30 June 2009
Instance Representation


                    AAi-5      AAi-4      AAi-3      AAi-2     AAi-1       AAi       AAi+1      AAi+2      AAi+3      AAi+4      AAi+5
                    CNi-5      CNi-4      CNi-3      CNi-2     CNi-1       CNi       CNi+1      CNi+2      CNi+3      CNi+4      CNi+5




                                                                    AAi-1,AAi,AAi+1  CNi
                                                                    AAi,AAi+1,AAi+2  CNi+1
                                                                    AAi+1,AAi+2,AAi+3  CNi+2




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   41 /73
Tuesday, 30 June 2009
Taken from: J. Bacardit, M. Stout, J.D. Hirst, K. Sastry, X. Llora, and N. Krasnogor.
       Automated alphabet reduction method with evolutionary algorithms for protein
       structure prediction. In Proceedings of the 2007 Genetic and Evolutionary
       Computation Conference, number ISBN 978-1-59593-697-4, pages 346-353. ACM
       Press, 2007.




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   42 /73
Tuesday, 30 June 2009
General Workflow of the Alphabet Reduction Pipeline



                               Size = N                                                                                                       Test set




  Dataset                         ECGA                                Dataset                              BioHEL                           Ensemble
  |∑|=20                                                               |∑|=N                                                                of rule sets




                                                                                                                                             Accuracy
                                 Mutual
                              Information




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel             43 /73
Tuesday, 30 June 2009
Methods: alphabet reduction
                       strategies
      Three strategies were evaluated
      They represent progressive levels of
       sophistication
        Mutual Information (MI)
        Robust Mutual Information (RMI)
        Dual Robust Mutual Information (DualRMI)
      Thus MI, RMI, DualRMI were used in
       separate experiments as the “fitness”
       function for the ECGA tournament phase.
           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   44 /73
Tuesday, 30 June 2009
Methods: MI strategy
        There are 21 symbols (20AA+end of chain) in the
         alphabet
        Each symbol will be assigned to a group in the
         chromosome used by ECGA




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   45 /73
Tuesday, 30 June 2009
Methods: MI stragegy
         Objective function for MI strategy: Mutual Information
               Mutual Information is a measure that quantifies the
                interrelationship that two discrete variables have
                among each other




     X is the reduced representation of the window of
      residues around the target.
     Y is the two-state definition fo CN or SA




              Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   46 /73
Tuesday, 30 June 2009
Methods: MI strategy
            Steps of objective function computation for the
             MI strategy
           1.       Reduction mappings are extracted from the
                    chromosome
           2.       Instances of the training set are transformed into the
                    lower cardinality alphabet
           3.       Mutual information between the class attribute and
                    the string formed by concatenating the input
                    attributes is computed
           4.       This MI is assigned as the result of the evaluation
                    function

           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   47 /73
Tuesday, 30 June 2009
Methods: MI strategy
     Problem of MI strategy
         Mutual Information needs redundancy in order to become a good
          estimator
         That is, each possible pattern in X and Y should be well represented in
          the dataset
         Patterns in Y are always well represented. What happens with patterns in
          X in our dataset?
         Our sample, despite having almost 260000 residues is too small

                       #letters                                                    Represented patterns
                       2                                                           100%
                       3                                                           97.8%
                       4                                                           57.6%
                       5                                                           11.3%
                       20                                                          3.1E-07
           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   48 /73
Tuesday, 30 June 2009
Methods: RMI strategy
   In order to solve the sample size problem of the MI strategy, we use
    a robust MI estimator proposed by [Cline et al., 02]
     Pairs of (x,y) in the dataset are scrambled
     That is, each x in the dataset is randomly joined to an y but the
       distribution of x and y remains equal
     MI is computed for the scrambled dataset
     This process is repeated N time, and the average scrambled MI is
       computed

       Finally, the value for the objective function is MI – Mis
       Mis is an estimation of the sampling bias in the data. By subtracting
        it from the original MI metric we obtain a less biased objective
        function


           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   49 /73
Tuesday, 30 June 2009
Methods: DualRMI strategy
        The next strategy is based on some observations we
         did in previous work [Bacardit et al., 06]
        Example of a rule set for prediction CN from primary
         sequence




        Predicate associated to the target residue (AA) is
         very different from the predicates associated to the
         other window positions
           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   50 /73
Tuesday, 30 June 2009
Methods: DualRMI strategy
      Why not generating two reduced alphabets
       at the same time?
        One for the target residue
        One for the other residues in the window
      Objective function remains unchanged




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   51 /73
Tuesday, 30 June 2009
Outline
      Introduction to Learning Classifier Systems
       and Extended Compact GA
      Problem Definition
      Methods (ECGA, LCS, Mutual Information)
      Results
      Conclusions and further work




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   52 /73
Tuesday, 30 June 2009
Experimental design
        For each problem (CN, SA)
        For each reduction strategy (MI, RMI, DualRMI)
        ECGA was run to generate alphabets of two, three,
         four and five letters
        Afterwards, BioHEL was trained over the reduced
         datasets to determine the prediction accuracy that
         could be obtained from each alphabet size
        Comparisons are drawn




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   53 /73
Tuesday, 30 June 2009
Reduced alphabets for CN
     Amino acids that remain always in the same group
     are marked with solid rectangles




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   54 /73
Tuesday, 30 June 2009
Alphabets for CN
      The two-letter alphabet divides the amino-acids between
       hydrophobic and polar
      RMI could not find a five-letter alphabet
      DualRMI did, but only for the target residue
      RMI and DualRMI have a much larger number of framed
       residues, showing more robustness
      For DualRMI we can observe small groups of hydrophobic
       residues, while all polar ones are in the same group
      We can also observe a strange group, GHTS, that mixes
       different kind of physico-chemical properties
        Not explained by properties but by inherent distribution in
         datasets

           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   55 /73
Tuesday, 30 June 2009
   A retrospective
        analysis of the dataset
        reveals why GHTS
        are clustered together
       We computed the
        proportion of residues
        for each amino acid
        type with high CN
       These four residues
        have very similar
        average behavior in
        relation to CN




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   56 /73
Tuesday, 30 June 2009
Accuracy of CN prediction Using Biohel
     Accuracy difference
      between the AA
      representation and the
      best reduced
      alphabets is 0.7%
     Difference in non-
      significant according to
      t-tests
     RMI and DualRMI
      perform similarly




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   57 /73
Tuesday, 30 June 2009
Reduced alphabets for SA




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   58 /73
Tuesday, 30 June 2009
Reduced alphabets for SA
        Even though SA and CN are somewhat related
         structural features, the resulting alphabets are
         different
        These alphabets contain more groups of polar
         residues, and less groups of hydrophobic ones (in
         contrast with CN)
        In DualRMI and 5 letters we can observe very small
         groups
              A, EK for the target alphabet
              G,X for the other residues alphabet
        Again, the GHTS group appears
             Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   59 /73
Tuesday, 30 June 2009
   Analysis of
      average SA
      behavior for
      each AA type
     The reduced
      alphabet
      matched
      perfectly the
      properties of the
      SA features




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   60 /73
Tuesday, 30 June 2009
Accuracy of SA prediction with BioHel
    Accuracy of reduced
     alphabets for SA
     prediction
    Only DualRMI
     managed to give a
     performace
     statistically similar to
     the original AA
     representation
    Accuracy difference
     is 0.4%




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   61 /73
Tuesday, 30 June 2009
Comparison to Other Reduced Alphabets from the
        Literature and Expert-Designed Alphabets Based on
                    Physico-Chemical Properties
    Alphabet              Letters            CN acc.              SA acc.                Diff.                                Ref.
       AA                     20            74.0±0.6             70.7±0.4                  ---                                 ---

    DualRMI                    5            73.3±0.5             70.3±0.4              0.7/0.4                           This work

      WW5                      6            73.1±0.7             69.6±0.4              0.9/1.1                  [Wang & Wang, 99]




                                                                                                                                             Alphabets from
                                                                                                                                              the literature
       SR5                     6            73.1±0.7             69.6±0.4              0.9/1.1              [Solis & Rackovsky, 00]

      MU4                      5            72.6±0.7             69.4±0.4              1.4/1.3                   [Murphy et al., 00]

      MM5                      6            73.1±0.6             69.3±0.3              0.9/1.4            [Melo & Marti-Renom, 06]




                                                                                                                                                 Expert designed
      HD1                      7            72.9±0.6             69.3±0.4              1.1/1.4                           This work




                                                                                                                                                    alphabets
      HD2                      9            73.0±0.6             69.3±0.4              1.0/1.4                           This work

      HD3                     11            73.2±0.6             69.9±0.4              0.8/0.8                           This work




            Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel       62 /73
Tuesday, 30 June 2009
Reduced Alphabets Comparison
      Automatically reduced alphabets obtain better accuracy,
       but how different are the alphabets themselves?
      We applied again the AA-wise high CN/SA analysis
      Two metrics were computed
          Transitions: how many times does the group index change
           through the list of sorted AA.
              The less number of changes, the more homogenous the groups are
          Average range: The range of a reduction group is the
           difference between the minimum and maximum CN/SA of the
           AAs belonging to that group
              The smaller the average range, the more focused the reduction
               groups are in relation to that structural property


           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   63 /73
Tuesday, 30 June 2009
Reduced Alphabets Comparison (CN)




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   64 /73
Tuesday, 30 June 2009
Reduced Alphabets Comparison (SA)




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   65 /73
Tuesday, 30 June 2009
Additional Results

      Are the alphabets interchangeable across
       problems?
      Can these reduced alphabets be applied to
       an evolutionary information-based
       representation?




           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   66 /73
Tuesday, 30 June 2009
Results: Are the alphabets interchangeable?
    We applied the alphabet optimized for CN to SA and vice
     versa




    SA alphabet is good for predicting CN, but CN alphabet
     obtains poor performance on SA
    Reduced alphabets must always be tailored to the
     domain at hand
           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   67 /73
Tuesday, 30 June 2009
Results
      Application of the reduced alphabets to an evolutionary
       information-based representation
          So far we have used only the simple primary sequence
           representation
          Can this process be applied to much richer (and complex)
           representations?
          We computed the position-specific scoring matrices (PSSM)
           representation of our dataset using PSI-BLAST. Each
           instance (9 window positions) is represented by 180
           continuous variables (rather than 20+1 as originally done)
          Then, we reduced this representation using our alphabets
              The values of each PSSM profile corresponding to amino acids in
               the same reduction group are averaged


           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   68 /73
Tuesday, 30 June 2009
Results

    Application of
     reduced
     alphabets to a
     PSSM
     representation
    Thus, we
     reduced the
     representation
     from 180
     attributes to 45


           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   69 /73
Tuesday, 30 June 2009
Results
     Results of learning from the reduced PSSM
      representation




     Accuracy difference is still less than 1%
     Obtained rules sets are simpler and training process
      is much faster
     Performance levels are similar to recent works in the
      literature [Kinjo et al., 05][Dor and Zhou, 07]
           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   70 /73
Tuesday, 30 June 2009
Conclusions
    We have proposed an automated alphabet reduction protocol
     for protein datasets
        Protocol does not use any domain knowledge
        It automatically tailors the reduced datasets to the domain at hand
    Our experiments show that it is possible to obtain quite
     reduced alphabets (5 letters) with similar performance than
     the original AA alphabet
    Our reduced alphabets are better at CN and SA prediction
     than other alphabet from the literature, as they are better
     suited for these tasks
    The findings from the protocol can be used in state-of-the-art
     protein representations as PSSM profiles
    We found some unexpected reduction groups (GHTS) but the
     properties of the data showed us that this is not an artifact


           Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   71 /73
Tuesday, 30 June 2009
Future work
        Explore alternative objective evaluation functions
        Other robust MI estimation
        Explore slightly higher cardinality alphabets
              Is it possible to close the accuracy gap even more?
        Apply this protocol to other kind of datasets
              E.g. protein mutations
              Structural aspects defined as continuous variables, not
               just discrete ones




             Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   72 /73
Tuesday, 30 June 2009
Acknowledgements
                                                   (in no particular order)                                                        (in no particular order)
                                                  Peter Siepmann
Contributors to the talks I will give at BGU




                                                  Pawel Widera                                                                   School of Physics and Astronomy
                                                  James Smaldon                                                                  School of Chemistry
                                                                                                                                  School of Pharmacy
                                                  Azhar Ali Shah
                                                                                                                                  School of Biosciences
                                                  Jack Chaplin
                                                                                                                                  School of Mathematics
                                                  Enrico Glaab                                                                   School of Computer Science
                                                  German Terrazas                                                                Centre for Biomolecular Sciences
                                                  Hongqing Cao                                                                   all the above at UoN
                                                  Jamie Twycross
                                                  Jonathan Blake                                            Thanks also go to:
                                                  Francisco Romero-Campero
                                                  Maria Franco                                                     Ben Gurion University of the
                                                  Adam Sweetman
                                                  Linda Fiaschi
                                                                                                                    Negev’s Distinguished Scientists
                                                                                                                    Visitor Program
              Funding From:
                          BBSRC, EPSRC, EU, ESF, UoN                                                                Professor Dr. Moshe Sipper
                                                     Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel   73 /73
Tuesday, 30 June 2009

Weitere ähnliche Inhalte

Mehr von Natalio Krasnogor

Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...
Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...
Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...Natalio Krasnogor
 
Pathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & BlockchainPathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & BlockchainNatalio Krasnogor
 
Biological Apps: Rapidly Converging Technologies for Living Information Proce...
Biological Apps: Rapidly Converging Technologies for Living Information Proce...Biological Apps: Rapidly Converging Technologies for Living Information Proce...
Biological Apps: Rapidly Converging Technologies for Living Information Proce...Natalio Krasnogor
 
Advanced computationalsyntbio
Advanced computationalsyntbioAdvanced computationalsyntbio
Advanced computationalsyntbioNatalio Krasnogor
 
Introduction to biocomputing
 Introduction to biocomputing Introduction to biocomputing
Introduction to biocomputingNatalio Krasnogor
 
Evolvability of Designs and Computation with Porphyrins-based Nano-tiles
Evolvability of Designs and Computation with Porphyrins-based Nano-tilesEvolvability of Designs and Computation with Porphyrins-based Nano-tiles
Evolvability of Designs and Computation with Porphyrins-based Nano-tilesNatalio Krasnogor
 
Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...
Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...
Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...Natalio Krasnogor
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsNatalio Krasnogor
 
An Unorthodox View on Memetic Algorithms
An Unorthodox View on Memetic AlgorithmsAn Unorthodox View on Memetic Algorithms
An Unorthodox View on Memetic AlgorithmsNatalio Krasnogor
 
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...Natalio Krasnogor
 
Towards a Rapid Model Prototyping Strategy for Systems & Synthetic Biology
Towards a Rapid Model Prototyping  Strategy for Systems & Synthetic BiologyTowards a Rapid Model Prototyping  Strategy for Systems & Synthetic Biology
Towards a Rapid Model Prototyping Strategy for Systems & Synthetic BiologyNatalio Krasnogor
 
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...Natalio Krasnogor
 
HUMIES presentation: Evolutionary design of energy functions for protein str...
HUMIES presentation: Evolutionary design of energy functions  for protein str...HUMIES presentation: Evolutionary design of energy functions  for protein str...
HUMIES presentation: Evolutionary design of energy functions for protein str...Natalio Krasnogor
 
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...Natalio Krasnogor
 
Computational Synthetic Biology
Computational Synthetic BiologyComputational Synthetic Biology
Computational Synthetic BiologyNatalio Krasnogor
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Natalio Krasnogor
 
Synthetic Biology - Modeling and Optimisation
Synthetic Biology -  Modeling and OptimisationSynthetic Biology -  Modeling and Optimisation
Synthetic Biology - Modeling and OptimisationNatalio Krasnogor
 
A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...
A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...
A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...Natalio Krasnogor
 

Mehr von Natalio Krasnogor (20)

Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...
Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...
Designing for Addressability, Bio-orthogonality and Abstraction Scalability a...
 
Pathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & BlockchainPathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & Blockchain
 
Biological Apps: Rapidly Converging Technologies for Living Information Proce...
Biological Apps: Rapidly Converging Technologies for Living Information Proce...Biological Apps: Rapidly Converging Technologies for Living Information Proce...
Biological Apps: Rapidly Converging Technologies for Living Information Proce...
 
DNA data-structure
DNA data-structureDNA data-structure
DNA data-structure
 
Advanced computationalsyntbio
Advanced computationalsyntbioAdvanced computationalsyntbio
Advanced computationalsyntbio
 
The Infobiotics workbench
The Infobiotics workbenchThe Infobiotics workbench
The Infobiotics workbench
 
Introduction to biocomputing
 Introduction to biocomputing Introduction to biocomputing
Introduction to biocomputing
 
Evolvability of Designs and Computation with Porphyrins-based Nano-tiles
Evolvability of Designs and Computation with Porphyrins-based Nano-tilesEvolvability of Designs and Computation with Porphyrins-based Nano-tiles
Evolvability of Designs and Computation with Porphyrins-based Nano-tiles
 
Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...
Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...
Plenary Speaker slides at the 2016 International Workshop on Biodesign Automa...
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric Bioinformatics
 
An Unorthodox View on Memetic Algorithms
An Unorthodox View on Memetic AlgorithmsAn Unorthodox View on Memetic Algorithms
An Unorthodox View on Memetic Algorithms
 
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
Darwin’s Magic: Evolutionary Computation in Nanoscience, Bioinformatics and S...
 
Towards a Rapid Model Prototyping Strategy for Systems & Synthetic Biology
Towards a Rapid Model Prototyping  Strategy for Systems & Synthetic BiologyTowards a Rapid Model Prototyping  Strategy for Systems & Synthetic Biology
Towards a Rapid Model Prototyping Strategy for Systems & Synthetic Biology
 
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
 
HUMIES presentation: Evolutionary design of energy functions for protein str...
HUMIES presentation: Evolutionary design of energy functions  for protein str...HUMIES presentation: Evolutionary design of energy functions  for protein str...
HUMIES presentation: Evolutionary design of energy functions for protein str...
 
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
 
Computational Synthetic Biology
Computational Synthetic BiologyComputational Synthetic Biology
Computational Synthetic Biology
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
 
Synthetic Biology - Modeling and Optimisation
Synthetic Biology -  Modeling and OptimisationSynthetic Biology -  Modeling and Optimisation
Synthetic Biology - Modeling and Optimisation
 
A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...
A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...
A Genetic Programming Challenge: Evolving the Energy Function for Protein Str...
 

Kürzlich hochgeladen

Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 

Kürzlich hochgeladen (20)

Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 

Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

  • 1. Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case Jaume Bacardit & Natalio Krasnogor ASAP - Interdisciplinary Optimisation Laboratory School of Computer Science Centre for Integrative Systems Biology School of Biology Centre for Healthcare Associated Infections Institute of Infection, Immunity & Inflammation University of Nottingham Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 1 /73 Tuesday, 30 June 2009
  • 2. Acknowledgements (in no particular order) (in no particular order)  Peter Siepmann Contributors to the talks I will give at BGU  Pawel Widera  School of Physics and Astronomy  James Smaldon  School of Chemistry  School of Pharmacy  Azhar Ali Shah  School of Biosciences  Jack Chaplin  School of Mathematics  Enrico Glaab  School of Computer Science  German Terrazas  Centre for Biomolecular Sciences  Hongqing Cao  all the above at UoN  Jamie Twycross  Jonathan Blake Thanks also go to:  Francisco Romero-Campero  Maria Franco Ben Gurion University of the  Adam Sweetman  Linda Fiaschi Negev’s Distinguished Scientists Visitor Program Funding From: BBSRC, EPSRC, EU, ESF, UoN Professor Dr. Moshe Sipper Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 2 /73 Tuesday, 30 June 2009
  • 3. Outline  Introduction to Learning Classifier Systems and Extended Compact GA  Problem Definition  Methods (ECGA, LCS, Mutual Information)  Results  Conclusions and further work Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 3 /73 Tuesday, 30 June 2009
  • 4. Based on Various Papers  J.Bacardit, M.Stout, J.D. Hirst, A.Valencia, R.E.Smith, and N.Krasnogor. Automated alphabet reduction for protein datasets. BMC Bioinformatics, 10(6), 2009.  J. Bacardit, M. Stout, J.D. Hirst, K. Sastry, X. Llora, and N. Krasnogor. Automated alphabet reduction method with evolutionary algorithms for protein structure prediction. In Proceedings of the 2007 Genetic and Evolutionary Computation Conference, number ISBN 978-1-59593-697-4, pages 346-353. ACM Press, 2007. This paper won the Bronze Medal in the THE 2007 “HUMIES” AWARDS FOR HUMAN-COMPETITIVE RESULTS PRODUCED BY GENETIC AND EVOLUTIONARY COMPUTATION. J.  J.Bacardit and N. Krasnogor. Performance and efficiency of memetic pittsburgh learning classifier systems. Evolutionary Computation, 17(3):(to appear), 2009.  J.Bacardit, E.K. Burke, and N.Krasnogor. Improving the scalability of rule-based evolutionary learning. Memetic Computing, 1(1):(to appear), 2009  J. Bacardit, M. Stout, and N. Krasnogor. A tale of human-competiveness in bioinformatics. Newsletter of ACM Special Interest Group on Genetic and Evolutionary Computation: SIGEvolution, 3(1):2-10, 2008. All papers available from: www.cs.nott.ac.uk/~nxk/publications.html Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 4 /73 Tuesday, 30 June 2009
  • 5. Learning Classifier Systems (LCS) are one of the major families of techniques that apply evolutionary computation to machine learning tasks  Machine learning: How to construct programs that automatically improve with experience [Mitchell, 1997]  Classification task: Learning how to label correctly new instances from a domain based on a set of previously labeled instances  LCS are almost as ancient as GAs, Holland made one of the first proposals  Two of the first LCS proposals are [Holland & Reitman, 78] and [Smith, 80] Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 5 /73 Tuesday, 30 June 2009
  • 6. Traditionally there have been two different paradigms of LCS  The Pittsburgh approach [Smith, 80]  The Michigan approach [Holland & Reitman, 78]  More recently: The Iterative Rule Learning approach [Venturini, 93]  Knowledge representations  All the initial approaches were rule-based  In recent years several knowledge representations have been used in the LCS field: decision trees, synthetic prototypes, etc. Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 6 /73 Tuesday, 30 June 2009
  • 7. Classification task  Classification task: Learning how to label correctly new instances from a domain based on a set of previously labeled instances New Instance Learning Inference Training Set Algorithm Engine Class Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 7 /73 Tuesday, 30 June 2009
  • 8. Classification task 1 If (X<0.25 and Y>0.75) or (X>0.75 and Y Y<0.25) then  0 1 X Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 8 /73 Tuesday, 30 June 2009
  • 9. Paradigms of LCS  The Pittsburgh approach  Each individual is a complete solution to the classification problem  Traditionally this means that each individual is a variable-length set of rules  GABIL [De Jong & Spears, 93] is a well- known representative of this approach  Fitness function is based on the rule set accuracy on the training set (usually also on complexity) Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 9 /73 Tuesday, 30 June 2009
  • 10. Paradigms of LCS  The Pittsburgh approach  Crossover operator Parents Offspring  Mutation operator: bit flipping  Individuals are interpreted as a decision list: an ordered rule set Instance 1 matches rules 2, 3 and 7  Rule 2 will be used Instance 2 matches rules 1 and 8  Rule 1 will be used 1 2 3 4 5 6 7 8 Instance 3 matches rule 8  Rule 8 will be used Instance 4 matches no rules  Instance 4 will not be classified Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 10 /73 Tuesday, 30 June 2009
  • 11. Paradigms of LCS  The Michigan approach  Each individual is a single rule  The whole population cooperates to solve the classification problem  A reinforcement system is used to identify the good rules  A GA is used to explore the search space for more rules  XCS [Wilson, 95] is the most well-known Michigan LCS Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 11 /73 Tuesday, 30 June 2009
  • 12. Paradigms of LCS  Working cycle Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 12 /73 Tuesday, 30 June 2009
  • 13. Paradigms of LCS  The Iterative Rule Learning approach  Each individual is a single rule  Individuals compete as in a standard GA  A single GA run generates one rule  The GA is run iteratively to learn all rules that solve the problem  Instances already covered by previous rules are removed from the training set of the next iteration Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 13 /73 Tuesday, 30 June 2009
  • 14. Paradigms of LCS  The Iterative Rule Learning approach  HIDER System [Aguilar, Riquelme & Toro, 03] 1. Input: Examples 2. RuleSet = Ø 3. While |Examples| > 0 1. Rule = Run GA with Examples 2. RuleSet = RuleSet U Rule 3. Examples = Examples Covered(Rule) 4. EndWhile 5. Output: RuleSet Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 14 /73 Tuesday, 30 June 2009
  • 15. Bioinformatics-oriented Hierarchical Evolutionary Learning (BioHel)  BioHEL [Bacardit et al., 07] is a recent learning system that applies the Iterative Rule Learning (IRL) approach to generate sets of rules  IRL was first used in EC by the SIA system [Venturini, 93]  BioHEL is strongly inspired by GAssist [Bacardit, 04], a Pittsburgh approach Learning Classifier System Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 15 /73 Tuesday, 30 June 2009
  • 16. BioHEL learning paradigm  IRL has been used for many years in the ML community, with the name of separate-and-conquer Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 16 /73 Tuesday, 30 June 2009
  • 17. BioHEL’s objective function  An objective function based on the Minimum- Description-Length (MDL) (Rissanen,1978) principle that tries to promote rules with  High accuracy: not making mistakes  High coverage: covering as many examples as possible without sacrificing accuracy. Recall (TP/(TP+FN)) will be used to define coverage  Low complexity: rules as simple and general as possible  The objective function is a linear combination of the three objectives above Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 17 /73 Tuesday, 30 June 2009
  • 18. BioHEL’s objective function  Intuitively, we would like to have accurate rules covering as many examples as possible.  However, in complex and inconsistent domains it is rare to obtain such rules  In these cases, easier path for evolutionary search is to maximize accuracy at the expense of coverage  Therefore, we need to enforce that the evolved rules cover enough examples Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 18 /73 Tuesday, 30 June 2009
  • 19. Methods: BioHEL’s objective function  Three parameters define the shape of the function  The choice of the coverage break is crucial for the proper performance of the system  Also, coverage term penalizes rules that do not cover a minimum percentage of examples or that cover too many Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 19 /73 Tuesday, 30 June 2009
  • 20. BioHEL’s other characteristics  Attribute list rule representation  Automatically identifying the relevant attributes for a given rule and discarding all the other ones  The ILAS windowing scheme  Efficiency enhancement method, not all training points are used for each fitness computation  An explicit default rule mechanism  Generating more compact rule sets  Iterative process terminates when it is impossible to evolve a rule where the associated class is the majority class among the matched examples  At this point, all remaining training instances are assigned to the default class  Ensembles for consensus prediction  Easy way of boosting robustness Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 20 /73 Tuesday, 30 June 2009
  • 21. Knowledge representations  Representation of XCS for binary problems: ternary representation  Ternary alphabet {0,1,#}  If A1=0 and A2=1 and A3 is irrelevant  class 0  01#|0  Representation of XCS for real-valued attributes: real-valued interval.  XCSR [Wilson, 99]  Interval is codified with two variables: center & spread: [center- spread, center+spread]  UBR [ Stone & Bull, 03]  The two bounds of the interval are codified directly with two real- valued variables. The variable with lowest value is the lower bound, the variable with higher value is the upper bound Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 21 /73 Tuesday, 30 June 2009
  • 22. Knowledge representations  Representation of GABIL for nominal attributes  Predicate → Class  Predicate: Conjunctive Normal Form (CNF) (A1=V11∨.. ∨ A1=V1n) ∧.....∧ (An=Vn2∨.. ∨ An=Vnm)  Ai : ith attribute  Vij : jth value of the ith attribute  The rules can be mapped into a binary string, e.g., 3 attributes with {3,5,2} values each respectively:  (A1=V11∨ A1=V13) ∧ (A2=V22 ∨ A2=V24 ∨ A2=V25) ∧ (A3=V31) 101|01011|10 Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 22 /73 Tuesday, 30 June 2009
  • 23. Knowledge representations  Pittsburgh representations for real-valued attributes:  Rule-based: Adaptive Discretization Intervals (ADI) representation [Bacardit, 04]  Intervals in ADI are build using as possible bounds the cut-points proposed by a discretization algorithm  Search bias promotes maximally general intervals  Several discretization algorithms are used at the same time to reduce bias Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 23 /73 Tuesday, 30 June 2009
  • 24. Knowledge representations  Pittsburgh representations for real-valued attributes:  Decision trees [Llorà, 02]  Nodes in the trees can use orthogonal or oblique criteria Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 24 /73 Tuesday, 30 June 2009
  • 25. Knowledge representations  Pittsburgh representations for real-valued attributes  Synthetic prototypes [Llorà, 02]  Each individual is a set of synthetic instances  These instances are used as the core of a nearest-neighbor classifier ? Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 25 /73 Tuesday, 30 June 2009
  • 26. Extended Compact Genetic Algorithm (ECGA)  ECGA belongs to a class of Evolutionary Algorithms called Estimation of Distribution Algorithms (EDA)  no crossover or mutation!  instead a probabilistic model of the structure of the problem is kept  individuals are sampled from this probability distribution model Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 26 /73 Tuesday, 30 June 2009
  • 27. Taken from: Linkage Learning via Probabilistic Modeling in the Extended Compact Genetic Algorithm (ECGA) by Georges R. Harik, Fernando G. Lobo and Kumara Sastry. Studies in Computational Intelligence, Volume 33/2006, Springer, 2007. Text Text Key Idea Behind Compact GA (CGA) Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 27 /73 Tuesday, 30 June 2009
  • 28. Taken from: Linkage Learning via Probabilistic Modeling in the Extended Compact Genetic Algorithm (ECGA) by Georges R. Harik, Fernando G. Lobo and Kumara Sastry. Studies in Computational Intelligence, Volume 33/2006, Springer, 2007.  Genes Interactions must be accounted for  Approximates complex distributions by Marginal Distribution Models (i.e. genes partitions)  Selects amongst alternative models by means of the MDL: Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 28 /73 Tuesday, 30 June 2009
  • 29. Taken from: Linkage Learning via Probabilistic Modeling in the Extended Compact Genetic Algorithm (ECGA) by Georges R. Harik, Fernando G. Lobo and Kumara Sastry. Studies in Computational Intelligence, Volume 33/2006, Springer, 2007. Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 29 /73 Tuesday, 30 June 2009
  • 30. Outline  Introduction to Learning Classifier Systems and Extended Compact GA  Problem Definition  Methods (ECGA, LCS, Mutual Information)  Results  Conclusions and further work Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 30 /73 Tuesday, 30 June 2009
  • 31. Protein Structure Prediction (PSP) has as goal to predict the 3D structure of a protein based on its primary sequence Primary Sequence 3D Structure Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 31 /73 Tuesday, 30 June 2009
  • 32.  PSP is a very costly process  As an example, one of the best PSP methods in the last CASP meeting, Rosetta@Home used up to 104 computing years to predict a single protein’s 3D structure  Ways to alleviate computational burden:  to simplify the problem  to simplify the representation used to model the proteins Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 32 /73 Tuesday, 30 June 2009
  • 33. From Full PSP to CN prediction  Two residues of a chain are said to be in contact if their distance is less than a certain threshold Primary Contact Native State Sequence  CN of a residue : number of contacts that a certain residue has  In this specific case we predict, e.g., whether the CN of a residue is smaller or higher than the middle point of the CN domain Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 33 /73 Tuesday, 30 June 2009
  • 34. From Full PSP to SA prediction  Solvent Accessibility: Amount of surface of each residue that is exposed to the solvent (e.g. water)  Metric is normalised for each AA type  Problem is to predict whether SA is lower or higher than 25% Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 34 /73 Tuesday, 30 June 2009
  • 35.  PSP is a very costly process  As an example, one of the best PSP methods in the last CASP meeting, Rosetta@Home used up to 104 computing years to predict a single protein’s 3D structure  Ways to alleviate computational burden:  to simplify the problem  to simplify the representation used to model the proteins Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 35 /73 Tuesday, 30 June 2009
  • 36. Primary Sequence of a protein (the amino acid type of the elements of a protein chain) is an usual target for such simplification  It is composed of a quite high cardinality alphabet of 20 symbols  One example of reduction widely used in the community is the hydrophobic-polar (HP) alphabet, reducing these 20 symbols to just two  HP representation usually is too simple, information is lost in the reduction process  M. Stout, et al. Prediction of residue exposure and contact number for simplified hp lattice model proteins using learning classifier systems. In Proceedings of the 7th International FLINS Conference on Applied Artificial Intelligence, pages 601-608. World Scientific, August 2006.  M. Stout, J. Bacardit, J.D. Hirst, N. Krasnogor, and J. Blazewicz. From hp lattice models to real proteins: coordination number prediction using learning classifier systems. In 4th European Workshop on Evolutionary Computation and Machine Learning in Bioinformatics, volume 3907 of Springer Lecture Notes in Computer Science, page 208–220, Budapest, Hungary, April 2006. Springer. ISBN 978-3-540-33237-4.  papers at: http://www.cs.nott.ac.uk/~nxk/publications.html Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 36 /73 Tuesday, 30 June 2009
  • 37.  Research Question:  Are there “simplified” alphabets that retain key information content while simplifying interpretation,processing time, etc?  If yes, are these alphabet general for any problem domain or domain specific?  Can we automatically generate these alphabets and tailor them to the specific domain we are predicting? Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 37 /73 Tuesday, 30 June 2009
  • 38. Outline  Introduction to Learning Classifier Systems and Extended Compact GA  Problem Definition  Methods (ECGA, LCS, Mutual Information)  Results  Conclusions and further work Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 38 /73 Tuesday, 30 June 2009
  • 39. Use an (automated) information theory-driven pipeline to reduce alphabet for PSP datasets  Use the Extended Compact Genetic Algorithm (ECGA) to find a dimensionality reduction policy (guided by a fitness function based on the Mutual Information (MI) metric)  Two PSP datasets will be used as testbed:  Coordination Number (CN) prediction  Relative Solvent Accessibility (SA) prediction  Verify the optimized reduction policies with BioHEL, an evolutionary-computation based rule learning system J.Bacardit, M.Stout, J.D. Hirst, A.Valencia, R.E.Smith, and N.Krasnogor. Automated alphabet reduction for protein datasets. BMC Bioinformatics, 10(6), 2009. J. Bacardit, M. Stout, and N. Krasnogor. A tale of human-competiveness in bioinformatics. Newsletter of ACM Special Interest Group on Genetic and Evolutionary Computation: SIGEvolution, 3(1):2-10, 2008. J. Bacardit, M. Stout, J.D. Hirst, K. Sastry, X. Llora, and N. Krasnogor. Automated alphabet reduction method with evolutionary algorithms for protein structure prediction. In Proceedings of the 2007 Genetic and Evolutionary Computation Conference, number ISBN 978-1-59593-697-4, pages 346-353. ACM Press, 2007. All papers at: http://www.cs.nott.ac.uk/~nxk/publications.html Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 39 /73 Tuesday, 30 June 2009
  • 40. Protein dataset proposed by [Kinjo et al., 05]  1050 proteins  259768 residues  Proteins were selected from PDB-REPRDB using these conditions:  Less than 30% sequence identity  More than 50 residues  Resolution better than 2Å  No membrane proteins, no chain breaks, no non- standard residues  Crystallographic R-factor better than 20%  Dataset is partitioned into training/test sets using ten- fold cross-validation Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 40 /73 Tuesday, 30 June 2009
  • 41. Instance Representation AAi-5 AAi-4 AAi-3 AAi-2 AAi-1 AAi AAi+1 AAi+2 AAi+3 AAi+4 AAi+5 CNi-5 CNi-4 CNi-3 CNi-2 CNi-1 CNi CNi+1 CNi+2 CNi+3 CNi+4 CNi+5 AAi-1,AAi,AAi+1  CNi AAi,AAi+1,AAi+2  CNi+1 AAi+1,AAi+2,AAi+3  CNi+2 Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 41 /73 Tuesday, 30 June 2009
  • 42. Taken from: J. Bacardit, M. Stout, J.D. Hirst, K. Sastry, X. Llora, and N. Krasnogor. Automated alphabet reduction method with evolutionary algorithms for protein structure prediction. In Proceedings of the 2007 Genetic and Evolutionary Computation Conference, number ISBN 978-1-59593-697-4, pages 346-353. ACM Press, 2007. Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 42 /73 Tuesday, 30 June 2009
  • 43. General Workflow of the Alphabet Reduction Pipeline Size = N Test set Dataset ECGA Dataset BioHEL Ensemble |∑|=20 |∑|=N of rule sets Accuracy Mutual Information Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 43 /73 Tuesday, 30 June 2009
  • 44. Methods: alphabet reduction strategies  Three strategies were evaluated  They represent progressive levels of sophistication  Mutual Information (MI)  Robust Mutual Information (RMI)  Dual Robust Mutual Information (DualRMI)  Thus MI, RMI, DualRMI were used in separate experiments as the “fitness” function for the ECGA tournament phase. Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 44 /73 Tuesday, 30 June 2009
  • 45. Methods: MI strategy  There are 21 symbols (20AA+end of chain) in the alphabet  Each symbol will be assigned to a group in the chromosome used by ECGA Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 45 /73 Tuesday, 30 June 2009
  • 46. Methods: MI stragegy  Objective function for MI strategy: Mutual Information  Mutual Information is a measure that quantifies the interrelationship that two discrete variables have among each other  X is the reduced representation of the window of residues around the target.  Y is the two-state definition fo CN or SA Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 46 /73 Tuesday, 30 June 2009
  • 47. Methods: MI strategy  Steps of objective function computation for the MI strategy 1. Reduction mappings are extracted from the chromosome 2. Instances of the training set are transformed into the lower cardinality alphabet 3. Mutual information between the class attribute and the string formed by concatenating the input attributes is computed 4. This MI is assigned as the result of the evaluation function Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 47 /73 Tuesday, 30 June 2009
  • 48. Methods: MI strategy  Problem of MI strategy  Mutual Information needs redundancy in order to become a good estimator  That is, each possible pattern in X and Y should be well represented in the dataset  Patterns in Y are always well represented. What happens with patterns in X in our dataset?  Our sample, despite having almost 260000 residues is too small #letters Represented patterns 2 100% 3 97.8% 4 57.6% 5 11.3% 20 3.1E-07 Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 48 /73 Tuesday, 30 June 2009
  • 49. Methods: RMI strategy  In order to solve the sample size problem of the MI strategy, we use a robust MI estimator proposed by [Cline et al., 02]  Pairs of (x,y) in the dataset are scrambled  That is, each x in the dataset is randomly joined to an y but the distribution of x and y remains equal  MI is computed for the scrambled dataset  This process is repeated N time, and the average scrambled MI is computed  Finally, the value for the objective function is MI – Mis  Mis is an estimation of the sampling bias in the data. By subtracting it from the original MI metric we obtain a less biased objective function Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 49 /73 Tuesday, 30 June 2009
  • 50. Methods: DualRMI strategy  The next strategy is based on some observations we did in previous work [Bacardit et al., 06]  Example of a rule set for prediction CN from primary sequence  Predicate associated to the target residue (AA) is very different from the predicates associated to the other window positions Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 50 /73 Tuesday, 30 June 2009
  • 51. Methods: DualRMI strategy  Why not generating two reduced alphabets at the same time?  One for the target residue  One for the other residues in the window  Objective function remains unchanged Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 51 /73 Tuesday, 30 June 2009
  • 52. Outline  Introduction to Learning Classifier Systems and Extended Compact GA  Problem Definition  Methods (ECGA, LCS, Mutual Information)  Results  Conclusions and further work Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 52 /73 Tuesday, 30 June 2009
  • 53. Experimental design  For each problem (CN, SA)  For each reduction strategy (MI, RMI, DualRMI)  ECGA was run to generate alphabets of two, three, four and five letters  Afterwards, BioHEL was trained over the reduced datasets to determine the prediction accuracy that could be obtained from each alphabet size  Comparisons are drawn Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 53 /73 Tuesday, 30 June 2009
  • 54. Reduced alphabets for CN Amino acids that remain always in the same group are marked with solid rectangles Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 54 /73 Tuesday, 30 June 2009
  • 55. Alphabets for CN  The two-letter alphabet divides the amino-acids between hydrophobic and polar  RMI could not find a five-letter alphabet  DualRMI did, but only for the target residue  RMI and DualRMI have a much larger number of framed residues, showing more robustness  For DualRMI we can observe small groups of hydrophobic residues, while all polar ones are in the same group  We can also observe a strange group, GHTS, that mixes different kind of physico-chemical properties  Not explained by properties but by inherent distribution in datasets Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 55 /73 Tuesday, 30 June 2009
  • 56. A retrospective analysis of the dataset reveals why GHTS are clustered together  We computed the proportion of residues for each amino acid type with high CN  These four residues have very similar average behavior in relation to CN Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 56 /73 Tuesday, 30 June 2009
  • 57. Accuracy of CN prediction Using Biohel  Accuracy difference between the AA representation and the best reduced alphabets is 0.7%  Difference in non- significant according to t-tests  RMI and DualRMI perform similarly Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 57 /73 Tuesday, 30 June 2009
  • 58. Reduced alphabets for SA Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 58 /73 Tuesday, 30 June 2009
  • 59. Reduced alphabets for SA  Even though SA and CN are somewhat related structural features, the resulting alphabets are different  These alphabets contain more groups of polar residues, and less groups of hydrophobic ones (in contrast with CN)  In DualRMI and 5 letters we can observe very small groups  A, EK for the target alphabet  G,X for the other residues alphabet  Again, the GHTS group appears Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 59 /73 Tuesday, 30 June 2009
  • 60. Analysis of average SA behavior for each AA type  The reduced alphabet matched perfectly the properties of the SA features Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 60 /73 Tuesday, 30 June 2009
  • 61. Accuracy of SA prediction with BioHel  Accuracy of reduced alphabets for SA prediction  Only DualRMI managed to give a performace statistically similar to the original AA representation  Accuracy difference is 0.4% Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 61 /73 Tuesday, 30 June 2009
  • 62. Comparison to Other Reduced Alphabets from the Literature and Expert-Designed Alphabets Based on Physico-Chemical Properties Alphabet Letters CN acc. SA acc. Diff. Ref. AA 20 74.0±0.6 70.7±0.4 --- --- DualRMI 5 73.3±0.5 70.3±0.4 0.7/0.4 This work WW5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Wang & Wang, 99] Alphabets from the literature SR5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Solis & Rackovsky, 00] MU4 5 72.6±0.7 69.4±0.4 1.4/1.3 [Murphy et al., 00] MM5 6 73.1±0.6 69.3±0.3 0.9/1.4 [Melo & Marti-Renom, 06] Expert designed HD1 7 72.9±0.6 69.3±0.4 1.1/1.4 This work alphabets HD2 9 73.0±0.6 69.3±0.4 1.0/1.4 This work HD3 11 73.2±0.6 69.9±0.4 0.8/0.8 This work Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 62 /73 Tuesday, 30 June 2009
  • 63. Reduced Alphabets Comparison  Automatically reduced alphabets obtain better accuracy, but how different are the alphabets themselves?  We applied again the AA-wise high CN/SA analysis  Two metrics were computed  Transitions: how many times does the group index change through the list of sorted AA.  The less number of changes, the more homogenous the groups are  Average range: The range of a reduction group is the difference between the minimum and maximum CN/SA of the AAs belonging to that group  The smaller the average range, the more focused the reduction groups are in relation to that structural property Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 63 /73 Tuesday, 30 June 2009
  • 64. Reduced Alphabets Comparison (CN) Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 64 /73 Tuesday, 30 June 2009
  • 65. Reduced Alphabets Comparison (SA) Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 65 /73 Tuesday, 30 June 2009
  • 66. Additional Results  Are the alphabets interchangeable across problems?  Can these reduced alphabets be applied to an evolutionary information-based representation? Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 66 /73 Tuesday, 30 June 2009
  • 67. Results: Are the alphabets interchangeable?  We applied the alphabet optimized for CN to SA and vice versa  SA alphabet is good for predicting CN, but CN alphabet obtains poor performance on SA  Reduced alphabets must always be tailored to the domain at hand Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 67 /73 Tuesday, 30 June 2009
  • 68. Results  Application of the reduced alphabets to an evolutionary information-based representation  So far we have used only the simple primary sequence representation  Can this process be applied to much richer (and complex) representations?  We computed the position-specific scoring matrices (PSSM) representation of our dataset using PSI-BLAST. Each instance (9 window positions) is represented by 180 continuous variables (rather than 20+1 as originally done)  Then, we reduced this representation using our alphabets  The values of each PSSM profile corresponding to amino acids in the same reduction group are averaged Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 68 /73 Tuesday, 30 June 2009
  • 69. Results  Application of reduced alphabets to a PSSM representation  Thus, we reduced the representation from 180 attributes to 45 Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 69 /73 Tuesday, 30 June 2009
  • 70. Results  Results of learning from the reduced PSSM representation  Accuracy difference is still less than 1%  Obtained rules sets are simpler and training process is much faster  Performance levels are similar to recent works in the literature [Kinjo et al., 05][Dor and Zhou, 07] Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 70 /73 Tuesday, 30 June 2009
  • 71. Conclusions  We have proposed an automated alphabet reduction protocol for protein datasets  Protocol does not use any domain knowledge  It automatically tailors the reduced datasets to the domain at hand  Our experiments show that it is possible to obtain quite reduced alphabets (5 letters) with similar performance than the original AA alphabet  Our reduced alphabets are better at CN and SA prediction than other alphabet from the literature, as they are better suited for these tasks  The findings from the protocol can be used in state-of-the-art protein representations as PSSM profiles  We found some unexpected reduction groups (GHTS) but the properties of the data showed us that this is not an artifact Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 71 /73 Tuesday, 30 June 2009
  • 72. Future work  Explore alternative objective evaluation functions  Other robust MI estimation  Explore slightly higher cardinality alphabets  Is it possible to close the accuracy gap even more?  Apply this protocol to other kind of datasets  E.g. protein mutations  Structural aspects defined as continuous variables, not just discrete ones Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 72 /73 Tuesday, 30 June 2009
  • 73. Acknowledgements (in no particular order) (in no particular order)  Peter Siepmann Contributors to the talks I will give at BGU  Pawel Widera  School of Physics and Astronomy  James Smaldon  School of Chemistry  School of Pharmacy  Azhar Ali Shah  School of Biosciences  Jack Chaplin  School of Mathematics  Enrico Glaab  School of Computer Science  German Terrazas  Centre for Biomolecular Sciences  Hongqing Cao  all the above at UoN  Jamie Twycross  Jonathan Blake Thanks also go to:  Francisco Romero-Campero  Maria Franco Ben Gurion University of the  Adam Sweetman  Linda Fiaschi Negev’s Distinguished Scientists Visitor Program Funding From: BBSRC, EPSRC, EU, ESF, UoN Professor Dr. Moshe Sipper Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel 73 /73 Tuesday, 30 June 2009