SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Saravanan Anandathiyagar                                                                   Project Background Paper
March 2002                                                                                  Supervisor: Simon Colton
                                           A Substructure Server

    Abstract

              Much of the reason for the high cost of medicines is rooted in the length and
              complexity of the development and approval process. At every possible stage of
              development, it is possible that a potential drug (leader) will fail to gain approval
              on the basis that it produces erratic results or harmful side effects.

              Predictive toxicology aims to reduce the money and time spent by identifying as
              early on in the drug development process as possible leaders that are likely to fail.
              Numerous machine learning techniques exist to identify such leaders. Here we
              present a possible solution based on the Find a maximally specific hypothesis (Find-S)
              algorithm. This algorithm, given a set of positive and negative examples of data,
              finds substructures that are statistically true of the majority of positive compounds,
              and statistically not true of the negative compounds.

              A discussion of the algorithm and its motivation is presented here.




                                                        i
Saravanan Anandathiyagar                                                                                       Project Background Paper
March 2002                                                                                                      Supervisor: Simon Colton
                                                     A Substructure Server


                                                              Contents
Abstract...................................................................................................................i
Contents................................................................................................................ii
1.Introduction........................................................................................................3
   1.1.Motivation..........................................................................................................................3

   1.2.Summary of Report...........................................................................................................4
2.Previous Research..............................................................................................5
   1.3.Structure-Activity Relationships.......................................................................................5

   1.4.Attribute-based representations........................................................................................5

   1.5.Relational-based representations......................................................................................7

   1.6.Inductive logic programming...........................................................................................7
3.The Find-S Technique.......................................................................................9
   1.7.Motivation.........................................................................................................................9

   1.8.General-to-specific ordering of hypotheses......................................................................9

   1.9.The Find-S algorithm......................................................................................................10

   1.10.Algorithm evaluation methods.......................................................................................14

   1.11.Issues with the Find-S technique...................................................................................15

   1.12.Existing Prolog implementation....................................................................................16
4.Implementation Considerations.......................................................................18
   1.13.Representing structures.................................................................................................18

   1.14.Improvement of current implementation......................................................................18

   1.15.Extensions......................................................................................................................18
5.References.........................................................................................................20




                                                                     ii
Saravanan Anandathiyagar                                                                    Project Background Paper
March 2002                                                                                   Supervisor: Simon Colton
                                           A Substructure Server

1. Introduction
    1.1. Motivation
         Each year, drug companies release new and improved drugs, claiming that they produce better
         results with fewer side effects. However, the cost of such advances in the drug industry is not small.
         Developing a drug from the theoretical stage to it appearing on pharmacy shelves normally takes in
         the region of 10 to 15 years, at an average cost of over £500 million [see ref 1]. This outlay by the
         drug company must be covered by the consumer for the company to remain in profit, and evidence
         of this can be seen, for example, in the regular rise of NHS prescription charges.

         Much of the reason for the high cost of medicines is rooted in the length and complexity of the
         development and approval process. At every possible stage of development, it is possible that a
         potential drug (leader) will fail to gain approval on the basis that it produces erratic results or
         harmful side effects. Even after promising lab tests, further experiments on animal specimens often
         return ideas to the drawing board. It is estimated that for every one drug that reaches clinical
         (human) trial stage, another 1000 have failed earlier testing.

         Despite this, it is important to note that medicines still reduce overall medical care costs by reducing
         even more expensive hospitalisation, surgery or other treatments. Drugs are the primary way of
         controlling the outcomes of chronic illness. Therefore, the development of new drugs is important
         for both patient care and for the positive long-term financial implications.

         It is clear that reducing the number of drug leaders developed at an early stage will have a significant
         effect in limiting development costs. Determining at an early stage that a leader is unsuitable for
         further testing saves the investment that may otherwise have been spent on this drug, only for the
         same conclusion to be reached. For this reason, the field of predictive toxicology was born. It is an
         effort on the part of biotechnology companies to predict in advance whether or not a drug will be
         toxic, using various techniques learnt from the fields of statistics, artificial intelligence (AI), and
         machine learning.

         Negative effects of a drug can range from relatively minor problems such as headaches and stomach
         upsets, to potentially life-threatening organ damage. While many accepted drugs do produce some
         side effects for some patients, the value of the treatment is always said to outweigh the side effects.
         However, there are certain characteristics of chemical compounds that will limit their effectiveness
         as a drug. Predictive toxicology aims to find this drug toxicity while still in the planning stages. Ruling
         out a leader at this early stage saves it being synthesised and tested, and allows resources to be
         focused on more promising areas of research.

         Machine learning programs in a variety of different guises have been used to try and discover the
         reasons why certain chemicals are toxic and others are not. Essentially, they learn a concept that is
         true of the toxic drugs and false for other non-toxic drugs. These derived concepts are usually small




                                                      Page 3
Saravanan Anandathiyagar                                                                     Project Background Paper
March 2002                                                                                    Supervisor: Simon Colton
                                            A Substructure Server

         (around five or six atoms) sub-structures of the larger drug molecule, where some of the atoms are
         fixed elements and others may vary.

         The task in hand is to effectively and efficiently identify such sub-structures using the Find
         Maximally Specific Hypothesis (FIND-S) machine learning algorithm. An implementation of the
         algorithm has been written in PROLOG by S Colton; our work here is based on extending this
         implementation and producing a web-based server application.

         A molecule is said to be positive if it contains the sub-structure in question. Conversely, it is said to be
         negative if it does not. The application will return interesting substructures given positive and negative
         molecules, whereby the substructure is true of statistically significant more positives than negatives.



    1.2. Summary of Report
         This report is an overview of the research undertaken, with an outline of how implementation of a
         Substructure Server may proceed. Section 2 summarises the machine learning techniques used in the
         field of predictive toxicology, and introduces the concepts of attribute-based and relationship-based
         structure-activity relationships.

         Section 3 is a comprehensive overview of the Find-S algorithm, with an emphasis on how it may
         perform in a predictive toxicology situation. A fictional example is presented and analysed which
         demonstrates the key methodologies of the technique. Evaluation techniques applicable to both the
         algorithm itself and to the results it produces are outlined, as well as various considerations that
         should be addressed on implementation. S Colton’s existing Prolog implementation of the algorithm
         is also discussed.

         Section 4 highlights some implementation considerations, suggesting a possible course of action
         towards building a substructure server available for public use.




                                                      Page 4
Saravanan Anandathiyagar                                                                 Project Background Paper
March 2002                                                                                Supervisor: Simon Colton
                                           A Substructure Server

2. Previous Research
    As was mentioned above, machine learning algorithms to find relevant sub-structures have been applied
    in the field of predictive toxicology. It is important to understand the approaches that have been taken in
    previous work, using it as a basis for further study.

    A summary of the key features of background study undertaken is summarised in this section.


    1.3. Structure-Activity Relationships
         A structure-activity relationship (SAR) models the relationship between activities and
         physicochemical properties of a set of compounds [2]. The goal of our work is essentially to form
         SARs from the given input molecules. These resultant SARs represent the molecules most likely
         contribute to toxicity, as calculated by our algorithm.

         A SAR is derived from two components:
              •   The learning algorithm employed during derivation, and
              •   The choice of representation to describe the chemical structure of the compounds being
                  considered.

         The learning algorithm used will rule out possible choices of representation, as the latter has to be
         rich enough to support the algorithm’s procedure. SARs can store different information about
         compounds, and typically such information (attributes) could consist of any of the following
         chemical properties [5]:


              •   Partial atomic charges                                 •   CMR
              •   Surface area                                           •   pKa, pKb
              •   Volume                                                 •   Hansch parameters π, σ, F
              •   H_bond donors/acceptors                                •   Molecular grids
              •   ClogP                                                  •   Polarisability

         The exact nature or meaning of each attribute type need not be discussed here. It is however
         important to note that there are any number of ways of representing a compound, using any
         combination of the attributes given above (and more).



    1.4. Attribute-based representations
         A large variety of learning techniques are in use that derive SARs of different forms. The majority of
         these are based on examining the types of attributes listed above. A short summary of a few of these
         techniques is presented here.




                                                    Page 5
Saravanan Anandathiyagar                                                                 Project Background Paper
March 2002                                                                                Supervisor: Simon Colton
                                          A Substructure Server

          1.4.1. Linear and Partial least-squares regression
                 Linear regression was the first learning algorithm employed in predictive toxicology, as
                 detailed by Hansch et al. [3]. “Training” the system involves providing suitable training
                 examples, which are simply saved to memory without being interpreted or compared in any
                 way. It is on this stored information (as explicitly provided by the user) that regression aims
                 to approximate its target function.

                 In the context of predictive toxicology, this would involve supplying examples of positive
                 compounds as training data. The procedure then run on a new compound would invoke a
                 set of similar compounds being retrieved from the stored values, and use this to classify the
                 new compound. The analysis of the compounds is based on chemical attributes as specified
                 by the algorithm; Hansch used global chemical properties of the molecule (LogP and π).

                 Least-squares regression is another learning technique involving the relationship between
                 chemical attributes. Visually it essentially entails forming a ‘line of best fit’ for a set of
                 training data plotted against two variables y and x, where x and y are two chemical attributes.
                 For any new compound encountered, a plot is made of the same two attributes; if the point
                 produced lies within a fixed bound of the line of best fit, then the new compound can be
                 deemed positive. The system can be extended to include multiple independent variables, and
                 to give each variable different weights – a measure of how important each attribute measure is
                 compared with each other.

                 It is important to note that both these techniques make no attempt to interpret the training
                 data as it is fed to them; all the processing of determining suitability criteria for new
                 compounds happens only once the new compound has been encountered.


          1.4.2. Decision trees
                 Decision trees classify the training data by considering each <attribute, value> pair (tuple)
                 for a given compound [4]. Each node in the tree specifies a test of a particular attribute, and
                 each branch descending from that node corresponds to a possible value for that attribute. A
                 compound is classified as positive or negative at the leaf nodes of the graph.

                 New compounds are classified by comparing their attribute values to ones stored from the
                 training data. An implementation of this algorithm needs to address the critical issue of which
                 attribute(s) to perform the test on. This decision could crucially alter the classification
                 schema, and is a problem inherent in trying to separate objects into discrete sets when their
                 behaviour or identity is given by a number of attribute. It is possible that any two attribute
                 values could contradict each other on a particular classification scheme, and it then becomes
                 necessary to impose some ordering or priority system over the attributes.




                                                     Page 6
Saravanan Anandathiyagar                                                                   Project Background Paper
March 2002                                                                                  Supervisor: Simon Colton
                                           A Substructure Server

          1.4.3. Neural networks
                 Artificial Neural Networks (ANNs) provide a general and practical method for learning
                 functions from examples [4], and have widespread use in AI applications. Predictive
                 toxicology lends itself to the use of ANNs because of how compound attributes can be
                 treated as <attribute, value> tuples, in a manner similar to that discussed in section 2.1.2
                 above. A compound can be represented by a list of such tuples covering the full range of
                 attributes.

                 The simplest form of ANN system is based on perceptrons, which will take the list of tuples
                 and calculates a ‘score’ for the compound. This score is calculated from a combination of the
                 input tuples, and a weight associated with each attribute. The algorithm can learn from the
                 training data by considering the attributes of positive compounds, and can then classify
                 unknown compounds as positive or negative, depending on the score calculated being higher
                 than a defined threshold.

                 Practical ANN systems usually implement the more advanced backpropogation algorithm,
                 which learns the weights for a network of neural nodes on multiple layers. However the
                 principal is the same as that used in the perceptron algorithm, with the compound score
                 being calculated in a non-linear manner taking into account more variables.


    1.5. Relational-based representations
         The techniques mentioned above for deriving SARs all share one key concept: they are all based on
         attributes of the object (in our case, the chemical compound being examined). These attributes can be
         considered to be global properties of these molecules, e.g. using the molecular grid attribute maps
         points in space, which are global properties of the coordinate system used. The tuple of attributes
         that has been used to represent the properties of the molecule is not an ideal format; it is difficult to
         efficiently map atoms and the bonds onto a linear list.

         A more general way to describe objects is to use relations. In a relational description the basic
         elements are substructures and their associations [2]. This allows the spatial representation of the
         atoms within the molecule to be represented more accurately, directly and efficiently.



    1.6. Inductive logic programming
         Fully relational descriptions were first used in SARs with the inductive logic programming (ILP)
         learning technique, as shown in [6]. ILP algorithms are designed to learn from training examples
         encoded as logical relations. ILP has been shown to significantly outperform the feature (attribute)
         based induction methods described above [7].

         ILP for SARs can be based on knowledge of atoms and their bond connectives within a molecule.
         Using this scheme has a number of benefits:



                                                     Page 7
Saravanan Anandathiyagar                                                                 Project Background Paper
March 2002                                                                                Supervisor: Simon Colton
                                          A Substructure Server

              •   Simple, powerful, and can be generally applied to any SAR
              •   Particularly well suited to forming SARs dependent on the relationship between the atoms
                  in space (shape)
              •   Chemists can easily understand and interpret the resultant SARs as they are familiar with
                  relating chemical properties to groups of atoms.

         The formal difference between the descriptive properties of attribute and relational SARs
         corresponds to the difference between propositional and first-order logic [2]. ILP involves learning a
         set of “if-then” rules for a training set, which can then be applied to unseen examples. Sets of first-
         order Horn clauses can be constructed to represent the given data rules, and these can be
         interpreted in the logic programming language PROLOG.

         ILP differs from the attribute based techniques in two key areas. ILP can learn first-order rules that
         contain variables, whereas the earlier algorithms can only accept finite ground terms for attribute
         values. Further, ILP sequentially examines the data set, learning one rule at a time to incrementally
         grow the final set of rules.

         We stated above that relational SARs can be described by fist-order predicate logic. The PROGOL
         algorithm was developed [8] to allow the bottom-up induction of Horn clauses, and is implemented
         in PROLOG. PROGOL uses inverted entailment to generalize a set of positive examples (active compounds)
         with respect to some background knowledge – atom and bond structure date, given in the form of
         prolog facts. PROGOL will construct a set of “if-then” rules which explain the positive (and negative)
         examples given.

         In the case of predictive toxicology, these rules generally specify a sub-molecular structure of around
         five or six atoms. These structures are those that have been calculated to contribute to toxicity,
         based on their presence in the set of positive training examples, and their non-presence in the set of
         negative training examples.

         These sub-structures can then be matched with components of unseen compounds in an attempt to
         predict toxicity.




                                                    Page 8
Saravanan Anandathiyagar                                                                    Project Background Paper
March 2002                                                                                   Supervisor: Simon Colton
                                            A Substructure Server

3. The Find-S Technique
    1.7. Motivation
         As mentioned previously, the focus of this research topic is to use the Find-S algorithm as described
         below to identify the sub-structures discussed at the end of section 2.3.1. Within the scope of
         predictive toxicology, it may appear that both Find-S and ILP do the same thing, however this is not
         the case. The Find-S technique differs from that of ILP due to the motivation behind the process.
         ILP looks for concepts that are true for positive examples, and false for negative examples, and
         produces a sub-molecule structure as a result. The Find-S procedure, on the other hand, is given a
         template (by the user) to guide its search, and the program looks for all possibilities of the general
         shape in the positive inputs.



    1.8. General-to-specific ordering of hypotheses
         Any given problem has a predefined space of potential hypotheses [4], which we shall denote H.
         Consider a target concept T, whose truth value (1 or 0) depends upon the values of three attributes,
         a1, a2, and a3. Each attribute a1, a2, or a3 can take a range of discrete values, some combinations of
         which will make T true, others will make T false. We denote the value x of an attribute an as v(an) =
         x.

         We can let each hypothesis consist of a conjunction of constraints on the attributes, i.e. take the list
         of attribute values for that particular instance of the problem. This list of attributes (of length three
         in this case) can be held in a vector. For each attribute an, the value v(an) will take one of the
         following forms:
              •   ? - indicating that any value is acceptable for this attribute
              •   ∅ - indicating that no value is acceptable for this attribute
              •   a single required value for the attribute, e.g. for an attribute ‘day of week’, acceptable values
                  would be ‘Monday’, ‘Tuesday’ etc.

         With this notation, the most general hypothesis for T is

                                                         <?, ?, ?>

         which states that any assignment to any of the three attributes will result in the hypothesis being
         satisfied. Conversely, the most specific hypothesis for T is


                                                       <∅, ∅, ∅>

         which states that no assignment to any of the variables will ever satisfy the hypothesis.

         All hypotheses within H can be represented in this way, with the majority falling somewhere
         between the two above extremes of generality. Indeed, hypotheses can be ordered on their generality,


                                                       Page 9
Saravanan Anandathiyagar                                                                          Project Background Paper
March 2002                                                                                         Supervisor: Simon Colton
                                              A Substructure Server

         from most general to most specific instances. For example, consider the following two possible
         hypotheses for T:

                                                        h1 = <x, ?, y>
                                                        h2 = <?, ?, y>

         Considering the two sets of instances that are classified positive by the two hypotheses, we can say
         that any instance classified positive by h1 will also be classified positive by h2, as h2 imposes fewer
         constraints. We say that h2 is more general than h1.

         Formally, for two hypotheses hj and hk, we can define hj to be more general than or equal to hk (written h j
         ≥ g h k ) if and only if


                                        (∀x ∈ X) [(h k (x) = 1) → (h k (x) = 1)]


         Further, we can define hj to be (strictly) more general than hk (written h j > g h k ) if and only if


                                                    (h j ≥ g h k ) ∧ (h j ≱ g h k )

    1.9. The Find-S algorithm
         The Find-S technique orders hypotheses according to their generality as explained in the previous
         section. The algorithm then starts with the most specific hypothesis h possible within H. For each
         positive example it encounters in the training set, if generalises h (if needed) so h now correctly
         classifies the encountered example as positive. After considering all positive training examples, the
         resultant h is output. This is the most specific hypothesis in H consistent with the examined positive
         examples.

         The algorithm can be more formally defined as follows [4]:


              1. Initialise h to the most specific hypothesis in H.
              2. For each positive training instance x
                          For each v(ai) in h
                                •    If v(ai) is satisfied by x
                                              Then do nothing
                                •   Else replace ai in h by the next more general constraint that is
                                    satisfied by x.
              3. Output hypothesis h


         The procedure is run with a different starting positive each time until all positives have been
         analysed. There is a question over how to measure how specific a particular hypothesis is. This is


                                                         Page 10
Saravanan Anandathiyagar                                                                 Project Background Paper
March 2002                                                                                Supervisor: Simon Colton
                                          A Substructure Server

         dependent on the representation scheme, but in first-order logic, for example, a more specific
         hypothesis will have more ground terms (fewer variables) in the logic sentence describing it than a
         less specific hypothesis.


          1.9.1. A simple example
                 An example to illustrate how the algorithm could be used in predictive toxicology is
                 presented below. It has been adapted from [9], and is fabricated in that the derived structure
                 is not a real indicator to toxicity. The example is simply illustrates the algorithm process.

                 Training Data

                 Consider the training set of seven drugs, four of which are known positives, and the
                 remaining three known negatives. Diagrams of these molecules are given below, with
                 molecules P1, P2, P3 and P4 representing positive examples, and N1, N2 and N3
                 representing negative ones. The atom labels (α, β, µ, and ν) are used in place of possible real
                 elements (e.g. N, C, H etc) to enforce the notion that the example is purely fabricated.



                                     α
                               P1            β        µ        ν
                                     µ                                                         α
                                                                                                         β           ν   α       N
                                                                                               α
                                     α                         ν       α
                              P2             β        β
                                                               α                     α                           µ           α
                                     α
                                                                                              β        β
                                                                                                                         α       N
                                                                                     α                          α

                                     ν                         µ            α
                              P3             β        β
                                                                        α            α                           µ           α
                                     α                         β
                                                                                             ν         β
                                                                                                                         α       N


                                     µ                         µ
                              P4             β        β
                                     β                         β

                                                    Figure 1: Training set for Find-S example




                                                    Page 11
Saravanan Anandathiyagar                                                                 Project Background Paper
March 2002                                                                                Supervisor: Simon Colton
                                          A Substructure Server

                 At this stage, the chemist (user) must suggest a possible template on which to base the search
                 for toxicity-inducing substructures. It is thought that a substructure of the form

                                                ATOM  ATOM  ATOM


                 (with  representing a bond) contributes to toxicity. It is now the task of the algorithm to
                 find sub-molecules matching the structure given above which exist in as many positives as
                 possible, but do not exist in as many negatives as possible.

                 The Algorithm Procedure

                 To solve the problem, we use the Find-S method with the aim of producing solutions of the
                 form
                                                        <A, B, C>

                 where A, B and C are taken from the set of chemical symbols present in the molecules, i.e.
                 {α, β, µ, ν}. However, we also need to look for general solutions where an atom in a
                 particular position is not fixed. We therefore append {?} to the previous set, giving {α, β, µ,
                 ν, ?}.

                 We start off with the most specific hypothesis possible. Any final concept learned will have
                 to be true of at least one positive example. We use this to produce our first set of triples:

                                                   <α, β, µ> and <β, µ, ν>

                 These are the two substructures that exist in P1 and match the template specified.

                 We now check whether each of these substructures is true in the next molecule (P2). If they
                 are not, then we generalise the substructure such that it becomes true in P2. This
                 generalisation is done by introducing as few variables as possible. In doing this, we find the
                 least general generalisations, which then guarantees that our final answers are as specific as
                 possible. This expanded set of substructures is then tested on P3, and following the same
                 procedure, on P4.




                                                    Page 12
Saravanan Anandathiyagar                                                                        Project Background Paper
March 2002                                                                                       Supervisor: Simon Colton
                                            A Substructure Server

                 A trace of the intermediate results produced is shown here:

                                                        Molecule being analysed
                                       P1                P2                P3                 P4
                                    <α, β, µ>         <α, β, µ>        <α, β, µ>           <α, β, µ>
                                    <β, µ, ν>         <β, µ, ν>        <β, µ, ν>           <β, µ, ν>
                                                      <α, β, ?>        <α, β, ?>           <α, β, ?>
                                                      <β, ?, ν>         <β, ?, ν>          <β, ?, ν>
                                                                        <?, β, µ>          <?, β, µ>
                                                                        <?, β, ?>          <?, β, ?>
                                                                        <α, ?, ?>          <α, ?, ?>
                                                                        <β, ?, ?>          <β, ?, ?>
                                                                        <?, ?, ν>          <?, ?, ν>

                 The trace shows previously derived substructures with a greyed out background. Note that
                 no new substructures are produced on analysis of P4 – all the substructures produced after
                 analysis of P3 match exactly components of P4 without the need for generalisation.

                 Evaluation of Results

                 So the algorithm has now returned nine possible hypotheses for substructures that determine
                 toxicity. These can now be scored, based on
                     •     How many positive molecules contain the substructure derived
                     •     How many negatives do not contain the substructure derived

                 A calculation of scores is given below:

                                                                                   Correctly classified
                                                 Correctly classified positives:
                                                                                       negatives:
                                  Hypothesi
                                                 P1       P2       P3        P4    N1     N2        N3    Accuracy
                                     s
                            1     <α, β, µ>                                                                43%

                            2     <β, µ, ν>                                                               57%

                            3     <α, β, ?>                                                               57%

                            4     <β, ?, ν>                                                             86%

                            5     <?, β, µ>                                                               57%

                            6     <?, β, ?>                                                               57%

                            7     <α, ?, ?>                                                                43%

                            8     <β, ?, ?>                                                               57%




                                                       Page 13
Saravanan Anandathiyagar                                                                 Project Background Paper
March 2002                                                                                Supervisor: Simon Colton
                                           A Substructure Server


                           9     <?, ?, ν>                                                         57%


                 It can be seen that the most accurate hypothesis derived is number four: <β, ?, ν>. This is
                 statistically the most frequent substructure (of the form ATOM  ATOM  ATOM)
                 that occurs in the positives, but not in the negatives. This structure can then be used to
                 predict the toxicity of unseen compounds; other compounds containing a match for
                 hypothesis four are statistically likely to be toxic.

                 For a complete implementation of the algorithm, the procedure should be repeated, but this
                 time with P2 as the initial positive, and generalising on the others. The same should be
                 applied for P3 and P4 as initial positives.


    1.10.Algorithm evaluation methods
         On obtaining a ‘result’ from the Find-S algorithm, i.e. a hypothesis (or set of hypotheses)
         representing a sub-molecule thought most likely to contribute to toxicity, it is desirable to have
         some certainty that the result obtained is indeed accurate. We want the promising results obtained
         with the training set to be extended to unseen examples. There is no way to guarantee the accuracy
         of a hypothesis, however there are accepted methods and measures through which a user can
         become more confident in the results obtained.

         In our example above, the ‘best’ hypothesis had a (predicted) accuracy of 86%, calculated by
         considering the number of correctly classified positives and negatives, over the total number of
         compounds analysed. However, this figure is based purely on the examples that the hypothesis has
         already seen; it is not a strong indicator of accuracy for unseen examples.


          1.10.1.Cross validation
                 One possible way of addressing this situation is to reserve some examples from the training
                 set, and then subsequently use these reserved examples as tests on the derived hypothesis.
                 The results of the hypothesis applied to the reserved examples can then be compared to their
                 actual categorisation, which is known as they were provided as part of the training set. This
                 cross validation is a standard machine learning technique, and the splitting of initial example
                 data into a training set and test set can give the user more confidence that the derived
                 hypothesis will be accurate and of use. Clearly, it can have the opposite effect, with a user
                 finding out that the derived hypothesis in fact performs poorly on genuinely unseen
                 examples.




          1.10.2.K-fold cross validation


                                                    Page 14
Saravanan Anandathiyagar                                                                  Project Background Paper
March 2002                                                                                 Supervisor: Simon Colton
                                           A Substructure Server

                 It is often of importance and interest that the performance of the learning algorithm itself is
                 measured, and not just a specific hypothesis. A technique to achieve this is k-fold cross
                 validation [4]. This involves partitioning the data into k disjoint subsets, each of equal size.
                 There are then k training and testing rounds, with each subset successively acting as a test
                 set, and the other k-1 sets as training sets. The average accuracy rate can then be calculated
                 from each independent test run. This technique is typically used when the number of data
                 objects is in the region of a few hundred, and the size of each subset is at least thirty. This
                 ensures that the tests provide reasonable results, as having too few test examples would
                 result in skewed accuracy figures.

                 As each round is performed independently, there is no guarantee that the hypothesis
                 generated on one training round will be the same as the hypothesis generated on another. It
                 is for this reason that the overall accuracy figures generated are representative of the
                 algorithm as a whole, not just one particular result.


    1.11.Issues with the Find-S technique
         As with all machine learning techniques, Find-S has some factors to encourage its use, and others
         that make it less favourable. Some of these considerations are discussed here.


          1.11.1.Guarantee of finding most specific hypothesis
                 As the name of the algorithm suggests, the process is guaranteed to find the most specific
                 hypothesis consistent with the positive training examples, within the hypothesis space. This
                 is because of the decisions made to select the least general generalisations when analysing
                 compounds.

                 This property can be viewed as being both advantageous and disadvantageous. It is
                 sometimes useful for users to know as much information about the substructure as possible,
                 and this may enable them to better understand the chemical reason for the molecule’s
                 toxicity. However, in the case of an example deriving multiple hypotheses consistent with the
                 tracing data, the algorithm would still return the most specific, even thought the others have
                 the same statistical accuracy.

                 Further, it is possible that the process derives several maximally specific consistent
                 hypotheses [4]. To account for this possible case, we need to extend the algorithm to allow
                 backtracking at choice points for generalisation. This would find target concepts along a
                 different branch to that first explored.


          1.11.2.Overfitting
                 Overfitting is often thought of as the problem of an algorithm memorising answers rather than
                 deducing concepts and rules from them, and is inherent in many machine learning
                 techniques. A particular hypothesis is said to overfit the training examples when some other


                                                    Page 15
Saravanan Anandathiyagar                                                                   Project Background Paper
March 2002                                                                                  Supervisor: Simon Colton
                                           A Substructure Server

                 hypothesis that fits the training examples less well, actually performs better over the whole set
                 of instances (i.e. including non-training set instances).

                 Overfitting can occur when the number of training examples used is too small and does not
                 provide an illustrative sample of the true target function. It can also occur when there are
                 errors in the example data, known as noise. Noise has a particularly detrimental effect on the
                 Find-S algorithm, as explained below.


          1.11.3.Noisy data
                 Any non-trivial set of data taken from the real world is subject to a degree of error in its
                 representation. Mistakes can be made analysing the data and categorising examples, in
                 translation of information from one form to another, and repeated data not being consistent
                 with itself. In machine learning terms, such errors in the data are termed noise.

                 While certain algorithms are fairly robust to noise in data, the Find-S technique is inherently
                 not so. This is because the algorithm effectively ignores all negative examples in the training
                 examples. Generalisations are made to include as many positive examples as possible, but no
                 attempt is made to exclude negatives. This in itself is not a problem; if the data contains no
                 errors, then the current hypothesis can never require a revision in response to a negative
                 example [4]. However, the introduction of noise into the data changes this situation. It may
                 no longer be the case that the negative examples can simply be ignored. Find-S makes no
                 effort to accommodate for these possible inconsistencies in data.


          1.11.4.Parallelisability
                 The Find-S algorithm lends itself well to a parallel distributed implementation, which would
                 speed-up computation time. A parallel implementation could involve individual processors
                 being allocated different initial positives; recall above that the algorithm is only complete
                 when hypotheses have been derived using each possible start positive. The derivation of any
                 particular hypothesis from an initial positive can be run independently, and hence can be run
                 in parallel with other derivations.



    1.12.Existing Prolog implementation
         S. Colton has implemented an initial version of the Find-S algorithm in PROLOG. This relatively
         compact program (approximately 300 lines of code) identifies substructures from a sample data set
         as used by King et al [2]. The program is guided by substructure templates, of which a few have
         been hard coded. It has recreated some of the results produced by the ILP method and PROGOL on
         the sample data set considered. The program can take parameters to specify the minimum number
         of ground terms that must appear in a resultant hypothesis (i.e. limit variables), and also specify the
         minimum number of molecules for which a hypothesis should return TRUE for a positive, and the
         maximum for which it can FALSE for a negative.



                                                     Page 16
Saravanan Anandathiyagar                                                               Project Background Paper
March 2002                                                                              Supervisor: Simon Colton
                                         A Substructure Server

         An important point for discussion here is the representation of the background and structural data.
         Information representing the molecules is represented as a series of facts in a PROLOG database. The
         representation is identical to that suggested in the section on inductive logic programming, and
         involves storing information about atoms and their inter bonding. The data stored for even a single
         molecule is extensive; however these PROLOG facts can be generated automatically as mentioned in
         section 4.1.




                                                  Page 17
Saravanan Anandathiyagar                                                                Project Background Paper
March 2002                                                                               Supervisor: Simon Colton
                                          A Substructure Server

4. Implementation Considerations
    The Find-S algorithm has been discussed at length as it represents the core component of a system to
    identify substructures. However, the initial remit was to create a substructure server, whereby users would
    be able to identify potentially interesting substructures from their positive and negative examples. As
    such, other considerations need to be examined, and these are summarised here.

    1.13.Representing structures
         There exists a conflict between the natural user representation of chemical structures, and those that
         are useful to the implemented algorithm. In a sense, the users’ view of structures must be parsed
         into the computer view (first order logic) at some stage, either by the user manually, or by the
         implemented software as pre-processing to the Find-S algorithm. It is clearly more desirable from
         the users’ position that this conversion is done in an automated fashion. The feasibility of this is
         briefly discussed here.

         Chemists are often concerned with modelling compounds, and the industry standard modelling
         software is QUANTA [9]. King et al. in [2] used QUANTA editing tools to automatically map a visual
         representation of a molecule into first order logic. After some suitable pre-processing, this mapped
         representation could be read by their PROGOL program as a series of facts.

         Another molecular simulation program, CHARMM [10], stores as data files information about the
         molecule being simulated. These data files use standard naming and referencing techniques, as
         described in The Protein Data bank [11]. The structure of these flat text files is conducive to
         translations to other formats, on development of suitable schema.



    1.14.Improvement of current implementation
         S Colton’s current implementation of the Find-S algorithm can serve as a basis for further work.
         The algorithm could be recoded in a modern object oriented language, which would facilitate
         parallelising and packaging the algorithm as a web-based application.

         One key improvement that could be made is with the introduction of new search templates. These
         templates guide the algorithm, restricting its search to sub-molecules matching the specified
         template. Currently only a small number of templates are implemented; it is desirable that more be
         available to the user.



    1.15.Extensions
         As advanced work in this area, further extensions to those suggested above are possible.
         Implementing the algorithm in parallel is one such possible extension. This would speed up the
         potentially highly complex and time consuming derivations of hypotheses.




                                                   Page 18
Saravanan Anandathiyagar                                                                  Project Background Paper
March 2002                                                                                 Supervisor: Simon Colton
                                          A Substructure Server

         There is also scope for the generated hypotheses to be represented in different formats. While an
         answer returned in first order logic maybe strictly accurate, it is unlikely to be of much use to a user
         with little or no knowledge of computational logic techniques. Molecular visualisation software such
         as RASMOL and the later PROTEIN EXPLORER [12] exist, that can take as input data in a similar format to
         that produced by QUANTA or CHARM. It would be desirable for a user to view the resultant
         hypotheses, with the sub-molecule derived by the algorithm presented visually.




                                                    Page 19
Saravanan Anandathiyagar                                                                        Project Background Paper
March 2002                                                                                       Supervisor: Simon Colton
                                              A Substructure Server

5. References
    [1] Ellis, L., Aetna InteilHealth Drug Resource Centre, From Laboratory To Pharmacy: How Drugs Are
        Developed, 2002.
        http://www.intelihealth.com/IH/ihtIH/WSIHW000/8124/31116/346361.html?d=dmtContent

    [2] King, Ross D., Muggleton, Stephen H., Srinivasan, A. & Sternberg, Michael J.E., Structure-activity
        relationships derived by machine learning: The use of atoms and their bond connectives to predict mutagenicity by
        inductive logic programming (1995) Proceedings of the National Academy of Sciences (USA) 93, 438-442

    [3] Hansch, C., Maloney, P. P., Fujita, T. & Muir, R. M., Correlation of Biological Activity of Phenoxyacetic
        Acids with Hammett Substituent Constants and Partition Coefficients (1962). Nature (London) 194, 178-180

    [4] Mitchell, T. M., Machine Learning, International Edition, 1997, McGraw-Hill

    [5] Glen,B., Molecular Modelling and Molecular Informatics, University of Cambridge – Centre for Molecular
        Infomatics, www-ucc.ch.cam.ac.uk/colloquia/rcg-lectures/A4

    [6] Muggleton, S., Inductive Logic Programming (1991), New Generation Computing 8, 295-318

    [7] Srinivasan, A., Muggleton, S. H., Sternberg, M. J. E., King, R. D., Theories for mutagenicity: a study in first-
        order and feature-based induction (1996), Artificial Intelligence 85(1,2), 277-299

    [8] Muggleton, S., Inverse Entailment and Progol (1995), New Generation Computing 13, 245-286

    [9] Colton, S. G., Lecture 11 – Overview of Machine Learning, Imperial College London, 2003.
        http://www2.doc.ic.ac.uk/~sgc/teaching/341.html

    [9]      Quanta software, http://www.accelrys.com/quanta/, Accelrys Inc.

    [10]     Chemistry HARvard Molecular Mechanics (CHARMM),
             http://www.ch.embnet.org/MD_tutorial/pages/CHARMM.Part1.html

    [11]     The Protein Data Bank, http://www.rcsb.org/pdb

    [12]     Rasmol Home Page, http://www.umass.edu/microbio/rasmol/




                                                        Page 20

Weitere ähnliche Inhalte

Andere mochten auch

Substructure Search Face-off
Substructure Search Face-offSubstructure Search Face-off
Substructure Search Face-offNextMove Software
 
Foundation design of burj khalifa (deep foundation&lt;piles>)
Foundation design of burj khalifa (deep foundation&lt;piles>)Foundation design of burj khalifa (deep foundation&lt;piles>)
Foundation design of burj khalifa (deep foundation&lt;piles>)sumitvikram
 
Foundation Deisgn of Burj Khalifa
Foundation Deisgn of Burj KhalifaFoundation Deisgn of Burj Khalifa
Foundation Deisgn of Burj Khalifaastoria104
 
Burj khalifa-mode of construction
Burj khalifa-mode of constructionBurj khalifa-mode of construction
Burj khalifa-mode of constructionJoji Mathew
 

Andere mochten auch (6)

Substructure Search Face-off
Substructure Search Face-offSubstructure Search Face-off
Substructure Search Face-off
 
Foundation design of burj khalifa (deep foundation&lt;piles>)
Foundation design of burj khalifa (deep foundation&lt;piles>)Foundation design of burj khalifa (deep foundation&lt;piles>)
Foundation design of burj khalifa (deep foundation&lt;piles>)
 
Foundation Deisgn of Burj Khalifa
Foundation Deisgn of Burj KhalifaFoundation Deisgn of Burj Khalifa
Foundation Deisgn of Burj Khalifa
 
Burj khalifa-mode of construction
Burj khalifa-mode of constructionBurj khalifa-mode of construction
Burj khalifa-mode of construction
 
Ppt of design of dams
Ppt of design of damsPpt of design of dams
Ppt of design of dams
 
Earthen Dams
Earthen DamsEarthen Dams
Earthen Dams
 

Ähnlich wie Background Report (DOC)

Molecular docking and its importance in drug design
Molecular docking and its importance in drug designMolecular docking and its importance in drug design
Molecular docking and its importance in drug designdevilpicassa01
 
Application of Machine Learning in Drug Discovery and Development Lifecycle
Application of Machine Learning in Drug Discovery and Development LifecycleApplication of Machine Learning in Drug Discovery and Development Lifecycle
Application of Machine Learning in Drug Discovery and Development LifecycleAI Publications
 
Application of Machine Learning in Drug Discovery and Development Lifecycle
Application of Machine Learning in Drug Discovery and Development LifecycleApplication of Machine Learning in Drug Discovery and Development Lifecycle
Application of Machine Learning in Drug Discovery and Development LifecycleAI Publications
 
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...IRJET Journal
 
The general environment
The general environmentThe general environment
The general environmentYakin Bakhtiar
 
MDC Connect: In-Silico Drug Design - what to do, what not to do - project dri...
MDC Connect: In-Silico Drug Design - what to do, what not to do - project dri...MDC Connect: In-Silico Drug Design - what to do, what not to do - project dri...
MDC Connect: In-Silico Drug Design - what to do, what not to do - project dri...Medicines Discovery Catapult
 
IRJET- Substance Abuse by Employees At Construction Industry
IRJET-  	  Substance Abuse by Employees At Construction IndustryIRJET-  	  Substance Abuse by Employees At Construction Industry
IRJET- Substance Abuse by Employees At Construction IndustryIRJET Journal
 
15020243019_NiharicaOgale_Final_Report
15020243019_NiharicaOgale_Final_Report15020243019_NiharicaOgale_Final_Report
15020243019_NiharicaOgale_Final_ReportNiharica Ogale
 
IRJET- Air Quality Forecast Monitoring and it’s Impact on Brain Health based ...
IRJET- Air Quality Forecast Monitoring and it’s Impact on Brain Health based ...IRJET- Air Quality Forecast Monitoring and it’s Impact on Brain Health based ...
IRJET- Air Quality Forecast Monitoring and it’s Impact on Brain Health based ...IRJET Journal
 
IRJET - Bankruptcy Score Indexing
IRJET - Bankruptcy Score IndexingIRJET - Bankruptcy Score Indexing
IRJET - Bankruptcy Score IndexingIRJET Journal
 
Cloud Enabled Pharma R&D Trials
Cloud Enabled Pharma R&D TrialsCloud Enabled Pharma R&D Trials
Cloud Enabled Pharma R&D TrialsDmitriy Synyak
 
Global Pharmaceuticals R&D Programme
Global Pharmaceuticals R&D ProgrammeGlobal Pharmaceuticals R&D Programme
Global Pharmaceuticals R&D ProgrammeKasia Cza
 
4Theory of Unpleasant SymptomsNameInstitutio
4Theory of Unpleasant SymptomsNameInstitutio4Theory of Unpleasant SymptomsNameInstitutio
4Theory of Unpleasant SymptomsNameInstitutiomilissaccm
 
Multivariate data analysis
Multivariate data analysisMultivariate data analysis
Multivariate data analysisSetia Pramana
 

Ähnlich wie Background Report (DOC) (20)

Molecular docking and its importance in drug design
Molecular docking and its importance in drug designMolecular docking and its importance in drug design
Molecular docking and its importance in drug design
 
PMC7
PMC7PMC7
PMC7
 
Application of Machine Learning in Drug Discovery and Development Lifecycle
Application of Machine Learning in Drug Discovery and Development LifecycleApplication of Machine Learning in Drug Discovery and Development Lifecycle
Application of Machine Learning in Drug Discovery and Development Lifecycle
 
Application of Machine Learning in Drug Discovery and Development Lifecycle
Application of Machine Learning in Drug Discovery and Development LifecycleApplication of Machine Learning in Drug Discovery and Development Lifecycle
Application of Machine Learning in Drug Discovery and Development Lifecycle
 
Seawater Desalination
Seawater DesalinationSeawater Desalination
Seawater Desalination
 
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
 
The general environment
The general environmentThe general environment
The general environment
 
MDC Connect: In-Silico Drug Design - what to do, what not to do - project dri...
MDC Connect: In-Silico Drug Design - what to do, what not to do - project dri...MDC Connect: In-Silico Drug Design - what to do, what not to do - project dri...
MDC Connect: In-Silico Drug Design - what to do, what not to do - project dri...
 
IRJET- Substance Abuse by Employees At Construction Industry
IRJET-  	  Substance Abuse by Employees At Construction IndustryIRJET-  	  Substance Abuse by Employees At Construction Industry
IRJET- Substance Abuse by Employees At Construction Industry
 
15020243019_NiharicaOgale_Final_Report
15020243019_NiharicaOgale_Final_Report15020243019_NiharicaOgale_Final_Report
15020243019_NiharicaOgale_Final_Report
 
Finding Balance
Finding BalanceFinding Balance
Finding Balance
 
IRJET- Air Quality Forecast Monitoring and it’s Impact on Brain Health based ...
IRJET- Air Quality Forecast Monitoring and it’s Impact on Brain Health based ...IRJET- Air Quality Forecast Monitoring and it’s Impact on Brain Health based ...
IRJET- Air Quality Forecast Monitoring and it’s Impact on Brain Health based ...
 
Designing Risk Metrics for Risk-Based Monitoring
Designing Risk Metrics for Risk-Based MonitoringDesigning Risk Metrics for Risk-Based Monitoring
Designing Risk Metrics for Risk-Based Monitoring
 
Ijetr042188
Ijetr042188Ijetr042188
Ijetr042188
 
IRJET - Bankruptcy Score Indexing
IRJET - Bankruptcy Score IndexingIRJET - Bankruptcy Score Indexing
IRJET - Bankruptcy Score Indexing
 
Cloud Enabled Pharma R&D Trials
Cloud Enabled Pharma R&D TrialsCloud Enabled Pharma R&D Trials
Cloud Enabled Pharma R&D Trials
 
Global Pharmaceuticals R&D Programme
Global Pharmaceuticals R&D ProgrammeGlobal Pharmaceuticals R&D Programme
Global Pharmaceuticals R&D Programme
 
4Theory of Unpleasant SymptomsNameInstitutio
4Theory of Unpleasant SymptomsNameInstitutio4Theory of Unpleasant SymptomsNameInstitutio
4Theory of Unpleasant SymptomsNameInstitutio
 
Multivariate data analysis
Multivariate data analysisMultivariate data analysis
Multivariate data analysis
 
Toc introduction
Toc introductionToc introduction
Toc introduction
 

Mehr von butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mehr von butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Background Report (DOC)

  • 1. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server Abstract Much of the reason for the high cost of medicines is rooted in the length and complexity of the development and approval process. At every possible stage of development, it is possible that a potential drug (leader) will fail to gain approval on the basis that it produces erratic results or harmful side effects. Predictive toxicology aims to reduce the money and time spent by identifying as early on in the drug development process as possible leaders that are likely to fail. Numerous machine learning techniques exist to identify such leaders. Here we present a possible solution based on the Find a maximally specific hypothesis (Find-S) algorithm. This algorithm, given a set of positive and negative examples of data, finds substructures that are statistically true of the majority of positive compounds, and statistically not true of the negative compounds. A discussion of the algorithm and its motivation is presented here. i
  • 2. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server Contents Abstract...................................................................................................................i Contents................................................................................................................ii 1.Introduction........................................................................................................3 1.1.Motivation..........................................................................................................................3 1.2.Summary of Report...........................................................................................................4 2.Previous Research..............................................................................................5 1.3.Structure-Activity Relationships.......................................................................................5 1.4.Attribute-based representations........................................................................................5 1.5.Relational-based representations......................................................................................7 1.6.Inductive logic programming...........................................................................................7 3.The Find-S Technique.......................................................................................9 1.7.Motivation.........................................................................................................................9 1.8.General-to-specific ordering of hypotheses......................................................................9 1.9.The Find-S algorithm......................................................................................................10 1.10.Algorithm evaluation methods.......................................................................................14 1.11.Issues with the Find-S technique...................................................................................15 1.12.Existing Prolog implementation....................................................................................16 4.Implementation Considerations.......................................................................18 1.13.Representing structures.................................................................................................18 1.14.Improvement of current implementation......................................................................18 1.15.Extensions......................................................................................................................18 5.References.........................................................................................................20 ii
  • 3. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 1. Introduction 1.1. Motivation Each year, drug companies release new and improved drugs, claiming that they produce better results with fewer side effects. However, the cost of such advances in the drug industry is not small. Developing a drug from the theoretical stage to it appearing on pharmacy shelves normally takes in the region of 10 to 15 years, at an average cost of over £500 million [see ref 1]. This outlay by the drug company must be covered by the consumer for the company to remain in profit, and evidence of this can be seen, for example, in the regular rise of NHS prescription charges. Much of the reason for the high cost of medicines is rooted in the length and complexity of the development and approval process. At every possible stage of development, it is possible that a potential drug (leader) will fail to gain approval on the basis that it produces erratic results or harmful side effects. Even after promising lab tests, further experiments on animal specimens often return ideas to the drawing board. It is estimated that for every one drug that reaches clinical (human) trial stage, another 1000 have failed earlier testing. Despite this, it is important to note that medicines still reduce overall medical care costs by reducing even more expensive hospitalisation, surgery or other treatments. Drugs are the primary way of controlling the outcomes of chronic illness. Therefore, the development of new drugs is important for both patient care and for the positive long-term financial implications. It is clear that reducing the number of drug leaders developed at an early stage will have a significant effect in limiting development costs. Determining at an early stage that a leader is unsuitable for further testing saves the investment that may otherwise have been spent on this drug, only for the same conclusion to be reached. For this reason, the field of predictive toxicology was born. It is an effort on the part of biotechnology companies to predict in advance whether or not a drug will be toxic, using various techniques learnt from the fields of statistics, artificial intelligence (AI), and machine learning. Negative effects of a drug can range from relatively minor problems such as headaches and stomach upsets, to potentially life-threatening organ damage. While many accepted drugs do produce some side effects for some patients, the value of the treatment is always said to outweigh the side effects. However, there are certain characteristics of chemical compounds that will limit their effectiveness as a drug. Predictive toxicology aims to find this drug toxicity while still in the planning stages. Ruling out a leader at this early stage saves it being synthesised and tested, and allows resources to be focused on more promising areas of research. Machine learning programs in a variety of different guises have been used to try and discover the reasons why certain chemicals are toxic and others are not. Essentially, they learn a concept that is true of the toxic drugs and false for other non-toxic drugs. These derived concepts are usually small Page 3
  • 4. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server (around five or six atoms) sub-structures of the larger drug molecule, where some of the atoms are fixed elements and others may vary. The task in hand is to effectively and efficiently identify such sub-structures using the Find Maximally Specific Hypothesis (FIND-S) machine learning algorithm. An implementation of the algorithm has been written in PROLOG by S Colton; our work here is based on extending this implementation and producing a web-based server application. A molecule is said to be positive if it contains the sub-structure in question. Conversely, it is said to be negative if it does not. The application will return interesting substructures given positive and negative molecules, whereby the substructure is true of statistically significant more positives than negatives. 1.2. Summary of Report This report is an overview of the research undertaken, with an outline of how implementation of a Substructure Server may proceed. Section 2 summarises the machine learning techniques used in the field of predictive toxicology, and introduces the concepts of attribute-based and relationship-based structure-activity relationships. Section 3 is a comprehensive overview of the Find-S algorithm, with an emphasis on how it may perform in a predictive toxicology situation. A fictional example is presented and analysed which demonstrates the key methodologies of the technique. Evaluation techniques applicable to both the algorithm itself and to the results it produces are outlined, as well as various considerations that should be addressed on implementation. S Colton’s existing Prolog implementation of the algorithm is also discussed. Section 4 highlights some implementation considerations, suggesting a possible course of action towards building a substructure server available for public use. Page 4
  • 5. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 2. Previous Research As was mentioned above, machine learning algorithms to find relevant sub-structures have been applied in the field of predictive toxicology. It is important to understand the approaches that have been taken in previous work, using it as a basis for further study. A summary of the key features of background study undertaken is summarised in this section. 1.3. Structure-Activity Relationships A structure-activity relationship (SAR) models the relationship between activities and physicochemical properties of a set of compounds [2]. The goal of our work is essentially to form SARs from the given input molecules. These resultant SARs represent the molecules most likely contribute to toxicity, as calculated by our algorithm. A SAR is derived from two components: • The learning algorithm employed during derivation, and • The choice of representation to describe the chemical structure of the compounds being considered. The learning algorithm used will rule out possible choices of representation, as the latter has to be rich enough to support the algorithm’s procedure. SARs can store different information about compounds, and typically such information (attributes) could consist of any of the following chemical properties [5]: • Partial atomic charges • CMR • Surface area • pKa, pKb • Volume • Hansch parameters π, σ, F • H_bond donors/acceptors • Molecular grids • ClogP • Polarisability The exact nature or meaning of each attribute type need not be discussed here. It is however important to note that there are any number of ways of representing a compound, using any combination of the attributes given above (and more). 1.4. Attribute-based representations A large variety of learning techniques are in use that derive SARs of different forms. The majority of these are based on examining the types of attributes listed above. A short summary of a few of these techniques is presented here. Page 5
  • 6. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 1.4.1. Linear and Partial least-squares regression Linear regression was the first learning algorithm employed in predictive toxicology, as detailed by Hansch et al. [3]. “Training” the system involves providing suitable training examples, which are simply saved to memory without being interpreted or compared in any way. It is on this stored information (as explicitly provided by the user) that regression aims to approximate its target function. In the context of predictive toxicology, this would involve supplying examples of positive compounds as training data. The procedure then run on a new compound would invoke a set of similar compounds being retrieved from the stored values, and use this to classify the new compound. The analysis of the compounds is based on chemical attributes as specified by the algorithm; Hansch used global chemical properties of the molecule (LogP and π). Least-squares regression is another learning technique involving the relationship between chemical attributes. Visually it essentially entails forming a ‘line of best fit’ for a set of training data plotted against two variables y and x, where x and y are two chemical attributes. For any new compound encountered, a plot is made of the same two attributes; if the point produced lies within a fixed bound of the line of best fit, then the new compound can be deemed positive. The system can be extended to include multiple independent variables, and to give each variable different weights – a measure of how important each attribute measure is compared with each other. It is important to note that both these techniques make no attempt to interpret the training data as it is fed to them; all the processing of determining suitability criteria for new compounds happens only once the new compound has been encountered. 1.4.2. Decision trees Decision trees classify the training data by considering each <attribute, value> pair (tuple) for a given compound [4]. Each node in the tree specifies a test of a particular attribute, and each branch descending from that node corresponds to a possible value for that attribute. A compound is classified as positive or negative at the leaf nodes of the graph. New compounds are classified by comparing their attribute values to ones stored from the training data. An implementation of this algorithm needs to address the critical issue of which attribute(s) to perform the test on. This decision could crucially alter the classification schema, and is a problem inherent in trying to separate objects into discrete sets when their behaviour or identity is given by a number of attribute. It is possible that any two attribute values could contradict each other on a particular classification scheme, and it then becomes necessary to impose some ordering or priority system over the attributes. Page 6
  • 7. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 1.4.3. Neural networks Artificial Neural Networks (ANNs) provide a general and practical method for learning functions from examples [4], and have widespread use in AI applications. Predictive toxicology lends itself to the use of ANNs because of how compound attributes can be treated as <attribute, value> tuples, in a manner similar to that discussed in section 2.1.2 above. A compound can be represented by a list of such tuples covering the full range of attributes. The simplest form of ANN system is based on perceptrons, which will take the list of tuples and calculates a ‘score’ for the compound. This score is calculated from a combination of the input tuples, and a weight associated with each attribute. The algorithm can learn from the training data by considering the attributes of positive compounds, and can then classify unknown compounds as positive or negative, depending on the score calculated being higher than a defined threshold. Practical ANN systems usually implement the more advanced backpropogation algorithm, which learns the weights for a network of neural nodes on multiple layers. However the principal is the same as that used in the perceptron algorithm, with the compound score being calculated in a non-linear manner taking into account more variables. 1.5. Relational-based representations The techniques mentioned above for deriving SARs all share one key concept: they are all based on attributes of the object (in our case, the chemical compound being examined). These attributes can be considered to be global properties of these molecules, e.g. using the molecular grid attribute maps points in space, which are global properties of the coordinate system used. The tuple of attributes that has been used to represent the properties of the molecule is not an ideal format; it is difficult to efficiently map atoms and the bonds onto a linear list. A more general way to describe objects is to use relations. In a relational description the basic elements are substructures and their associations [2]. This allows the spatial representation of the atoms within the molecule to be represented more accurately, directly and efficiently. 1.6. Inductive logic programming Fully relational descriptions were first used in SARs with the inductive logic programming (ILP) learning technique, as shown in [6]. ILP algorithms are designed to learn from training examples encoded as logical relations. ILP has been shown to significantly outperform the feature (attribute) based induction methods described above [7]. ILP for SARs can be based on knowledge of atoms and their bond connectives within a molecule. Using this scheme has a number of benefits: Page 7
  • 8. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server • Simple, powerful, and can be generally applied to any SAR • Particularly well suited to forming SARs dependent on the relationship between the atoms in space (shape) • Chemists can easily understand and interpret the resultant SARs as they are familiar with relating chemical properties to groups of atoms. The formal difference between the descriptive properties of attribute and relational SARs corresponds to the difference between propositional and first-order logic [2]. ILP involves learning a set of “if-then” rules for a training set, which can then be applied to unseen examples. Sets of first- order Horn clauses can be constructed to represent the given data rules, and these can be interpreted in the logic programming language PROLOG. ILP differs from the attribute based techniques in two key areas. ILP can learn first-order rules that contain variables, whereas the earlier algorithms can only accept finite ground terms for attribute values. Further, ILP sequentially examines the data set, learning one rule at a time to incrementally grow the final set of rules. We stated above that relational SARs can be described by fist-order predicate logic. The PROGOL algorithm was developed [8] to allow the bottom-up induction of Horn clauses, and is implemented in PROLOG. PROGOL uses inverted entailment to generalize a set of positive examples (active compounds) with respect to some background knowledge – atom and bond structure date, given in the form of prolog facts. PROGOL will construct a set of “if-then” rules which explain the positive (and negative) examples given. In the case of predictive toxicology, these rules generally specify a sub-molecular structure of around five or six atoms. These structures are those that have been calculated to contribute to toxicity, based on their presence in the set of positive training examples, and their non-presence in the set of negative training examples. These sub-structures can then be matched with components of unseen compounds in an attempt to predict toxicity. Page 8
  • 9. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 3. The Find-S Technique 1.7. Motivation As mentioned previously, the focus of this research topic is to use the Find-S algorithm as described below to identify the sub-structures discussed at the end of section 2.3.1. Within the scope of predictive toxicology, it may appear that both Find-S and ILP do the same thing, however this is not the case. The Find-S technique differs from that of ILP due to the motivation behind the process. ILP looks for concepts that are true for positive examples, and false for negative examples, and produces a sub-molecule structure as a result. The Find-S procedure, on the other hand, is given a template (by the user) to guide its search, and the program looks for all possibilities of the general shape in the positive inputs. 1.8. General-to-specific ordering of hypotheses Any given problem has a predefined space of potential hypotheses [4], which we shall denote H. Consider a target concept T, whose truth value (1 or 0) depends upon the values of three attributes, a1, a2, and a3. Each attribute a1, a2, or a3 can take a range of discrete values, some combinations of which will make T true, others will make T false. We denote the value x of an attribute an as v(an) = x. We can let each hypothesis consist of a conjunction of constraints on the attributes, i.e. take the list of attribute values for that particular instance of the problem. This list of attributes (of length three in this case) can be held in a vector. For each attribute an, the value v(an) will take one of the following forms: • ? - indicating that any value is acceptable for this attribute • ∅ - indicating that no value is acceptable for this attribute • a single required value for the attribute, e.g. for an attribute ‘day of week’, acceptable values would be ‘Monday’, ‘Tuesday’ etc. With this notation, the most general hypothesis for T is <?, ?, ?> which states that any assignment to any of the three attributes will result in the hypothesis being satisfied. Conversely, the most specific hypothesis for T is <∅, ∅, ∅> which states that no assignment to any of the variables will ever satisfy the hypothesis. All hypotheses within H can be represented in this way, with the majority falling somewhere between the two above extremes of generality. Indeed, hypotheses can be ordered on their generality, Page 9
  • 10. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server from most general to most specific instances. For example, consider the following two possible hypotheses for T: h1 = <x, ?, y> h2 = <?, ?, y> Considering the two sets of instances that are classified positive by the two hypotheses, we can say that any instance classified positive by h1 will also be classified positive by h2, as h2 imposes fewer constraints. We say that h2 is more general than h1. Formally, for two hypotheses hj and hk, we can define hj to be more general than or equal to hk (written h j ≥ g h k ) if and only if (∀x ∈ X) [(h k (x) = 1) → (h k (x) = 1)] Further, we can define hj to be (strictly) more general than hk (written h j > g h k ) if and only if (h j ≥ g h k ) ∧ (h j ≱ g h k ) 1.9. The Find-S algorithm The Find-S technique orders hypotheses according to their generality as explained in the previous section. The algorithm then starts with the most specific hypothesis h possible within H. For each positive example it encounters in the training set, if generalises h (if needed) so h now correctly classifies the encountered example as positive. After considering all positive training examples, the resultant h is output. This is the most specific hypothesis in H consistent with the examined positive examples. The algorithm can be more formally defined as follows [4]: 1. Initialise h to the most specific hypothesis in H. 2. For each positive training instance x  For each v(ai) in h • If v(ai) is satisfied by x Then do nothing • Else replace ai in h by the next more general constraint that is satisfied by x. 3. Output hypothesis h The procedure is run with a different starting positive each time until all positives have been analysed. There is a question over how to measure how specific a particular hypothesis is. This is Page 10
  • 11. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server dependent on the representation scheme, but in first-order logic, for example, a more specific hypothesis will have more ground terms (fewer variables) in the logic sentence describing it than a less specific hypothesis. 1.9.1. A simple example An example to illustrate how the algorithm could be used in predictive toxicology is presented below. It has been adapted from [9], and is fabricated in that the derived structure is not a real indicator to toxicity. The example is simply illustrates the algorithm process. Training Data Consider the training set of seven drugs, four of which are known positives, and the remaining three known negatives. Diagrams of these molecules are given below, with molecules P1, P2, P3 and P4 representing positive examples, and N1, N2 and N3 representing negative ones. The atom labels (α, β, µ, and ν) are used in place of possible real elements (e.g. N, C, H etc) to enforce the notion that the example is purely fabricated. α P1 β µ ν µ α β ν α N α α ν α P2 β β α α µ α α β β α N α α ν µ α P3 β β α α µ α α β ν β α N µ µ P4 β β β β Figure 1: Training set for Find-S example Page 11
  • 12. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server At this stage, the chemist (user) must suggest a possible template on which to base the search for toxicity-inducing substructures. It is thought that a substructure of the form ATOM  ATOM  ATOM (with  representing a bond) contributes to toxicity. It is now the task of the algorithm to find sub-molecules matching the structure given above which exist in as many positives as possible, but do not exist in as many negatives as possible. The Algorithm Procedure To solve the problem, we use the Find-S method with the aim of producing solutions of the form <A, B, C> where A, B and C are taken from the set of chemical symbols present in the molecules, i.e. {α, β, µ, ν}. However, we also need to look for general solutions where an atom in a particular position is not fixed. We therefore append {?} to the previous set, giving {α, β, µ, ν, ?}. We start off with the most specific hypothesis possible. Any final concept learned will have to be true of at least one positive example. We use this to produce our first set of triples: <α, β, µ> and <β, µ, ν> These are the two substructures that exist in P1 and match the template specified. We now check whether each of these substructures is true in the next molecule (P2). If they are not, then we generalise the substructure such that it becomes true in P2. This generalisation is done by introducing as few variables as possible. In doing this, we find the least general generalisations, which then guarantees that our final answers are as specific as possible. This expanded set of substructures is then tested on P3, and following the same procedure, on P4. Page 12
  • 13. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server A trace of the intermediate results produced is shown here: Molecule being analysed P1 P2 P3 P4 <α, β, µ> <α, β, µ> <α, β, µ> <α, β, µ> <β, µ, ν> <β, µ, ν> <β, µ, ν> <β, µ, ν> <α, β, ?> <α, β, ?> <α, β, ?> <β, ?, ν> <β, ?, ν> <β, ?, ν> <?, β, µ> <?, β, µ> <?, β, ?> <?, β, ?> <α, ?, ?> <α, ?, ?> <β, ?, ?> <β, ?, ?> <?, ?, ν> <?, ?, ν> The trace shows previously derived substructures with a greyed out background. Note that no new substructures are produced on analysis of P4 – all the substructures produced after analysis of P3 match exactly components of P4 without the need for generalisation. Evaluation of Results So the algorithm has now returned nine possible hypotheses for substructures that determine toxicity. These can now be scored, based on • How many positive molecules contain the substructure derived • How many negatives do not contain the substructure derived A calculation of scores is given below: Correctly classified Correctly classified positives: negatives: Hypothesi P1 P2 P3 P4 N1 N2 N3 Accuracy s 1 <α, β, µ>    43% 2 <β, µ, ν>     57% 3 <α, β, ?>     57% 4 <β, ?, ν>       86% 5 <?, β, µ>     57% 6 <?, β, ?>     57% 7 <α, ?, ?>    43% 8 <β, ?, ?>     57% Page 13
  • 14. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 9 <?, ?, ν>     57% It can be seen that the most accurate hypothesis derived is number four: <β, ?, ν>. This is statistically the most frequent substructure (of the form ATOM  ATOM  ATOM) that occurs in the positives, but not in the negatives. This structure can then be used to predict the toxicity of unseen compounds; other compounds containing a match for hypothesis four are statistically likely to be toxic. For a complete implementation of the algorithm, the procedure should be repeated, but this time with P2 as the initial positive, and generalising on the others. The same should be applied for P3 and P4 as initial positives. 1.10.Algorithm evaluation methods On obtaining a ‘result’ from the Find-S algorithm, i.e. a hypothesis (or set of hypotheses) representing a sub-molecule thought most likely to contribute to toxicity, it is desirable to have some certainty that the result obtained is indeed accurate. We want the promising results obtained with the training set to be extended to unseen examples. There is no way to guarantee the accuracy of a hypothesis, however there are accepted methods and measures through which a user can become more confident in the results obtained. In our example above, the ‘best’ hypothesis had a (predicted) accuracy of 86%, calculated by considering the number of correctly classified positives and negatives, over the total number of compounds analysed. However, this figure is based purely on the examples that the hypothesis has already seen; it is not a strong indicator of accuracy for unseen examples. 1.10.1.Cross validation One possible way of addressing this situation is to reserve some examples from the training set, and then subsequently use these reserved examples as tests on the derived hypothesis. The results of the hypothesis applied to the reserved examples can then be compared to their actual categorisation, which is known as they were provided as part of the training set. This cross validation is a standard machine learning technique, and the splitting of initial example data into a training set and test set can give the user more confidence that the derived hypothesis will be accurate and of use. Clearly, it can have the opposite effect, with a user finding out that the derived hypothesis in fact performs poorly on genuinely unseen examples. 1.10.2.K-fold cross validation Page 14
  • 15. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server It is often of importance and interest that the performance of the learning algorithm itself is measured, and not just a specific hypothesis. A technique to achieve this is k-fold cross validation [4]. This involves partitioning the data into k disjoint subsets, each of equal size. There are then k training and testing rounds, with each subset successively acting as a test set, and the other k-1 sets as training sets. The average accuracy rate can then be calculated from each independent test run. This technique is typically used when the number of data objects is in the region of a few hundred, and the size of each subset is at least thirty. This ensures that the tests provide reasonable results, as having too few test examples would result in skewed accuracy figures. As each round is performed independently, there is no guarantee that the hypothesis generated on one training round will be the same as the hypothesis generated on another. It is for this reason that the overall accuracy figures generated are representative of the algorithm as a whole, not just one particular result. 1.11.Issues with the Find-S technique As with all machine learning techniques, Find-S has some factors to encourage its use, and others that make it less favourable. Some of these considerations are discussed here. 1.11.1.Guarantee of finding most specific hypothesis As the name of the algorithm suggests, the process is guaranteed to find the most specific hypothesis consistent with the positive training examples, within the hypothesis space. This is because of the decisions made to select the least general generalisations when analysing compounds. This property can be viewed as being both advantageous and disadvantageous. It is sometimes useful for users to know as much information about the substructure as possible, and this may enable them to better understand the chemical reason for the molecule’s toxicity. However, in the case of an example deriving multiple hypotheses consistent with the tracing data, the algorithm would still return the most specific, even thought the others have the same statistical accuracy. Further, it is possible that the process derives several maximally specific consistent hypotheses [4]. To account for this possible case, we need to extend the algorithm to allow backtracking at choice points for generalisation. This would find target concepts along a different branch to that first explored. 1.11.2.Overfitting Overfitting is often thought of as the problem of an algorithm memorising answers rather than deducing concepts and rules from them, and is inherent in many machine learning techniques. A particular hypothesis is said to overfit the training examples when some other Page 15
  • 16. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server hypothesis that fits the training examples less well, actually performs better over the whole set of instances (i.e. including non-training set instances). Overfitting can occur when the number of training examples used is too small and does not provide an illustrative sample of the true target function. It can also occur when there are errors in the example data, known as noise. Noise has a particularly detrimental effect on the Find-S algorithm, as explained below. 1.11.3.Noisy data Any non-trivial set of data taken from the real world is subject to a degree of error in its representation. Mistakes can be made analysing the data and categorising examples, in translation of information from one form to another, and repeated data not being consistent with itself. In machine learning terms, such errors in the data are termed noise. While certain algorithms are fairly robust to noise in data, the Find-S technique is inherently not so. This is because the algorithm effectively ignores all negative examples in the training examples. Generalisations are made to include as many positive examples as possible, but no attempt is made to exclude negatives. This in itself is not a problem; if the data contains no errors, then the current hypothesis can never require a revision in response to a negative example [4]. However, the introduction of noise into the data changes this situation. It may no longer be the case that the negative examples can simply be ignored. Find-S makes no effort to accommodate for these possible inconsistencies in data. 1.11.4.Parallelisability The Find-S algorithm lends itself well to a parallel distributed implementation, which would speed-up computation time. A parallel implementation could involve individual processors being allocated different initial positives; recall above that the algorithm is only complete when hypotheses have been derived using each possible start positive. The derivation of any particular hypothesis from an initial positive can be run independently, and hence can be run in parallel with other derivations. 1.12.Existing Prolog implementation S. Colton has implemented an initial version of the Find-S algorithm in PROLOG. This relatively compact program (approximately 300 lines of code) identifies substructures from a sample data set as used by King et al [2]. The program is guided by substructure templates, of which a few have been hard coded. It has recreated some of the results produced by the ILP method and PROGOL on the sample data set considered. The program can take parameters to specify the minimum number of ground terms that must appear in a resultant hypothesis (i.e. limit variables), and also specify the minimum number of molecules for which a hypothesis should return TRUE for a positive, and the maximum for which it can FALSE for a negative. Page 16
  • 17. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server An important point for discussion here is the representation of the background and structural data. Information representing the molecules is represented as a series of facts in a PROLOG database. The representation is identical to that suggested in the section on inductive logic programming, and involves storing information about atoms and their inter bonding. The data stored for even a single molecule is extensive; however these PROLOG facts can be generated automatically as mentioned in section 4.1. Page 17
  • 18. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 4. Implementation Considerations The Find-S algorithm has been discussed at length as it represents the core component of a system to identify substructures. However, the initial remit was to create a substructure server, whereby users would be able to identify potentially interesting substructures from their positive and negative examples. As such, other considerations need to be examined, and these are summarised here. 1.13.Representing structures There exists a conflict between the natural user representation of chemical structures, and those that are useful to the implemented algorithm. In a sense, the users’ view of structures must be parsed into the computer view (first order logic) at some stage, either by the user manually, or by the implemented software as pre-processing to the Find-S algorithm. It is clearly more desirable from the users’ position that this conversion is done in an automated fashion. The feasibility of this is briefly discussed here. Chemists are often concerned with modelling compounds, and the industry standard modelling software is QUANTA [9]. King et al. in [2] used QUANTA editing tools to automatically map a visual representation of a molecule into first order logic. After some suitable pre-processing, this mapped representation could be read by their PROGOL program as a series of facts. Another molecular simulation program, CHARMM [10], stores as data files information about the molecule being simulated. These data files use standard naming and referencing techniques, as described in The Protein Data bank [11]. The structure of these flat text files is conducive to translations to other formats, on development of suitable schema. 1.14.Improvement of current implementation S Colton’s current implementation of the Find-S algorithm can serve as a basis for further work. The algorithm could be recoded in a modern object oriented language, which would facilitate parallelising and packaging the algorithm as a web-based application. One key improvement that could be made is with the introduction of new search templates. These templates guide the algorithm, restricting its search to sub-molecules matching the specified template. Currently only a small number of templates are implemented; it is desirable that more be available to the user. 1.15.Extensions As advanced work in this area, further extensions to those suggested above are possible. Implementing the algorithm in parallel is one such possible extension. This would speed up the potentially highly complex and time consuming derivations of hypotheses. Page 18
  • 19. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server There is also scope for the generated hypotheses to be represented in different formats. While an answer returned in first order logic maybe strictly accurate, it is unlikely to be of much use to a user with little or no knowledge of computational logic techniques. Molecular visualisation software such as RASMOL and the later PROTEIN EXPLORER [12] exist, that can take as input data in a similar format to that produced by QUANTA or CHARM. It would be desirable for a user to view the resultant hypotheses, with the sub-molecule derived by the algorithm presented visually. Page 19
  • 20. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 5. References [1] Ellis, L., Aetna InteilHealth Drug Resource Centre, From Laboratory To Pharmacy: How Drugs Are Developed, 2002. http://www.intelihealth.com/IH/ihtIH/WSIHW000/8124/31116/346361.html?d=dmtContent [2] King, Ross D., Muggleton, Stephen H., Srinivasan, A. & Sternberg, Michael J.E., Structure-activity relationships derived by machine learning: The use of atoms and their bond connectives to predict mutagenicity by inductive logic programming (1995) Proceedings of the National Academy of Sciences (USA) 93, 438-442 [3] Hansch, C., Maloney, P. P., Fujita, T. & Muir, R. M., Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients (1962). Nature (London) 194, 178-180 [4] Mitchell, T. M., Machine Learning, International Edition, 1997, McGraw-Hill [5] Glen,B., Molecular Modelling and Molecular Informatics, University of Cambridge – Centre for Molecular Infomatics, www-ucc.ch.cam.ac.uk/colloquia/rcg-lectures/A4 [6] Muggleton, S., Inductive Logic Programming (1991), New Generation Computing 8, 295-318 [7] Srinivasan, A., Muggleton, S. H., Sternberg, M. J. E., King, R. D., Theories for mutagenicity: a study in first- order and feature-based induction (1996), Artificial Intelligence 85(1,2), 277-299 [8] Muggleton, S., Inverse Entailment and Progol (1995), New Generation Computing 13, 245-286 [9] Colton, S. G., Lecture 11 – Overview of Machine Learning, Imperial College London, 2003. http://www2.doc.ic.ac.uk/~sgc/teaching/341.html [9] Quanta software, http://www.accelrys.com/quanta/, Accelrys Inc. [10] Chemistry HARvard Molecular Mechanics (CHARMM), http://www.ch.embnet.org/MD_tutorial/pages/CHARMM.Part1.html [11] The Protein Data Bank, http://www.rcsb.org/pdb [12] Rasmol Home Page, http://www.umass.edu/microbio/rasmol/ Page 20