SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
On Learning Gene
             Regulatory Networks with
             Only Positive Examples
             Luigi Cerulo, University of Sannio, Italy
             Michele Ceccarelli, University of Sannio, Italy




Tuesday, October 12, 2010
Outline
                    • Supervised inference of gene regulatory
                            networks
                    • The positive only problem
                    • Negative selection approaches
                    • Effect on prediction accuracy
                    • Conclusions and future directions
Tuesday, October 12, 2010
Gene Regulatory
                                 Network (GRN)
                            The network of transcription dependences among genes of an organism,
                                    known as transcription factors, and their binding sites.



                                            TF                                              protein


                                                                        TF
                                   Gene A                                              Gene B



                                                 gene A            gene B




Tuesday, October 12, 2010
Gene Regulatory
                               Network (GRN)
                    • A gene regulatory                              G2
                            network can be            G1
                            represented as a graph
                            G = (Vertices, Edges)               G6             G3


                    • Vertices = Genes               G7


                    • Edges = Interactions                      G5

                                                                          G4
                                                           G8



Tuesday, October 12, 2010
Inference of Gene
                            regulatory networks
                                                               G2

                                                G1


                                                          G6             G3

                                               G7
                                                          G5

                                                                    G4
                                                     G8
           Gi = {e1 , e2 , e3 , . . . , en }

Tuesday, October 12, 2010
GRN
                      unsupervised inference
                    • Correlation models (eg. Mutual
                            information)
                    • Bayesian Network
                    • Boolean networks
                    • ODE
                    • ...
Tuesday, October 12, 2010
GRN
                            supervised Inference
                                                                    G2

                                                     G1


                    • Part of the network
                            is known in advance                G6
                                                                              G3


                            from public databases   G7

                            (Eg. RegulonDB)                    G5

                                                                         G4
                                                          G8




Tuesday, October 12, 2010
GRN
                            supervised Inference
                                                             G2                             G2

                                              G1                             G1




                                     +
                                                        G6             G3              G6             G3

                                             G7                             G7
                                                        G5                             G5

                                                                  G4                             G4
                                                   G8                             G8




                        Gi = {e1 , e2 , e3 , . . . , en }
                        T = {(G1 , G2 ), (G2 , G3 ), (G6 , G7 ), (G7 , G8 )}

                        Binary classifier (SVM, Decision Tree, Neural Networks,...)
Tuesday, October 12, 2010
Related work




Tuesday, October 12, 2010
•       SIRENE approach

                            •    trains an SVM classifier for each gene and predicts
                                 which genes are regulated by that gene

                            •    combines all predicted regulations to obtain the full
                                 regulatory network


                                     G2                            G2                            G2

                 G1                                  G1                            G1




                                                                                                                ...
                                G6             G3             G6             G3             G6             G3

               G7                                   G7                            G7
                                G5                            G5                            G5

                                          G4                            G4                            G4
                     G8                                  G8                            G8



Tuesday, October 12, 2010
1                                                                      1
                                                                                                                                               CLR
                                                                                                                                               SIRENE
                               0.8                                                                    0.8                                      SIRENE−Bias
     Ratio of true positives




                               0.6                                                                    0.6




                                                                                          Precision
                                                                                          Precision
                               0.4                                                                    0.4


                               0.2                                     CLR                            0.2
                                                                       SIRENE
                                                                       SIRENE−Bias
                                0                                                                      0
                                     0       0.2        0.4        0.6      0.8       1                     0   0.2       0.4            0.6      0.8        1
                                                   Ratio of false positives                                                     Recall

                                            Method                         Recall at 60% of Precision           Recall at 80% of Precision
                                            SIRENE                         44.5%                                17.6%
                                            CLR                            7.5%                                 5.5%
                                            Relevance networks             4.7%                                 3.3%
                                            ARACNe                         1%                                   0%
                                            Bayesian network               1%                                   0%


                                         Compared with unsupervised methods (Mordelet and Vert, 2008)

Tuesday, October 12, 2010
60
                                                                          supervised (SIRENE)

                                        45
                       True Positives




                                        30
                                                                             unsupervised (ARACNE)

                                        15



                                         0
                                             0           100        200              300         400
                                                                   Top N
                                                 prediction of new c-Myc regulations
                                                 True positives are validated with IPA
                                                         (www.ingenuity.com)
Tuesday, October 12, 2010
Supervised learning

                                       +   +
                                 -          +    +
                                   +     +
                                             +         +
                                     -      +        +
                               -
                                     - -         -    +
                               -       -
                                          -      +
                                   -
Tuesday, October 12, 2010
Supervised learning

                                       +   +
                                 -          + +
                                   +     +
                                             +      +
                                     -      +     +
                               -
                                     - -         - +
                               -       -
                                          -       +
                                   -           f(x)
Tuesday, October 12, 2010
Supervised learning

                                       +   +
                                 -          + +
                                   +     +
                                     -   ?+ + +
                               -            +
                                     - -        - +
                               -       - ?
                                          -      +
                                   -          f(x)
Tuesday, October 12, 2010
Supervised learning

                                       +   +
                                 -          + +
                                   +     +
                                     -   ?+ + +
                               -            +
                                     - -        - +
                               -       - -
                                          -      +
                                   -          f(x)
Tuesday, October 12, 2010
Supervised learning

                                       +   +
                                 -          + +
                                   +     +
                                     -   ++ + +
                               -            +
                                     - -        - +
                               -       - -
                                          -      +
                                   -          f(x)
Tuesday, October 12, 2010
Supervised learning with
                      unlabeled data
                                    +   +
                              -          +    +
                                +     +
                                          +         +
                                  -      +        +
                            -
                                  - -         -    +
                            -       -
                                       -      +
                                -
Tuesday, October 12, 2010
Supervised learning with
                      unlabeled data
                                    +   +
                                        ?

                              -          +    +
                                +
                                ?     +
                                          +         +
                                  -      +        +
                            -           ?


                                  - -         -    +?

                            -       -
                                       -      +
                                -
Tuesday, October 12, 2010
Supervised learning with
                      unlabeled data
                                    +   +               ?

                              - ?
                                         +                      +
                                +     + ?
                                          +                       +
                                  -      +
                                            ?                   +
                            -
                            ?
                                                        ?


                                  - -   ?           ?       -
                                                            ?
                                                                    +
                                                                    ?

                            -
                            ?       -           ?

                                       -            ?           +
                                -   ?


Tuesday, October 12, 2010
Supervised learning with
                      unlabeled data
                                    +   +               ?

                              - ?
                                         +                      +
                                +     + ?
                                          +                       +
                                  -      +
                                            ?                   +           f(x)
                            -
                            ?
                                                        ?


                                  - -   ?           ?       -
                                                            ?
                                                                    +
                                                                    ?
                                                                        PU-learning

                            -
                            ?       -           ?

                                       -            ?           +
                                -   ?


Tuesday, October 12, 2010
Supervised learning of
                gene regulatory networks
                                 +1        G2

                     G1                          +1


                                      G6
                                                      G3
                            +1

                 G7                                   Is this a negative
                                      G5
                                                          example?
                 +1
                                                G4
                            G8
                                  Is this a negative
                                      example?


Tuesday, October 12, 2010
Training set
                            Labeled        Unlabeled


                              P       Q           N
                                Positive        Negative



                                                   |P |
                            % of Known Positives
                                                 |P ∪ Q|


Tuesday, October 12, 2010
1.0
                                    0.9
                                    0.8
                            AUROC


                                    0.7
                                    0.6
                                    0.5




                                          10   20    30   40   50   60   70   80   90   100




                                                    % of known positives



                              Effect of PU-learning
                                      E.coli dataset [J.J. Faith et al., 2007]
Tuesday, October 12, 2010
Reliable negative
                                selection
                                     +   +                ?

                               -  ?
                                          +                       +
                                 +     +  ?
                                           +                        +
                                   -      +
                                              ?                   +           f(x)
                             -?
                                                          ?


                                   - -    ?           ?       -
                                                              ?
                                                                      +
                                                                      ?
                                                                          PU-learning

                             -
                             ?       -            ?

                                        -             ?           +
                                 -    ?


Tuesday, October 12, 2010
Reliable negative
                                selection
                                     +   +    ?

                               -          +           +
                                 +    ?+
                                           +            +
                                   -      +           +           f(x)
                             -?
                                              ?


                                   - -            -
                                                  ?
                                                          +
                                                          ?
                                                              PU-learning

                             -       -    ?

                                        -             +
                                 -?


Tuesday, October 12, 2010
Reliable negative
                                selection
                                     +   +    ?

                               -          +           +
                                 +    ?+
                                           +            +
                                   -      +           +           f(x)
                             -?
                                              ?


                                   - -            -
                                                  ?
                                                          +
                                                          ?
                                                              PU-learning

                             -       -    ?

                                        -             +
                                 -?

                                                  f’(x)
Tuesday, October 12, 2010
Reliable negative
                     selection in text mining
              • B. Liu et al. Building Text Classifiers Using
                      Positive and Unlabeled Examples, in ICDM
                      2003
              • Yu et al. PEBL: Positive Example Based
                      Learning for Web Page Classification Using
                      SVM, in KDD 2002
              • Denis et al. Text classification from positive
                      and unlabeled Examples, in IPMU 2002

Tuesday, October 12, 2010
Methods based on
                 reliable negative selection
                                       Labeled       Unlabeled

                        Original
                      training set       P       Q          N
                                                             Negative selection
                                                                 heuristic


                            New
                        training set         P            RN


Tuesday, October 12, 2010
Quality of RN

                                              RN

                    •       RN could be contaminated with positives
                            embedded in unlabeled data
                    •       The fraction of positive contamination is the
                            ratio between the number of positives in RN
                            and the total number of unknown positives |Q|


Tuesday, October 12, 2010
1.0
                                            positive contamination = 0




                                    0.9
                                    0.8
                            AUROC


                                    0.7
                                                      positive contamination = 1
                                                             (PU-learning)
                                    0.6
                                    0.5




                                            10   20     30    40    50    60       70   80   90   100




                                                      % of known positives



                Effect of positive contamination
                                          E.coli dataset [J.J. Faith et al., 2007]

Tuesday, October 12, 2010
0.6
                                        0.5
                                                     positive contamination = 0




                                        0.4
                            F-Measure


                                        0.3
                                        0.2
                                        0.1




                                                                               positive contamination = 1
                                                                                      (PU-learning)
                                        0.0




                                                10      20    30    40    50      60    70    80    90      100




                                                             % of known positives



                Effect of positive contamination
                                              E.coli dataset [J.J. Faith et al., 2007]

Tuesday, October 12, 2010
Network topology
                             based heuristics


Tuesday, October 12, 2010
Network motifs

                        Network motifs are small connected
                        subnetworks a network exhibits in a
                        significant higher or lower occurrences than
                        would be expected just by chance

                                                       A

                            A    B   C
                                              B    C       D   E



Tuesday, October 12, 2010
B. Goemann, E. Wingender, and A. P. Potapov, “An approach to evaluate the
                 topological significance of motifs and other patterns in regulatory networks.”
                 BMC System Biology, vol. 3, no. 53, May 2009.

                 S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, “Network motifs in the
                 transcriptional regulation network of escherichia coli,” Nature Genetics, vol. 31,
                 no. 1, pp. 64–68, May 2002.

Tuesday, October 12, 2010
Network Motifs
                                 Heuristic

                    • For each three genes sub networks T:
                    • If matches a network motifs M then
                            considers all connections not present in M
                            as negatives
                                                          B




                                                     A         C


Tuesday, October 12, 2010
Network Motifs
                                 Heuristic

                    • For each three genes sub networks T:
                    • If matches a network motifs M then
                            considers all connections not present in M
                            as negatives
                                                          B




                                                     A         C


Tuesday, October 12, 2010
Network Motifs
                                 Heuristic

                    • For each three genes sub networks T:
                    • If matches a network motifs M then
                            considers all connections not present in M
                            as negatives
                                                          B




                                                     A         C


Tuesday, October 12, 2010
Network Motifs
                                 Heuristic

                    • For each three genes sub networks T:
                    • If matches a network motifs M then
                            considers all connections not present in M
                            as negatives
                                                          B




                                                     A         C


Tuesday, October 12, 2010
1.0
                                           positive contamination = 0




                                     0.9
                                     0.8
                            AUROC


                                     0.7
                                                     positive contamination = 1
                                                            (PU-learning)
                                     0.6
                                     0.5




                                           10   20     30    40    50    60       70   80   90   100




                                                     % of known positives



                 MOTIF selection performance
                               E.coli dataset [J.J. Faith et al., 2007 and RegulonDB]

Tuesday, October 12, 2010
1.0
                                             positive contamination = 0




                                     0.9
                                     0.8
                            AUROC


                                     0.7
                                                        positive contamination = 1
                                           MOTIF               (PU-learning)
                                     0.6
                                     0.5




                                            10     20     30    40    50    60       70   80   90   100




                                                        % of known positives



                 MOTIF selection performance
                               E.coli dataset [J.J. Faith et al., 2007 and RegulonDB]

Tuesday, October 12, 2010
0.6
                                        0.5
                                                     positive contamination = 0




                                        0.4
                            F-Measure


                                        0.3
                                        0.2
                                        0.1




                                                                               positive contamination = 1
                                                                                      (PU-learning)
                                        0.0




                                                10      20    30    40    50      60    70    80    90      100




                                                             % of known positives



                Effect of positive contamination
                                              E.coli dataset [J.J. Faith et al., 2007]

Tuesday, October 12, 2010
0.6
                                        0.5
                                                     positive contamination = 0




                                        0.4
                            F-Measure


                                        0.3
                                                                     MOTIF
                                        0.2
                                        0.1




                                                                               positive contamination = 1
                                                                                      (PU-learning)
                                        0.0




                                                10      20    30    40    50      60    70    80    90      100




                                                             % of known positives



                Effect of positive contamination
                                              E.coli dataset [J.J. Faith et al., 2007]

Tuesday, October 12, 2010
Scale free networks




                Albert-László Barabási and Zoltán N. Oltvai
                Network biology: Understanding the cell’s functional organization
                Nature Reviews Genetics 5, 101-113 (2004)


Tuesday, October 12, 2010
Hierarchical networks




                Hong-Wu Ma, Jan Buer, and An-Ping Zeng
                Hierarchical structure and modules in the Escherichia coli transcriptional
                regulatory network revealed by a new top-down approach
                BMC Bioinformatics 2004 5:199

Tuesday, October 12, 2010
Experimental data
                    • 445 Affymetrix Antisense2 microarray
                            expression profiles for 4345 genes of E.coli
                            [J.J. Faith et al., 2007]
                    • Data were standardized (i.e. zero mean unit
                            standard deviation)
                    • Regulations extracted from RegulonDB (v.
                            5) between 154 Transcription Factors and
                            1211 genes


Tuesday, October 12, 2010
Summary and
                                      conclusions
                    •       Learning gene regulations is affected by the problem
                            of learning from positive only data
                    •       At least for E.coli
                            •   The study of positive contamination shows that
                                there is room for new heuristics
                            •   Topology based heuristics (eg. motifs) have shown
                                promising results.
                    •       Open issues arise on higher level organisms where
                            gene interactions are more complex


Tuesday, October 12, 2010
Re weighting strategy
                            (PosOnly)
                            Cerulo et al. BMC Bioinformatics 2010, 11:228
                            http://www.biomedcentral.com/1471-2105/11/228




                            RESEARCH ARTICLE                                                                                          Open Access

                            Learning gene regulatory networks from only
                            Research article

                            positive and unlabeled data
                            Luigi Cerulo*1,2, Charles Elkan3 and Michele Ceccarelli1,2



                             Abstract
                             Background: Recently, supervised learning methods have been exploited to reconstruct gene regulatory networks
                             from gene expression data. The reconstruction of a network is modeled as a binary classification problem for each pair
                             of genes. A statistical classifier is trained to recognize the relationships between the activation profiles of gene pairs.
                             This approach has been proven to outperform previous unsupervised methods. However, the supervised approach
                             raises open questions. In particular, although known regulatory connections can safely be assumed to be positive
                             training examples, obtaining negative examples is not straightforward, because definite knowledge is typically not
                             available that a given pair of genes do not interact.
                             Results: A recent advance in research on data mining is a method capable of learning a classifier from only positive and
                             unlabeled examples, that does not need labeled negative examples. Applied to the reconstruction of gene regulatory
                             networks, we show that this method significantly outperforms the current state of the art of machine learning
                             methods. We assess the new method using both simulated and experimental data, and obtain major performance
Tuesday, October 12, 2010    improvement.
PosOnly: How it works
                                                   Labeled             Unlabeled
                                                     s=1                  s=0


                        Let x be a                    P       Q               N
                        random example                  Positive            Negative
                                                         y=1                  y=0



                        s=1 iff x is labeled, s=0 iff x is unlabeled
                        y=1 iff x is positive, y=0 iff x is negative
                        If s=1 then y=1 (the contrary is not always true!)
                        p(s=1|x,y=0) = 0
Tuesday, October 12, 2010
PosOnly: How it works
                                                 Labeled        Unlabeled
                                                   s=1             s=0


                •       The goal is to learn       P       Q           N
                        a classifier such that:
                        f(x) = p(y=1|x)              Positive
                                                      y=1
                                                                     Negative
                                                                       y=0


                •       It is easy to see that (Elkan and Noto, 2008):
                        f(x) = p(s=1|x)/p(s=1|y=1)
                        = p(s=1 and y=1|x)/p(s=1|y=1)
                        = p(s=1|y=1,x)p(y=1|x)/p(s=1|y=1)
                        = p(y=1|x)

Tuesday, October 12, 2010
PosOnly: How it works
                                      binary classifier trained with
                                      labeled and unlabeled examples

                                      p(s = 1|x)
                            f (x) =
                                    p(s = 1|y = 1)
                                      unknown constant estimated
                                      empirically in a number of ways



Tuesday, October 12, 2010
Mean of F-Measure




                                                % of Known Positives




                 Results: experimental data
Tuesday, October 12, 2010

Weitere ähnliche Inhalte

Learning Gene Networks with Only Positive Data

  • 1. On Learning Gene Regulatory Networks with Only Positive Examples Luigi Cerulo, University of Sannio, Italy Michele Ceccarelli, University of Sannio, Italy Tuesday, October 12, 2010
  • 2. Outline • Supervised inference of gene regulatory networks • The positive only problem • Negative selection approaches • Effect on prediction accuracy • Conclusions and future directions Tuesday, October 12, 2010
  • 3. Gene Regulatory Network (GRN) The network of transcription dependences among genes of an organism, known as transcription factors, and their binding sites. TF protein TF Gene A Gene B gene A gene B Tuesday, October 12, 2010
  • 4. Gene Regulatory Network (GRN) • A gene regulatory G2 network can be G1 represented as a graph G = (Vertices, Edges) G6 G3 • Vertices = Genes G7 • Edges = Interactions G5 G4 G8 Tuesday, October 12, 2010
  • 5. Inference of Gene regulatory networks G2 G1 G6 G3 G7 G5 G4 G8 Gi = {e1 , e2 , e3 , . . . , en } Tuesday, October 12, 2010
  • 6. GRN unsupervised inference • Correlation models (eg. Mutual information) • Bayesian Network • Boolean networks • ODE • ... Tuesday, October 12, 2010
  • 7. GRN supervised Inference G2 G1 • Part of the network is known in advance G6 G3 from public databases G7 (Eg. RegulonDB) G5 G4 G8 Tuesday, October 12, 2010
  • 8. GRN supervised Inference G2 G2 G1 G1 + G6 G3 G6 G3 G7 G7 G5 G5 G4 G4 G8 G8 Gi = {e1 , e2 , e3 , . . . , en } T = {(G1 , G2 ), (G2 , G3 ), (G6 , G7 ), (G7 , G8 )} Binary classifier (SVM, Decision Tree, Neural Networks,...) Tuesday, October 12, 2010
  • 10. SIRENE approach • trains an SVM classifier for each gene and predicts which genes are regulated by that gene • combines all predicted regulations to obtain the full regulatory network G2 G2 G2 G1 G1 G1 ... G6 G3 G6 G3 G6 G3 G7 G7 G7 G5 G5 G5 G4 G4 G4 G8 G8 G8 Tuesday, October 12, 2010
  • 11. 1 1 CLR SIRENE 0.8 0.8 SIRENE−Bias Ratio of true positives 0.6 0.6 Precision Precision 0.4 0.4 0.2 CLR 0.2 SIRENE SIRENE−Bias 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Ratio of false positives Recall Method Recall at 60% of Precision Recall at 80% of Precision SIRENE 44.5% 17.6% CLR 7.5% 5.5% Relevance networks 4.7% 3.3% ARACNe 1% 0% Bayesian network 1% 0% Compared with unsupervised methods (Mordelet and Vert, 2008) Tuesday, October 12, 2010
  • 12. 60 supervised (SIRENE) 45 True Positives 30 unsupervised (ARACNE) 15 0 0 100 200 300 400 Top N prediction of new c-Myc regulations True positives are validated with IPA (www.ingenuity.com) Tuesday, October 12, 2010
  • 13. Supervised learning + + - + + + + + + - + + - - - - + - - - + - Tuesday, October 12, 2010
  • 14. Supervised learning + + - + + + + + + - + + - - - - + - - - + - f(x) Tuesday, October 12, 2010
  • 15. Supervised learning + + - + + + + - ?+ + + - + - - - + - - ? - + - f(x) Tuesday, October 12, 2010
  • 16. Supervised learning + + - + + + + - ?+ + + - + - - - + - - - - + - f(x) Tuesday, October 12, 2010
  • 17. Supervised learning + + - + + + + - ++ + + - + - - - + - - - - + - f(x) Tuesday, October 12, 2010
  • 18. Supervised learning with unlabeled data + + - + + + + + + - + + - - - - + - - - + - Tuesday, October 12, 2010
  • 19. Supervised learning with unlabeled data + + ? - + + + ? + + + - + + - ? - - - +? - - - + - Tuesday, October 12, 2010
  • 20. Supervised learning with unlabeled data + + ? - ? + + + + ? + + - + ? + - ? ? - - ? ? - ? + ? - ? - ? - ? + - ? Tuesday, October 12, 2010
  • 21. Supervised learning with unlabeled data + + ? - ? + + + + ? + + - + ? + f(x) - ? ? - - ? ? - ? + ? PU-learning - ? - ? - ? + - ? Tuesday, October 12, 2010
  • 22. Supervised learning of gene regulatory networks +1 G2 G1 +1 G6 G3 +1 G7 Is this a negative G5 example? +1 G4 G8 Is this a negative example? Tuesday, October 12, 2010
  • 23. Training set Labeled Unlabeled P Q N Positive Negative |P | % of Known Positives |P ∪ Q| Tuesday, October 12, 2010
  • 24. 1.0 0.9 0.8 AUROC 0.7 0.6 0.5 10 20 30 40 50 60 70 80 90 100 % of known positives Effect of PU-learning E.coli dataset [J.J. Faith et al., 2007] Tuesday, October 12, 2010
  • 25. Reliable negative selection + + ? - ? + + + + ? + + - + ? + f(x) -? ? - - ? ? - ? + ? PU-learning - ? - ? - ? + - ? Tuesday, October 12, 2010
  • 26. Reliable negative selection + + ? - + + + ?+ + + - + + f(x) -? ? - - - ? + ? PU-learning - - ? - + -? Tuesday, October 12, 2010
  • 27. Reliable negative selection + + ? - + + + ?+ + + - + + f(x) -? ? - - - ? + ? PU-learning - - ? - + -? f’(x) Tuesday, October 12, 2010
  • 28. Reliable negative selection in text mining • B. Liu et al. Building Text Classifiers Using Positive and Unlabeled Examples, in ICDM 2003 • Yu et al. PEBL: Positive Example Based Learning for Web Page Classification Using SVM, in KDD 2002 • Denis et al. Text classification from positive and unlabeled Examples, in IPMU 2002 Tuesday, October 12, 2010
  • 29. Methods based on reliable negative selection Labeled Unlabeled Original training set P Q N Negative selection heuristic New training set P RN Tuesday, October 12, 2010
  • 30. Quality of RN RN • RN could be contaminated with positives embedded in unlabeled data • The fraction of positive contamination is the ratio between the number of positives in RN and the total number of unknown positives |Q| Tuesday, October 12, 2010
  • 31. 1.0 positive contamination = 0 0.9 0.8 AUROC 0.7 positive contamination = 1 (PU-learning) 0.6 0.5 10 20 30 40 50 60 70 80 90 100 % of known positives Effect of positive contamination E.coli dataset [J.J. Faith et al., 2007] Tuesday, October 12, 2010
  • 32. 0.6 0.5 positive contamination = 0 0.4 F-Measure 0.3 0.2 0.1 positive contamination = 1 (PU-learning) 0.0 10 20 30 40 50 60 70 80 90 100 % of known positives Effect of positive contamination E.coli dataset [J.J. Faith et al., 2007] Tuesday, October 12, 2010
  • 33. Network topology based heuristics Tuesday, October 12, 2010
  • 34. Network motifs Network motifs are small connected subnetworks a network exhibits in a significant higher or lower occurrences than would be expected just by chance A A B C B C D E Tuesday, October 12, 2010
  • 35. B. Goemann, E. Wingender, and A. P. Potapov, “An approach to evaluate the topological significance of motifs and other patterns in regulatory networks.” BMC System Biology, vol. 3, no. 53, May 2009. S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, “Network motifs in the transcriptional regulation network of escherichia coli,” Nature Genetics, vol. 31, no. 1, pp. 64–68, May 2002. Tuesday, October 12, 2010
  • 36. Network Motifs Heuristic • For each three genes sub networks T: • If matches a network motifs M then considers all connections not present in M as negatives B A C Tuesday, October 12, 2010
  • 37. Network Motifs Heuristic • For each three genes sub networks T: • If matches a network motifs M then considers all connections not present in M as negatives B A C Tuesday, October 12, 2010
  • 38. Network Motifs Heuristic • For each three genes sub networks T: • If matches a network motifs M then considers all connections not present in M as negatives B A C Tuesday, October 12, 2010
  • 39. Network Motifs Heuristic • For each three genes sub networks T: • If matches a network motifs M then considers all connections not present in M as negatives B A C Tuesday, October 12, 2010
  • 40. 1.0 positive contamination = 0 0.9 0.8 AUROC 0.7 positive contamination = 1 (PU-learning) 0.6 0.5 10 20 30 40 50 60 70 80 90 100 % of known positives MOTIF selection performance E.coli dataset [J.J. Faith et al., 2007 and RegulonDB] Tuesday, October 12, 2010
  • 41. 1.0 positive contamination = 0 0.9 0.8 AUROC 0.7 positive contamination = 1 MOTIF (PU-learning) 0.6 0.5 10 20 30 40 50 60 70 80 90 100 % of known positives MOTIF selection performance E.coli dataset [J.J. Faith et al., 2007 and RegulonDB] Tuesday, October 12, 2010
  • 42. 0.6 0.5 positive contamination = 0 0.4 F-Measure 0.3 0.2 0.1 positive contamination = 1 (PU-learning) 0.0 10 20 30 40 50 60 70 80 90 100 % of known positives Effect of positive contamination E.coli dataset [J.J. Faith et al., 2007] Tuesday, October 12, 2010
  • 43. 0.6 0.5 positive contamination = 0 0.4 F-Measure 0.3 MOTIF 0.2 0.1 positive contamination = 1 (PU-learning) 0.0 10 20 30 40 50 60 70 80 90 100 % of known positives Effect of positive contamination E.coli dataset [J.J. Faith et al., 2007] Tuesday, October 12, 2010
  • 44. Scale free networks Albert-László Barabási and Zoltán N. Oltvai Network biology: Understanding the cell’s functional organization Nature Reviews Genetics 5, 101-113 (2004) Tuesday, October 12, 2010
  • 45. Hierarchical networks Hong-Wu Ma, Jan Buer, and An-Ping Zeng Hierarchical structure and modules in the Escherichia coli transcriptional regulatory network revealed by a new top-down approach BMC Bioinformatics 2004 5:199 Tuesday, October 12, 2010
  • 46. Experimental data • 445 Affymetrix Antisense2 microarray expression profiles for 4345 genes of E.coli [J.J. Faith et al., 2007] • Data were standardized (i.e. zero mean unit standard deviation) • Regulations extracted from RegulonDB (v. 5) between 154 Transcription Factors and 1211 genes Tuesday, October 12, 2010
  • 47. Summary and conclusions • Learning gene regulations is affected by the problem of learning from positive only data • At least for E.coli • The study of positive contamination shows that there is room for new heuristics • Topology based heuristics (eg. motifs) have shown promising results. • Open issues arise on higher level organisms where gene interactions are more complex Tuesday, October 12, 2010
  • 48. Re weighting strategy (PosOnly) Cerulo et al. BMC Bioinformatics 2010, 11:228 http://www.biomedcentral.com/1471-2105/11/228 RESEARCH ARTICLE Open Access Learning gene regulatory networks from only Research article positive and unlabeled data Luigi Cerulo*1,2, Charles Elkan3 and Michele Ceccarelli1,2 Abstract Background: Recently, supervised learning methods have been exploited to reconstruct gene regulatory networks from gene expression data. The reconstruction of a network is modeled as a binary classification problem for each pair of genes. A statistical classifier is trained to recognize the relationships between the activation profiles of gene pairs. This approach has been proven to outperform previous unsupervised methods. However, the supervised approach raises open questions. In particular, although known regulatory connections can safely be assumed to be positive training examples, obtaining negative examples is not straightforward, because definite knowledge is typically not available that a given pair of genes do not interact. Results: A recent advance in research on data mining is a method capable of learning a classifier from only positive and unlabeled examples, that does not need labeled negative examples. Applied to the reconstruction of gene regulatory networks, we show that this method significantly outperforms the current state of the art of machine learning methods. We assess the new method using both simulated and experimental data, and obtain major performance Tuesday, October 12, 2010 improvement.
  • 49. PosOnly: How it works Labeled Unlabeled s=1 s=0 Let x be a P Q N random example Positive Negative y=1 y=0 s=1 iff x is labeled, s=0 iff x is unlabeled y=1 iff x is positive, y=0 iff x is negative If s=1 then y=1 (the contrary is not always true!) p(s=1|x,y=0) = 0 Tuesday, October 12, 2010
  • 50. PosOnly: How it works Labeled Unlabeled s=1 s=0 • The goal is to learn P Q N a classifier such that: f(x) = p(y=1|x) Positive y=1 Negative y=0 • It is easy to see that (Elkan and Noto, 2008): f(x) = p(s=1|x)/p(s=1|y=1) = p(s=1 and y=1|x)/p(s=1|y=1) = p(s=1|y=1,x)p(y=1|x)/p(s=1|y=1) = p(y=1|x) Tuesday, October 12, 2010
  • 51. PosOnly: How it works binary classifier trained with labeled and unlabeled examples p(s = 1|x) f (x) = p(s = 1|y = 1) unknown constant estimated empirically in a number of ways Tuesday, October 12, 2010
  • 52. Mean of F-Measure % of Known Positives Results: experimental data Tuesday, October 12, 2010