SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
1




Computational discovery of
composite motifs in DNA

Geir Kjetil Sandve, Osman Abul and Finn Drabløs


                                     Finn Drabløs [tare.medisin.ntnu.no]
Introduction                                                  2



   Basic gene regulation
 • Proteins (transcription
   factors, TFs)
   recognise binding
   sites (sequence
   motifs) in gene
   regulatory regions
 • The transcription
   factors stabilise the                      Michael Lones

   transcription complex
 • Distal promoters
   (enhancers) interact
   through DNA looping

                             Finn Drabløs [tare.medisin.ntnu.no]
Motivation                                                                                     3



 De novo prediction of binding sites
 • Make a set of co-regulated genes
     – E.g. from microarray experiments, normally imperfect sets
 • Extract assumed regulatory regions
     – Normally a fixed region upstream from TSS of each gene
 • Search for overrepresented patterns in these regions
     – Use a model for what a motif should look like
         • Consensus sequence with mismatches
         • Position Weight Matrix (PWM) based on log odds scores for occurrences
     – Use a strategy to find (local) optima for this model
         • E.g. Gibbs sampling, expectation maximisation …

 • Problem: More than 100 different methods
     – Which methods are reliable?



                                                              Finn Drabløs [tare.medisin.ntnu.no]
Motivation                                                                            4



   Benchmarking of de novo tools
   • Tompa et al, Nature Biotech 23, 137-144 (2005)
   • Tested 14 different tools for motif discovery
   • Used 52 data sets from fly (6), human (26), mouse (12)
     and yeast (8)
   • Used data sets with real (Transfac) binding sites in
     different sequence contexts
       – ”real” – The actual promoter sequences
       – ”generic” – Randomly chosen promoter sequences from same genome
       – ”markov” – Sequences generated by Markov chain of order 3
   • Measured performance at nucleotide level




                                                     Finn Drabløs [tare.medisin.ntnu.no]
Motivation                                                                                  5




 Average benchmark performance
   Method         TP      FP     FN       TN     TP FN
   AlignAce       477    3789   8186   436048    FP TN   Pred_P        Pred_N
   ANN-Spec       754    7799   7909   432038
   Consensus      178    1394   8485   438443   Real_P      471            8192
   GLAM           223    5619   8440   434218   Real_N     5167        434670
   Improbizer     594   7942    8069   431895
   MEME           581    4836   8082   435001
   MEME3          673    6726   7990   433111   nCC = 0.053
   MITRA          272    4092   8391   435745
   MotifSampler   520   4344    8143   435493   Performance is close to
   Oligo/dyad     345    1891   8318   437946
   QuickScore     151    4856   8512   434981
                                                random!
   SeSiMCMC       530   13813   8133   426024
   Weeder         748    1748   7915   438089   Too many FP, FN
   YMF            554    3492   8109   436345




                                                           Finn Drabløs [tare.medisin.ntnu.no]
Motivation                                                                              6



   Can we improve performance?
 • Use better motif representations
     – Hidden Markov Models
 • Use better algorithms
     – More exhaustive searching TODAY!
     – Discriminative motif discovery
 • Use better background models
     – Real sequences (not Markov models)     TODAY!



 • Filter out false positives
     – Identify “motif-like” solutions
     – Identify regulatory regions
     – Use co-occurrence of motifs
                                         TODAY!
         • Modules, composite motifs

                                                       Finn Drabløs [tare.medisin.ntnu.no]
Approach                                                               7



 Composite motif discovery




• TFs act together as modules
• Modules are not completely unique

                                      Finn Drabløs [tare.medisin.ntnu.no]
Algorithm                                                                                           8



 Basic definitions
 • Frequent modules
     – Modules (and motifs) can be ranked by support
            • Fraction of sequences where the module (or motif) is found
     – Support is monotonous
            • Adding a motif to a module can never increase module support

 • Specific modules
     – Modules can be ranked by hit probability
            • Probability that a sequence supports the module
     – Hit probability is monotonous (as for support)
     – Specific modules have low hit probability in background sequences
 • Significant modules
     – Modules can be ranked by significance
            • Probability that support in sequence ≠ background



                                                                   Finn Drabløs [tare.medisin.ntnu.no]
Algorithm                                                                      9



 Search tree
 • Discretized single motifs
   {1, 2, 3, …} organised as an
   implicit search tree
 • Support set H and hit
   probability P is iteratively
   computed (monotonicity)
     – Initially H is full sequence set and
       P is 1)
 • Search tree is efficiently
   pruned (indicated with X)
   based on H and P
 • Final output can be ranked
   by module significance
                                              Finn Drabløs [tare.medisin.ntnu.no]
Implementation                                                                                   10



 Module significance
 • Position-level probability in background
     – Probability of single motif at specific location
     – Estimated from real DNA background sequences
 • Sequence-level probability in background
     – Probability of single motif at least once in given background sequence
     – Estimated as union of position-level probabilities
 • Hit-probability in background
     – Probability of composite motif at least once in background sequence
     – Estimated as product of individual motif components
 • Significance p-value of observed support
     – Probability of seeing at least observed support in background set
     – Estimated as right tail of binomial distribution
 p       • At least k out of n successes given hit-probability


                                                                 Finn Drabløs [tare.medisin.ntnu.no]
Implementation                                                                        11



 Problem specification
 • Frequent and specific modules
     – Use thresholds on support and
       specificity
     – Complete solutions but multi-
       objective optimization
 • Top-ranking modules
     – Combine objectives into single
       measure, e.g. p-value
 • Pareto-optimal modules
     – Each objective is a separate
       dimension of optimality
                                          http://en.wikipedia.org/wiki/Pareto_efficiency
     – Return Pareto front of composite
       motifs



                                                      Finn Drabløs [tare.medisin.ntnu.no]
Implementation                                            12



 Motif prediction flowchart




                          Finn Drabløs [tare.medisin.ntnu.no]
Benchmarking                                                                               13



 Benchmark data set



 • Known composite motifs from the TransCompel database
 • Tests performance by adding “noise matrices” to input
    – Matrices for TFs assumed not to bind in sequence set
        • Will have random (false positive) hits
    – Selected at random from Transfac
        • Max noise level includes all Transfac matrices
    – Similar to actual usage
        • Searching for motifs consisting of unknown TFs


                                                           Finn Drabløs [tare.medisin.ntnu.no]
Benchmarking                                                                14



 General performance (nCC)




 • Compo compared to several other tools
    – TransCompel benchmark set
 • Compo has clearly best performance, in particular at
   realistic settings (high noise level)

                                            Finn Drabløs [tare.medisin.ntnu.no]
Benchmarking                                                                       15



 Background and support
 • Compo gains performance from realistic background (real
   DNA) and support
    – Random DNA based on multinomial sequence model
 • Performance without real DNA background or support
   comparable to other tools




                                                   Finn Drabløs [tare.medisin.ntnu.no]
Future development                                                            16



 Pareto front
• Pareto front on support,
  max motif distance and
  significance (colour)
• Compo prediction not
  optimal
    – Compo predicted Ets and
      GATA
    – Annotated motif is AP1 and
      NFAT
• Explore alternative
  solutions
• Explore parameter                X – NFAT
  interactions                     O – AP1
                                              Finn Drabløs [tare.medisin.ntnu.no]
Acknowledgements                                                                                17



  The research group
   BiGR                                   Programmers / Technicians
                                          Johansen, Jostein
   Drabløs, Finn                          Thomas, Laurent
                                          Olsen, Lene C.
   Postdocs / Researchers
   Sætrom, Pål                            Others
   Kusnierczyk, Wacek                     Solbakken, Trude
   Rye, Morten
   Klein, Jörn                            Master students
   Anderssen, Endre                       Bolstad, Kjersti
   Wang, Xinhui (ERCIM)                   Muiser, Iwe
   Capatana, Ana (ERCIM, starting 2009)   Sponberg, Bjørn
                                          Brands, Stef
   PhDs                                   Skaland, Even
   Bratlie, Marit Skyrud
   Klepper, Kjetil                        Former members
   Saito, Takaya                          Sandve, Geir Kjetil
   Lundbæk, Marie                         Abul, Osman
   Håndstad, Tony                         Schwalie, Petra
                                          Lones, Michael

                                                                Finn Drabløs [tare.medisin.ntnu.no]

Weitere ähnliche Inhalte

Ähnlich wie Drablos Composite Motifs Bosc2009

Reference Materials Selection and Design Working Group Summary Aug2012
Reference Materials Selection and Design Working Group Summary Aug2012Reference Materials Selection and Design Working Group Summary Aug2012
Reference Materials Selection and Design Working Group Summary Aug2012
GenomeInABottle
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
DataScienceConferenc1
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
butest
 
Neural network
Neural networkNeural network
Neural network
Saddam Hussain
 
140127 rm selection wg summary
140127 rm selection wg summary140127 rm selection wg summary
140127 rm selection wg summary
GenomeInABottle
 

Ähnlich wie Drablos Composite Motifs Bosc2009 (20)

Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
 
Reference Materials Selection and Design Working Group Summary Aug2012
Reference Materials Selection and Design Working Group Summary Aug2012Reference Materials Selection and Design Working Group Summary Aug2012
Reference Materials Selection and Design Working Group Summary Aug2012
 
Deep learning frameworks v0.40
Deep learning frameworks v0.40Deep learning frameworks v0.40
Deep learning frameworks v0.40
 
Deep Learning Frameworks slides
Deep Learning Frameworks slides Deep Learning Frameworks slides
Deep Learning Frameworks slides
 
Robust music signal separation based on supervised nonnegative matrix factori...
Robust music signal separation based on supervised nonnegative matrix factori...Robust music signal separation based on supervised nonnegative matrix factori...
Robust music signal separation based on supervised nonnegative matrix factori...
 
High-Dimensional Machine Learning for Medicine
High-Dimensional Machine Learning for MedicineHigh-Dimensional Machine Learning for Medicine
High-Dimensional Machine Learning for Medicine
 
Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...Towards automating machine learning: benchmarking tools for hyperparameter tu...
Towards automating machine learning: benchmarking tools for hyperparameter tu...
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
SC1.pptx
SC1.pptxSC1.pptx
SC1.pptx
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
Evolutionary (deep) neural network
Evolutionary (deep) neural networkEvolutionary (deep) neural network
Evolutionary (deep) neural network
 
Neural network
Neural networkNeural network
Neural network
 
140127 rm selection wg summary
140127 rm selection wg summary140127 rm selection wg summary
140127 rm selection wg summary
 
Lec 18-19.pptx
Lec 18-19.pptxLec 18-19.pptx
Lec 18-19.pptx
 
Artificial Neural Network Learning Algorithm.ppt
Artificial Neural Network Learning Algorithm.pptArtificial Neural Network Learning Algorithm.ppt
Artificial Neural Network Learning Algorithm.ppt
 
Predicting Customer Conversion with Random Forests
Predicting Customer Conversion with Random ForestsPredicting Customer Conversion with Random Forests
Predicting Customer Conversion with Random Forests
 
13 random forest
13 random forest13 random forest
13 random forest
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
Ivy Zhu, Research Scientist, Intel at MLconf SEA - 5/01/15
Ivy Zhu, Research Scientist, Intel at MLconf SEA - 5/01/15Ivy Zhu, Research Scientist, Intel at MLconf SEA - 5/01/15
Ivy Zhu, Research Scientist, Intel at MLconf SEA - 5/01/15
 

Mehr von bosc

Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009
bosc
 
Bosc Intro 20090627
Bosc Intro 20090627Bosc Intro 20090627
Bosc Intro 20090627
bosc
 
Software Patterns Panel Bosc2009
Software Patterns Panel Bosc2009Software Patterns Panel Bosc2009
Software Patterns Panel Bosc2009
bosc
 
Schbath Rmes Bosc2009
Schbath Rmes Bosc2009Schbath Rmes Bosc2009
Schbath Rmes Bosc2009
bosc
 
Kallio Chipster Bosc2009
Kallio Chipster Bosc2009Kallio Chipster Bosc2009
Kallio Chipster Bosc2009
bosc
 
Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009
bosc
 
Rice Emboss Bosc2009
Rice Emboss Bosc2009Rice Emboss Bosc2009
Rice Emboss Bosc2009
bosc
 
Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009
bosc
 
Senger Soaplab Bosc2009
Senger Soaplab Bosc2009Senger Soaplab Bosc2009
Senger Soaplab Bosc2009
bosc
 
Cock Biopython Bosc2009
Cock Biopython Bosc2009Cock Biopython Bosc2009
Cock Biopython Bosc2009
bosc
 
Hanmer Software Patterns Bosc2009
Hanmer Software Patterns Bosc2009Hanmer Software Patterns Bosc2009
Hanmer Software Patterns Bosc2009
bosc
 
Snell Psoda Bosc2009
Snell Psoda Bosc2009Snell Psoda Bosc2009
Snell Psoda Bosc2009
bosc
 
Procter Vamsas Bosc2009
Procter Vamsas Bosc2009Procter Vamsas Bosc2009
Procter Vamsas Bosc2009
bosc
 
Fauteux Seeder Bosc2009
Fauteux Seeder Bosc2009Fauteux Seeder Bosc2009
Fauteux Seeder Bosc2009
bosc
 
Moeller Debian Bosc2009
Moeller Debian Bosc2009Moeller Debian Bosc2009
Moeller Debian Bosc2009
bosc
 
Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009
bosc
 
Wilczynski_BNFinder_BOSC2009
Wilczynski_BNFinder_BOSC2009Wilczynski_BNFinder_BOSC2009
Wilczynski_BNFinder_BOSC2009
bosc
 
Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009
bosc
 
Varre_Biomanycores_BOSC2009
Varre_Biomanycores_BOSC2009Varre_Biomanycores_BOSC2009
Varre_Biomanycores_BOSC2009
bosc
 
Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009
bosc
 

Mehr von bosc (20)

Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009
 
Bosc Intro 20090627
Bosc Intro 20090627Bosc Intro 20090627
Bosc Intro 20090627
 
Software Patterns Panel Bosc2009
Software Patterns Panel Bosc2009Software Patterns Panel Bosc2009
Software Patterns Panel Bosc2009
 
Schbath Rmes Bosc2009
Schbath Rmes Bosc2009Schbath Rmes Bosc2009
Schbath Rmes Bosc2009
 
Kallio Chipster Bosc2009
Kallio Chipster Bosc2009Kallio Chipster Bosc2009
Kallio Chipster Bosc2009
 
Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009
 
Rice Emboss Bosc2009
Rice Emboss Bosc2009Rice Emboss Bosc2009
Rice Emboss Bosc2009
 
Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009
 
Senger Soaplab Bosc2009
Senger Soaplab Bosc2009Senger Soaplab Bosc2009
Senger Soaplab Bosc2009
 
Cock Biopython Bosc2009
Cock Biopython Bosc2009Cock Biopython Bosc2009
Cock Biopython Bosc2009
 
Hanmer Software Patterns Bosc2009
Hanmer Software Patterns Bosc2009Hanmer Software Patterns Bosc2009
Hanmer Software Patterns Bosc2009
 
Snell Psoda Bosc2009
Snell Psoda Bosc2009Snell Psoda Bosc2009
Snell Psoda Bosc2009
 
Procter Vamsas Bosc2009
Procter Vamsas Bosc2009Procter Vamsas Bosc2009
Procter Vamsas Bosc2009
 
Fauteux Seeder Bosc2009
Fauteux Seeder Bosc2009Fauteux Seeder Bosc2009
Fauteux Seeder Bosc2009
 
Moeller Debian Bosc2009
Moeller Debian Bosc2009Moeller Debian Bosc2009
Moeller Debian Bosc2009
 
Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009
 
Wilczynski_BNFinder_BOSC2009
Wilczynski_BNFinder_BOSC2009Wilczynski_BNFinder_BOSC2009
Wilczynski_BNFinder_BOSC2009
 
Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009
 
Varre_Biomanycores_BOSC2009
Varre_Biomanycores_BOSC2009Varre_Biomanycores_BOSC2009
Varre_Biomanycores_BOSC2009
 
Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Drablos Composite Motifs Bosc2009

  • 1. 1 Computational discovery of composite motifs in DNA Geir Kjetil Sandve, Osman Abul and Finn Drabløs Finn Drabløs [tare.medisin.ntnu.no]
  • 2. Introduction 2 Basic gene regulation • Proteins (transcription factors, TFs) recognise binding sites (sequence motifs) in gene regulatory regions • The transcription factors stabilise the Michael Lones transcription complex • Distal promoters (enhancers) interact through DNA looping Finn Drabløs [tare.medisin.ntnu.no]
  • 3. Motivation 3 De novo prediction of binding sites • Make a set of co-regulated genes – E.g. from microarray experiments, normally imperfect sets • Extract assumed regulatory regions – Normally a fixed region upstream from TSS of each gene • Search for overrepresented patterns in these regions – Use a model for what a motif should look like • Consensus sequence with mismatches • Position Weight Matrix (PWM) based on log odds scores for occurrences – Use a strategy to find (local) optima for this model • E.g. Gibbs sampling, expectation maximisation … • Problem: More than 100 different methods – Which methods are reliable? Finn Drabløs [tare.medisin.ntnu.no]
  • 4. Motivation 4 Benchmarking of de novo tools • Tompa et al, Nature Biotech 23, 137-144 (2005) • Tested 14 different tools for motif discovery • Used 52 data sets from fly (6), human (26), mouse (12) and yeast (8) • Used data sets with real (Transfac) binding sites in different sequence contexts – ”real” – The actual promoter sequences – ”generic” – Randomly chosen promoter sequences from same genome – ”markov” – Sequences generated by Markov chain of order 3 • Measured performance at nucleotide level Finn Drabløs [tare.medisin.ntnu.no]
  • 5. Motivation 5 Average benchmark performance Method TP FP FN TN TP FN AlignAce 477 3789 8186 436048 FP TN Pred_P Pred_N ANN-Spec 754 7799 7909 432038 Consensus 178 1394 8485 438443 Real_P 471 8192 GLAM 223 5619 8440 434218 Real_N 5167 434670 Improbizer 594 7942 8069 431895 MEME 581 4836 8082 435001 MEME3 673 6726 7990 433111 nCC = 0.053 MITRA 272 4092 8391 435745 MotifSampler 520 4344 8143 435493 Performance is close to Oligo/dyad 345 1891 8318 437946 QuickScore 151 4856 8512 434981 random! SeSiMCMC 530 13813 8133 426024 Weeder 748 1748 7915 438089 Too many FP, FN YMF 554 3492 8109 436345 Finn Drabløs [tare.medisin.ntnu.no]
  • 6. Motivation 6 Can we improve performance? • Use better motif representations – Hidden Markov Models • Use better algorithms – More exhaustive searching TODAY! – Discriminative motif discovery • Use better background models – Real sequences (not Markov models) TODAY! • Filter out false positives – Identify “motif-like” solutions – Identify regulatory regions – Use co-occurrence of motifs TODAY! • Modules, composite motifs Finn Drabløs [tare.medisin.ntnu.no]
  • 7. Approach 7 Composite motif discovery • TFs act together as modules • Modules are not completely unique Finn Drabløs [tare.medisin.ntnu.no]
  • 8. Algorithm 8 Basic definitions • Frequent modules – Modules (and motifs) can be ranked by support • Fraction of sequences where the module (or motif) is found – Support is monotonous • Adding a motif to a module can never increase module support • Specific modules – Modules can be ranked by hit probability • Probability that a sequence supports the module – Hit probability is monotonous (as for support) – Specific modules have low hit probability in background sequences • Significant modules – Modules can be ranked by significance • Probability that support in sequence ≠ background Finn Drabløs [tare.medisin.ntnu.no]
  • 9. Algorithm 9 Search tree • Discretized single motifs {1, 2, 3, …} organised as an implicit search tree • Support set H and hit probability P is iteratively computed (monotonicity) – Initially H is full sequence set and P is 1) • Search tree is efficiently pruned (indicated with X) based on H and P • Final output can be ranked by module significance Finn Drabløs [tare.medisin.ntnu.no]
  • 10. Implementation 10 Module significance • Position-level probability in background – Probability of single motif at specific location – Estimated from real DNA background sequences • Sequence-level probability in background – Probability of single motif at least once in given background sequence – Estimated as union of position-level probabilities • Hit-probability in background – Probability of composite motif at least once in background sequence – Estimated as product of individual motif components • Significance p-value of observed support – Probability of seeing at least observed support in background set – Estimated as right tail of binomial distribution p • At least k out of n successes given hit-probability Finn Drabløs [tare.medisin.ntnu.no]
  • 11. Implementation 11 Problem specification • Frequent and specific modules – Use thresholds on support and specificity – Complete solutions but multi- objective optimization • Top-ranking modules – Combine objectives into single measure, e.g. p-value • Pareto-optimal modules – Each objective is a separate dimension of optimality http://en.wikipedia.org/wiki/Pareto_efficiency – Return Pareto front of composite motifs Finn Drabløs [tare.medisin.ntnu.no]
  • 12. Implementation 12 Motif prediction flowchart Finn Drabløs [tare.medisin.ntnu.no]
  • 13. Benchmarking 13 Benchmark data set • Known composite motifs from the TransCompel database • Tests performance by adding “noise matrices” to input – Matrices for TFs assumed not to bind in sequence set • Will have random (false positive) hits – Selected at random from Transfac • Max noise level includes all Transfac matrices – Similar to actual usage • Searching for motifs consisting of unknown TFs Finn Drabløs [tare.medisin.ntnu.no]
  • 14. Benchmarking 14 General performance (nCC) • Compo compared to several other tools – TransCompel benchmark set • Compo has clearly best performance, in particular at realistic settings (high noise level) Finn Drabløs [tare.medisin.ntnu.no]
  • 15. Benchmarking 15 Background and support • Compo gains performance from realistic background (real DNA) and support – Random DNA based on multinomial sequence model • Performance without real DNA background or support comparable to other tools Finn Drabløs [tare.medisin.ntnu.no]
  • 16. Future development 16 Pareto front • Pareto front on support, max motif distance and significance (colour) • Compo prediction not optimal – Compo predicted Ets and GATA – Annotated motif is AP1 and NFAT • Explore alternative solutions • Explore parameter X – NFAT interactions O – AP1 Finn Drabløs [tare.medisin.ntnu.no]
  • 17. Acknowledgements 17 The research group BiGR Programmers / Technicians Johansen, Jostein Drabløs, Finn Thomas, Laurent Olsen, Lene C. Postdocs / Researchers Sætrom, Pål Others Kusnierczyk, Wacek Solbakken, Trude Rye, Morten Klein, Jörn Master students Anderssen, Endre Bolstad, Kjersti Wang, Xinhui (ERCIM) Muiser, Iwe Capatana, Ana (ERCIM, starting 2009) Sponberg, Bjørn Brands, Stef PhDs Skaland, Even Bratlie, Marit Skyrud Klepper, Kjetil Former members Saito, Takaya Sandve, Geir Kjetil Lundbæk, Marie Abul, Osman Håndstad, Tony Schwalie, Petra Lones, Michael Finn Drabløs [tare.medisin.ntnu.no]