SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
R’MES
  Finding Exceptional Motifs in Sequences

                 S. Schbath


         INRA, Jouy-en-Josas, France



http://genome.jouy.inra.fr/ssb/rmes/




                                            BOSC, Stockholm, June 27-28, 2009 – p.1
Introduction:
motifs and statistics




                        BOSC, Stockholm, June 27-28, 2009 – p.2
DNA and motifs


 • DNA: Long molecule, sequence of
   nucleotides
 • Nucleotides: A(denine), C(ytosine),
   G(uanine), T(hymine).
 • Motif (= oligonucleotides): short
   sequence of nucleotides, e.g.
   CAGTAG
 • Functional motif: recognized by
   proteins or enzymes to initiate a
   biological process



TAGACAGATAGACGATCAGTAGCCAGTAGACAGTAGGCATGA. . .



                                                  BOSC, Stockholm, June 27-28, 2009 – p.3
Some functional motifs

• Restriction sites: recognized by specific bacterial restriction enzymes ⇒
  double-strand DNA break.
  E.g. GAATTC recognized by EcoRI

• Chi motif: recognized by an enzyme which processes along DNA sequence
  and degrades it ⇒ enzyme degradation activity stopped and DNA repair is
  stimulated by recombination.
  E.g. GCTGGTGG recognized by RecBCD (E. coli)

• parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome
  into macro-domains.
        t
  TGTTAACACGTGAAACA
   c    c   t           t



• promoter: structured motif recognized by the RNA polymerase to initiate
  gene transcription.
                (16;18)
  E.g. TTGAC − − − TATAAT (E. coli).

                                                              BOSC, Stockholm, June 27-28, 2009 – p.4
Some functional motifs

• Restriction sites: recognized by specific bacterial restriction enzymes ⇒
  double-strand DNA break.
  E.g. GAATTC recognized by EcoRI
  very rare along bacterial genomes
• Chi motif: recognized by an enzyme which processes along DNA sequence
  and degrades it ⇒ enzyme degradation activity stopped and DNA repair is
  stimulated by recombination.
  E.g. GCTGGTGG recognized by RecBCD (E. coli)
  very frequent along E. coli genome
• parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome
  into macro-domains.
        t
  TGTTAACACGTGAAACA
   c    c   t           t
  very frequent into the ORI domain, rare elsewhere
• promoter: structured motif recognized by the RNA polymerase to initiate
  gene transcription.
                (16;18)
  E.g. TTGAC − − − TATAAT (E. coli).
  particularly located in front of genes
                                                              BOSC, Stockholm, June 27-28, 2009 – p.4
Prediction of functional motifs

Most of the functional motifs are unknown in the different species.
For instance,
  • which would be the Chi motif of S. aureus? [Halpern et al. (08)]
  • Is there an equivalent of parS in E. coli? [Mercier et al. (08)]
Statistical approach: to identify candidate motifs based on their statistical
properties.

     The most over-represented                The most over-represented families
       8-letter words under M1                           anbcdef g under M1
           E. coli ( = 4.6 106 )                  H. influenzae ( = 1.8 106 )
    word         obs       exp     score         motif        obs     exp        score
 gctggtgg        762      84.9     73.5       gntggtgg        223     55.3       22.33
 ggcgctgg        828     125.9     62.6       anttcatc        469    180.3       21.59
 cgctggcg        870     150.8     58.6       anatcgcc        288     87.8       21.38
 gctggcgg        723     125.9     53.3       tnatcgcc        279     84.5       21.18
 cgctggtg        619     101.7     51.3       gnagaaga        270     83.6       20.10
                                                                         BOSC, Stockholm, June 27-28, 2009 – p.5
Presentation of R’MES




                        BOSC, Stockholm, June 27-28, 2009 – p.6
Statistical questions addressed by R’MES

Questions related to the significance of the number of occurrences of motifs w
in sequences:

  • Is N obs (w) significantly high?
  • Is N obs (w) significantly higher than N obs (w )?
     −→ If w = w: is w significantly skewed (strand bias)?
        obs                                      obs
  • Is N1 (w) significantly more unexpected than N2 (w)?
Several types of motifs w:

  • fixed words (e.g. gctggtgg),
  • degenerated patterns (e.g. gntggtgg),
  • set of words (e.g. {w, w}).




                                                                BOSC, Stockholm, June 27-28, 2009 – p.7
Is N obs (w) significantly high?

• One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the
  count (r.v.) of w in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
  sequence composition in oligos of length 1- up to -(m + 1).
  Possibility to take the phase in coding sequences into account (Mm_3)




                                                              BOSC, Stockholm, June 27-28, 2009 – p.8
Is N obs (w) significantly high?

• One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the
  count (r.v.) of w in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
  sequence composition in oligos of length 1- up to -(m + 1).
  Possibility to take the phase in coding sequences into account (Mm_3)
• R’MES approximates the p-value by using
   • either a Gaussian approximation of N (w) (when E(N (w)) is large)
      [Prum et al. (95)], [Schbath et al. (95)]
   • or a compound Poisson distribution of N (w) (when E(N (w)) is small)
      [Schbath (95)], [Roquain and Schbath (07)]
  (see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 )




                                                              BOSC, Stockholm, June 27-28, 2009 – p.8
Is N obs (w) significantly high?

• One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the
  count (r.v.) of w in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
  sequence composition in oligos of length 1- up to -(m + 1).
  Possibility to take the phase in coding sequences into account (Mm_3)
• R’MES approximates the p-value by using
   • either a Gaussian approximation of N (w) (when E(N (w)) is large)
      [Prum et al. (95)], [Schbath et al. (95)]
   • or a compound Poisson distribution of N (w) (when E(N (w)) is small)
      [Schbath (95)], [Roquain and Schbath (07)]
  (see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 )
• R’MES produces scores of exceptionality (probit transformation).
  High positive (resp. negative) scores correspond to exceptionally frequent
  (resp. rare) motifs.

         rmes –gauss –s seqfile –m m –l wordlength –o outputfile


                                                               BOSC, Stockholm, June 27-28, 2009 – p.8
Is N obs (w) significantly higher than N obs (w)?

                                                          N obs (w)
 • One needs to calculate the p-value P                                   where N (·) is the
                                          “                           ”
                                              N (w)
                                              N (w)
                                                      ≥   N obs (w)
   count (r.v.) in random sequences (→ model).
 • R’MES considers Markov chain models of order m (Mm) which fit the
   sequence composition in oligos of length 1- up to -(m + 1).
   Possibility to take the phase in coding sequences into account (Mm_3)




                                                                              BOSC, Stockholm, June 27-28, 2009 – p.9
Is N obs (w) significantly higher than N obs (w)?

                                                           N obs (w)
 • One needs to calculate the p-value P                                    where N (·) is the
                                           “                           ”
                                               N (w)
                                               N (w)
                                                       ≥   N obs (w)
   count (r.v.) in random sequences (→ model).
 • R’MES considers Markov chain models of order m (Mm) which fit the
   sequence composition in oligos of length 1- up to -(m + 1).
   Possibility to take the phase in coding sequences into account (Mm_3)
 • R’MES approximates the p-value by using
    • the 2-dimensional Gaussian approximation of (N (w), N (w)) (when
       E(N (w)) and E(N (w)) are large)
       [Prum et al. (95)], [Schbath et al. (95)]




                                                                               BOSC, Stockholm, June 27-28, 2009 – p.9
Is N obs (w) significantly higher than N obs (w)?

                                                           N obs (w)
 • One needs to calculate the p-value P                                    where N (·) is the
                                           “                           ”
                                               N (w)
                                               N (w)
                                                       ≥   N obs (w)
   count (r.v.) in random sequences (→ model).
 • R’MES considers Markov chain models of order m (Mm) which fit the
   sequence composition in oligos of length 1- up to -(m + 1).
   Possibility to take the phase in coding sequences into account (Mm_3)
 • R’MES approximates the p-value by using
    • the 2-dimensional Gaussian approximation of (N (w), N (w)) (when
       E(N (w)) and E(N (w)) are large)
       [Prum et al. (95)], [Schbath et al. (95)]
 • R’MES produces scores of exceptional skew (probit transformation):
   High positive (resp. negative) scores correspond to motifs significantlty more
   frequent (resp. rare) along the sequence than along the complementary one.

          rmes –skew –seq seqfile –m m –l wordlength –o outputfile




                                                                               BOSC, Stockholm, June 27-28, 2009 – p.9
obs                                   obs
Is N1 (w) significantly more except. than N2 (w)?

    • One wants to compare the exceptionality of a motif w in two different
                                      obs        obs
      sequences (two observed counts N1 (w) and N2 (w))




                                                                   BOSC, Stockholm, June 27-28, 2009 – p.10
obs                                   obs
Is N1 (w) significantly more except. than N2 (w)?

    • One wants to compare the exceptionality of a motif w in two different
                                      obs        obs
      sequences (two observed counts N1 (w) and N2 (w))
    • R’MES computes a test statistic and its asociated p-value to test
                 H0 : {w is equally exceptional in both sequences}
      against
                 H1 : {w is more exceptional in the first sequence}
      [Robin et al. (08)]




                                                                     BOSC, Stockholm, June 27-28, 2009 – p.10
obs                                   obs
Is N1 (w) significantly more except. than N2 (w)?

    • One wants to compare the exceptionality of a motif w in two different
                                      obs        obs
      sequences (two observed counts N1 (w) and N2 (w))
    • R’MES computes a test statistic and its asociated p-value to test
                 H0 : {w is equally exceptional in both sequences}
      against
                 H1 : {w is more exceptional in the first sequence}
      [Robin et al. (08)]
    • The test is performed by considering occurrence processes like Poisson
      processes whose intensities take the sequence compositions in oligos of
      length 1- up to -(m + 1) into account.
    • Option –seq2 soon available in R’MES.




                                                                     BOSC, Stockholm, June 27-28, 2009 – p.10
RMESPlot interface




                     BOSC, Stockholm, June 27-28, 2009 – p.11
Prediction and identification
 of functional DNA motifs




                               BOSC, Stockholm, June 27-28, 2009 – p.12
Chi motifs in bacterial genomes

• Motif involved in the repair of double-strand DNA breaks.
  Chi needs to be frequent along bacterial genomes.
• Chi motifs have been identified for few bacterial species. They are not
  conserved through species.
• Known Chi motifs are 5 to 8 nucleotides long and can be degenerated.
• Moreover, Chi activity is strongly orientation-dependent (direction of DNA
  replication).
  It is present preferentially on the leading strands (high skew).




                                                                     BOSC, Stockholm, June 27-28, 2009 – p.13
E. coli as a learning case

  • 8-letter word GCTGGTGG
  • 762 occurrences on the leading strands ( = 4.6 106 )
  • Among the most over-represented 8-letter words (whatever the model Mm)
     ⇒ its frequency cannot be explained by the genome composition.
  • Its rank is improved if one analyzes only the backbone genome (genome
     conserved in several strains of the species).
  • Its skew equals 3.20 (p-value of 3.310−11 ).


The skew of a motif w is defined by N obs (w)/N obs (w) where w is the reverse
complementary of w.




                                                                 BOSC, Stockholm, June 27-28, 2009 – p.14
Identification of Chi motif in S. aureus

                               Halpern et al. (07)
 •  Analysis of the S. aureus backbone ( = 2.44 106 ).
 • 8-letter words: none of the most over-represented and skewed motifs were
    frequent enough.
 • 7-letter words:




A=gaaaatg (1067),      B=ggattag (266),   C=gaagcgg (272),   D=gaattag (614)
                                                              BOSC, Stockholm, June 27-28, 2009 – p.15
Organization of the Ter macrodomain in E. coli

The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)].
How is such structure ensured?




                                                                 BOSC, Stockholm, June 27-28, 2009 – p.16
Organization of the Ter macrodomain in E. coli

The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)].
How is such structure ensured?



Bacillus subtilis as a learning case:

  • In B. subtilis, the parS motif is responsible for the structuration of the
     chromosomal domain surrounding the origin of replication [Lin and
     Grossman (98)].
  • parS motif is 16 nt long, its sequence is partially degenerated and rather
     palindromic.
                                    t
                             TGTTAACACGTGAAACA
                             c      c   t        t

  • It is recognized by SpoOJ in both directions.
  • One of its 11-mer is the most exceptional 11-mer (w, w) in the origin domain.



                                                                      BOSC, Stockholm, June 27-28, 2009 – p.16
Identification of matS in E. coli

10 most over-represented 11-mer (w, w) of the TER domain (compound Poisson
approximation + family option):
                                                                                  rank            ra
           word   N1   N2     E1      E2    score1   score2   p-skew          R’MES             Ske
 GACACTGTCAC       7     0   0.21    0.43     5.84     0.39   0.0004                    1
 TGACACTGTCA       7     2   0.28    0.53     5.49     1.29   0.0101                    2            4
 GACAGTGTCAC       6     0   0.20    0.43     5.24     0.38   0.0011                    3            1
 GACGTTGTCAC       7     3   0.35    1.30     5.22     1.06   0.0012                    4            1
 GACAACGTCAC       7     3   0.37    1.49     5.15     0.88   0.0008                    5            1
 GACCCGAACGA       5     1   0.12    0.47     5.09     0.31   0.0017                    6            2
  ATAGGGTAGAT      4     1   0.06    0.26     4.94     0.73   0.0041                    7            3
  TAGTTACAACA      5     1   0.16    0.54     4.79     0.21   0.0032                    8            2
  ATAAACGGCCC      6     3   0.31    1.68     4.76     0.71   0.0008                    9            1
 TGACAACGTCA       7     5   0.51   1.786     4.72     1.81   0.0073                  10             3




                                                               BOSC, Stockholm, June 27-28, 2009 – p.17
Identification of matS in E. coli

                             GACACTGTCAC
                            TGACACTGTCA
                             GACAGTGTCAC
                             GACGTTGTCAC
                             GACAACGTCAC
                            TGACAACGTCA

                          GTGACRNYGTCAC



matS is the 13nt GTGACRNYGTCAC: it is recognized by the matP protein which
structures the Ter domain [Mercier at al. (08)].




                                                            BOSC, Stockholm, June 27-28, 2009 – p.18
Acknowledgment

       Françoise Gélis (R’MES 1.0)
        Annie Bouvier (R’MES 2.0)
       Mark Hoebeke (R’MES 3.0)


http://genome.jouy.inra.fr/ssb/rmes/




                                     BOSC, Stockholm, June 27-28, 2009 – p.19

Weitere ähnliche Inhalte

Was ist angesagt?

次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性
次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性
次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性Yuichi Yoshida
 
PAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ WarwickPAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ WarwickPierre Jacob
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Spectral Learning Methods for Finite State Machines with Applications to Na...
  Spectral Learning Methods for Finite State Machines with Applications to Na...  Spectral Learning Methods for Finite State Machines with Applications to Na...
Spectral Learning Methods for Finite State Machines with Applications to Na...LARCA UPC
 
Wasserstein GAN
Wasserstein GANWasserstein GAN
Wasserstein GANJinho Lee
 
Micro to macro passage in traffic models including multi-anticipation effect
Micro to macro passage in traffic models including multi-anticipation effectMicro to macro passage in traffic models including multi-anticipation effect
Micro to macro passage in traffic models including multi-anticipation effectGuillaume Costeseque
 

Was ist angesagt? (6)

次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性
次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性
次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性
 
PAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ WarwickPAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ Warwick
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Spectral Learning Methods for Finite State Machines with Applications to Na...
  Spectral Learning Methods for Finite State Machines with Applications to Na...  Spectral Learning Methods for Finite State Machines with Applications to Na...
Spectral Learning Methods for Finite State Machines with Applications to Na...
 
Wasserstein GAN
Wasserstein GANWasserstein GAN
Wasserstein GAN
 
Micro to macro passage in traffic models including multi-anticipation effect
Micro to macro passage in traffic models including multi-anticipation effectMicro to macro passage in traffic models including multi-anticipation effect
Micro to macro passage in traffic models including multi-anticipation effect
 

Ähnlich wie Schbath Rmes Bosc2009

14th Athens Colloquium on Algorithms and Complexity (ACAC19)
14th Athens Colloquium on Algorithms and Complexity (ACAC19)14th Athens Colloquium on Algorithms and Complexity (ACAC19)
14th Athens Colloquium on Algorithms and Complexity (ACAC19)Apostolos Chalkis
 
Thesis defense improved
Thesis defense improvedThesis defense improved
Thesis defense improvedZheng Mengdi
 
Multimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applicationsMultimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applicationsXavier Anguera
 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...Vissarion Fisikopoulos
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignmentavrilcoghlan
 
IGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfIGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfgrssieee
 
DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club avrilcoghlan
 
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemTMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemIosif Itkin
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural NetworksMasahiro Suzuki
 
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...Harshal Chaudhari
 

Ähnlich wie Schbath Rmes Bosc2009 (20)

Dsp
DspDsp
Dsp
 
14th Athens Colloquium on Algorithms and Complexity (ACAC19)
14th Athens Colloquium on Algorithms and Complexity (ACAC19)14th Athens Colloquium on Algorithms and Complexity (ACAC19)
14th Athens Colloquium on Algorithms and Complexity (ACAC19)
 
Dmss2011 public
Dmss2011 publicDmss2011 public
Dmss2011 public
 
Thesis defense improved
Thesis defense improvedThesis defense improved
Thesis defense improved
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Multimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applicationsMultimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applications
 
Gwt sdm public
Gwt sdm publicGwt sdm public
Gwt sdm public
 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
IGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfIGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdf
 
Volume computation and applications
Volume computation and applications Volume computation and applications
Volume computation and applications
 
DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club
 
Thesis defense
Thesis defenseThesis defense
Thesis defense
 
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemTMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
 
Algorithm Assignment Help
Algorithm Assignment HelpAlgorithm Assignment Help
Algorithm Assignment Help
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
 
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
Algorithm Exam Help
Algorithm Exam HelpAlgorithm Exam Help
Algorithm Exam Help
 
QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...
QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...
QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...
 

Mehr von bosc

Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009bosc
 
Bosc Intro 20090627
Bosc Intro 20090627Bosc Intro 20090627
Bosc Intro 20090627bosc
 
Software Patterns Panel Bosc2009
Software Patterns Panel Bosc2009Software Patterns Panel Bosc2009
Software Patterns Panel Bosc2009bosc
 
Kallio Chipster Bosc2009
Kallio Chipster Bosc2009Kallio Chipster Bosc2009
Kallio Chipster Bosc2009bosc
 
Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009bosc
 
Rice Emboss Bosc2009
Rice Emboss Bosc2009Rice Emboss Bosc2009
Rice Emboss Bosc2009bosc
 
Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009bosc
 
Senger Soaplab Bosc2009
Senger Soaplab Bosc2009Senger Soaplab Bosc2009
Senger Soaplab Bosc2009bosc
 
Cock Biopython Bosc2009
Cock Biopython Bosc2009Cock Biopython Bosc2009
Cock Biopython Bosc2009bosc
 
Hanmer Software Patterns Bosc2009
Hanmer Software Patterns Bosc2009Hanmer Software Patterns Bosc2009
Hanmer Software Patterns Bosc2009bosc
 
Snell Psoda Bosc2009
Snell Psoda Bosc2009Snell Psoda Bosc2009
Snell Psoda Bosc2009bosc
 
Procter Vamsas Bosc2009
Procter Vamsas Bosc2009Procter Vamsas Bosc2009
Procter Vamsas Bosc2009bosc
 
Drablos Composite Motifs Bosc2009
Drablos Composite Motifs Bosc2009Drablos Composite Motifs Bosc2009
Drablos Composite Motifs Bosc2009bosc
 
Fauteux Seeder Bosc2009
Fauteux Seeder Bosc2009Fauteux Seeder Bosc2009
Fauteux Seeder Bosc2009bosc
 
Moeller Debian Bosc2009
Moeller Debian Bosc2009Moeller Debian Bosc2009
Moeller Debian Bosc2009bosc
 
Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009bosc
 
Wilczynski_BNFinder_BOSC2009
Wilczynski_BNFinder_BOSC2009Wilczynski_BNFinder_BOSC2009
Wilczynski_BNFinder_BOSC2009bosc
 
Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009bosc
 
Varre_Biomanycores_BOSC2009
Varre_Biomanycores_BOSC2009Varre_Biomanycores_BOSC2009
Varre_Biomanycores_BOSC2009bosc
 
Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009bosc
 

Mehr von bosc (20)

Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009
 
Bosc Intro 20090627
Bosc Intro 20090627Bosc Intro 20090627
Bosc Intro 20090627
 
Software Patterns Panel Bosc2009
Software Patterns Panel Bosc2009Software Patterns Panel Bosc2009
Software Patterns Panel Bosc2009
 
Kallio Chipster Bosc2009
Kallio Chipster Bosc2009Kallio Chipster Bosc2009
Kallio Chipster Bosc2009
 
Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009
 
Rice Emboss Bosc2009
Rice Emboss Bosc2009Rice Emboss Bosc2009
Rice Emboss Bosc2009
 
Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009
 
Senger Soaplab Bosc2009
Senger Soaplab Bosc2009Senger Soaplab Bosc2009
Senger Soaplab Bosc2009
 
Cock Biopython Bosc2009
Cock Biopython Bosc2009Cock Biopython Bosc2009
Cock Biopython Bosc2009
 
Hanmer Software Patterns Bosc2009
Hanmer Software Patterns Bosc2009Hanmer Software Patterns Bosc2009
Hanmer Software Patterns Bosc2009
 
Snell Psoda Bosc2009
Snell Psoda Bosc2009Snell Psoda Bosc2009
Snell Psoda Bosc2009
 
Procter Vamsas Bosc2009
Procter Vamsas Bosc2009Procter Vamsas Bosc2009
Procter Vamsas Bosc2009
 
Drablos Composite Motifs Bosc2009
Drablos Composite Motifs Bosc2009Drablos Composite Motifs Bosc2009
Drablos Composite Motifs Bosc2009
 
Fauteux Seeder Bosc2009
Fauteux Seeder Bosc2009Fauteux Seeder Bosc2009
Fauteux Seeder Bosc2009
 
Moeller Debian Bosc2009
Moeller Debian Bosc2009Moeller Debian Bosc2009
Moeller Debian Bosc2009
 
Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009
 
Wilczynski_BNFinder_BOSC2009
Wilczynski_BNFinder_BOSC2009Wilczynski_BNFinder_BOSC2009
Wilczynski_BNFinder_BOSC2009
 
Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009
 
Varre_Biomanycores_BOSC2009
Varre_Biomanycores_BOSC2009Varre_Biomanycores_BOSC2009
Varre_Biomanycores_BOSC2009
 
Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009
 

Kürzlich hochgeladen

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Kürzlich hochgeladen (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Schbath Rmes Bosc2009

  • 1. R’MES Finding Exceptional Motifs in Sequences S. Schbath INRA, Jouy-en-Josas, France http://genome.jouy.inra.fr/ssb/rmes/ BOSC, Stockholm, June 27-28, 2009 – p.1
  • 2. Introduction: motifs and statistics BOSC, Stockholm, June 27-28, 2009 – p.2
  • 3. DNA and motifs • DNA: Long molecule, sequence of nucleotides • Nucleotides: A(denine), C(ytosine), G(uanine), T(hymine). • Motif (= oligonucleotides): short sequence of nucleotides, e.g. CAGTAG • Functional motif: recognized by proteins or enzymes to initiate a biological process TAGACAGATAGACGATCAGTAGCCAGTAGACAGTAGGCATGA. . . BOSC, Stockholm, June 27-28, 2009 – p.3
  • 4. Some functional motifs • Restriction sites: recognized by specific bacterial restriction enzymes ⇒ double-strand DNA break. E.g. GAATTC recognized by EcoRI • Chi motif: recognized by an enzyme which processes along DNA sequence and degrades it ⇒ enzyme degradation activity stopped and DNA repair is stimulated by recombination. E.g. GCTGGTGG recognized by RecBCD (E. coli) • parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome into macro-domains. t TGTTAACACGTGAAACA c c t t • promoter: structured motif recognized by the RNA polymerase to initiate gene transcription. (16;18) E.g. TTGAC − − − TATAAT (E. coli). BOSC, Stockholm, June 27-28, 2009 – p.4
  • 5. Some functional motifs • Restriction sites: recognized by specific bacterial restriction enzymes ⇒ double-strand DNA break. E.g. GAATTC recognized by EcoRI very rare along bacterial genomes • Chi motif: recognized by an enzyme which processes along DNA sequence and degrades it ⇒ enzyme degradation activity stopped and DNA repair is stimulated by recombination. E.g. GCTGGTGG recognized by RecBCD (E. coli) very frequent along E. coli genome • parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome into macro-domains. t TGTTAACACGTGAAACA c c t t very frequent into the ORI domain, rare elsewhere • promoter: structured motif recognized by the RNA polymerase to initiate gene transcription. (16;18) E.g. TTGAC − − − TATAAT (E. coli). particularly located in front of genes BOSC, Stockholm, June 27-28, 2009 – p.4
  • 6. Prediction of functional motifs Most of the functional motifs are unknown in the different species. For instance, • which would be the Chi motif of S. aureus? [Halpern et al. (08)] • Is there an equivalent of parS in E. coli? [Mercier et al. (08)] Statistical approach: to identify candidate motifs based on their statistical properties. The most over-represented The most over-represented families 8-letter words under M1 anbcdef g under M1 E. coli ( = 4.6 106 ) H. influenzae ( = 1.8 106 ) word obs exp score motif obs exp score gctggtgg 762 84.9 73.5 gntggtgg 223 55.3 22.33 ggcgctgg 828 125.9 62.6 anttcatc 469 180.3 21.59 cgctggcg 870 150.8 58.6 anatcgcc 288 87.8 21.38 gctggcgg 723 125.9 53.3 tnatcgcc 279 84.5 21.18 cgctggtg 619 101.7 51.3 gnagaaga 270 83.6 20.10 BOSC, Stockholm, June 27-28, 2009 – p.5
  • 7. Presentation of R’MES BOSC, Stockholm, June 27-28, 2009 – p.6
  • 8. Statistical questions addressed by R’MES Questions related to the significance of the number of occurrences of motifs w in sequences: • Is N obs (w) significantly high? • Is N obs (w) significantly higher than N obs (w )? −→ If w = w: is w significantly skewed (strand bias)? obs obs • Is N1 (w) significantly more unexpected than N2 (w)? Several types of motifs w: • fixed words (e.g. gctggtgg), • degenerated patterns (e.g. gntggtgg), • set of words (e.g. {w, w}). BOSC, Stockholm, June 27-28, 2009 – p.7
  • 9. Is N obs (w) significantly high? • One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the count (r.v.) of w in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) BOSC, Stockholm, June 27-28, 2009 – p.8
  • 10. Is N obs (w) significantly high? • One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the count (r.v.) of w in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) • R’MES approximates the p-value by using • either a Gaussian approximation of N (w) (when E(N (w)) is large) [Prum et al. (95)], [Schbath et al. (95)] • or a compound Poisson distribution of N (w) (when E(N (w)) is small) [Schbath (95)], [Roquain and Schbath (07)] (see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 ) BOSC, Stockholm, June 27-28, 2009 – p.8
  • 11. Is N obs (w) significantly high? • One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the count (r.v.) of w in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) • R’MES approximates the p-value by using • either a Gaussian approximation of N (w) (when E(N (w)) is large) [Prum et al. (95)], [Schbath et al. (95)] • or a compound Poisson distribution of N (w) (when E(N (w)) is small) [Schbath (95)], [Roquain and Schbath (07)] (see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 ) • R’MES produces scores of exceptionality (probit transformation). High positive (resp. negative) scores correspond to exceptionally frequent (resp. rare) motifs. rmes –gauss –s seqfile –m m –l wordlength –o outputfile BOSC, Stockholm, June 27-28, 2009 – p.8
  • 12. Is N obs (w) significantly higher than N obs (w)? N obs (w) • One needs to calculate the p-value P where N (·) is the “ ” N (w) N (w) ≥ N obs (w) count (r.v.) in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) BOSC, Stockholm, June 27-28, 2009 – p.9
  • 13. Is N obs (w) significantly higher than N obs (w)? N obs (w) • One needs to calculate the p-value P where N (·) is the “ ” N (w) N (w) ≥ N obs (w) count (r.v.) in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) • R’MES approximates the p-value by using • the 2-dimensional Gaussian approximation of (N (w), N (w)) (when E(N (w)) and E(N (w)) are large) [Prum et al. (95)], [Schbath et al. (95)] BOSC, Stockholm, June 27-28, 2009 – p.9
  • 14. Is N obs (w) significantly higher than N obs (w)? N obs (w) • One needs to calculate the p-value P where N (·) is the “ ” N (w) N (w) ≥ N obs (w) count (r.v.) in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) • R’MES approximates the p-value by using • the 2-dimensional Gaussian approximation of (N (w), N (w)) (when E(N (w)) and E(N (w)) are large) [Prum et al. (95)], [Schbath et al. (95)] • R’MES produces scores of exceptional skew (probit transformation): High positive (resp. negative) scores correspond to motifs significantlty more frequent (resp. rare) along the sequence than along the complementary one. rmes –skew –seq seqfile –m m –l wordlength –o outputfile BOSC, Stockholm, June 27-28, 2009 – p.9
  • 15. obs obs Is N1 (w) significantly more except. than N2 (w)? • One wants to compare the exceptionality of a motif w in two different obs obs sequences (two observed counts N1 (w) and N2 (w)) BOSC, Stockholm, June 27-28, 2009 – p.10
  • 16. obs obs Is N1 (w) significantly more except. than N2 (w)? • One wants to compare the exceptionality of a motif w in two different obs obs sequences (two observed counts N1 (w) and N2 (w)) • R’MES computes a test statistic and its asociated p-value to test H0 : {w is equally exceptional in both sequences} against H1 : {w is more exceptional in the first sequence} [Robin et al. (08)] BOSC, Stockholm, June 27-28, 2009 – p.10
  • 17. obs obs Is N1 (w) significantly more except. than N2 (w)? • One wants to compare the exceptionality of a motif w in two different obs obs sequences (two observed counts N1 (w) and N2 (w)) • R’MES computes a test statistic and its asociated p-value to test H0 : {w is equally exceptional in both sequences} against H1 : {w is more exceptional in the first sequence} [Robin et al. (08)] • The test is performed by considering occurrence processes like Poisson processes whose intensities take the sequence compositions in oligos of length 1- up to -(m + 1) into account. • Option –seq2 soon available in R’MES. BOSC, Stockholm, June 27-28, 2009 – p.10
  • 18. RMESPlot interface BOSC, Stockholm, June 27-28, 2009 – p.11
  • 19. Prediction and identification of functional DNA motifs BOSC, Stockholm, June 27-28, 2009 – p.12
  • 20. Chi motifs in bacterial genomes • Motif involved in the repair of double-strand DNA breaks. Chi needs to be frequent along bacterial genomes. • Chi motifs have been identified for few bacterial species. They are not conserved through species. • Known Chi motifs are 5 to 8 nucleotides long and can be degenerated. • Moreover, Chi activity is strongly orientation-dependent (direction of DNA replication). It is present preferentially on the leading strands (high skew). BOSC, Stockholm, June 27-28, 2009 – p.13
  • 21. E. coli as a learning case • 8-letter word GCTGGTGG • 762 occurrences on the leading strands ( = 4.6 106 ) • Among the most over-represented 8-letter words (whatever the model Mm) ⇒ its frequency cannot be explained by the genome composition. • Its rank is improved if one analyzes only the backbone genome (genome conserved in several strains of the species). • Its skew equals 3.20 (p-value of 3.310−11 ). The skew of a motif w is defined by N obs (w)/N obs (w) where w is the reverse complementary of w. BOSC, Stockholm, June 27-28, 2009 – p.14
  • 22. Identification of Chi motif in S. aureus Halpern et al. (07) • Analysis of the S. aureus backbone ( = 2.44 106 ). • 8-letter words: none of the most over-represented and skewed motifs were frequent enough. • 7-letter words: A=gaaaatg (1067), B=ggattag (266), C=gaagcgg (272), D=gaattag (614) BOSC, Stockholm, June 27-28, 2009 – p.15
  • 23. Organization of the Ter macrodomain in E. coli The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)]. How is such structure ensured? BOSC, Stockholm, June 27-28, 2009 – p.16
  • 24. Organization of the Ter macrodomain in E. coli The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)]. How is such structure ensured? Bacillus subtilis as a learning case: • In B. subtilis, the parS motif is responsible for the structuration of the chromosomal domain surrounding the origin of replication [Lin and Grossman (98)]. • parS motif is 16 nt long, its sequence is partially degenerated and rather palindromic. t TGTTAACACGTGAAACA c c t t • It is recognized by SpoOJ in both directions. • One of its 11-mer is the most exceptional 11-mer (w, w) in the origin domain. BOSC, Stockholm, June 27-28, 2009 – p.16
  • 25. Identification of matS in E. coli 10 most over-represented 11-mer (w, w) of the TER domain (compound Poisson approximation + family option): rank ra word N1 N2 E1 E2 score1 score2 p-skew R’MES Ske GACACTGTCAC 7 0 0.21 0.43 5.84 0.39 0.0004 1 TGACACTGTCA 7 2 0.28 0.53 5.49 1.29 0.0101 2 4 GACAGTGTCAC 6 0 0.20 0.43 5.24 0.38 0.0011 3 1 GACGTTGTCAC 7 3 0.35 1.30 5.22 1.06 0.0012 4 1 GACAACGTCAC 7 3 0.37 1.49 5.15 0.88 0.0008 5 1 GACCCGAACGA 5 1 0.12 0.47 5.09 0.31 0.0017 6 2 ATAGGGTAGAT 4 1 0.06 0.26 4.94 0.73 0.0041 7 3 TAGTTACAACA 5 1 0.16 0.54 4.79 0.21 0.0032 8 2 ATAAACGGCCC 6 3 0.31 1.68 4.76 0.71 0.0008 9 1 TGACAACGTCA 7 5 0.51 1.786 4.72 1.81 0.0073 10 3 BOSC, Stockholm, June 27-28, 2009 – p.17
  • 26. Identification of matS in E. coli GACACTGTCAC TGACACTGTCA GACAGTGTCAC GACGTTGTCAC GACAACGTCAC TGACAACGTCA GTGACRNYGTCAC matS is the 13nt GTGACRNYGTCAC: it is recognized by the matP protein which structures the Ter domain [Mercier at al. (08)]. BOSC, Stockholm, June 27-28, 2009 – p.18
  • 27. Acknowledgment Françoise Gélis (R’MES 1.0) Annie Bouvier (R’MES 2.0) Mark Hoebeke (R’MES 3.0) http://genome.jouy.inra.fr/ssb/rmes/ BOSC, Stockholm, June 27-28, 2009 – p.19