1. R’MES
Finding Exceptional Motifs in Sequences
S. Schbath
INRA, Jouy-en-Josas, France
http://genome.jouy.inra.fr/ssb/rmes/
BOSC, Stockholm, June 27-28, 2009 – p.1
3. DNA and motifs
• DNA: Long molecule, sequence of
nucleotides
• Nucleotides: A(denine), C(ytosine),
G(uanine), T(hymine).
• Motif (= oligonucleotides): short
sequence of nucleotides, e.g.
CAGTAG
• Functional motif: recognized by
proteins or enzymes to initiate a
biological process
TAGACAGATAGACGATCAGTAGCCAGTAGACAGTAGGCATGA. . .
BOSC, Stockholm, June 27-28, 2009 – p.3
4. Some functional motifs
• Restriction sites: recognized by specific bacterial restriction enzymes ⇒
double-strand DNA break.
E.g. GAATTC recognized by EcoRI
• Chi motif: recognized by an enzyme which processes along DNA sequence
and degrades it ⇒ enzyme degradation activity stopped and DNA repair is
stimulated by recombination.
E.g. GCTGGTGG recognized by RecBCD (E. coli)
• parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome
into macro-domains.
t
TGTTAACACGTGAAACA
c c t t
• promoter: structured motif recognized by the RNA polymerase to initiate
gene transcription.
(16;18)
E.g. TTGAC − − − TATAAT (E. coli).
BOSC, Stockholm, June 27-28, 2009 – p.4
5. Some functional motifs
• Restriction sites: recognized by specific bacterial restriction enzymes ⇒
double-strand DNA break.
E.g. GAATTC recognized by EcoRI
very rare along bacterial genomes
• Chi motif: recognized by an enzyme which processes along DNA sequence
and degrades it ⇒ enzyme degradation activity stopped and DNA repair is
stimulated by recombination.
E.g. GCTGGTGG recognized by RecBCD (E. coli)
very frequent along E. coli genome
• parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome
into macro-domains.
t
TGTTAACACGTGAAACA
c c t t
very frequent into the ORI domain, rare elsewhere
• promoter: structured motif recognized by the RNA polymerase to initiate
gene transcription.
(16;18)
E.g. TTGAC − − − TATAAT (E. coli).
particularly located in front of genes
BOSC, Stockholm, June 27-28, 2009 – p.4
6. Prediction of functional motifs
Most of the functional motifs are unknown in the different species.
For instance,
• which would be the Chi motif of S. aureus? [Halpern et al. (08)]
• Is there an equivalent of parS in E. coli? [Mercier et al. (08)]
Statistical approach: to identify candidate motifs based on their statistical
properties.
The most over-represented The most over-represented families
8-letter words under M1 anbcdef g under M1
E. coli ( = 4.6 106 ) H. influenzae ( = 1.8 106 )
word obs exp score motif obs exp score
gctggtgg 762 84.9 73.5 gntggtgg 223 55.3 22.33
ggcgctgg 828 125.9 62.6 anttcatc 469 180.3 21.59
cgctggcg 870 150.8 58.6 anatcgcc 288 87.8 21.38
gctggcgg 723 125.9 53.3 tnatcgcc 279 84.5 21.18
cgctggtg 619 101.7 51.3 gnagaaga 270 83.6 20.10
BOSC, Stockholm, June 27-28, 2009 – p.5
8. Statistical questions addressed by R’MES
Questions related to the significance of the number of occurrences of motifs w
in sequences:
• Is N obs (w) significantly high?
• Is N obs (w) significantly higher than N obs (w )?
−→ If w = w: is w significantly skewed (strand bias)?
obs obs
• Is N1 (w) significantly more unexpected than N2 (w)?
Several types of motifs w:
• fixed words (e.g. gctggtgg),
• degenerated patterns (e.g. gntggtgg),
• set of words (e.g. {w, w}).
BOSC, Stockholm, June 27-28, 2009 – p.7
9. Is N obs (w) significantly high?
• One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the
count (r.v.) of w in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
sequence composition in oligos of length 1- up to -(m + 1).
Possibility to take the phase in coding sequences into account (Mm_3)
BOSC, Stockholm, June 27-28, 2009 – p.8
10. Is N obs (w) significantly high?
• One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the
count (r.v.) of w in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
sequence composition in oligos of length 1- up to -(m + 1).
Possibility to take the phase in coding sequences into account (Mm_3)
• R’MES approximates the p-value by using
• either a Gaussian approximation of N (w) (when E(N (w)) is large)
[Prum et al. (95)], [Schbath et al. (95)]
• or a compound Poisson distribution of N (w) (when E(N (w)) is small)
[Schbath (95)], [Roquain and Schbath (07)]
(see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 )
BOSC, Stockholm, June 27-28, 2009 – p.8
11. Is N obs (w) significantly high?
• One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the
count (r.v.) of w in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
sequence composition in oligos of length 1- up to -(m + 1).
Possibility to take the phase in coding sequences into account (Mm_3)
• R’MES approximates the p-value by using
• either a Gaussian approximation of N (w) (when E(N (w)) is large)
[Prum et al. (95)], [Schbath et al. (95)]
• or a compound Poisson distribution of N (w) (when E(N (w)) is small)
[Schbath (95)], [Roquain and Schbath (07)]
(see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 )
• R’MES produces scores of exceptionality (probit transformation).
High positive (resp. negative) scores correspond to exceptionally frequent
(resp. rare) motifs.
rmes –gauss –s seqfile –m m –l wordlength –o outputfile
BOSC, Stockholm, June 27-28, 2009 – p.8
12. Is N obs (w) significantly higher than N obs (w)?
N obs (w)
• One needs to calculate the p-value P where N (·) is the
“ ”
N (w)
N (w)
≥ N obs (w)
count (r.v.) in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
sequence composition in oligos of length 1- up to -(m + 1).
Possibility to take the phase in coding sequences into account (Mm_3)
BOSC, Stockholm, June 27-28, 2009 – p.9
13. Is N obs (w) significantly higher than N obs (w)?
N obs (w)
• One needs to calculate the p-value P where N (·) is the
“ ”
N (w)
N (w)
≥ N obs (w)
count (r.v.) in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
sequence composition in oligos of length 1- up to -(m + 1).
Possibility to take the phase in coding sequences into account (Mm_3)
• R’MES approximates the p-value by using
• the 2-dimensional Gaussian approximation of (N (w), N (w)) (when
E(N (w)) and E(N (w)) are large)
[Prum et al. (95)], [Schbath et al. (95)]
BOSC, Stockholm, June 27-28, 2009 – p.9
14. Is N obs (w) significantly higher than N obs (w)?
N obs (w)
• One needs to calculate the p-value P where N (·) is the
“ ”
N (w)
N (w)
≥ N obs (w)
count (r.v.) in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
sequence composition in oligos of length 1- up to -(m + 1).
Possibility to take the phase in coding sequences into account (Mm_3)
• R’MES approximates the p-value by using
• the 2-dimensional Gaussian approximation of (N (w), N (w)) (when
E(N (w)) and E(N (w)) are large)
[Prum et al. (95)], [Schbath et al. (95)]
• R’MES produces scores of exceptional skew (probit transformation):
High positive (resp. negative) scores correspond to motifs significantlty more
frequent (resp. rare) along the sequence than along the complementary one.
rmes –skew –seq seqfile –m m –l wordlength –o outputfile
BOSC, Stockholm, June 27-28, 2009 – p.9
15. obs obs
Is N1 (w) significantly more except. than N2 (w)?
• One wants to compare the exceptionality of a motif w in two different
obs obs
sequences (two observed counts N1 (w) and N2 (w))
BOSC, Stockholm, June 27-28, 2009 – p.10
16. obs obs
Is N1 (w) significantly more except. than N2 (w)?
• One wants to compare the exceptionality of a motif w in two different
obs obs
sequences (two observed counts N1 (w) and N2 (w))
• R’MES computes a test statistic and its asociated p-value to test
H0 : {w is equally exceptional in both sequences}
against
H1 : {w is more exceptional in the first sequence}
[Robin et al. (08)]
BOSC, Stockholm, June 27-28, 2009 – p.10
17. obs obs
Is N1 (w) significantly more except. than N2 (w)?
• One wants to compare the exceptionality of a motif w in two different
obs obs
sequences (two observed counts N1 (w) and N2 (w))
• R’MES computes a test statistic and its asociated p-value to test
H0 : {w is equally exceptional in both sequences}
against
H1 : {w is more exceptional in the first sequence}
[Robin et al. (08)]
• The test is performed by considering occurrence processes like Poisson
processes whose intensities take the sequence compositions in oligos of
length 1- up to -(m + 1) into account.
• Option –seq2 soon available in R’MES.
BOSC, Stockholm, June 27-28, 2009 – p.10
20. Chi motifs in bacterial genomes
• Motif involved in the repair of double-strand DNA breaks.
Chi needs to be frequent along bacterial genomes.
• Chi motifs have been identified for few bacterial species. They are not
conserved through species.
• Known Chi motifs are 5 to 8 nucleotides long and can be degenerated.
• Moreover, Chi activity is strongly orientation-dependent (direction of DNA
replication).
It is present preferentially on the leading strands (high skew).
BOSC, Stockholm, June 27-28, 2009 – p.13
21. E. coli as a learning case
• 8-letter word GCTGGTGG
• 762 occurrences on the leading strands ( = 4.6 106 )
• Among the most over-represented 8-letter words (whatever the model Mm)
⇒ its frequency cannot be explained by the genome composition.
• Its rank is improved if one analyzes only the backbone genome (genome
conserved in several strains of the species).
• Its skew equals 3.20 (p-value of 3.310−11 ).
The skew of a motif w is defined by N obs (w)/N obs (w) where w is the reverse
complementary of w.
BOSC, Stockholm, June 27-28, 2009 – p.14
22. Identification of Chi motif in S. aureus
Halpern et al. (07)
• Analysis of the S. aureus backbone ( = 2.44 106 ).
• 8-letter words: none of the most over-represented and skewed motifs were
frequent enough.
• 7-letter words:
A=gaaaatg (1067), B=ggattag (266), C=gaagcgg (272), D=gaattag (614)
BOSC, Stockholm, June 27-28, 2009 – p.15
23. Organization of the Ter macrodomain in E. coli
The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)].
How is such structure ensured?
BOSC, Stockholm, June 27-28, 2009 – p.16
24. Organization of the Ter macrodomain in E. coli
The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)].
How is such structure ensured?
Bacillus subtilis as a learning case:
• In B. subtilis, the parS motif is responsible for the structuration of the
chromosomal domain surrounding the origin of replication [Lin and
Grossman (98)].
• parS motif is 16 nt long, its sequence is partially degenerated and rather
palindromic.
t
TGTTAACACGTGAAACA
c c t t
• It is recognized by SpoOJ in both directions.
• One of its 11-mer is the most exceptional 11-mer (w, w) in the origin domain.
BOSC, Stockholm, June 27-28, 2009 – p.16
26. Identification of matS in E. coli
GACACTGTCAC
TGACACTGTCA
GACAGTGTCAC
GACGTTGTCAC
GACAACGTCAC
TGACAACGTCA
GTGACRNYGTCAC
matS is the 13nt GTGACRNYGTCAC: it is recognized by the matP protein which
structures the Ter domain [Mercier at al. (08)].
BOSC, Stockholm, June 27-28, 2009 – p.18
27. Acknowledgment
Françoise Gélis (R’MES 1.0)
Annie Bouvier (R’MES 2.0)
Mark Hoebeke (R’MES 3.0)
http://genome.jouy.inra.fr/ssb/rmes/
BOSC, Stockholm, June 27-28, 2009 – p.19