SlideShare ist ein Scribd-Unternehmen logo
1 von 92
Detection of Genetic
Motifs
Bioinfo talk 11
Promoters – Biology
Information theory
Random Projections
Composed motif detection
Motifs and promoters
DNA sequence
gene
junk DNA
gene
UTR-5' UTR-3'
exon
intron
Promoter module Promoter module
TSS
TFBSTFBS
TATA box
INR DSE
e1 e2 e5e4e3
Distal Promoter Proximal promoter Core promoter
INR DSE
INR = Initiator Region
DSE = DownStream region
TSS = Transcription
Start Site
TFBS-Transcription Factor
Binding Site
 Short strings (12 to 20
nucleotides long)
 spreaded over up to 5kb
before TSS
 The string structure select
the protein that will bind on
the basis of Van der Waals
interactions
 Van der Waals interactions
-
protein that
is going to bind
example of a Transcription Factor
Binding Sites TFBS
ACCGATTATCA
Assembly of the promoter protein
complex of transcription
Transcription factors TF
TFIID
TBP
Transcription factor Binding Sites TFBS
TATA box
INR
TSS
DNA
1st stage
2nd stage
INR
TSS
TFIID
TBP
TATA
core
promoter
DNA
Assembly of the promoter protein
complex of transcription
Assembly of the promoter protein
complex of transcription
TFIID
RNA Poly II
TFIIA
INR
TSS
TATA
core
promoter
TFIIE
TFIIH
Distal promoter/
enhancer
Proximal
promoter
TF1
TF2
TF3
TF4
TF5
DNA looping
TFIIB
TBP
TFIIB
DNA
Information based motif
detection
known
unknown
TFBS's with the same colour are correlated
The set of all TFBS (for a certain
class of genes, organism or other)
Unknown Known
A T G C T C
Protein of the
Promoter complex
A T C C T G
Protein of the
Promoter complex
Example
Entropy
 Given a probability distribution, we want a function
representing the quantity of information stored in the
distribution.
 We define the entropy (H) as:
 For the sake of simplicity, we will use from now on
the discrete definition.
dxxpxpH
or
ipipH
i
∫
∑
−=
−=
))(log()(
))(log()(
Observed entropy
 The real distribution is usually unknown, but
we can replace it by the observed distribution
f(x). The resulting entropy is:
 For a multi dimensional probability distribution
it is:
∑−=
x
xfxfxH ))(log()()(
∑ ∑
∑
=
−=
yx y
yx
yxfxyfxf
yxfyxfyxH
,
,
)),(log()|()(
)),(log(),(),(
Mutual Information
 X and Y are strings of equal length, S={A, C, G, T}, x
and y belong to S
 f(x,y) is the relative joint frequency of x,y in X and Y
 f(x) is the relative frequency of x in X
 f(y) is the relative frequency of y in Y
)()(),(
))](log())(log()),()[log(,(
)
)()(
),(
log(),().(
,
,
yHxHyxH
yfxfyxfyxf
yfxf
yxf
yxfyxI
yx
yx
−−=
−−−=
−=
∑
∑
Information divergence
 Given two distributions
P and Q
∑∑
∑
−=
=
xx
x
xqxpxpxp
xq
xp
xpQPD
))(log()())(log()(
)
)(
)(
log()(),(
Not for exam
A C A T T T A CC A T A G A C A A C T A
A C T T T T A CG A T G G A A A C C T G
X
Y
f(x,y)
6 4 4 6
f(x,y) A C G T
9 A 5 1 2 1
f(x) 5 C 1 3 1 0
1 G 0 0 1 0
5 T 0 0 0 5
Divide by 20 to obtain relative frequencies
Example of calculation
Algorithm for finding new
TFBS
1) select a true TFBS (for example ACATTTACCATAGACAACT)
(from a data bank as IUPAC or TRANSFAC) as a probe;
2) shift the probe over a non-coding zone;
3) evaluate step-by-step mutual information I(P,S), where P is
the probe and S is the current adjacent string on the sequence;
4) select the positions (and the corresponding adjacent strings) for
which
I(P,S)> threshold
5) the strings starting from these positions are candidate
TFBS,which need to be validated in vitro.
CACTGTGCGACTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGATthe same string
TTCGGAACCGGCCTTAAGACGGTGAAGGCGCTACTCATTTAATTGTGTTC
CACTGTGCGTCTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT1 error
TACTATATAATTGATCGTGTTTTGGCCCGCTACTCATGAAGAGCCGTTCG
CACTGTGCGTCTGTCATTCGTCATCCACCGTTGTTAGCACAGGGGTCGAT2 errors
TAAGGGTATCCAAGTCTGAATACCCCCTGTATTACACTCTCGCTGTCAGT
CACTGTGCGTCTGTCATTCGTCATCCACCATTGTTAGCATAGGGGTCGAC5 errors
CATTATCGAGGACAGTGATTTGTGGAATGCTTGGCCTTAATACGTCTCTA
GAGTCTCGCAGTCTGATTGATGATGGAGGCTTCTTACGAGACCCCTGCATC<--> G
TCAAAGTCAATTTACAGATTGGCGCCTCATGTAATAACGTTGGCATACTA
GAGTGTGGGAGTGTGATTGATGATGGAGGGTTGTTAGGAGAGGGGTGGATC <-- G
CTTAAGATAACGGACACTTGATTGAGATACGCTCGACGCTATGTCCGGCT
CAGTGTCCGACTGTCATTGATCATCCACGCTTGTTACCAGACGCGTCGATsome C<-> G
ACTCGACATAAGGTTACAGCATGTGGAGTAATGCGGTCGCTAACTACGGG
GTGACACGCTGACAGTAAGTAGTAGGTGGCAACAATCGTGTCCCCAGCTAcomplementar
GCGTGGCGAGCTTAATCCCTGCTGCTCTGAGCAAGGAGGGCGTGTAGAAA
GTGACACGCGGACAGTAAGTAGTAGGTGGCAACAATCGTGTCCCCAGCTAcompl+1error
CAAGGTGACAGAGTATTGAGTGAATCTACAATGTTCGCAGTGCTTTGTCG
GTGACACGCTGACAGTAAGAAGTAGGTGGCAACAATCGTGTCCCCAGCTAcompl+2errors
GCGGTCGCCAATCGTCAAGGAAATGATAGGTCTGATTGGCGTGGCTTAAG
GTGACACGCTGACAGTAAGAAGTAGGTGGAAACAATCGTCTCCCCAGCTGcompl+5errors
GGCGCTAACGAATACTTCAAGGCCCGAAGGATTGGTGTTGATACTAGCCG
CACTGTGCGACTGTCATTCATCATCACACCGTTGTTAGCACAGGGGTCGAT1 letter more
CGTGACCAGATGTCCTTACTCTGAATGTTATGGTATTAAGTGAGGTAGTG
CACTGTGCGACTGTCATTCATCATCCACACCGTTGTTAGCACAGGGGTCGAT2 letters more
GCCCATGAACATACATTCATGACTGTTCAAGCGCACTGGACCACTCGTTC
CACTGTGCGACTGTCATTCATCATCCATCACCGTTGTTAGCACAGGGGTCGAT3 letters more
CACTGTGCGACTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGATprobe
Example
1 error
2 errors
5 errors
C becomes G
C and G exchanged complementary
complementary+1error
complementary+2errors
complementary+5errors
the same string
4 C become G and 5 G become C
0.2
0.4
0.6
0.8
1
1 letter more
2 letters more
3 letters
more
Detected values for I(P,S)
Conclusions:
 Use Mutual information as a tool to capture strings
that are correlated to a true TFBS used as a probe.
 validate in vitro the candidates so obtained
 This is more flexible than the use of Hamming or
Levenshtein distance, since correlated strings
could be very distant one another
Drawbacks:
1. the method need a precise calibration of the
threshold
2. Does not include gaps
Random Projection Approach to
Motif Finding
daf-19 Binding Sites in C. elegans
GTTGTCATGGTGAC
GTTTCCATGGAAAC
GCTACCATGGCAAC
GTTACCATAGTAAC
GTTTCCATGGTAAC
che-2
daf-19
osm-1
osm-6
F02D8.3
-150 -1
The (l,d) Planted Motif Problem
 Generate a random length l consensus
sequence C.
 Generate 20 instances, each differing from C
by d random mutations.
 Plant one at a random position in each of
N=20 random sequences of length n=600.
 Can you find the planted instances?
Planted Motifs
AGTTATCGCGGCACAGGCTCCTTCTTTATAGCC
ATGATAGCATCAACCTAACCCTAGATATGGGAT
TTTTGGGATATATCGCCCCTACACTGGATGACT
GGATATACATGAACACGGTGGGAAAACCCTGAC
 Each instance differs from ACAGGATCA by 2
mutations
 Remaining sequence random
Random Projection Algorithm
 Buhler and Tompa (2001)
 Guiding principle: Some instances of a motif
agree on a subset of positions.
 Use information from multiple motif instances
to construct model.
ATGCGTC
...ccATCCGACca...
...ttATGAGGCtc...
...ctATAAGTCgc...
...tcATGTGACac... (7,2) motif
x(1)
x(2)
x(5)
x(8)
=M
k-Projections
 Choose k positions in string of length l.
 Concatenate nucleotides at chosen k
positions to form k-tuple.
 In l-dimensional Hamming space, projection
onto k dimensional subspace.
ATGGCATTCAGATTC TGCTGAT
l = 15 k = 7
P
P = (2, 4, 5, 7, 11, 12, 13)
Random Projection Algorithm
 Choose a projection by
selecting k positions
uniformly at random.
 For each l-tuple in input
sequences, hash into
bucket based on letters
at k selected positions.
 Recover motif from
bucket containing
multiple l-tuples.
Bucket TGCT
TGCACCT
Input sequence x(i):
…TCAATGCACCTAT...
Example
 l = 7 (motif size) , k = 4 (projection size)
 Choose projection (1,2,5,7)
GCTC
...TAGACATCCGACTTGCCTTACTAC...
Buckets
Input Sequence
ATGC
ATCCGAC
GCCTTAC
Hashing and Buckets
 Hash function h(x) obtained from k positions
of projection.
 Buckets are labeled by values of h(x).
 Enriched buckets: contain at least s l-tuples,
for some parameter s.
ATTCCATCGCTCATGC
Frequency Matrix Model From
Bucket














025.025.010
025.105.00
10025.25.00
05.05.25.01
T
G
C
A
Frequency matrix WATGC
ATCCGAC
ATGAGGC
ATAAGTC
ATGTGAC
Refined matrix W*
EM algorithm
Motif Refinement
 How do we recover the motif from the
sequences in the enriched buckets?
 k nucleotides are known from hash value of
bucket.
 Use information in other l-k positions as
starting point for local refinement scheme,
e.g. EM or Gibbs sampler
Local refinement algorithm
ATGCGTC
Candidate motif
ATGC
ATCCGAC
ATGAGGC
ATAAGTC
ATGTGAC
Expectation Maximization (EM)
 S = { x(1), …, x(N)} : set of input sequences
 Given:
 W = An initial probabilistic motif model
 P0 = background probability distribution.
 Find value Wmax that maximizes likelihood ratio:
)|Pr(
)|Pr(
0
max
PS
WS
 EM is local optimization scheme. Requires
starting value W
EM Motif Refinement
 For each bucket h containing more than s
sequences, form weight matrix Wh
 Use EM algorithm with starting point Wh to obtain
refined weight matrix model Wh
*
 For each input sequence x(i), return l tuple y(i) which
maximizes likelihood ratio:
Pr(y(i) | Wh
*
)/ Pr(y(i) | P0).
 T = {y(1), y(2), …, y(N)}
 C(T ) = consensus string
What Is the Best Motif?
 Compute score S for each motif:
 Generate W, an initial PSSM from the returned
l-mers {y(1), y(2), …, y(N)}
 Return motif with maximal score
∑=
i PiyP
WiyP
Score
)|)((
)|)((
log
0
Iterations
 Single iteration.
 Choose a random k-projection.
 Hash each l-mer x in input sequence into bucket
labelled by h(x).
 From each bucket B with at least s sequences, form
weight matrix model, and perform EM/Gibbs sampler
refinement.
 Candidate motif is the best one found from refinement
of all enriched buckets.
 Multiple iterations.
 Repeat process for multiple projections.
Parameter Selection
 Projection size k
 Choose k small so several motif instances
hash to same bucket. (k < l - d)
 Choose k large to avoid contamination by
spurious l-mers. E > (N (n - l + 1))/ 4k
 Bucket threshold s: (s = 3, s = 4)
How Many Iterations?
 Planted bucket : bucket with hash value h(M),
where M is motif.
 Choose m = number of iterations, such that
Pr(planted bucket contains ≥ s sequences in
at least one of m iterations) ≥ 0.95.
 Probability is readily computable since
iterations form a sequence of independent
trials.
Composite motifs
detection
Question
Monad detection
Mitra
monad patterns
 Short contiguous strings
 Appear surprisingly many times( in a statistically significant
way)
 S =
AGTCAGTCTTGCTAGTCAGTCCGTAATATCCGGATAGAATAATGATC
GTAGCATCGTACGTAGCTATCGATCTGAAGCTAGCAGC
AAGATGTACTAGAGTCAGTCACGTAGCTAGTCAGTCATCTATACGAGAG
TCTCGATGTAGTAGCTATCGATCGTAGCTAGAGTCAGTCCGTAGC
AGCTAGTATCGTAGTGAGCAACATGAGTCAGTCCAGTGCATAA
GTCGTCAGCTCATGAGTCAGTCGCATAGTCAGTC
P = AGTC
Introduction
 However, many of the actual regulatory
signals are composite patterns.
 Groups of monad patterns
 Occur relatively near each other
 An example of a composite pattern is a dyad
signal.
 S=ACGTAAATCACGTTGACTAGCTAGCACGAG
CTAGCATAATCACACTTTGACGAGTCGACTGC
ATGCATTGACGCAGTGCATTGCTAGCATGGG
TAATCAAACGTTGGCTAGCTAGCATGCATCTG
AGCATGCTAGCTACGTACTAGCGCGATAGTC
TACTACAAATCACCCATTGCGAGCTACGTAG
CTAGCTAGCTAGCTAGCTAGTGATGCATGCTA
GAATCCGATCTTGCGATCGAT
Composite Pattern
CP = AATCxxxxTTG
Introduction
 A possible approach is to find each part of the
pattern separately and reconstruct the
composite pattern.
 However, they often fail to output composite
regulatory patterns consisting of weak monad
parts.
Introduction
 A better approach would be to detect both parts of a
composite pattern at the same time.
 Two steps in the proposed algorithm:
 Preprocessing the sample creates a set of ‘virtual’
monads.
 Apply an exhaustive monad discovery algorithm to
the ’virtual’ monad problem.
 By preprocessing, original problem can be
transformed into a larger monad discovery problem.
Monad Pattern Discovery
 Canonical pattern lmer
 A continuous string of length l
 (l, d)-neighbourhood of an lmer P
 all possible lmers with up to d mismatches as compared to P
 The number of such lmers is :
 (l,d)-k patterns
 Given a sequence S, find all lmers that occur with up to d
mismatches at least k times in the sample
 A variant : the sample is split into several sequence, to find all
lmers, d mismatches, in at least k sequences
A C A3mer:
i
d
i i
l
3
0
∑=






Pattern Driven Approach(PDA)
 (Prvzner, 2000)
 Examine all 4 l patterns of fixed length l in lexical order,
compares each pattern to every lmer in the sample, and
return all (l, d)-k pattern
 (Waterman et al., 1984 and Galas et al.,1985)
 Bypass excessive time requirement
 Most of all 4 l examines not worth since neither these
patterns nor their neighbours appear in the sample
 SDA was therefore designed only explores the lmer
appearing in the sample and their neighbours.
Sample Driven Approach(SDA)
 First initializes a table of size 4l
 Each table entry corresponds to a pattern SDA
generate the (l, d)-neighbourhood of lmer
 Incremented by a certain amount
 After all lmers processed, SDA return all pattern
whose table entries have scores exceed the
threshold
AAAAA 3
AAAAC 1
AAACC 2
… ..
4
l
Sample Driven Approach(SDA)
 Faster but requires a large 4l
table still
 not practical for long pattern in mid 1980
 Not mainstream and no tool
 (Today gigabytes of RAM memory available thus l
increased without a memory-efficient algorithm)
SDA Iterations
 First, explore all neighbour of the first lmer from the
sample.
 Second, explore all neighbour of the second lmer
 If an lmer P belongs to the neighbour of the lmers
appearing at positions i1 ,…ik in the sample  info about P
collected at iteration i1 ,…ik .
 So the Waterman approach update info about P k times 
memory slot for P is occupied during the course time even
if P is not “interesting” lmer
 Most of lmers explored are not interesting—waste memory
slot
To improve SDA
 Better solution:
 Collect info about all P at the same time
 to remove the need to keep the info in memory
 but require a new approach to navigate the space of
all lmers
 MITRA runs faster than PDA and SDA, and uses
only a fraction of the memory of the SDA
Pattern-finding vs. profile-based
 Profile-based is more biologically relevant for
finding motifs in biological samples?
 Probably the reason Waterman algorithm not
popular in the last decade
 Sagot and colleagues were the first to rebut this
opinion
 Develop an efficient version of Waterman’s
Pattern-based vs. profile-based
 Similarities
 Pattern-based generate the profile
 Every profile of length l corresponds to a pattern of length
l formed by the most frequent nucleotides in every
position.
 Pattern-driven at least as good as profile-based
 Even better on simulated samples with implanted patterns
 Though profile-implantation model is somehow limited
 Today little evidence profile-based perform any better on
either biological or simulated samples
Mitra
Mismatch Tree Algorithm
Mismatch Tree Algorithm
(MITRA)
 MITRA uses a mismatch tree data structure to
split the space of all possible patterns into disjoint
subspaces that start with a given prefix.
 For reducing the pattern discovery into smaller
sub-problems.
 MITRA also takes advantage of pair-wise
similarity between instances.
Splitting Pattern Space
 A pattern is called weak if it has less than k
( l ,d )-neighbours in the sample.
 A subspace is called weak if all patterns in this
subspace are weak.
Splitting pattern space
 A pattern is called weak if it has less than k
( l ,d )-neighbours in the sample.
l = 3 …. sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }
d =1 ; k =2
( l ,d )-neighbours in the sample = { GTA, ATC, GTT }
Sequence = AGTATCAGTT
P= GTC
Not weak
Splitting pattern space
 A pattern is called weak if it has less than k
( l ,d )-neighbours in the sample.
l = 3 …. sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }
d =1 ; k =2
( l ,d )-neighbours in the sample = { CAG }
Sequence = AGTATCAGTT
P= CAG
weak
Splitting pattern space
 A subspace is called weak if all patterns in this
subspace are weak.
•subspaceA = { AAA, AAT, AAC, AAG ………..AGG }
•subspaceT = { TAA, TAT, TAC, TAG ………..TGG }
•subspaceC = { CAA, CAT, CAC, CAG ………..CGG }
•subspaceGG = { GGA, GGT, GGC, GGG}
Sequence = AGTATCAGTT
Question
 Input:
 S, l, d, k
 Output:
 All l mers that occur with up to d mismatches
at least k times in the sample.
Solution
 Naïve :
 Test all l mer in the space
 If occur with up to d mismatches at least k times
in the sample than output this l mer.
space = { AAA, AAT, AAC, AAG ………..AGG
TAA, TAT, TAC, TAG ………..TGG
CAA, CAT, CAC, CAG ………..CGG
GAA, GAT, GAC, GAG ………..GGG }
sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }
Splitting pattern space
 if we are looking for patterns of length l we would first
split the space of all l mers into 4 disjoint subspaces.
 Subspace of all l mers starting with A,
 Subspace of all l mers starting with T,
 Subspace of all l mers starting with C,
 Subspace of all l mers starting with G,
Splitting pattern space
 if we are looking for patterns of length l we would first
split the space of all l mers into 4 disjoint subspaces.
Space:
A*
SubspaceA
T* C* G*
Splitting pattern space
 we further determine whether the subspace
contains a ( l ,d )-k pattern.
Space:
A*
Can’t rule out
Splitting pattern space
 we further determine whether the subspace
contains a ( l ,d )-k pattern.
Space:
AT*
AA* AC*
AG*
Can rule out
Splitting pattern space
 we further determine whether the subspace
contains a ( l ,d )-k pattern.
Space:
Can’t rule out
Splitting pattern space
 we further determine whether the subspace
contains a ( l ,d )-k pattern.
 If we can rule out this subspace contains such
a pattern
 we stop searching in this subspace;
 release the memory slot;
 If we can’t rule out this subspace contains
such a pattern
 we split this subspace again on the next
symbol;
 and repeat;
Mismatch tree data structure
 A mismatch tree is a rooted tree where each internal
node has 4 branches labeled with a symbol in
{A,C,T,G}
 The maximum depth of the tree is l.
 Each node in the mismatch tree corresponds to the
subspace of patterns P with a fixed prefix.
 Each node contains pointers to all l mers instances
from the sample that are within d mismatches from a
pattern p.
Mismatch tree data structure
 MITRA start with examining the root node of the
mismatch tree that corresponds to the space of all
patterns.
 When examining a node, MITRA tries to prove that it
corresponds to a weak subspace.
 If (we can’t prove it)
 we expand the node’s children and examine each of
them.
 Whenever we reach a node corresponding to a weak
subspace, we backtrack.
 The intuition is that many of the nodes correspond
to weak subspaces and can be rule out.
 This allows us to avoid searching much of the
pattern space.
Mismatch tree data structure
 If we reach depth l and the number of instances is
not less than k.
 the l mer corresponding to the path from the
root to the leaf .
 the pointers from this node correspond to the
instances of this pattern.
Example
 Consider a very simple example of finding the
pattern of length 4 with up to 1 mismatch and
at least 2 times in the sample S =
“AGTATCAGTT”.
 The substrings (4mers) in S are
{ AGTA, GTAT, TATC, ATCA, TCAG,
CAGT, AGTT }
Not for exam
A
A
A
k =7
k = 5
0 0 0 0 0 0 0
A G T A T C A
G T A T C A G
T A T C A G T
A T C A G T T
0 1 1 0 1 1 0
A G T A T C A
G T A T C A G
T A T C A G T
A T C A G T T
1 2 1 1 2 1 1
A G T A T C A
G T A T C A G
T A T C A G T
A T C A G T T
2 2 2 2 2
A T A C A
G A T A G
T T C G T
A C A T T
k = 0
C
2 2 1 2 2
A T A C A
G A T A G
T T C G T
A C A T T
k = 1
G 2 2 2 1 2
A T A C A
G A T A G
T T C G T
A C A T T
k = 1
1 1 2 2 1
A T A C A
G A T A G
T T C G T
A C A T T
T
k = 3
A 1 2 2
A T A
G A G
T T T
A C T
k = 1
C
2 1 2
A T A
G A G
T T T
A C T
k = 1
G 2 2 2
A T A
G A G
T T T
A C T
k = 0
2 2 1
A T A
G A G
T T T
A C T
T
k = 1
A
k =7
0 0 0 0 0 0 0
A G T A T C A
G T A T C A G
T A T C A G T
A T C A G T T
0 1 1 0 1 1 0
A G T A T C A
G T A T C A G
T A T C A G T
A T C A G T T
G
k = 3
0 2 2 1 2 2 0
A G T A T C A
G T A T C A G
T A T C A G T
A T C A G T T
0 2 0
A A A
G T G
T C T
A A T
T
k = 2
A 0 1
A A
G G
T T
A T
k = 2
Output: AGTA C
1 1
A A
G G
T T
A T
k = 2
AGTC
G 1 1
A A
G G
T T
A T
k = 2AGTG
1 0
A A
G G
T T
A T
T
k = 2
AGTT
T
Overall complexity
 Space =
 Time =
TTGACTA
TGACTAT
GACTATG
ACTATGA
0000000
A
G
T
T
O(|S|)
O(l)
O(O(ll22
× |S|)× |S|)
O(4O(4ll
× |S|)× |S|)
.
.
.
.
.
.
.
.
.
l
 Number of nodes = O(4l
)
l
TTGACTA
TGACTAT
GACTATG
ACTATGA
0110110
O(|S|)
– Number of comparisons in each node = O(|S|)Number of comparisons in each node = O(|S|)
Take a Closer Look
 In mismatch tree algorithm, we can not start
ruling out a node until traverse to depth .
A
A
k =7
k = 5
0 0 0 0 0 0 0
A G T A T C A
G T A T C A G
T A T C A G T
A T C A G T T
0 1 1 0 1 1 0
A G T A T C A
G T A T C A G
T A T C A G T
A T C A G T T
1 2 1 1 2 1 1
A G T A T C A
G T A T C A G
T A T C A G T
A T C A G T T
d +1
MITRA Graph
 Information about pairwise similarities between
instances of the pattern can significantly
the sample-driven approach.
 The graph that is constructed to model this
pairwise similarity is called MITRA-Graph
speed up
MITRA Graph
 Given a pattern P and sample S we can construct
a graph G(P, S) where each vertex is an lmer in
the sample and there is an edge connecting two
lmers if P is within d mismatches from both
lmers.
ACA
(d=3)
TAA
(d=1)
AAC (d=1)
P = TAC
S = TAACA
MITRA Graph
 For an (l,d) – k pattern P the corresponding graph
contains a clique of size k.
ACA
(d=3)
TAA
(d=1)
AAC (d=1)
P = AAA
S = TAACA
MITRA Graph
 Given a set of patterns P and a sample S, define a
graph G(P , S) whose edge set is a union of edge
sets of graphs G(P, S) for P∈P .
 Each vertex of G(P , S) is an lmer in the sample
and there is an edge connecting two lmers if there
is a pattern P∈P that is within d mismatches
from both lmers.
 If for a subspace of patterns we can rule out an
existence of a clique of size k, then the subspace
has no (l,d)-k
The WINNOWER Algorithm
 The WINNOWER algorithm by Pevzner and
Sze (2000) constructs the following graph:
Each lmer in the sample is a vertex, and an edge
connects two vertices if the corresponding lmers
have less than d mismatches.
 Instances of a (l,d)-k pattern form a clique of
size k in this graph.
The WINNOWER Algorithm
(con’t)
 Since clique are difficult to find, WINNOWER
takes the approach of trying to remove edges that
do not corresponding to a clique.
k = 4
Improvements by MITRA-Graph
1. Construct a graph at each node in the mismatch
tree.
A
0 1 1 0 1 1 0
A G T A T C A
G T A T C A G
T A T C A G T
A T C A G T T
Improvements by MITRA-Graph
2. Remove edges which are not part of a clique.
A
Improvements by MITRA-Graph
3. If no potential clique remains, rule out the
subspace corresponding to the node and
backtrack.
A
A
Improvements by MITRA-Graph
4. If we cannot rule out a clique, split the subspace
of patterns and examine the child nodes
A
MISMATCH TREE ALGORITHM —
Improvements over WINNOWER
 At each node of the tree, we remove edges
by computing the degree of each vertex.
 If the degree of the vertex is less than k-1,
we can remove all edges incident to it since
we know it is not part of a clique.
 We repeat this procedure until we cannot
remove any more edges.
 If the number of edges remaining is less than
the minimum number of edges in a clique,
we can rule out the existence of a clique and
backtrack.
MISMATCH TREE ALGORITHM —
Improvements over WINNOWER
 The problem with this approach is how to
efficiently construct the graph at each node in
the mismatch tree.
 Instead of constructing the graph from scratch,
we construct it based on the graph at the
parent node
 an edge connecting two l mers
 the first l mer matches the prefix of the pattern
subspace with d1 mismatches
 the second l mer matches with d2 mismatches
MISMATCH TREE ALGORITHM —
Improvements over WINNOWER
 the number of mismatches between the tail of
the first and the second l mers as m.
 The edge between these two l mers exists in
the pattern subspace if and only if d1 <= d,
d2 <= d and d1+d2+m <= 2d.
the first lmer
the second lmer
The prefix of the
pattern subspace
MISMATCH TREE ALGORITHM —
Improvements over WINNOWER (cont’d)
 In the root node since d1 = d2 = 0, an edge exists only if
m <= 2d which is the equivalent graph to
WINNOWER.
 With moving down the tree, the condition becomes
much stronger than the WINNOWER.
 We can compute the edges of a node based on the
edges of the node’s parents by keeping track of the
quantities d1, d2, and m for each edge.
MISMATCH TREE ALGORITHM —
Improvements over WINNOWER
 To summarize, the MITRA-Graph algorithm works as
follows
 We first compute the set of edges at the root node by
performing pairwise comparisons between all l mers
due to d1 = d2 = 0.
 We traverse the tree in a depth first order, passing on
the valid edges and keeping track of the quantities d1,
d2, and m for each of them.
 At each node, we prune the graph by eliminating any
edges incident to vertices that have degrees of less
than k-1.
 If there are less than the minimum number of edges for
a clique, we backtrack.
 If we reach a leaf of the tree (depth l), then we output
the corresponding pattern.
Discovering dyad signals
DISCOVERING DYNAD SIGNALS
 For dyad signals, we are interested in
discovering two monads that occur a certain
length apart
 We use the notation (l1-(s1,s2)-l2,d)-k pattern to
denote a dyad signal
s s sl1 l1 l1l2 l2 l2
DISCOVERING DYNAD SIGNALS
 The MITRA-Dyad algorithm casts the dyad
discovery problem into a monad discovery
problem by preprocessing the input and
creating a “virtual” sample to solve the
(l1+l2,d)-k monad pattern discovery problem in
this sample
 For each l1mer in the sample and for each s in
[s1,s2], we create an l1+l2 mer which is the l1mer
concatenated with the l2 mer upstream s
nucleotides of the l1mer.
DISCOVERING DYNAD SIGNALS
 The number of elements in the “virtual” sample
will be approximately (s1-s2+1) times larger.
 An (l1+l2,d)-k pattern in the “virtual” sample will
correspond to a (l1-(s1,s2)-l2,d)-k pattern in the
original sample, and we can easily map the
solution from the monad problem to the dyad
one.
 An important feature of MITRA-Dyad is an
ability to search for long patterns.
DISCOVERING DYNAD SIGNALS
 If the range s1-s2+1 of acceptable distances
between monad parts in a composite pattern
is large, the MITRA-Dyad algorithm becomes
inefficient
 A simple approach to detect these patterns is
to generate a long ranked list of candidate
monad patterns using MITRA.
 Then check each occurrence of each pair from
the list to see if they occur within the
acceptable distance.

Weitere ähnliche Inhalte

Was ist angesagt?

Magnet Design - Hollow Cylindrical Conductor
Magnet Design - Hollow Cylindrical ConductorMagnet Design - Hollow Cylindrical Conductor
Magnet Design - Hollow Cylindrical ConductorPei-Che Chang
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
"Exact and approximate algorithms for resultant polytopes."
"Exact and approximate algorithms for resultant polytopes." "Exact and approximate algorithms for resultant polytopes."
"Exact and approximate algorithms for resultant polytopes." Vissarion Fisikopoulos
 
Presentacion limac-unc
Presentacion limac-uncPresentacion limac-unc
Presentacion limac-uncPucheta Julian
 
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014PyData
 
Brief Introduction About Topological Interference Management (TIM)
Brief Introduction About Topological Interference Management (TIM)Brief Introduction About Topological Interference Management (TIM)
Brief Introduction About Topological Interference Management (TIM)Pei-Che Chang
 
Class 28: Entropy
Class 28: EntropyClass 28: Entropy
Class 28: EntropyDavid Evans
 
Pairwise testing sagar_hadawale
Pairwise  testing sagar_hadawalePairwise  testing sagar_hadawale
Pairwise testing sagar_hadawaleSagar Hadawale
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMijfls
 

Was ist angesagt? (20)

DSP 08 _ Sheet Eight
DSP 08 _ Sheet EightDSP 08 _ Sheet Eight
DSP 08 _ Sheet Eight
 
Magnet Design - Hollow Cylindrical Conductor
Magnet Design - Hollow Cylindrical ConductorMagnet Design - Hollow Cylindrical Conductor
Magnet Design - Hollow Cylindrical Conductor
 
6th Semester Electronic and Communication Engineering (2013-December) Questio...
6th Semester Electronic and Communication Engineering (2013-December) Questio...6th Semester Electronic and Communication Engineering (2013-December) Questio...
6th Semester Electronic and Communication Engineering (2013-December) Questio...
 
Finite frequency H∞ control for wind turbine systems in T-S form
Finite frequency H∞ control for wind turbine systems in T-S formFinite frequency H∞ control for wind turbine systems in T-S form
Finite frequency H∞ control for wind turbine systems in T-S form
 
Matched filter
Matched filterMatched filter
Matched filter
 
SPSF03 - Numerical Integrations
SPSF03 - Numerical IntegrationsSPSF03 - Numerical Integrations
SPSF03 - Numerical Integrations
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
"Exact and approximate algorithms for resultant polytopes."
"Exact and approximate algorithms for resultant polytopes." "Exact and approximate algorithms for resultant polytopes."
"Exact and approximate algorithms for resultant polytopes."
 
6th Semester (December; January-2014 and 2015) Electronics and Communication ...
6th Semester (December; January-2014 and 2015) Electronics and Communication ...6th Semester (December; January-2014 and 2015) Electronics and Communication ...
6th Semester (December; January-2014 and 2015) Electronics and Communication ...
 
6th Semester Electronic and Communication Engineering (2012-December) Questi...
6th Semester Electronic and Communication Engineering  (2012-December) Questi...6th Semester Electronic and Communication Engineering  (2012-December) Questi...
6th Semester Electronic and Communication Engineering (2012-December) Questi...
 
Multirate sim
Multirate simMultirate sim
Multirate sim
 
Presentacion limac-unc
Presentacion limac-uncPresentacion limac-unc
Presentacion limac-unc
 
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
 
Electronic and Communication Engineering 6th Semester (2010-December) Questio...
Electronic and Communication Engineering 6th Semester (2010-December) Questio...Electronic and Communication Engineering 6th Semester (2010-December) Questio...
Electronic and Communication Engineering 6th Semester (2010-December) Questio...
 
Brief Introduction About Topological Interference Management (TIM)
Brief Introduction About Topological Interference Management (TIM)Brief Introduction About Topological Interference Management (TIM)
Brief Introduction About Topological Interference Management (TIM)
 
6th Semester Electronic and Communication Engineering (2013-June) Question Pa...
6th Semester Electronic and Communication Engineering (2013-June) Question Pa...6th Semester Electronic and Communication Engineering (2013-June) Question Pa...
6th Semester Electronic and Communication Engineering (2013-June) Question Pa...
 
Class 28: Entropy
Class 28: EntropyClass 28: Entropy
Class 28: Entropy
 
Channel coding
Channel codingChannel coding
Channel coding
 
Pairwise testing sagar_hadawale
Pairwise  testing sagar_hadawalePairwise  testing sagar_hadawale
Pairwise testing sagar_hadawale
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 

Andere mochten auch

Andere mochten auch (13)

Rna synthesis and processing
Rna synthesis  and processing Rna synthesis  and processing
Rna synthesis and processing
 
RNA & Protein Synthesis
RNA & Protein SynthesisRNA & Protein Synthesis
RNA & Protein Synthesis
 
RNA- Structure, Types and Functions
RNA- Structure, Types and FunctionsRNA- Structure, Types and Functions
RNA- Structure, Types and Functions
 
Biochem synthesis of rna(june.23.2010)
Biochem   synthesis of rna(june.23.2010)Biochem   synthesis of rna(june.23.2010)
Biochem synthesis of rna(june.23.2010)
 
Rna processing
Rna processingRna processing
Rna processing
 
RNA processing
RNA processingRNA processing
RNA processing
 
RNA processing final eukaryotes.
RNA processing final eukaryotes.RNA processing final eukaryotes.
RNA processing final eukaryotes.
 
RNA polymerase
RNA polymeraseRNA polymerase
RNA polymerase
 
RNA
RNARNA
RNA
 
Protein synthesis
Protein synthesisProtein synthesis
Protein synthesis
 
structure types and function of RNA
structure types and function of RNAstructure types and function of RNA
structure types and function of RNA
 
Pharmacology Cardiovascular Drugs
Pharmacology   Cardiovascular DrugsPharmacology   Cardiovascular Drugs
Pharmacology Cardiovascular Drugs
 
A sample of holistic scoring rubric
A  sample of holistic scoring rubricA  sample of holistic scoring rubric
A sample of holistic scoring rubric
 

Ähnlich wie RNA synthesis

2012 mdsp pr06  hmm
2012 mdsp pr06  hmm2012 mdsp pr06  hmm
2012 mdsp pr06  hmmnozomuhamada
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10FredrikRonquist
 
Utlization Cat Swarm Optimization Algorithm for Selected Harmonic Elemination...
Utlization Cat Swarm Optimization Algorithm for Selected Harmonic Elemination...Utlization Cat Swarm Optimization Algorithm for Selected Harmonic Elemination...
Utlization Cat Swarm Optimization Algorithm for Selected Harmonic Elemination...IJPEDS-IAES
 
Replica exchange MCMC
Replica exchange MCMCReplica exchange MCMC
Replica exchange MCMC. .
 
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...Gota Morota
 
Learning Algorithms For Life Scientists
Learning Algorithms For Life ScientistsLearning Algorithms For Life Scientists
Learning Algorithms For Life ScientistsBrian Frezza
 
LPEI_ZCNI_Poster
LPEI_ZCNI_PosterLPEI_ZCNI_Poster
LPEI_ZCNI_PosterLong Pei
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsAlexander Litvinenko
 
Decomposition and Denoising for moment sequences using convex optimization
Decomposition and Denoising for moment sequences using convex optimizationDecomposition and Denoising for moment sequences using convex optimization
Decomposition and Denoising for moment sequences using convex optimizationBadri Narayan Bhaskar
 
A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...JuanPabloCarbajal3
 
SPSF02 - Graphical Data Representation
SPSF02 - Graphical Data RepresentationSPSF02 - Graphical Data Representation
SPSF02 - Graphical Data RepresentationSyeilendra Pramuditya
 
LeastSquaresParameterEstimation.ppt
LeastSquaresParameterEstimation.pptLeastSquaresParameterEstimation.ppt
LeastSquaresParameterEstimation.pptStavrovDule2
 
EC202 SIGNALS & SYSTEMS Module4 QUESTION BANK
EC202 SIGNALS & SYSTEMS  Module4 QUESTION BANKEC202 SIGNALS & SYSTEMS  Module4 QUESTION BANK
EC202 SIGNALS & SYSTEMS Module4 QUESTION BANKVISHNUPRABHANKAIMAL
 
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction ChallengeESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction ChallengeFrancisco Zamora-Martinez
 
Nyquist criterion for zero ISI
Nyquist criterion for zero ISINyquist criterion for zero ISI
Nyquist criterion for zero ISIGunasekara Reddy
 

Ähnlich wie RNA synthesis (20)

2012 mdsp pr06  hmm
2012 mdsp pr06  hmm2012 mdsp pr06  hmm
2012 mdsp pr06  hmm
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10
 
Utlization Cat Swarm Optimization Algorithm for Selected Harmonic Elemination...
Utlization Cat Swarm Optimization Algorithm for Selected Harmonic Elemination...Utlization Cat Swarm Optimization Algorithm for Selected Harmonic Elemination...
Utlization Cat Swarm Optimization Algorithm for Selected Harmonic Elemination...
 
Families of Triangular Norm Based Kernel Function and Its Application to Kern...
Families of Triangular Norm Based Kernel Function and Its Application to Kern...Families of Triangular Norm Based Kernel Function and Its Application to Kern...
Families of Triangular Norm Based Kernel Function and Its Application to Kern...
 
Replica exchange MCMC
Replica exchange MCMCReplica exchange MCMC
Replica exchange MCMC
 
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
Application of Bayesian and Sparse Network Models for Assessing Linkage Diseq...
 
Learning Algorithms For Life Scientists
Learning Algorithms For Life ScientistsLearning Algorithms For Life Scientists
Learning Algorithms For Life Scientists
 
LPEI_ZCNI_Poster
LPEI_ZCNI_PosterLPEI_ZCNI_Poster
LPEI_ZCNI_Poster
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEs
 
Decomposition and Denoising for moment sequences using convex optimization
Decomposition and Denoising for moment sequences using convex optimizationDecomposition and Denoising for moment sequences using convex optimization
Decomposition and Denoising for moment sequences using convex optimization
 
A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...
 
SPSF02 - Graphical Data Representation
SPSF02 - Graphical Data RepresentationSPSF02 - Graphical Data Representation
SPSF02 - Graphical Data Representation
 
Dsp manual
Dsp manualDsp manual
Dsp manual
 
LeastSquaresParameterEstimation.ppt
LeastSquaresParameterEstimation.pptLeastSquaresParameterEstimation.ppt
LeastSquaresParameterEstimation.ppt
 
EC202 SIGNALS & SYSTEMS Module4 QUESTION BANK
EC202 SIGNALS & SYSTEMS  Module4 QUESTION BANKEC202 SIGNALS & SYSTEMS  Module4 QUESTION BANK
EC202 SIGNALS & SYSTEMS Module4 QUESTION BANK
 
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction ChallengeESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge
 
Nyquist criterion for zero ISI
Nyquist criterion for zero ISINyquist criterion for zero ISI
Nyquist criterion for zero ISI
 
6th Semeste Electronics and Communication Engineering (June-2016) Question Pa...
6th Semeste Electronics and Communication Engineering (June-2016) Question Pa...6th Semeste Electronics and Communication Engineering (June-2016) Question Pa...
6th Semeste Electronics and Communication Engineering (June-2016) Question Pa...
 
Hmm and neural networks
Hmm and neural networksHmm and neural networks
Hmm and neural networks
 
sampling.ppt
sampling.pptsampling.ppt
sampling.ppt
 

Mehr von Juan Carlos Munévar

Biología de los Tejidos de la cavidad oral, cabeza y cuello
Biología de los Tejidos de la cavidad oral, cabeza y cuelloBiología de los Tejidos de la cavidad oral, cabeza y cuello
Biología de los Tejidos de la cavidad oral, cabeza y cuelloJuan Carlos Munévar
 
Secretoma congreso institucional 2017
Secretoma congreso institucional 2017Secretoma congreso institucional 2017
Secretoma congreso institucional 2017Juan Carlos Munévar
 
Células Madre “Bombo Publicitario o Esperanza Médica”
Células Madre “Bombo Publicitario o Esperanza Médica”Células Madre “Bombo Publicitario o Esperanza Médica”
Células Madre “Bombo Publicitario o Esperanza Médica”Juan Carlos Munévar
 
Stem Cell clinical grade Biology for human therapies
Stem Cell clinical grade Biology for human therapiesStem Cell clinical grade Biology for human therapies
Stem Cell clinical grade Biology for human therapiesJuan Carlos Munévar
 
Regeneracion y reparacion periodontal
Regeneracion y reparacion periodontalRegeneracion y reparacion periodontal
Regeneracion y reparacion periodontalJuan Carlos Munévar
 
¿Cómo publicar en revistas académicas indexadas peer review?
¿Cómo publicar en revistas académicas  indexadas peer review?¿Cómo publicar en revistas académicas  indexadas peer review?
¿Cómo publicar en revistas académicas indexadas peer review?Juan Carlos Munévar
 
Fisiopatologia y Biologia de la inflamación
Fisiopatologia y Biologia de la inflamaciónFisiopatologia y Biologia de la inflamación
Fisiopatologia y Biologia de la inflamaciónJuan Carlos Munévar
 
OSTEOINMUNOLOGÍA: Biología de osteoclasto
OSTEOINMUNOLOGÍA: Biología de osteoclasto OSTEOINMUNOLOGÍA: Biología de osteoclasto
OSTEOINMUNOLOGÍA: Biología de osteoclasto Juan Carlos Munévar
 
Big data o datos masivos en investigación en odontología
Big data o datos masivos en investigación en odontologíaBig data o datos masivos en investigación en odontología
Big data o datos masivos en investigación en odontologíaJuan Carlos Munévar
 
Lectura crítica de la literatura biomédica
Lectura crítica de la literatura biomédicaLectura crítica de la literatura biomédica
Lectura crítica de la literatura biomédicaJuan Carlos Munévar
 
Indicadores produccióncientífica
Indicadores produccióncientíficaIndicadores produccióncientífica
Indicadores produccióncientíficaJuan Carlos Munévar
 
Mecanismos de señalización en osteoclastogenesis y enfermedad òsea
Mecanismos de señalización en osteoclastogenesis y enfermedad òseaMecanismos de señalización en osteoclastogenesis y enfermedad òsea
Mecanismos de señalización en osteoclastogenesis y enfermedad òseaJuan Carlos Munévar
 
Profundización en Biologia Osea para postgrados en el área de la salud
Profundización en Biologia Osea para postgrados en el área de la saludProfundización en Biologia Osea para postgrados en el área de la salud
Profundización en Biologia Osea para postgrados en el área de la saludJuan Carlos Munévar
 
INDICADORES DE PRODUCCION CIENTIFICA
INDICADORES DE  PRODUCCION CIENTIFICAINDICADORES DE  PRODUCCION CIENTIFICA
INDICADORES DE PRODUCCION CIENTIFICAJuan Carlos Munévar
 
INTERACCIONES MOLECULARES Y ENLACES ATÓMICOS
INTERACCIONES MOLECULARES Y ENLACES ATÓMICOSINTERACCIONES MOLECULARES Y ENLACES ATÓMICOS
INTERACCIONES MOLECULARES Y ENLACES ATÓMICOSJuan Carlos Munévar
 
¿Escribir artículo de revisión?
¿Escribir artículo de revisión?¿Escribir artículo de revisión?
¿Escribir artículo de revisión?Juan Carlos Munévar
 
Lectura critica de la literatura biomédica
Lectura critica de la literatura biomédicaLectura critica de la literatura biomédica
Lectura critica de la literatura biomédicaJuan Carlos Munévar
 

Mehr von Juan Carlos Munévar (20)

Biología de los Tejidos de la cavidad oral, cabeza y cuello
Biología de los Tejidos de la cavidad oral, cabeza y cuelloBiología de los Tejidos de la cavidad oral, cabeza y cuello
Biología de los Tejidos de la cavidad oral, cabeza y cuello
 
Proyecto Decreto Minsalud 2021
Proyecto Decreto Minsalud 2021Proyecto Decreto Minsalud 2021
Proyecto Decreto Minsalud 2021
 
Tablero demo postgrados
Tablero demo postgradosTablero demo postgrados
Tablero demo postgrados
 
Secretoma congreso institucional 2017
Secretoma congreso institucional 2017Secretoma congreso institucional 2017
Secretoma congreso institucional 2017
 
Células Madre “Bombo Publicitario o Esperanza Médica”
Células Madre “Bombo Publicitario o Esperanza Médica”Células Madre “Bombo Publicitario o Esperanza Médica”
Células Madre “Bombo Publicitario o Esperanza Médica”
 
Stem Cell clinical grade Biology for human therapies
Stem Cell clinical grade Biology for human therapiesStem Cell clinical grade Biology for human therapies
Stem Cell clinical grade Biology for human therapies
 
Regeneracion y reparacion periodontal
Regeneracion y reparacion periodontalRegeneracion y reparacion periodontal
Regeneracion y reparacion periodontal
 
¿Cómo publicar en revistas académicas indexadas peer review?
¿Cómo publicar en revistas académicas  indexadas peer review?¿Cómo publicar en revistas académicas  indexadas peer review?
¿Cómo publicar en revistas académicas indexadas peer review?
 
Fisiopatologia y Biologia de la inflamación
Fisiopatologia y Biologia de la inflamaciónFisiopatologia y Biologia de la inflamación
Fisiopatologia y Biologia de la inflamación
 
OSTEOINMUNOLOGÍA: Biología de osteoclasto
OSTEOINMUNOLOGÍA: Biología de osteoclasto OSTEOINMUNOLOGÍA: Biología de osteoclasto
OSTEOINMUNOLOGÍA: Biología de osteoclasto
 
Big data o datos masivos en investigación en odontología
Big data o datos masivos en investigación en odontologíaBig data o datos masivos en investigación en odontología
Big data o datos masivos en investigación en odontología
 
Lectura crítica de la literatura biomédica
Lectura crítica de la literatura biomédicaLectura crítica de la literatura biomédica
Lectura crítica de la literatura biomédica
 
Indicadores produccióncientífica
Indicadores produccióncientíficaIndicadores produccióncientífica
Indicadores produccióncientífica
 
Mecanismos de señalización en osteoclastogenesis y enfermedad òsea
Mecanismos de señalización en osteoclastogenesis y enfermedad òseaMecanismos de señalización en osteoclastogenesis y enfermedad òsea
Mecanismos de señalización en osteoclastogenesis y enfermedad òsea
 
Profundización en Biologia Osea para postgrados en el área de la salud
Profundización en Biologia Osea para postgrados en el área de la saludProfundización en Biologia Osea para postgrados en el área de la salud
Profundización en Biologia Osea para postgrados en el área de la salud
 
INDICADORES DE PRODUCCION CIENTIFICA
INDICADORES DE  PRODUCCION CIENTIFICAINDICADORES DE  PRODUCCION CIENTIFICA
INDICADORES DE PRODUCCION CIENTIFICA
 
INTERACCIONES MOLECULARES Y ENLACES ATÓMICOS
INTERACCIONES MOLECULARES Y ENLACES ATÓMICOSINTERACCIONES MOLECULARES Y ENLACES ATÓMICOS
INTERACCIONES MOLECULARES Y ENLACES ATÓMICOS
 
¿Escribir artículo de revisión?
¿Escribir artículo de revisión?¿Escribir artículo de revisión?
¿Escribir artículo de revisión?
 
Lectura critica de la literatura biomédica
Lectura critica de la literatura biomédicaLectura critica de la literatura biomédica
Lectura critica de la literatura biomédica
 
Seminario Manejo de diabetes
Seminario Manejo de diabetesSeminario Manejo de diabetes
Seminario Manejo de diabetes
 

Kürzlich hochgeladen

Jual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan CytotecJual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan Cytotecjualobat34
 
VIP Hyderabad Call Girls KPHB 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls KPHB 7877925207 ₹5000 To 25K With AC Room 💚😋VIP Hyderabad Call Girls KPHB 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls KPHB 7877925207 ₹5000 To 25K With AC Room 💚😋mahima pandey
 
Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...
Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...
Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...Sheetaleventcompany
 
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...Sheetaleventcompany
 
Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...
Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...
Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...dishamehta3332
 
👉Chandigarh Call Girl Service📲Niamh 8868886958 📲Book 24hours Now📲👉Sexy Call G...
👉Chandigarh Call Girl Service📲Niamh 8868886958 📲Book 24hours Now📲👉Sexy Call G...👉Chandigarh Call Girl Service📲Niamh 8868886958 📲Book 24hours Now📲👉Sexy Call G...
👉Chandigarh Call Girl Service📲Niamh 8868886958 📲Book 24hours Now📲👉Sexy Call G...Sheetaleventcompany
 
Most Beautiful Call Girl in Chennai 7427069034 Contact on WhatsApp
Most Beautiful Call Girl in Chennai 7427069034 Contact on WhatsAppMost Beautiful Call Girl in Chennai 7427069034 Contact on WhatsApp
Most Beautiful Call Girl in Chennai 7427069034 Contact on WhatsAppjimmihoslasi
 
Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...
Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...
Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...Sheetaleventcompany
 
💚Chandigarh Call Girls 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...
💚Chandigarh Call Girls 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...💚Chandigarh Call Girls 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...
💚Chandigarh Call Girls 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...Sheetaleventcompany
 
tongue disease lecture Dr Assadawy legacy
tongue disease lecture Dr Assadawy legacytongue disease lecture Dr Assadawy legacy
tongue disease lecture Dr Assadawy legacyDrMohamed Assadawy
 
Circulatory Shock, types and stages, compensatory mechanisms
Circulatory Shock, types and stages, compensatory mechanismsCirculatory Shock, types and stages, compensatory mechanisms
Circulatory Shock, types and stages, compensatory mechanismsMedicoseAcademics
 
Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...
Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...
Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...Sheetaleventcompany
 
Dehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
Dehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service DehradunDehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
Dehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service DehradunSheetaleventcompany
 
Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...
Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...
Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...GENUINE ESCORT AGENCY
 
Ahmedabad Call Girls Book Now 9630942363 Top Class Ahmedabad Escort Service A...
Ahmedabad Call Girls Book Now 9630942363 Top Class Ahmedabad Escort Service A...Ahmedabad Call Girls Book Now 9630942363 Top Class Ahmedabad Escort Service A...
Ahmedabad Call Girls Book Now 9630942363 Top Class Ahmedabad Escort Service A...GENUINE ESCORT AGENCY
 
Gastric Cancer: Сlinical Implementation of Artificial Intelligence, Synergeti...
Gastric Cancer: Сlinical Implementation of Artificial Intelligence, Synergeti...Gastric Cancer: Сlinical Implementation of Artificial Intelligence, Synergeti...
Gastric Cancer: Сlinical Implementation of Artificial Intelligence, Synergeti...Oleg Kshivets
 
Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...
Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...
Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...gragneelam30
 
Kolkata Call Girls Naktala 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Girl Se...
Kolkata Call Girls Naktala  💯Call Us 🔝 8005736733 🔝 💃  Top Class Call Girl Se...Kolkata Call Girls Naktala  💯Call Us 🔝 8005736733 🔝 💃  Top Class Call Girl Se...
Kolkata Call Girls Naktala 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Girl Se...Namrata Singh
 
💰Call Girl In Bangalore☎️63788-78445💰 Call Girl service in Bangalore☎️Bangalo...
💰Call Girl In Bangalore☎️63788-78445💰 Call Girl service in Bangalore☎️Bangalo...💰Call Girl In Bangalore☎️63788-78445💰 Call Girl service in Bangalore☎️Bangalo...
💰Call Girl In Bangalore☎️63788-78445💰 Call Girl service in Bangalore☎️Bangalo...gragneelam30
 
Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...
Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...
Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...Sheetaleventcompany
 

Kürzlich hochgeladen (20)

Jual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan CytotecJual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi Di Dubai UAE Wa 0838-4800-7379 Obat Penggugur Kandungan Cytotec
 
VIP Hyderabad Call Girls KPHB 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls KPHB 7877925207 ₹5000 To 25K With AC Room 💚😋VIP Hyderabad Call Girls KPHB 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls KPHB 7877925207 ₹5000 To 25K With AC Room 💚😋
 
Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...
Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...
Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...
 
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
Goa Call Girl Service 📞9xx000xx09📞Just Call Divya📲 Call Girl In Goa No💰Advanc...
 
Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...
Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...
Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...
 
👉Chandigarh Call Girl Service📲Niamh 8868886958 📲Book 24hours Now📲👉Sexy Call G...
👉Chandigarh Call Girl Service📲Niamh 8868886958 📲Book 24hours Now📲👉Sexy Call G...👉Chandigarh Call Girl Service📲Niamh 8868886958 📲Book 24hours Now📲👉Sexy Call G...
👉Chandigarh Call Girl Service📲Niamh 8868886958 📲Book 24hours Now📲👉Sexy Call G...
 
Most Beautiful Call Girl in Chennai 7427069034 Contact on WhatsApp
Most Beautiful Call Girl in Chennai 7427069034 Contact on WhatsAppMost Beautiful Call Girl in Chennai 7427069034 Contact on WhatsApp
Most Beautiful Call Girl in Chennai 7427069034 Contact on WhatsApp
 
Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...
Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...
Chandigarh Call Girls Service ❤️🍑 9809698092 👄🫦Independent Escort Service Cha...
 
💚Chandigarh Call Girls 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...
💚Chandigarh Call Girls 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...💚Chandigarh Call Girls 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...
💚Chandigarh Call Girls 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...
 
tongue disease lecture Dr Assadawy legacy
tongue disease lecture Dr Assadawy legacytongue disease lecture Dr Assadawy legacy
tongue disease lecture Dr Assadawy legacy
 
Circulatory Shock, types and stages, compensatory mechanisms
Circulatory Shock, types and stages, compensatory mechanismsCirculatory Shock, types and stages, compensatory mechanisms
Circulatory Shock, types and stages, compensatory mechanisms
 
Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...
Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...
Premium Call Girls Dehradun {8854095900} ❤️VVIP ANJU Call Girls in Dehradun U...
 
Dehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
Dehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service DehradunDehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
Dehradun Call Girl Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
 
Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...
Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...
Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...
 
Ahmedabad Call Girls Book Now 9630942363 Top Class Ahmedabad Escort Service A...
Ahmedabad Call Girls Book Now 9630942363 Top Class Ahmedabad Escort Service A...Ahmedabad Call Girls Book Now 9630942363 Top Class Ahmedabad Escort Service A...
Ahmedabad Call Girls Book Now 9630942363 Top Class Ahmedabad Escort Service A...
 
Gastric Cancer: Сlinical Implementation of Artificial Intelligence, Synergeti...
Gastric Cancer: Сlinical Implementation of Artificial Intelligence, Synergeti...Gastric Cancer: Сlinical Implementation of Artificial Intelligence, Synergeti...
Gastric Cancer: Сlinical Implementation of Artificial Intelligence, Synergeti...
 
Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...
Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...
Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...
 
Kolkata Call Girls Naktala 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Girl Se...
Kolkata Call Girls Naktala  💯Call Us 🔝 8005736733 🔝 💃  Top Class Call Girl Se...Kolkata Call Girls Naktala  💯Call Us 🔝 8005736733 🔝 💃  Top Class Call Girl Se...
Kolkata Call Girls Naktala 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Girl Se...
 
💰Call Girl In Bangalore☎️63788-78445💰 Call Girl service in Bangalore☎️Bangalo...
💰Call Girl In Bangalore☎️63788-78445💰 Call Girl service in Bangalore☎️Bangalo...💰Call Girl In Bangalore☎️63788-78445💰 Call Girl service in Bangalore☎️Bangalo...
💰Call Girl In Bangalore☎️63788-78445💰 Call Girl service in Bangalore☎️Bangalo...
 
Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...
Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...
Gorgeous Call Girls Dehradun {8854095900} ❤️VVIP ROCKY Call Girls in Dehradun...
 

RNA synthesis

  • 1. Detection of Genetic Motifs Bioinfo talk 11 Promoters – Biology Information theory Random Projections Composed motif detection
  • 3. DNA sequence gene junk DNA gene UTR-5' UTR-3' exon intron Promoter module Promoter module TSS TFBSTFBS TATA box INR DSE e1 e2 e5e4e3 Distal Promoter Proximal promoter Core promoter INR DSE INR = Initiator Region DSE = DownStream region TSS = Transcription Start Site
  • 4. TFBS-Transcription Factor Binding Site  Short strings (12 to 20 nucleotides long)  spreaded over up to 5kb before TSS  The string structure select the protein that will bind on the basis of Van der Waals interactions  Van der Waals interactions - protein that is going to bind example of a Transcription Factor Binding Sites TFBS ACCGATTATCA
  • 5. Assembly of the promoter protein complex of transcription Transcription factors TF TFIID TBP Transcription factor Binding Sites TFBS TATA box INR TSS DNA 1st stage
  • 6. 2nd stage INR TSS TFIID TBP TATA core promoter DNA Assembly of the promoter protein complex of transcription
  • 7. Assembly of the promoter protein complex of transcription TFIID RNA Poly II TFIIA INR TSS TATA core promoter TFIIE TFIIH Distal promoter/ enhancer Proximal promoter TF1 TF2 TF3 TF4 TF5 DNA looping TFIIB TBP TFIIB DNA
  • 9. known unknown TFBS's with the same colour are correlated The set of all TFBS (for a certain class of genes, organism or other) Unknown Known
  • 10. A T G C T C Protein of the Promoter complex A T C C T G Protein of the Promoter complex Example
  • 11. Entropy  Given a probability distribution, we want a function representing the quantity of information stored in the distribution.  We define the entropy (H) as:  For the sake of simplicity, we will use from now on the discrete definition. dxxpxpH or ipipH i ∫ ∑ −= −= ))(log()( ))(log()(
  • 12. Observed entropy  The real distribution is usually unknown, but we can replace it by the observed distribution f(x). The resulting entropy is:  For a multi dimensional probability distribution it is: ∑−= x xfxfxH ))(log()()( ∑ ∑ ∑ = −= yx y yx yxfxyfxf yxfyxfyxH , , )),(log()|()( )),(log(),(),(
  • 13. Mutual Information  X and Y are strings of equal length, S={A, C, G, T}, x and y belong to S  f(x,y) is the relative joint frequency of x,y in X and Y  f(x) is the relative frequency of x in X  f(y) is the relative frequency of y in Y )()(),( ))](log())(log()),()[log(,( ) )()( ),( log(),().( , , yHxHyxH yfxfyxfyxf yfxf yxf yxfyxI yx yx −−= −−−= −= ∑ ∑
  • 14. Information divergence  Given two distributions P and Q ∑∑ ∑ −= = xx x xqxpxpxp xq xp xpQPD ))(log()())(log()( ) )( )( log()(),( Not for exam
  • 15. A C A T T T A CC A T A G A C A A C T A A C T T T T A CG A T G G A A A C C T G X Y f(x,y) 6 4 4 6 f(x,y) A C G T 9 A 5 1 2 1 f(x) 5 C 1 3 1 0 1 G 0 0 1 0 5 T 0 0 0 5 Divide by 20 to obtain relative frequencies Example of calculation
  • 16. Algorithm for finding new TFBS 1) select a true TFBS (for example ACATTTACCATAGACAACT) (from a data bank as IUPAC or TRANSFAC) as a probe; 2) shift the probe over a non-coding zone; 3) evaluate step-by-step mutual information I(P,S), where P is the probe and S is the current adjacent string on the sequence; 4) select the positions (and the corresponding adjacent strings) for which I(P,S)> threshold 5) the strings starting from these positions are candidate TFBS,which need to be validated in vitro.
  • 17. CACTGTGCGACTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGATthe same string TTCGGAACCGGCCTTAAGACGGTGAAGGCGCTACTCATTTAATTGTGTTC CACTGTGCGTCTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT1 error TACTATATAATTGATCGTGTTTTGGCCCGCTACTCATGAAGAGCCGTTCG CACTGTGCGTCTGTCATTCGTCATCCACCGTTGTTAGCACAGGGGTCGAT2 errors TAAGGGTATCCAAGTCTGAATACCCCCTGTATTACACTCTCGCTGTCAGT CACTGTGCGTCTGTCATTCGTCATCCACCATTGTTAGCATAGGGGTCGAC5 errors CATTATCGAGGACAGTGATTTGTGGAATGCTTGGCCTTAATACGTCTCTA GAGTCTCGCAGTCTGATTGATGATGGAGGCTTCTTACGAGACCCCTGCATC<--> G TCAAAGTCAATTTACAGATTGGCGCCTCATGTAATAACGTTGGCATACTA GAGTGTGGGAGTGTGATTGATGATGGAGGGTTGTTAGGAGAGGGGTGGATC <-- G CTTAAGATAACGGACACTTGATTGAGATACGCTCGACGCTATGTCCGGCT CAGTGTCCGACTGTCATTGATCATCCACGCTTGTTACCAGACGCGTCGATsome C<-> G ACTCGACATAAGGTTACAGCATGTGGAGTAATGCGGTCGCTAACTACGGG GTGACACGCTGACAGTAAGTAGTAGGTGGCAACAATCGTGTCCCCAGCTAcomplementar GCGTGGCGAGCTTAATCCCTGCTGCTCTGAGCAAGGAGGGCGTGTAGAAA GTGACACGCGGACAGTAAGTAGTAGGTGGCAACAATCGTGTCCCCAGCTAcompl+1error CAAGGTGACAGAGTATTGAGTGAATCTACAATGTTCGCAGTGCTTTGTCG GTGACACGCTGACAGTAAGAAGTAGGTGGCAACAATCGTGTCCCCAGCTAcompl+2errors GCGGTCGCCAATCGTCAAGGAAATGATAGGTCTGATTGGCGTGGCTTAAG GTGACACGCTGACAGTAAGAAGTAGGTGGAAACAATCGTCTCCCCAGCTGcompl+5errors GGCGCTAACGAATACTTCAAGGCCCGAAGGATTGGTGTTGATACTAGCCG CACTGTGCGACTGTCATTCATCATCACACCGTTGTTAGCACAGGGGTCGAT1 letter more CGTGACCAGATGTCCTTACTCTGAATGTTATGGTATTAAGTGAGGTAGTG CACTGTGCGACTGTCATTCATCATCCACACCGTTGTTAGCACAGGGGTCGAT2 letters more GCCCATGAACATACATTCATGACTGTTCAAGCGCACTGGACCACTCGTTC CACTGTGCGACTGTCATTCATCATCCATCACCGTTGTTAGCACAGGGGTCGAT3 letters more CACTGTGCGACTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGATprobe Example
  • 18. 1 error 2 errors 5 errors C becomes G C and G exchanged complementary complementary+1error complementary+2errors complementary+5errors the same string 4 C become G and 5 G become C 0.2 0.4 0.6 0.8 1 1 letter more 2 letters more 3 letters more Detected values for I(P,S)
  • 19. Conclusions:  Use Mutual information as a tool to capture strings that are correlated to a true TFBS used as a probe.  validate in vitro the candidates so obtained  This is more flexible than the use of Hamming or Levenshtein distance, since correlated strings could be very distant one another Drawbacks: 1. the method need a precise calibration of the threshold 2. Does not include gaps
  • 20. Random Projection Approach to Motif Finding
  • 21. daf-19 Binding Sites in C. elegans GTTGTCATGGTGAC GTTTCCATGGAAAC GCTACCATGGCAAC GTTACCATAGTAAC GTTTCCATGGTAAC che-2 daf-19 osm-1 osm-6 F02D8.3 -150 -1
  • 22. The (l,d) Planted Motif Problem  Generate a random length l consensus sequence C.  Generate 20 instances, each differing from C by d random mutations.  Plant one at a random position in each of N=20 random sequences of length n=600.  Can you find the planted instances?
  • 24. Random Projection Algorithm  Buhler and Tompa (2001)  Guiding principle: Some instances of a motif agree on a subset of positions.  Use information from multiple motif instances to construct model. ATGCGTC ...ccATCCGACca... ...ttATGAGGCtc... ...ctATAAGTCgc... ...tcATGTGACac... (7,2) motif x(1) x(2) x(5) x(8) =M
  • 25. k-Projections  Choose k positions in string of length l.  Concatenate nucleotides at chosen k positions to form k-tuple.  In l-dimensional Hamming space, projection onto k dimensional subspace. ATGGCATTCAGATTC TGCTGAT l = 15 k = 7 P P = (2, 4, 5, 7, 11, 12, 13)
  • 26. Random Projection Algorithm  Choose a projection by selecting k positions uniformly at random.  For each l-tuple in input sequences, hash into bucket based on letters at k selected positions.  Recover motif from bucket containing multiple l-tuples. Bucket TGCT TGCACCT Input sequence x(i): …TCAATGCACCTAT...
  • 27. Example  l = 7 (motif size) , k = 4 (projection size)  Choose projection (1,2,5,7) GCTC ...TAGACATCCGACTTGCCTTACTAC... Buckets Input Sequence ATGC ATCCGAC GCCTTAC
  • 28. Hashing and Buckets  Hash function h(x) obtained from k positions of projection.  Buckets are labeled by values of h(x).  Enriched buckets: contain at least s l-tuples, for some parameter s. ATTCCATCGCTCATGC
  • 29. Frequency Matrix Model From Bucket               025.025.010 025.105.00 10025.25.00 05.05.25.01 T G C A Frequency matrix WATGC ATCCGAC ATGAGGC ATAAGTC ATGTGAC Refined matrix W* EM algorithm
  • 30. Motif Refinement  How do we recover the motif from the sequences in the enriched buckets?  k nucleotides are known from hash value of bucket.  Use information in other l-k positions as starting point for local refinement scheme, e.g. EM or Gibbs sampler Local refinement algorithm ATGCGTC Candidate motif ATGC ATCCGAC ATGAGGC ATAAGTC ATGTGAC
  • 31. Expectation Maximization (EM)  S = { x(1), …, x(N)} : set of input sequences  Given:  W = An initial probabilistic motif model  P0 = background probability distribution.  Find value Wmax that maximizes likelihood ratio: )|Pr( )|Pr( 0 max PS WS  EM is local optimization scheme. Requires starting value W
  • 32. EM Motif Refinement  For each bucket h containing more than s sequences, form weight matrix Wh  Use EM algorithm with starting point Wh to obtain refined weight matrix model Wh *  For each input sequence x(i), return l tuple y(i) which maximizes likelihood ratio: Pr(y(i) | Wh * )/ Pr(y(i) | P0).  T = {y(1), y(2), …, y(N)}  C(T ) = consensus string
  • 33. What Is the Best Motif?  Compute score S for each motif:  Generate W, an initial PSSM from the returned l-mers {y(1), y(2), …, y(N)}  Return motif with maximal score ∑= i PiyP WiyP Score )|)(( )|)(( log 0
  • 34. Iterations  Single iteration.  Choose a random k-projection.  Hash each l-mer x in input sequence into bucket labelled by h(x).  From each bucket B with at least s sequences, form weight matrix model, and perform EM/Gibbs sampler refinement.  Candidate motif is the best one found from refinement of all enriched buckets.  Multiple iterations.  Repeat process for multiple projections.
  • 35. Parameter Selection  Projection size k  Choose k small so several motif instances hash to same bucket. (k < l - d)  Choose k large to avoid contamination by spurious l-mers. E > (N (n - l + 1))/ 4k  Bucket threshold s: (s = 3, s = 4)
  • 36. How Many Iterations?  Planted bucket : bucket with hash value h(M), where M is motif.  Choose m = number of iterations, such that Pr(planted bucket contains ≥ s sequences in at least one of m iterations) ≥ 0.95.  Probability is readily computable since iterations form a sequence of independent trials.
  • 38. monad patterns  Short contiguous strings  Appear surprisingly many times( in a statistically significant way)  S = AGTCAGTCTTGCTAGTCAGTCCGTAATATCCGGATAGAATAATGATC GTAGCATCGTACGTAGCTATCGATCTGAAGCTAGCAGC AAGATGTACTAGAGTCAGTCACGTAGCTAGTCAGTCATCTATACGAGAG TCTCGATGTAGTAGCTATCGATCGTAGCTAGAGTCAGTCCGTAGC AGCTAGTATCGTAGTGAGCAACATGAGTCAGTCCAGTGCATAA GTCGTCAGCTCATGAGTCAGTCGCATAGTCAGTC P = AGTC
  • 39. Introduction  However, many of the actual regulatory signals are composite patterns.  Groups of monad patterns  Occur relatively near each other  An example of a composite pattern is a dyad signal.
  • 41. Introduction  A possible approach is to find each part of the pattern separately and reconstruct the composite pattern.  However, they often fail to output composite regulatory patterns consisting of weak monad parts.
  • 42. Introduction  A better approach would be to detect both parts of a composite pattern at the same time.  Two steps in the proposed algorithm:  Preprocessing the sample creates a set of ‘virtual’ monads.  Apply an exhaustive monad discovery algorithm to the ’virtual’ monad problem.  By preprocessing, original problem can be transformed into a larger monad discovery problem.
  • 43. Monad Pattern Discovery  Canonical pattern lmer  A continuous string of length l  (l, d)-neighbourhood of an lmer P  all possible lmers with up to d mismatches as compared to P  The number of such lmers is :  (l,d)-k patterns  Given a sequence S, find all lmers that occur with up to d mismatches at least k times in the sample  A variant : the sample is split into several sequence, to find all lmers, d mismatches, in at least k sequences A C A3mer: i d i i l 3 0 ∑=      
  • 44. Pattern Driven Approach(PDA)  (Prvzner, 2000)  Examine all 4 l patterns of fixed length l in lexical order, compares each pattern to every lmer in the sample, and return all (l, d)-k pattern  (Waterman et al., 1984 and Galas et al.,1985)  Bypass excessive time requirement  Most of all 4 l examines not worth since neither these patterns nor their neighbours appear in the sample  SDA was therefore designed only explores the lmer appearing in the sample and their neighbours.
  • 45. Sample Driven Approach(SDA)  First initializes a table of size 4l  Each table entry corresponds to a pattern SDA generate the (l, d)-neighbourhood of lmer  Incremented by a certain amount  After all lmers processed, SDA return all pattern whose table entries have scores exceed the threshold AAAAA 3 AAAAC 1 AAACC 2 … .. 4 l
  • 46. Sample Driven Approach(SDA)  Faster but requires a large 4l table still  not practical for long pattern in mid 1980  Not mainstream and no tool  (Today gigabytes of RAM memory available thus l increased without a memory-efficient algorithm)
  • 47. SDA Iterations  First, explore all neighbour of the first lmer from the sample.  Second, explore all neighbour of the second lmer  If an lmer P belongs to the neighbour of the lmers appearing at positions i1 ,…ik in the sample  info about P collected at iteration i1 ,…ik .  So the Waterman approach update info about P k times  memory slot for P is occupied during the course time even if P is not “interesting” lmer  Most of lmers explored are not interesting—waste memory slot
  • 48. To improve SDA  Better solution:  Collect info about all P at the same time  to remove the need to keep the info in memory  but require a new approach to navigate the space of all lmers  MITRA runs faster than PDA and SDA, and uses only a fraction of the memory of the SDA
  • 49. Pattern-finding vs. profile-based  Profile-based is more biologically relevant for finding motifs in biological samples?  Probably the reason Waterman algorithm not popular in the last decade  Sagot and colleagues were the first to rebut this opinion  Develop an efficient version of Waterman’s
  • 50. Pattern-based vs. profile-based  Similarities  Pattern-based generate the profile  Every profile of length l corresponds to a pattern of length l formed by the most frequent nucleotides in every position.  Pattern-driven at least as good as profile-based  Even better on simulated samples with implanted patterns  Though profile-implantation model is somehow limited  Today little evidence profile-based perform any better on either biological or simulated samples
  • 52. Mismatch Tree Algorithm (MITRA)  MITRA uses a mismatch tree data structure to split the space of all possible patterns into disjoint subspaces that start with a given prefix.  For reducing the pattern discovery into smaller sub-problems.  MITRA also takes advantage of pair-wise similarity between instances.
  • 53. Splitting Pattern Space  A pattern is called weak if it has less than k ( l ,d )-neighbours in the sample.  A subspace is called weak if all patterns in this subspace are weak.
  • 54. Splitting pattern space  A pattern is called weak if it has less than k ( l ,d )-neighbours in the sample. l = 3 …. sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT } d =1 ; k =2 ( l ,d )-neighbours in the sample = { GTA, ATC, GTT } Sequence = AGTATCAGTT P= GTC Not weak
  • 55. Splitting pattern space  A pattern is called weak if it has less than k ( l ,d )-neighbours in the sample. l = 3 …. sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT } d =1 ; k =2 ( l ,d )-neighbours in the sample = { CAG } Sequence = AGTATCAGTT P= CAG weak
  • 56. Splitting pattern space  A subspace is called weak if all patterns in this subspace are weak. •subspaceA = { AAA, AAT, AAC, AAG ………..AGG } •subspaceT = { TAA, TAT, TAC, TAG ………..TGG } •subspaceC = { CAA, CAT, CAC, CAG ………..CGG } •subspaceGG = { GGA, GGT, GGC, GGG} Sequence = AGTATCAGTT
  • 57. Question  Input:  S, l, d, k  Output:  All l mers that occur with up to d mismatches at least k times in the sample.
  • 58. Solution  Naïve :  Test all l mer in the space  If occur with up to d mismatches at least k times in the sample than output this l mer. space = { AAA, AAT, AAC, AAG ………..AGG TAA, TAT, TAC, TAG ………..TGG CAA, CAT, CAC, CAG ………..CGG GAA, GAT, GAC, GAG ………..GGG } sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }
  • 59. Splitting pattern space  if we are looking for patterns of length l we would first split the space of all l mers into 4 disjoint subspaces.  Subspace of all l mers starting with A,  Subspace of all l mers starting with T,  Subspace of all l mers starting with C,  Subspace of all l mers starting with G,
  • 60. Splitting pattern space  if we are looking for patterns of length l we would first split the space of all l mers into 4 disjoint subspaces. Space: A* SubspaceA T* C* G*
  • 61. Splitting pattern space  we further determine whether the subspace contains a ( l ,d )-k pattern. Space: A* Can’t rule out
  • 62. Splitting pattern space  we further determine whether the subspace contains a ( l ,d )-k pattern. Space: AT* AA* AC* AG* Can rule out
  • 63. Splitting pattern space  we further determine whether the subspace contains a ( l ,d )-k pattern. Space: Can’t rule out
  • 64. Splitting pattern space  we further determine whether the subspace contains a ( l ,d )-k pattern.  If we can rule out this subspace contains such a pattern  we stop searching in this subspace;  release the memory slot;  If we can’t rule out this subspace contains such a pattern  we split this subspace again on the next symbol;  and repeat;
  • 65. Mismatch tree data structure  A mismatch tree is a rooted tree where each internal node has 4 branches labeled with a symbol in {A,C,T,G}  The maximum depth of the tree is l.  Each node in the mismatch tree corresponds to the subspace of patterns P with a fixed prefix.  Each node contains pointers to all l mers instances from the sample that are within d mismatches from a pattern p.
  • 66. Mismatch tree data structure  MITRA start with examining the root node of the mismatch tree that corresponds to the space of all patterns.  When examining a node, MITRA tries to prove that it corresponds to a weak subspace.  If (we can’t prove it)  we expand the node’s children and examine each of them.  Whenever we reach a node corresponding to a weak subspace, we backtrack.  The intuition is that many of the nodes correspond to weak subspaces and can be rule out.  This allows us to avoid searching much of the pattern space.
  • 67. Mismatch tree data structure  If we reach depth l and the number of instances is not less than k.  the l mer corresponding to the path from the root to the leaf .  the pointers from this node correspond to the instances of this pattern.
  • 68. Example  Consider a very simple example of finding the pattern of length 4 with up to 1 mismatch and at least 2 times in the sample S = “AGTATCAGTT”.  The substrings (4mers) in S are { AGTA, GTAT, TATC, ATCA, TCAG, CAGT, AGTT } Not for exam
  • 69. A A A k =7 k = 5 0 0 0 0 0 0 0 A G T A T C A G T A T C A G T A T C A G T A T C A G T T 0 1 1 0 1 1 0 A G T A T C A G T A T C A G T A T C A G T A T C A G T T 1 2 1 1 2 1 1 A G T A T C A G T A T C A G T A T C A G T A T C A G T T 2 2 2 2 2 A T A C A G A T A G T T C G T A C A T T k = 0 C 2 2 1 2 2 A T A C A G A T A G T T C G T A C A T T k = 1 G 2 2 2 1 2 A T A C A G A T A G T T C G T A C A T T k = 1 1 1 2 2 1 A T A C A G A T A G T T C G T A C A T T T k = 3 A 1 2 2 A T A G A G T T T A C T k = 1 C 2 1 2 A T A G A G T T T A C T k = 1 G 2 2 2 A T A G A G T T T A C T k = 0 2 2 1 A T A G A G T T T A C T T k = 1
  • 70. A k =7 0 0 0 0 0 0 0 A G T A T C A G T A T C A G T A T C A G T A T C A G T T 0 1 1 0 1 1 0 A G T A T C A G T A T C A G T A T C A G T A T C A G T T G k = 3 0 2 2 1 2 2 0 A G T A T C A G T A T C A G T A T C A G T A T C A G T T 0 2 0 A A A G T G T C T A A T T k = 2 A 0 1 A A G G T T A T k = 2 Output: AGTA C 1 1 A A G G T T A T k = 2 AGTC G 1 1 A A G G T T A T k = 2AGTG 1 0 A A G G T T A T T k = 2 AGTT T
  • 71. Overall complexity  Space =  Time = TTGACTA TGACTAT GACTATG ACTATGA 0000000 A G T T O(|S|) O(l) O(O(ll22 × |S|)× |S|) O(4O(4ll × |S|)× |S|) . . . . . . . . . l  Number of nodes = O(4l ) l TTGACTA TGACTAT GACTATG ACTATGA 0110110 O(|S|) – Number of comparisons in each node = O(|S|)Number of comparisons in each node = O(|S|)
  • 72. Take a Closer Look  In mismatch tree algorithm, we can not start ruling out a node until traverse to depth . A A k =7 k = 5 0 0 0 0 0 0 0 A G T A T C A G T A T C A G T A T C A G T A T C A G T T 0 1 1 0 1 1 0 A G T A T C A G T A T C A G T A T C A G T A T C A G T T 1 2 1 1 2 1 1 A G T A T C A G T A T C A G T A T C A G T A T C A G T T d +1
  • 73. MITRA Graph  Information about pairwise similarities between instances of the pattern can significantly the sample-driven approach.  The graph that is constructed to model this pairwise similarity is called MITRA-Graph speed up
  • 74. MITRA Graph  Given a pattern P and sample S we can construct a graph G(P, S) where each vertex is an lmer in the sample and there is an edge connecting two lmers if P is within d mismatches from both lmers. ACA (d=3) TAA (d=1) AAC (d=1) P = TAC S = TAACA
  • 75. MITRA Graph  For an (l,d) – k pattern P the corresponding graph contains a clique of size k. ACA (d=3) TAA (d=1) AAC (d=1) P = AAA S = TAACA
  • 76. MITRA Graph  Given a set of patterns P and a sample S, define a graph G(P , S) whose edge set is a union of edge sets of graphs G(P, S) for P∈P .  Each vertex of G(P , S) is an lmer in the sample and there is an edge connecting two lmers if there is a pattern P∈P that is within d mismatches from both lmers.  If for a subspace of patterns we can rule out an existence of a clique of size k, then the subspace has no (l,d)-k
  • 77. The WINNOWER Algorithm  The WINNOWER algorithm by Pevzner and Sze (2000) constructs the following graph: Each lmer in the sample is a vertex, and an edge connects two vertices if the corresponding lmers have less than d mismatches.  Instances of a (l,d)-k pattern form a clique of size k in this graph.
  • 78. The WINNOWER Algorithm (con’t)  Since clique are difficult to find, WINNOWER takes the approach of trying to remove edges that do not corresponding to a clique. k = 4
  • 79. Improvements by MITRA-Graph 1. Construct a graph at each node in the mismatch tree. A 0 1 1 0 1 1 0 A G T A T C A G T A T C A G T A T C A G T A T C A G T T
  • 80. Improvements by MITRA-Graph 2. Remove edges which are not part of a clique. A
  • 81. Improvements by MITRA-Graph 3. If no potential clique remains, rule out the subspace corresponding to the node and backtrack. A A
  • 82. Improvements by MITRA-Graph 4. If we cannot rule out a clique, split the subspace of patterns and examine the child nodes A
  • 83. MISMATCH TREE ALGORITHM — Improvements over WINNOWER  At each node of the tree, we remove edges by computing the degree of each vertex.  If the degree of the vertex is less than k-1, we can remove all edges incident to it since we know it is not part of a clique.  We repeat this procedure until we cannot remove any more edges.  If the number of edges remaining is less than the minimum number of edges in a clique, we can rule out the existence of a clique and backtrack.
  • 84. MISMATCH TREE ALGORITHM — Improvements over WINNOWER  The problem with this approach is how to efficiently construct the graph at each node in the mismatch tree.  Instead of constructing the graph from scratch, we construct it based on the graph at the parent node  an edge connecting two l mers  the first l mer matches the prefix of the pattern subspace with d1 mismatches  the second l mer matches with d2 mismatches
  • 85. MISMATCH TREE ALGORITHM — Improvements over WINNOWER  the number of mismatches between the tail of the first and the second l mers as m.  The edge between these two l mers exists in the pattern subspace if and only if d1 <= d, d2 <= d and d1+d2+m <= 2d. the first lmer the second lmer The prefix of the pattern subspace
  • 86. MISMATCH TREE ALGORITHM — Improvements over WINNOWER (cont’d)  In the root node since d1 = d2 = 0, an edge exists only if m <= 2d which is the equivalent graph to WINNOWER.  With moving down the tree, the condition becomes much stronger than the WINNOWER.  We can compute the edges of a node based on the edges of the node’s parents by keeping track of the quantities d1, d2, and m for each edge.
  • 87. MISMATCH TREE ALGORITHM — Improvements over WINNOWER  To summarize, the MITRA-Graph algorithm works as follows  We first compute the set of edges at the root node by performing pairwise comparisons between all l mers due to d1 = d2 = 0.  We traverse the tree in a depth first order, passing on the valid edges and keeping track of the quantities d1, d2, and m for each of them.  At each node, we prune the graph by eliminating any edges incident to vertices that have degrees of less than k-1.  If there are less than the minimum number of edges for a clique, we backtrack.  If we reach a leaf of the tree (depth l), then we output the corresponding pattern.
  • 89. DISCOVERING DYNAD SIGNALS  For dyad signals, we are interested in discovering two monads that occur a certain length apart  We use the notation (l1-(s1,s2)-l2,d)-k pattern to denote a dyad signal s s sl1 l1 l1l2 l2 l2
  • 90. DISCOVERING DYNAD SIGNALS  The MITRA-Dyad algorithm casts the dyad discovery problem into a monad discovery problem by preprocessing the input and creating a “virtual” sample to solve the (l1+l2,d)-k monad pattern discovery problem in this sample  For each l1mer in the sample and for each s in [s1,s2], we create an l1+l2 mer which is the l1mer concatenated with the l2 mer upstream s nucleotides of the l1mer.
  • 91. DISCOVERING DYNAD SIGNALS  The number of elements in the “virtual” sample will be approximately (s1-s2+1) times larger.  An (l1+l2,d)-k pattern in the “virtual” sample will correspond to a (l1-(s1,s2)-l2,d)-k pattern in the original sample, and we can easily map the solution from the monad problem to the dyad one.  An important feature of MITRA-Dyad is an ability to search for long patterns.
  • 92. DISCOVERING DYNAD SIGNALS  If the range s1-s2+1 of acceptable distances between monad parts in a composite pattern is large, the MITRA-Dyad algorithm becomes inefficient  A simple approach to detect these patterns is to generate a long ranked list of candidate monad patterns using MITRA.  Then check each occurrence of each pair from the list to see if they occur within the acceptable distance.