SlideShare a Scribd company logo
1 of 161
Download to read offline
www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms
Protein Sequencing and
Identification by Mass
Spectrometry
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Outline
• Tandem Mass Spectrometry
• De Novo Peptide Sequencing
• Spectrum Graph
• Protein Identificationvia Database Search
• IdentifyingPost Translationally Modified Peptides
• Spectral Convolution
• Spectral Alignment
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Different Amino Acid Have Different Masses
H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH
Ri-1 Ri Ri+1
AA residuei-1 AA residuei AA residuei+1
N-terminus C-terminus
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Fragmentation
• Peptides tend to fragment along the backbone.
• Mass spectrometer is a sophisticated
(and rather expensive!) scale to measure
the masses of these fragments
H...-HN-CH-CO . . . NH-CH-CO-NH-CH-CO-…OH
Ri-1 Ri Ri+1
H+
Prefix Fragment Suffix Fragment
Collision Induced Dissociation
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Breaking Protein into Peptides and
Peptides into Fragment Ions
• Most mass spectrometers can only measure masses of
short peptides (e.g., 20 amino acids or less) rather than
masses of entire proteins (usually hundreds of amino
acids). That‟s why:
• Proteases, e.g. trypsin, break protein into short peptides.
• A Tandem Mass Spectrometer further breaks the peptides
down into fragment ions and measures the mass of each
piece.
• Mass Spectrometer accelerates the fragmented ions;
heavier ions accelerate slower than lighter ones.
• Mass Spectrometer measure mass/charge ratio of an ion.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
N- and C-terminal Peptides
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Terminal peptides and ion types
Peptide
Mass (D) 57 + 97 + 147 + 114 = 415
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Masses of fragment ions
Peptide
Mass (D) 57 + 97 + 147 + 114 = 415
Peptide
Mass (D) 57 + 97 + 147 + 114 – 18 = 397
without
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
N- and C-terminal Peptides
415
486
301
154
57
71
185
332
429
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
N- and C-terminal Peptides
415
486
301
154
57
71
185
332
429
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Theoretical Spectrum
415
486
301
154
57
71
185
332
429
Reconstruct peptide from the set of masses of fragment ions
(mass-spectrum)
57 71 154 185 301 332 415 429 486
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reconstructing Peptides
Reconstruct peptide from the set of masses of fragment ions
(mass-spectrum)
57 71 154 185 301 332 415 429 486
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reconstructing Peptides
Reconstruct peptide from the set of masses of fragment ions
(mass-spectrum)
57 71 81 100 112 131 160 172 177 185 201 221 235 301 312 325 332 370 387 409 415 423 429 460 472 486154
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reconstructing Peptides
Reconstruct peptide from the set of masses of fragment ions
(mass-spectrum)
57 71 81 100 112 131 160 172 177 185 201 221 235 301 312 325 370 387 409 415 423 429 460 472 486
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Fragmentation
y3
b2
y2 y1
b3a2 a3
HO NH3
+
| |
R1 O R2 O R3 O R4
| || | || | || |
H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH
| | | | | | |
H H H H H H H
b2-H2O
y3 -H2O
b3- NH3
y2 - NH3
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Mass Spectra
G V D L K
mass
0
57 Da = „G‟ 99 Da = „V‟
LK D V G
• The peaks in the mass spectrum:
• Prefix
• Fragments with neutral losses (-H2O, -NH3)
• Noise and missing peaks.
and Suffix Fragments.
D
H2O
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Protein Identification with MS/MS
mass
0
Intensity
mass
0
MS/MS
Peptide
Identification
Protein database
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Tandem Mass Spectrum
• Tandem Mass Spectrometry mainly generates N-
and C-terminal fragment ions
• Chemical noise often complicates the spectrum.
• Represented in 2-D: mass/charge axis vs. intensity
axis
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
RelativeAbundance
850.3
687.3
588.1
851.4
425.0
949.4
326.0
524.9
589.2
1048.6
397.1226.9
1049.6
489.1
629.0
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Tandem Mass-Spectrometry
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Breaking Proteins into Peptides
peptides
MPSER
……
GTDIMR
PAKID
……
HPLC
To
MS/MSMPSERGTDIMRPAKID......
protein
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Mass Spectrometry
Matrix-Assisted Laser Desorption/Ionization (MALDI)
From lectures by Vineet Bafna (UCSD)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Tandem Mass Spectrometry
RT: 0.01- 80.02
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
Time(min)
0
10
20
30
40
50
60
70
80
90
100
RelativeAbundance
1389
1991
1409 2149
1615 1621
1411
2147
1611
19951655
1593
1387
21551435 1987
2001 2177
1445 1661
1937
2205
1779 2135
2017
1313 22071307 2329
1105 17071095
2331
NL:
1.52E8
BasePeakF:+
cFullms[
300.00 -
2000.00]
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
RelativeAbundance
850.3
687.3
588.1
851.4
425.0
949.4
326.0
524.9
589.2
1048.6
397.1226.9
1049.6
489.1
629.0
Scan 1708
LC
S#: 1707 RT: 54.44 AV: 1 NL: 2.41E7
F: + c Full ms [ 300.00 - 2000.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
RelativeAbundance
638.0
801.0
638.9
1173.8
872.3 1275.3
687.6
944.7 1884.51742.11212.0783.3 1048.3 1413.9 1617.7
Scan 1707
MS
MS/MS
Ion
Source
MS-1
collision
cell MS-2
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Protein Identification by Tandem
Mass Spectrometry (MS/MS)
S
e
q
u
e
n
c
e
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
RelativeAbundance
850.3
687.3
588.1
851.4
425.0
949.4
326.0
524.9
589.2
1048.6
397.1226.9
1049.6
489.1
629.0
MS/MS instrument
database search
•Sequest, Mascot, etc
de novo interpretation
•Lutefisk, Peaks, etc
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
De Novo vs. Database Search
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
RelativeAbundance
850.3
687.3
588.1
851.4
425.0
949.4
326.0
524.9
589.2
1048.6
397.1226.9
1049.6
489.1
629.0
W
R
A
C
V
G
E
K
DW
L
P
T
L T
W
R
A
C
V
G
E
K
DW
L
P
T
L T
De Novo
AVGELTK
Database
Search
Database of all peptides = 20n
AAAAAAAA,AAAAAAAC,AAAAAAAD,AAAAAAAE,
AAAAAAAG,AAAAAAAF,AAAAAAAH,AAAAAAI,
AVGELTI, AVGELTK , AVGELTL, AVGELTM,
YYYYYYYS,YYYYYYYT,YYYYYYYV,YYYYYYYY
Database of
known peptides
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, AVGELTK,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
Database of
known peptides
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, AVGELTK,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
Mass, Score
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
De Novo vs. Database Search: A Paradox
• The database of all peptides is huge ≈ 20n peptides of length n
• The database of all known peptides is much smaller ≈ 108 peptides
• However, de novo algorithms can be much faster, even though their
search space is much larger!
• A database search scans all peptides in the database of all known
peptides to find best one.
• De novo eliminates the need to scan database of all peptides by
modeling the problem as a graph search.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Three Algorithmic Problems
• Searching for a million words in a text. Suppose it takes
1 sec to find a word in a text. How much time would it take
to find 1 million words in the text?
• Searching for a word without even looking at 99.999%
of the text. Suppose you search for a word in a text.
Would it be possible to ignore 99.999% of the text, scan
only the remaining part and guarantee that the word you
are looking for will be found?
• Finding spelling errors in a book written in an
unknown language. Given a book (in an unknown
language) and a misspelled word (with insertions,
deletions, and substitutions of letters) correct spelling
errors in the word.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Three Algorithmic Problems
• Searching for a million words in a text. Suppose it takes
1 sec to find a word in a text. How much time would it take
to find 1 million words in the text? 1 millionseconds?
• Searching for a word without even looking at 99.999%
of the text. Suppose you search for a word in a text.
Would it be possible to ignore 99.999% of the text, scan
only the remaining part and guarantee that the word you
are looking for will be found?
• Finding spelling errors in a book written in an
unknown language. Given a book (in an unknown
language) and a misspelled word (with insertions,
deletions, and substitutions of letters) correct spelling
errors in the word.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Genomics: Problems Solved.
• Searching for a million words in a text.
Aho-Corasik algorithm takes roughly the same time
with a million words as it takes with a single word.
• Searching for a word without even looking at
99.999% of the text.
Filtration algorithms (like FASTA or BLAST) ignore
99.999% of the text.
• Finding spelling errors.
Sequence alignment algorithms (like Smith-
Waterman) do it in quadratic time
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Proteomics: Three Problems
• Comparing a million spectra against a database. Suppose it
takes 1 sec to interpret a spectrum. How much time would it take to
interpret 1 million spectra?
• Mass-spectrometry database search without even looking at
99.999% of the database. Suppose you compare a spectrum
against a database. Would it be possible to ignore 99.999% of the
database, scan only the remaining part and guarantee that you still
can identify a peptide of interest?
• Blind PTM search and discovery of new PTM types. Given a
spectrum of a peptide with unknown PTM types, find this peptide in
the database. Discover new PTM types by data mining of large
MS/MS datasets.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Three Solutions
• Comparing a million spectra against a database.
InsPecT (Tanner et al., Anal. Chem, 2005)
• MS/MS database search without even looking at
99.999% of the database.
InsPecT (Tanner et al., Anal. Chem, 2005)
• Blind PTM search and discovery of new PTM types. Given a
spectrum of a peptide with unknown PTM types, find this
peptide in the database. Discover new PTM types by data
mining of large MS/MS datasets.
MS-Alignment (Tsur et al., Nature Biotech., 2005)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Filtration: Combining De Novo Sequencing and
Database Search in Mass-Spectrometry
• So far de novo and database search were presented as
two separate techniques
• Database search is rather slow: many labs generate
more than 100,000 spectra per day. SEQUEST takes
approximately 1 minute to compare a single spectrum
against SWISS-PROT (54Mb) on a desktop.
• It will take SEQUEST more than 2 months to analyze the
MS/MS data produced in a single day.
• Can slow database search be combined with fast de
novo analysis?
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
De novo Peptide Sequencing
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
RelativeAbundance
850.3
687.3
588.1
851.4
425.0
949.4
326.0
524.9
589.2
1048.6
397.1226.9
1049.6
489.1
629.0
Sequence
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Building Spectrum Graph
• How to create vertices (from masses)
• How to create edges (from mass differences)
• How to score vertices
• How to score paths
• How to find the best path
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
S E Q U E N C E
b-ions (prefix or N-terminal ions)
Mass/Charge (M/Z)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
a-ions = b-ions - CO = b-ions - 28
Mass/Charge (M/Z)
S E Q U E N C E
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
S E Q U E N C E
Mass/Charge (M/Z)
Shifting Peaks: a-ions = b-ions - CO = b-ions - 28
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
y-ions (suffix of C-terminal ions)
Mass/Charge (M/Z)
E C N E U Q E S
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Mass/Charge (M/Z)
Intensity
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Mass/Charge (M/Z)
Intensity
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
noise
Mass/Charge (M/Z)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
MS/MS Spectrum
Mass/Charge (M/z)
Intensity
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Some Mass Differences between Peaks
Correspond to Amino Acids
s
s
s
e
e
e
e
e
e
e
e
q
q
qu
u
u
n
n
n
e
c
c
c
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Some Mass Differences between Peaks
Correspond to Amino Acids
s
s
s
e
e
e
e
e
e
e
e
q
q
qu
u
u
n
n
n
e
c
c
c
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Ion Types
• Some masses correspond to fragment ions,
others are just random noise
• Knowing ion types Δ={δ1, δ2,…, δk} allows one to
distinguish fragment ions from noise
• We can learn ion types δi and their probabilities qi
by analyzing a large test sample of annotated
spectra.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Example of Ion Type
• Δ={δ1, δ2,…, δk}
• Ion types
{b, b-NH3, b-H2O, b-CO}
correspond to
Δ={0, 17, 18, 28}
*Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Match between Spectra and the
Shared Peak Count
• The match between two spectra is the number of masses (peaks)
they share (Shared Peak Count or SPC)
• In practice mass-spectrometrists use the weighted SPC that
reflects intensities of the peaks
• Match between experimental and theoretical spectra is defined
similarly
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Sequencing Problem
Goal: Find a peptide with maximal match between
an experimental and theoretical spectrum.
Input:
• S: experimental spectrum
• Δ: set of possible ion types
Output:
• A peptide whose theoretical spectrum
matches the experimental spectrum the best
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
S E Q U E N C E
Mass/Charge (M/Z)
Shifting Peaks: a-ions = b-ions - CO = b-ions - 28
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reverse Shifts
Shift in H2O+NH3
Shift in H2O
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Vertices of the Spectrum Graph
• Masses of potential N-terminalpeptides
• Vertices aregenerated by reverseshifts correspondingto ion types
Δ={δ1, δ2,…, δk}
• Every N-terminalpeptidecan generateup to k ions
m-δ1, m-δ2, …, m-δk
• Every mass s in an MS/MS spectrumgenerates k vertices
V(s) = {s+δ1, s+δ2, …, s+δk}
correspondingto potentialN-terminalpeptides
• Vertices of the spectrum graph:
{initialvertex} V(s1) V(s2) ... V(sm) {terminalvertex}
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Edges of the Spectrum Graph
• Two vertices with mass difference corresponding to
an amino acid A:
• Connect with an edge labeled by A
• Gap edges corresponding to the mass of pairs of
amino acids (optional)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Paths
• Paths in the labeled graph spell out amino
acid sequences
• There are many paths, how to find the correct
one?
• We need scoring to evaluate paths
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Path Score
• p(P,S) = probability that peptide P produces
spectrum S= {s1,s2,…sq}
• p(P, s) = the probability that peptide P
generates a peak s
• Scoring = computing probabilities
• p(P,S) = πsєS p(P, s)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
• For a position t that represents ion type dj :
qj, if peak is generated at t
p(P,st) =
1-qj , otherwise
Peak Score
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peak Score (cont’d)
• For a position t that is not associated with an
ion type:
qR , if peak is generated at t
pR(P,st) =
1-qR , otherwise
• qR = the probability of a noisy peak that does
not correspond to any ion type
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Finding Optimal Paths in the Spectrum Graph
• For a given MS/MS spectrum S, find a
peptide P’ maximizing p(P,S) over all
possible peptides P:
• Peptides = paths in the spectrum graph
• P’ = the optimal path in the spectrum graph
p(P,S)p(P',S) Pmax
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Ions and Probabilities
• Tandem mass spectrometry is characterized
by a set of ion types {δ1,δ2,..,δk} and their
probabilities {q1,...,qk}
• δi-ions of a partial peptide are produced
independently with probabilities qi
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Ions and Probabilities
• A peptide has all k peaks with probability
• and no peaks with probability
• A peptide also produces a ``random noise''
with uniform probability qR in any position.
k
i
iq
1
k
i
iq
1
)1(
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Ratio Test Scoring for Partial Peptides
• Incorporates premiums for observed ions
and penalties for missing ions.
• Example: for k=4, assume that for a partial
peptide P‟ we only see ions δ1,δ2,δ4.
The score is calculated as:
RRRR q
q
q
q
q
q
q
q 4321
)1(
)1(
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Scoring Peptides
• T- set of all positions.
• Ti={t δ1,, t δ2,..., ,t δk,}- set of positions that
represent ions of partial peptides Pi.
• A peak at position tδj is generated with
probability qj.
• R=T- U Ti - set of positions that are not
associated with any partial peptides (noise).
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Probabilistic Model
• For a position t δj Ti the probability p(t, P,S) that
peptide P produces a peak at position t.
• Similarly, for t R, the probability that P produces a
random noise peak at t is:
otherwise1
position tatgeneratedispeakaif
),,(
j
j
j
q
q
SPtP
otherwise1
position tatgeneratedispeakaif
)(
R
R
R
q
q
tP
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Probabilistic Score
• For a peptide P with n amino acids, the score
for the whole peptides is expressed by the
following ratio test:
n
i
k
j iR
i
R j
j
tp
SPtp
Sp
SPp
1 1 )(
),,(
)(
),(
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
De Novo vs. Database Search
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
RelativeAbundance
850.3
687.3
588.1
851.4
425.0
949.4
326.0
524.9
589.2
1048.6
397.1226.9
1049.6
489.1
629.0
W
R
A
C
V
G
E
K
DW
L
P
T
L T
W
R
A
C
V
G
E
K
DW
L
P
T
L T
De Novo
AVGELTK
Database
Search
Database of
known peptides
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, AVGELTK,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
Database of
known peptides
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, AVGELTK,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
De Novo vs. Database Search: A Paradox
• de novo algorithms are much faster, even though their
search space is much larger!
• A database search scans all peptides to find the best one.
• De novo eliminates the need to scan all peptides by
modeling the problem as a graph search.
Why not sequence de novo?
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Why Not Sequence De Novo?
• De novo sequencing is not very accurate:
• Less than 30% of the peptides sequenced
were completely correct!
Algorithm Amino
Acid
Accuracy
Whole Peptide
Accuracy
Lutefisk, 1997 0.566 0.189
SHERENGA, 1999 0.690 0.289
Peaks, 2003 0.673 0.246
PepNovo, 2005 0.727 0.296
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Pros and Cons of de novo Sequencing
• Advantage:
• Gets the sequences that are not necessarily in the
database.
• An additional similarity search step using these sequences
may identify the related proteins in the database.
• Disadvantage:
• Requires higher quality spectra to be accurate.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Sequencing Problem
Goal: Find a peptide with maximal match between
an experimental and theoretical spectrum.
Input:
• S: experimental spectrum
• Δ: set of possible ion types
Output:
• A peptide whose theoretical spectrum
matches the experimental S spectrum the best
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Identification Problem
Goal: Find a peptide from the database with
maximal match between an experimental and
theoretical spectrum.
Input:
• S: experimental spectrum
• database of peptides
• Δ: set of possible ion types
Output:
• A peptide from the database whose theoretical
spectrum matches the experimental S
spectrum the best
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
MS/MS Database Search
Database search in mass-spectrometry has been
successful in identification of already known proteins.
Experimental spectrum can be compared with theoretical
spectra of database peptides to find the best fit.
SEQUEST (Yates et al., 1995)
But reliable algorithms for identification of modified
peptides is a much more difficult problem.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
The dynamic nature of the proteome
• The proteome of the cell
is changing
• Various extra-cellular,
and other signals
activate various protein
pathways.
• A key mechanism of
protein activation is
post-translational
modification (PTM)
• These pathways may
lead to other genes
being switched on or off
• Mass spectrometry is
key to probing the
proteome and detecting
PTMs
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Post-Translational Modifications
Proteins are involved in cellular signaling and
metabolic regulation.
They are subject to a large number of biological
modifications.
Most protein sequences are post-translationally
modified and 600 types (!) of modifications of
amino acid residues are known.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Examples of Post-Translational
Modification
Post-translational modifications increase the number of “letters” in
amino acid alphabet and lead to a combinatorial explosion in both
database search and de novo approaches.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Masses
415
486
301
154
57
71
185
332
429
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Masses: Modification +16 on N
415
486
301
154
57
71
185
332
429
+16
+16
+16
+16
+16
+16
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reconstructing Modified Peptides
Reconstruct peptide from the set of masses of fragment ions
(mass-spectrum)
57 71 154 185 301 332 415 429 486
+16 +16 +16 +16 +16
57 71 154 201 301 348 431 445 502
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reconstructing Modified Peptides
Reconstruct peptide from the set of masses of fragment ions
(mass-spectrum)
57 71 81 100 112 131 154 160 172 177 185 201 221 235 301 312 325 332 370 387 409 415 423 429 460 472 486
57 71 81 100 112 131 154 160 172 177 185 201 221 235 301 312 325 332 348 387 409 415 423 431 445 472 502
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Identification Problem Revisited
Goal: Find a peptide from the database with
maximal match between an experimental and
theoretical spectrum.
Input:
• S: experimental spectrum
• database of peptides
• Δ: set of possible ion types
Output:
• A peptide from the database whose theoretical
spectrum matches the experimental S
spectrum the best
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Modified Peptide Identification Problem
Goal: Find a modified peptide from the database with maximal
match between an experimental and theoretical spectrum.
Input:
• S: experimental spectrum
• database of peptides
• Δ: set of possible ion types
• Parameter k (# of mutations/modifications)
Output:
• A peptide that is at most k mutations/modifications apart
from a database peptide and whose theoretical spectrum
matches the experimental S spectrum the best
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Search for Modified Peptides:
Virtual Database Approach
• YFDSTDYNMAK
• 25=32 possibilities,
with
2 types of
modifications!
Phosphorylation?
Oxidation?
Yates et al.,1995: an exhaustive search in a
virtual database of all modified peptides.
Combinatorial explosion, even for a small
set of modifications types.
A larger set of spurious matches must be
filtered out. It‟s much more likely that
incorrect matches will have high scores.
Problem. Extend the virtual database
approach to a large set of modifications.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Restrictive vs Unrestrictive (Blind)
Search for Modified Peptides
• Restrictive search requires the researcher to
guess which modification types are present in
the sample
• How would you feel if you were allowed to
perform only 10 out of 20x19=380 possible
point mutations (with all other forbidden)
while aligning two proteins?
• This is exactly what the restrictive PTM
search algorithms do.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Restrictive vs Unrestrictive (Blind) Search
for Modified Peptides
• Restrictive search requires the researcher to guess
which modification types are present in the sample
• Blind search performs an unrestrictive search for all
possible modification offsets at once.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Database Search:
Sequence Analysis vs. MS/MS Analysis
Sequence analysis:
similar peptides (that a few mutations apart) have similar sequences
MS/MS analysis:
similar peptides (that a few mutations apart) have dissimilar spectra
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Identification Problem: Challenge
Very similar peptides may have very different
spectra!
Goal: Define a notion of spectral similarity that
correlates well with the sequence similarity.
If peptides are a few mutations/modifications
apart, the spectral similarity between their
spectra should be high.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Deficiency of the Shared Peaks Count
Shared peaks count (SPC): intuitive measure
of spectral similarity.
Problem: SPC diminishes very quickly as the
number of mutations increases.
Only a small portion of correlations between
the spectra of mutated peptides is captured
by SPC.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
SPC Diminishes Quickly
S(PRTEIN) = {98, 133, 246, 254, 355, 375, 476, 484, 597, 632}
S(PRTEYN) = {98, 133, 254, 296, 355, 425, 484, 526, 647, 682}
S(PGTEYN) = {98, 133, 155, 256, 296, 385, 425, 526, 548, 583}
no mutations
SPC=10
1 mutation
SPC=5
2 mutations
SPC=2
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Convolution
)0)((
))((
,
12
12
122211
22111212
:
SS
xSS
ssSsSs
}S,sS:ss{sSS
x
:peak)(SPCcountpeakssharedThe
withpairsofNumber
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Elements of S2 S1 represented as elements of a difference matrix. The
elements with multiplicity >2 are colored; the elements with multiplicity =2
are circled. The SPC takes into account only the red entries
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
1
2
3
4
5
0
-150 -100 -50 0 50 100
150
x
Spectral Convolution: An Example
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Comparison: Difficult Case
S = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}
Which of the spectra
S’ = {10, 20, 30, 40, 50, 55, 65, 75,85, 95}
or
S” = {10, 15, 30, 35, 50, 55, 70, 75, 90, 95}
fits the spectrum S the best?
SPC: both S’ and S” have 5 peaks in common with S.
Spectral Convolution: reveals the peaks at 0 and 5.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Comparison: Difficult Case
S S’
S S’’
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Limitations of the Spectrum Convolutions
Spectral convolution does not reveal that
spectra S and S’ are similar, while spectra S
and S” are not.
Clumps of shared peaks: the matching
positions in S’ come in clumps while the
matching positions in S” don't.
This important property was not captured by
spectral convolution.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Sequence Alignment=Path in a Grid
A R N G A L R
A 1 1
R 1 1
N 1
G 1
Z
A 1 1
L 1
R 1 1
Finding similarities between
two peptides
is equivalent to finding an optimal path in
a Manhattan-like grid (sequence
alignment).
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Sequence Alignment=Path in a Grid
A R N G A L R
A 1 1
R 1 1
N 1
G 1
Z
A 1 1
L 1
R 1 1
Finding similarities between
two peptides
is equivalent to finding an optimal path in
a Manhattan-like grid (sequence
alignment). Every horizontal/vertical
segment in this path corresponds to
insertion/deletion of an amino acid.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Sequence Alignment=Path in a Grid
A R N G A L R
A 1 1
R 1 1
N 1
G 1
Z
A 1 1
L 1
R 1 1
Finding similarities between
two peptides
is equivalent to finding an optimal path in a
Manhattan-like grid (sequence alignment).
Every horizontal/vertical segment in this path
corresponds to insertion/deletion of an amino
acid.
Can we find similarities between
a spectrum and a peptide
using a similar approach (spectral alignment)?
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Converting Spectra into 0-1 Sequences
• Convert spectrum into a 0-1 string with 1s
corresponding to the positions of the peaks.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Modified peptide
Modifications are modeled as insertion (or deletions)
of blocks of zeroes
000101001010000000110000001001 Spectrum
00010100001-----00010000001001 Peptide
A modification with positive offset - inserting a block of 0s
A modification with negative offset - deleting a block of 0s
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectra Comparison vs. String Comparison
• Comparison of theoretical and
experimental spectra (represented as 0-1
strings) corresponds to a (somewhat
unusual) edit distance/alignment
between 0-1 strings where elementary
edit operations are insertions and
deletions of blocks of 0s
• Use sequence alignment algorithms!
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Alignment Graph
0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
0
0
0
0
1 1 1 1 1 1
0
0
0
0
1 1 1 1 1 1
0
0
1 1 1 1 1 1
0
0
0
0
1 1 1 1 1 1
0
0
0
0
1 1 1 1 1 1
Horizontal axis:
Experimental spectrum
Vertical axis:
Theoretical spectrum of a
peptide
A
B
C
D
E
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Alignment Graph
0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
0
0
0
0
1 1 1 1 1 1
0
0
0
0
1 1 1 1 1 1
0
0
1 1 1 1 1 1
0
0
0
0
1 1 1 1 1 1
0
0
0
0
1 1 1 1 1 1
Like in alignment algorithms,
every path in the spectral
alignment graph represents a
possible interpretation of a
spectra.
A path covering maximal
number of 1s is the “best”
interpretation of the spectrum.
Vertical / horizontal segment
in the optimal path are
modifications
A
B
C
D
E
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Alignment vs. Sequence Alignment
• Alignment graph with different alphabet and
scoring.
• Movement can be diagonal (matching
masses) or horizontal/vertical
(insertions/deletions corresponding to PTMs).
• At most k horizontal/vertical moves.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Shifts
A = {a1 < … < an} : an ordered set of natural
numbers.
A shift (i, ) is characterized by two parameters,
the position (i) and the length ( ).
The shift (i, ) transforms
{a1, …., an}
into
{a1, ….,ai-1,ai+ ,…,an+ }
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Shifts: An Example
The shift (i, ) transforms {a1, …., an}
into {a1, ….,ai-1,ai+ ,…,an+ }
e.g.
10 20 30 40 50 60 70 80 90
10 20 30 35 45 55 65 75 85
10 20 30 35 45 55 62 72 82
shift (4, -5)
shift (7,-3)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Alignment Problem
• Find a series of k shifts that make the sets
A={a1, …., an} and B={b1,….,bn}
as similar as possible.
• k-similarity between sets
• D(k) - the maximum number of elements in
common between sets after k shifts.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Comparing Spectra=Comparing 0-1 Strings
• A modification with positive offset corresponds to
inserting a block of 0s
• A modification with negative offset corresponds to
deleting a block of 0s
• Comparison of theoretical and experimental spectra
(represented as 0-1 strings) corresponds to a
(somewhat unusual) edit distance/alignment
problem where elementary edit operations are
insertions/deletions of blocks of 0s
• Use sequence alignment algorithms!
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Product
A={a1,…., an} and B={b1,…., bn}
SpectralproductA B: two-dimensional matrix with
nm 1s corresponding to all pairs of
indices (ai,bj) and remaining
elements being 0s.
10 20 30 40 50 55 65 75 85 95
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
SPC: the number of 1s at
the main diagonal.
-shifted SPC: the number
of 1s on the diagonal (i,i+ )
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Alignment: k-similarity
k-similarity between spectra: the maximum number
of 1s on a path through this graph that uses at most
k+1 diagonals.
k-optimal spectral
alignment = a path.
The spectral alignment
allows one to detect
more and more subtle
similarities between
spectra by increasing k.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
SPC reveals only
D(0)=3 matching
peaks.
Spectral Alignment
reveals more
hidden similarities
between spectra:
D(1)=5 and D(2)=8
and detects
corresponding
mutations.
Use of k-Similarity
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Black line represent the path for k=0
Red lines represent the path for k=1
Blue lines (right) represents the path for k=2
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Convolution’ Limitation
The spectral convolution considers diagonals
separately without combining them into feasible
mutation scenarios.
D(1) =10 shift function score = 10 D(1) =6
10 20 30 40 50 55 65 75 85 95
10
20
30
40
50
60
70
80
90
100
10 15 30 35 50 55 70 75 90 95
10
20
30
40
50
60
70
80
90
100
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Dynamic Programming for Spectral Alignment
Dij(k): the maximum number of 1s on a path to
(ai,bj) that uses at most k+1 diagonals.
(i’,j’) ~ (i,j): the points are located on the same diagonal
Running time: O(n4 k)
otherwisekD
jijiifkD
kD
ji
ji
jiji
ij
,1)1(
),(~)','(,1)(
max)(
''
''
),()','(
{
)(max)( kDkD ij
ij
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Edit Graph for Fast Spectral Alignment
diag(i,j) – the position
of previous 1 on the
same diagonal as (i,j)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Fast Spectral Alignment Algorithm
1)1(
1)(
max)(
1,1
),(
kM
kD
kD
ji
jidiag
ij
)(max)( ''
),()','(
kDkM ji
jiji
ij
)(
)(
)(
max)(
1,
,1
kM
kM
kD
kM
ji
ji
ij
ij
Running time: O(n2 k)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Alignment: Complications
Spectra are combinations of an increasing (N-
terminal ions) and a decreasing (C-terminal
ions) number series.
These series form two diagonals in the
spectral product, the main diagonal and the
perpendicular diagonal.
The described algorithm deals with the main
diagonal only.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Alignment: Complications
• Simultaneous analysis of N- and C-terminal
ions
• Taking into account the intensities and
charges
• Analysis of minor ions
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
MS/MS and East-West Traveling Salesman Problem
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
EE SE E SE D E SE D T E S
129
244
345
47487
216
317
432
E D T
ES
S
E S
T D E
E
S
Anti-symmetric paths
E D T
ES
S
E S
T D E
E
S
Anti-symmetric path problem (Tim Chen, SODA 2001) avoids reusing the same
peak twice (as both b- and y-ions) in the optimal path
Without anti-symmetric condition, the correct peptide SETDEwould be
missed and the algorithm would output a symmetric peptide:
ESESE
Why?
De-novo problem
Input: Spectrum graph
Output: Anti-symmetric longestpath
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
PTM Frequency Matrix
A C D E F G H I K L M N P Q R S T V W Y
10 1 1 3 2 0 4 0 0 2 1 1 0 7 2 3 5 4 3 0 0
11 7 2 0 1 1 6 0 1 1 1 1 2 0 0 0 7 2 1 0 1
12 2 2 1 1 6 3 1 0 5 2 3 1 0 5 0 2 1 3 0 0
13 4 0 3 2 0 5 1 4 4 5 3 1 0 2 2 5 4 2 2 1
14 12 2 5 16 1 12 0 7 ## 4 43 3 5 2 2 13 4 20 0 3
15 5 0 3 2 2 8 0 7 16 7 18 5 3 4 1 6 1 12 0 1
16 6 0 20 63 2 21 5 73 2 63 ## 14 10 18 2 7 8 10 ## 6
17 5 0 7 8 3 9 2 18 2 23 ## 32 5 18 0 8 2 5 29 3
18 0 3 3 3 3 10 1 6 4 9 15 7 43 5 1 5 3 5 2 1
19 2 0 0 3 0 7 1 3 9 4 3 3 7 1 0 7 1 12 4 0
20 3 3 0 3 2 5 0 2 3 3 1 3 2 1 0 4 0 0 0 0
21 8 0 1 3 0 3 0 1 1 5 0 1 1 4 2 2 1 2 1 2
22 12 0 25 25 15 39 8 20 5 25 1 29 7 27 1 39 27 22 1 13
23 1 1 3 6 0 14 0 3 4 5 0 26 6 2 1 3 2 0 0 2
24 0 0 1 2 0 6 1 4 1 2 0 0 1 1 4 6 1 3 0 1
25 1 6 2 0 1 4 1 0 3 8 0 0 0 4 1 6 0 8 0 0
26 1 2 1 2 2 5 0 1 0 4 0 2 4 3 1 6 2 3 0 1
27 2 1 1 2 0 4 0 2 3 3 1 1 1 3 0 2 2 3 1 0
28 5 5 2 2 4 1 1 12 ## 29 2 2 4 1 6 40 1 56 0 2
29 4 0 0 1 4 3 2 4 28 5 1 6 1 0 3 11 1 9 0 1
30 6 1 1 3 3 20 0 6 13 1 2 3 2 13 1 6 4 4 1 1
31 3 3 4 1 4 6 1 8 8 9 7 1 2 19 2 7 8 6 0 0
32 5 3 0 0 4 7 1 2 1 5 ## 3 1 4 9 1 2 6 43 0
33 1 1 0 1 2 6 1 1 3 9 33 2 0 4 2 6 1 1 8 3
34 5 0 2 2 2 8 4 7 9 19 3 7 1 5 4 4 1 0 0 2
35 0 1 0 2 1 7 0 1 5 2 1 2 2 1 0 2 2 3 0 2
50,000 spectra (IKKb sample)
were searched in blind mode,
and identifications with p-value
<0.05 were retained
Shading of the cell (x,y) reflects
the number of annotations with
modification:
(offset x, amino acid y)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
PTM Frequency Matrix
A C D E F G H I K L M N P Q R S T V W Y
10 1 1 3 2 0 4 0 0 2 1 1 0 7 2 3 5 4 3 0 0
11 7 2 0 1 1 6 0 1 1 1 1 2 0 0 0 7 2 1 0 1
12 2 2 1 1 6 3 1 0 5 2 3 1 0 5 0 2 1 3 0 0
13 4 0 3 2 0 5 1 4 4 5 3 1 0 2 2 5 4 2 2 1
14 12 2 5 16 1 12 0 7 ## 4 43 3 5 2 2 13 4 20 0 3
15 5 0 3 2 2 8 0 7 16 7 18 5 3 4 1 6 1 12 0 1
16 6 0 20 63 2 21 5 73 2 63 ## 14 10 18 2 7 8 10 ## 6
17 5 0 7 8 3 9 2 18 2 23 ## 32 5 18 0 8 2 5 29 3
18 0 3 3 3 3 10 1 6 4 9 15 7 43 5 1 5 3 5 2 1
19 2 0 0 3 0 7 1 3 9 4 3 3 7 1 0 7 1 12 4 0
20 3 3 0 3 2 5 0 2 3 3 1 3 2 1 0 4 0 0 0 0
21 8 0 1 3 0 3 0 1 1 5 0 1 1 4 2 2 1 2 1 2
22 12 0 25 25 15 39 8 20 5 25 1 29 7 27 1 39 27 22 1 13
23 1 1 3 6 0 14 0 3 4 5 0 26 6 2 1 3 2 0 0 2
24 0 0 1 2 0 6 1 4 1 2 0 0 1 1 4 6 1 3 0 1
25 1 6 2 0 1 4 1 0 3 8 0 0 0 4 1 6 0 8 0 0
26 1 2 1 2 2 5 0 1 0 4 0 2 4 3 1 6 2 3 0 1
27 2 1 1 2 0 4 0 2 3 3 1 1 1 3 0 2 2 3 1 0
28 5 5 2 2 4 1 1 12 ## 29 2 2 4 1 6 40 1 56 0 2
29 4 0 0 1 4 3 2 4 28 5 1 6 1 0 3 11 1 9 0 1
30 6 1 1 3 3 20 0 6 13 1 2 3 2 13 1 6 4 4 1 1
31 3 3 4 1 4 6 1 8 8 9 7 1 2 19 2 7 8 6 0 0
32 5 3 0 0 4 7 1 2 1 5 ## 3 1 4 9 1 2 6 43 0
33 1 1 0 1 2 6 1 1 3 9 33 2 0 4 2 6 1 1 8 3
34 5 0 2 2 2 8 4 7 9 19 3 7 1 5 4 4 1 0 0 2
35 0 1 0 2 1 7 0 1 5 2 1 2 2 1 0 2 2 3 0 2
16 on W - Oxidation
16 on M - Oxidation
14 on K - Methylation
22 - Sodium
32 on M - Double oxidation
28 on K - Dimethylation
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Filtration in Similarity Searches
Scoring
Protein Query
SequenceAlignment – Smith-Waterman Algorithm
Sequence matches
Filtration
SequenceAlignment: – BLAST
Database
actgcgctagctacggatagctgatcc
agatcgatgccataggtagctgatcc
atgctagcttagacataaagcttgaat
cgatcgggtaacccatagctagctcg
atcgacttagacttcgattcgatcgaat
tcgatctgatctgaatatattaggtccg
atgctagctgtggtagtgatgtaaga
• BLAST filters out very few correct
matches and is almost as accurate as
Smith – Waterman algorithm.
Database
actgcgctagctacggatagctgatcc
agatcgatgccataggtagctgatcc
atgctagcttagacataaagcttgaat
cgatcgggtaacccatagctagctcg
atcgacttagacttcgattcgatcgaat
tcgatctgatctgaatatattaggtccg
atgctagctgtggtagtgatgtaaga
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Filtration and MS/MS
Scoring
MS/MS spectrum
Peptide Sequencing – SEQUEST / Mascot
Sequence matches
Filtration
Database
MDERHILNMKLQWVCSDLPT
YWASDLENQIKRSACVMTLA
CHGGEMNGALPQWRTHLLE
RTYKMNVVGGPASSDALITG
MQSDPILLVCATRGHEWAILF
GHNLWACVNMLETAIKLEGVF
GSVLRAEKLNKAAPETYIN..
Database
MDERHILNMKLQWVCSDLPT
YWASDLENQIKRSACVMTLA
CHGGEMNGALPQWRTHLLE
RTYKMNVVGGPASSDALITG
MQSDPILLVCATRGHEWAILF
GHNLWACVNMLETAIKLEGVF
GSVLRAEKLNKAAPETYIN..
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Filtration in MS/MS Sequencing
• Filtration in MS/MS is more difficult than in BLAST.
• Early approaches using Peptide Sequence Tags were
not able to substitute the complete database search.
• Can we design a filtration based search that can replace
the database search, and is orders of magnitude faster?
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Asking the Old Question Again: Why
Not Sequence De Novo?
• De novo sequencing is still not very accurate!
Algorithm Amino Acid
Accuracy
Whole Peptide
Accuracy
Lutefisk (Taylor and Johnson, 1997). 0.566 0.189
SHERENGA (Dancik et. al., 1999). 0.690 0.289
Peaks (Ma et al., 2003). 0.673 0.246
PepNovo (Frank and Pevzner, 2005). 0.727 0.296
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
So What Can be Done with De Novo?
• Given an MS/MS spectrum:
• Can de novo predict the entire peptide sequence?
• Can de novo predict partial sequences?
• Can de novo predict a set of partial sequences, that with
high probability, contains at least one correct tag?
A Covering Set of Tags
- No!
(accuracy is less than 30%).
- No!
(accuracy is 50% for GutenTag and 80% for PepNovo )
- Yes!
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Sequence Tags
• A Peptide Sequence Tag is short substring of
a peptide.
Example: G V D L K
G V D
V D L
D L K
Tags:
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Filtration with Peptide Sequence Tags
• Peptide sequence tags can be used as filters in
database searches.
• The Filtration: Consider only database peptides
that contain the tag (in its correct relative mass
location).
• First suggested by Mann and Wilm (1994).
• Similar concepts also used by:
• GutenTag - Tabb et. al. 2003.
• MultiTag - Sunayev et. al. 2003.
• OpenSea - Searle et. al. 2004.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Why Filter Database Candidates?
• Filtration makes genomic database searches practical
(BLAST).
• Effective filtration can greatly speed-up the process,
enabling expensive searches involving post-translational
modifications.
• Goal: generate a small set of covering tags and use them
to filter the database peptides.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Tag Generation - Global Tags
• Parse tags from de novo reconstruction.
• Only a small number of tags can be generated.
• If the de novo sequence is completely incorrect,
none of the tags will be correct.
W
R
A
C
V
G
E
K
DW
L
P
T
L
T
AVGELTK
TAG Prefix Mass
AVG 0.0
VGE 71.0
GEL 170.1
ELT 227.1
LTK 356.2
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Tag Generation - Local Tags
• Extract the highest scoring
subspaths from the spectrum graph.
• Sometimes gets misled by locally
promising-looking “garden paths”.
W
R
A
C
V
G
E
K
DW
L
P
T
L T
TAG Prefix Mass
AVG 0.0
WTD 120.2
PET 211.4
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Ranking Tags
• Each additional tag used to filter increases the
number of database hits and slows down the
database search.
• Tags can be ranked according to their scores,
however this ranking is not very accurate.
• It is better to determine for each tag the
“probability” that it is correct, and choose most
probable tags.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reliability of Amino Acids in Tags
• For each amino acid in a tag we want to assign a
probability that it is correct.
• Each amino acid, which corresponds to an edge in the
spectrum graph, is mapped to a feature space that consists
of the features that correlate with reliability of amino acid
prediction, e.g. score reduction due to edge removal
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Score Reduction Due to Edge Removal
• The removal of an edge corresponding to a
genuine amino acid usually leads to a reduction
in the score of the de novo path.
• However, the removal of an edge that does not
correspond to a genuine amino acid tends to
leave the score unchanged.
W
R
A
C
V
G
K
DW
L
P
T
L T
W
R
A
C
V
G
K
DW
L
P
T
L T
W
R
A
C
V
G
K
DW
L
P
T
L T
E
W
W
R
A
C
V
G
K
D
L
P
T
L T
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Probabilities of Tags
• How do we determine the probability of a
predicted tag ?
• We use the predicted probabilities of its amino
acids and follow the concept:
a chain is only as strong as its weakest link
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Experimental Results
Length 3 Length 4 Length 5
Algorithm  #tags 1 10 1 10 1 10
GlobalTag 0.80 0.94 0.73 0.87 0.66 0.80
LocalTag+ 0.75 0.96 0.70 0.90 0.57 0.80
GutenTag 0.49 0.89 0.41 0.78 0.31 0.64
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Tag-based Database Search
Tag filter SignificanceScore
Tag
extension
De novo
Db
55M
peptides
Candidate Peptides (700)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Matching Multiple Tags
• Matching of a sequence tag against a database is
fast
• Even matching many tags against a database is
fast
• k tags can be matched against a database in time
proportional to database size, but independent of
the number of tags.
• keyword trees (Aho-Corasick algorithm)
• Scan time can be amortized by combining scans for
many spectra all at once.
• build one keyword tree from multiple spectra
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Keyword Trees
Y A K
SN
N
F
F
AT
YFAK
YFNS
FNTA
…..Y F R A Y F N T A…..
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Tag Extension
Filter SignificanceScoreExtension
De novo
Db
55M
peptides
Candidate
Peptides
(700)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
• Given:
• tag with prefix and suffix masses <mP> xyz <mS>
• match in the database
• Compute if a suffix and prefix match with allowable
modifications.
• Compute a candidate peptide with most likely
positions of modifications (attachment points).
Fast Extension
xyz
<mP>xyz<mS>
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Scoring Modified Peptides
Filter SignificanceScoreExtension
De novo
Db
55M
peptides
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Scoring
• Input:
• Candidate peptide with attached modifications
• Spectrum
• Output:
• Score function that normalizes for length, as
variable modifications can change peptide length.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Assessing Reliability of Identifications
Filter SignificanceScoreextension
De novo
Db
55M
peptides
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Selecting Features for Separating
Correct and Incorrect Predictions
• Features:
• Score S: as computed
• Explained Intensity I: fraction of total intensity explained by
annotated peaks.
• b-y score B: fraction of b+y ions annotated
• Explained peaks P: fraction of top 25 peaks annotated.
• Each of I,S,B,P features is normalized (subtract mean
and divide by s.d.)
• Problem: separate correct and incorrect identifications
using I,S,B,P
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Separating power of features
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Separating power of features
Quality scores:
Q = wI I + wS S + wB B + wP P
The weights are chosen to minimize
the mis-classification error
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Distribution of Quality Scores
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Results on ISB data-set
• All ISB spectra were searched.
• The top match is valid for 2978 spectra (2765 for Sequest)
• InsPecT-Sequest: 644 spectra (I-S dataset)
• Sequest-InsPecT: 422 spectra (S-I dataset)
• Average explained intensity of I-S = 52%
• Average explained intensity of S-I = 28%
• Average explained intensity I S = 58%
• ~70 Met. Oxidations
• Run time is 0.7 secs. per spectrum (2.7 secs. for Sequest)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Filtration Results
The search was done against SWISS-PROT (54Mb).
• With 10 tags of length 3:
• The filtration is 1500 more efficient.
• Less than 4% of spectra are filtered out.
• The search time per spectrum is reduced by two orders of magnitude
as compared to SEQUEST.
PTMs Tag
Length
# Tags Filtration InsPecT
Runtime
SEQUEST
Runtime
None 3 1 3.4 10-7 0.17 sec > 1 minute
3 10 1.6 10-6 0.27 sec
Phosphory
lation
3 1 5.8 10-7 0.21 sec > 2 minutes
3 10 2.7 10-6 0.38 sec
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
SPIDER: Yet Another Application of de novo
Sequencing
• Suppose you have a good MS/MS spectrum of an
elephant peptide
• Suppose you even have a good de novo
reconstruction of this spectra
• However, until elephant genome is sequenced, it is
hard to verify this de novo reconstruction
• Can you search de novo reconstruction of a peptide
from elephant against human protein database?
• SPIDER algorithm addresses this comparative
proteomics problem
Slides from Bin Ma, University of Western Ontario
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Common de novo sequencing errors
GG
N and GG
have the
same mass
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
From de novo Reconstruction to Database
Candidate through Real Sequence
• Given a sequence with errors, search for
the similar sequences in a DB.
(Seq) X: LSCFAV
(Real) Y: SLCFAV
(Match) Z: SLCF-V
sequencing error
(Seq) X: LSCF-AV
(Real) Y: EACF-AV
(Match) Z: DACFKAV mass(LS)=mass(EA)
Homology mutations
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Alignment between de novo Candidate and
Database Candidate
• If real sequence Y is known then:
d(X,Z) = seqError(X,Y) + editDist(Y,Z)
(Seq) X: LSCF-AV
(Real) Y: EACF-AV
(Match) Z: DACFKAV
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Alignment between de novo Candidate and
Database Candidate
• If real sequence Y is known then:
d(X,Z) = seqError(X,Y) + editDist(Y,Z)
• If real sequence Y is unknown then the distance between de
novo candidate X and database candidate Z:
• d(X,Z) = minY ( seqError(X,Y) + editDist(Y,Z) )
(Seq) X: LSCF-AV
(Real) Y: EACF-AV
(Match) Z: DACFKAV
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Alignment between de novo Candidate and
Database Candidate
• If real sequence Y is known then:
d(X,Z) = seqError(X,Y) + editDist(Y,Z)
• If real sequence Y is unknown then the distance between de
novo candidate X and database candidate Z:
• d(X,Z) = minY ( seqError(X,Y) + editDist(Y,Z) )
• Problem: search a database for Z that minimizes d(X,Z)
• The core problem is to compute d(X,Z) for given X and Z.
(Seq) X: LSCF-AV
(Real) Y: EACF-AV
(Match) Z: DACFKAV
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Computing seqError(X,Y)
• Align X and Y (according to mass).
• A segment of X can be aligned to a segment of
Y only if their mass is the same!
• For each erroneous mass block (Xi,Yi), the cost is
f(Xi,Yi)=f(mass(Xi)).
• f(m) depends on how often de novo sequencing
makes errors on a segment with mass m.
• seqError(X,Y) is the sum of all f(mass(Xi)).
X
Y
Z
seqError
editDist
(Seq) X: LSCFAV
(Real) Y: EACFAV
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Computing d(X,Z)
• Dynamic Programming:
• Let D[i,j]=d(X[1..i], Z[1..j])
• We examine the last block of the alignment of
X[1..i] and Z[1..j].
(Seq) X: LSCF-AV
(Real) Y: EACF-AV
(Match) Z: DACFKAV
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Dynamic Programming: Four Cases
• Cases A, B, C - no
de novo sequencing
errors
• Case D: de novo
sequencing error
D[i,j]=D[i,j-1]+indel D[i,j]=D[i-1,j]+indel
D[i,j]=D[i-1,j-1]+dist(X[i],Z[j]) D[i,j]=D[i’-1,j’-1]+alpha(X[i’..i],Z[j’..j])
• D[i,j]is the
minimum of the
four cases.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Computing alpha(.,.)
• alpha(X[i’..i],Z[j’..j])
= min m(y)=m(X[i’..i]) [seqError (X[i’..i],y)+editDist(y,Z[j’..j])]
= min m(y)=m[i’..i] [f(m[i’..i])+editDist(y,Z[j’..j])].
= f(m[i’..i]) + min m(y)=m[i’..i] editDist(y,Z[j’..j]).
• This is like to align a mass with a string.
• Mass-alignment Problem: Given a mass m and a
peptide P, find a peptide of mass m that is most
similar to P (among all possible peptides)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Solving Mass-Alignment Problem
])[,()])1..([),((min
)])1..([,(
])..[),((min
min])..[,(
jZydistjiZymm
indeljiZm
indeljiZymm
jiZm
y
y
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Improving the Efficiency
• Homology Match mode:
• Assumes tagging (only peptides that share a tag of
length 3 with de novo reconstruction are considered)
and extension of found hits by dynamic programming
around the hits.
• Non-gapped homology match mode:
• Sequencing error and homology mutations do not
overlap.
• Segment Match mode:
• No homology mutations.
• Exact Match mode:
• No sequencing errors and homology mutations.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Experiment Result
• The correct peptide sequence for each spectrum is known.
• The proteins are all in Swissprot but not in Human
database.
• SPIDER searches 144 spectra against both Swissprot and
human databases
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Example
• Using de novo reconstruction X=CCQWDAEACAFNNPGK,
the homolog Z was found in human database. At the same
time, the correct sequence Y, was found in SwissProt
database.
Seq(X): CCQ[W ]DAEAC[AF]<NN><PG>K
Real(Y): CCK AD DAEAC FA VE GP K
Database(Z): CCK[AD]DKETC[FA]<EE><GK>K
sequencing errors
homology mutations

More Related Content

Viewers also liked

mass spectrometry for pesticides residue analysis- L4
mass spectrometry for pesticides residue analysis- L4mass spectrometry for pesticides residue analysis- L4
mass spectrometry for pesticides residue analysis- L4sherif Taha
 
Protein sequencing
Protein   sequencingProtein   sequencing
Protein sequencingStudent
 
Protein sequence determinatiom
Protein sequence determinatiomProtein sequence determinatiom
Protein sequence determinatiomdravidjanardhan
 
Rapid analysis of polymers and petroleum by Ion-mobility mass spectrometry
Rapid analysis of polymers and petroleum by Ion-mobility mass spectrometryRapid analysis of polymers and petroleum by Ion-mobility mass spectrometry
Rapid analysis of polymers and petroleum by Ion-mobility mass spectrometryWaters Corporation - Chemical Materials
 
Data Integration, Mass Spectrometry Proteomics Software Development
Data Integration, Mass Spectrometry Proteomics Software DevelopmentData Integration, Mass Spectrometry Proteomics Software Development
Data Integration, Mass Spectrometry Proteomics Software DevelopmentNeil Swainston
 
Ion mobility spectrometery
Ion mobility spectrometeryIon mobility spectrometery
Ion mobility spectrometeryAshish Sharma
 
The determination of amino acid sequences presentation autumne 2015
The determination of amino acid sequences presentation autumne 2015The determination of amino acid sequences presentation autumne 2015
The determination of amino acid sequences presentation autumne 2015Richard Trinh
 
Analysis of Trace Elements in Water by EPA Method 200.8 using ICP Mass Spectr...
Analysis of Trace Elements in Water by EPA Method 200.8 using ICP Mass Spectr...Analysis of Trace Elements in Water by EPA Method 200.8 using ICP Mass Spectr...
Analysis of Trace Elements in Water by EPA Method 200.8 using ICP Mass Spectr...Shimadzu Scientific Instruments
 
MASS SPECTROMETRY IN THE FIELD OF FOOD INDUSTRY
MASS SPECTROMETRY IN THE FIELD OF FOOD INDUSTRYMASS SPECTROMETRY IN THE FIELD OF FOOD INDUSTRY
MASS SPECTROMETRY IN THE FIELD OF FOOD INDUSTRYErin Davis
 

Viewers also liked (19)

Analysis of Milk and Egg Allergens in Wine using UPLC-MS - Waters Corporation...
Analysis of Milk and Egg Allergens in Wine using UPLC-MS - Waters Corporation...Analysis of Milk and Egg Allergens in Wine using UPLC-MS - Waters Corporation...
Analysis of Milk and Egg Allergens in Wine using UPLC-MS - Waters Corporation...
 
mass spectrometry for pesticides residue analysis- L4
mass spectrometry for pesticides residue analysis- L4mass spectrometry for pesticides residue analysis- L4
mass spectrometry for pesticides residue analysis- L4
 
The Analysis of Allergens in Raw and Roasted Peanuts using Nanoscale UPLC & T...
The Analysis of Allergens in Raw and Roasted Peanuts using Nanoscale UPLC & T...The Analysis of Allergens in Raw and Roasted Peanuts using Nanoscale UPLC & T...
The Analysis of Allergens in Raw and Roasted Peanuts using Nanoscale UPLC & T...
 
Protein sequencing
Protein   sequencingProtein   sequencing
Protein sequencing
 
Mass spectroscopy
Mass spectroscopyMass spectroscopy
Mass spectroscopy
 
Protein sequence determinatiom
Protein sequence determinatiomProtein sequence determinatiom
Protein sequence determinatiom
 
Rapid analysis of polymers and petroleum by Ion-mobility mass spectrometry
Rapid analysis of polymers and petroleum by Ion-mobility mass spectrometryRapid analysis of polymers and petroleum by Ion-mobility mass spectrometry
Rapid analysis of polymers and petroleum by Ion-mobility mass spectrometry
 
Data Integration, Mass Spectrometry Proteomics Software Development
Data Integration, Mass Spectrometry Proteomics Software DevelopmentData Integration, Mass Spectrometry Proteomics Software Development
Data Integration, Mass Spectrometry Proteomics Software Development
 
Discovery and Analysis of Peanut Allergens using Proteomic approaches with Io...
Discovery and Analysis of Peanut Allergens using Proteomic approaches with Io...Discovery and Analysis of Peanut Allergens using Proteomic approaches with Io...
Discovery and Analysis of Peanut Allergens using Proteomic approaches with Io...
 
Ion mobility spectrometery
Ion mobility spectrometeryIon mobility spectrometery
Ion mobility spectrometery
 
Amino acid sequencing
Amino acid sequencingAmino acid sequencing
Amino acid sequencing
 
The determination of amino acid sequences presentation autumne 2015
The determination of amino acid sequences presentation autumne 2015The determination of amino acid sequences presentation autumne 2015
The determination of amino acid sequences presentation autumne 2015
 
Probiotics ppt
Probiotics pptProbiotics ppt
Probiotics ppt
 
Analysis of Trace Elements in Water by EPA Method 200.8 using ICP Mass Spectr...
Analysis of Trace Elements in Water by EPA Method 200.8 using ICP Mass Spectr...Analysis of Trace Elements in Water by EPA Method 200.8 using ICP Mass Spectr...
Analysis of Trace Elements in Water by EPA Method 200.8 using ICP Mass Spectr...
 
Protein sequencing
Protein sequencingProtein sequencing
Protein sequencing
 
MASS SPECTROMETRY IN THE FIELD OF FOOD INDUSTRY
MASS SPECTROMETRY IN THE FIELD OF FOOD INDUSTRYMASS SPECTROMETRY IN THE FIELD OF FOOD INDUSTRY
MASS SPECTROMETRY IN THE FIELD OF FOOD INDUSTRY
 
Overview of Food Allergen Detection using Mass Spectrometry - Waters Corporat...
Overview of Food Allergen Detection using Mass Spectrometry - Waters Corporat...Overview of Food Allergen Detection using Mass Spectrometry - Waters Corporat...
Overview of Food Allergen Detection using Mass Spectrometry - Waters Corporat...
 
Probiotics
ProbioticsProbiotics
Probiotics
 
Analysis of Mycotoxins by UPLC and Tandem Quadrupole Mass Spectrometry - Wate...
Analysis of Mycotoxins by UPLC and Tandem Quadrupole Mass Spectrometry - Wate...Analysis of Mycotoxins by UPLC and Tandem Quadrupole Mass Spectrometry - Wate...
Analysis of Mycotoxins by UPLC and Tandem Quadrupole Mass Spectrometry - Wate...
 

Similar to Bioalgo 2012-03-massspec

Using text mining to inform genetic variant interpretation
Using text mining to inform genetic variant interpretationUsing text mining to inform genetic variant interpretation
Using text mining to inform genetic variant interpretationKarin Verspoor
 
2016 bioinformatics i_score_matrices_wim_vancriekinge
2016 bioinformatics i_score_matrices_wim_vancriekinge2016 bioinformatics i_score_matrices_wim_vancriekinge
2016 bioinformatics i_score_matrices_wim_vancriekingeProf. Wim Van Criekinge
 
2015 bioinformatics score_matrices_wim_vancriekinge
2015 bioinformatics score_matrices_wim_vancriekinge2015 bioinformatics score_matrices_wim_vancriekinge
2015 bioinformatics score_matrices_wim_vancriekingeProf. Wim Van Criekinge
 
2015 bioinformatics protein_structure_wimvancriekinge
2015 bioinformatics protein_structure_wimvancriekinge2015 bioinformatics protein_structure_wimvancriekinge
2015 bioinformatics protein_structure_wimvancriekingeProf. Wim Van Criekinge
 
2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekinge2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekingeProf. Wim Van Criekinge
 
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single MoleculeThe Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single MoleculeJustin Johnson
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSAksw Group
 
The Cardiac Organellar Peptide Spectral Library
The Cardiac Organellar Peptide Spectral LibraryThe Cardiac Organellar Peptide Spectral Library
The Cardiac Organellar Peptide Spectral LibraryRafael C. Jimenez
 
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013Prof. Wim Van Criekinge
 
Bacterial transcriptome profiling using Ion Torrent Proton™ technology
Bacterial transcriptome profiling using Ion Torrent Proton™ technologyBacterial transcriptome profiling using Ion Torrent Proton™ technology
Bacterial transcriptome profiling using Ion Torrent Proton™ technologyThermo Fisher Scientific
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentationlordjoe
 
Computational Biology thesis defense
Computational Biology thesis defenseComputational Biology thesis defense
Computational Biology thesis defensecsfunk
 
IRSAE aquatic ecology 28 June 2018 metabolomics
IRSAE aquatic ecology 28 June 2018 metabolomicsIRSAE aquatic ecology 28 June 2018 metabolomics
IRSAE aquatic ecology 28 June 2018 metabolomicsPanagiotis Arapitsas
 
A Pde Silva Slintec
A Pde Silva SlintecA Pde Silva Slintec
A Pde Silva SlintecSLINTEC
 

Similar to Bioalgo 2012-03-massspec (20)

Using text mining to inform genetic variant interpretation
Using text mining to inform genetic variant interpretationUsing text mining to inform genetic variant interpretation
Using text mining to inform genetic variant interpretation
 
Cheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural ProductsCheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural Products
 
P7 2017 biopython3
P7 2017 biopython3P7 2017 biopython3
P7 2017 biopython3
 
2016 bioinformatics i_score_matrices_wim_vancriekinge
2016 bioinformatics i_score_matrices_wim_vancriekinge2016 bioinformatics i_score_matrices_wim_vancriekinge
2016 bioinformatics i_score_matrices_wim_vancriekinge
 
2015 bioinformatics score_matrices_wim_vancriekinge
2015 bioinformatics score_matrices_wim_vancriekinge2015 bioinformatics score_matrices_wim_vancriekinge
2015 bioinformatics score_matrices_wim_vancriekinge
 
P7 2018 biopython3
P7 2018 biopython3P7 2018 biopython3
P7 2018 biopython3
 
2015 bioinformatics protein_structure_wimvancriekinge
2015 bioinformatics protein_structure_wimvancriekinge2015 bioinformatics protein_structure_wimvancriekinge
2015 bioinformatics protein_structure_wimvancriekinge
 
Using online chemistry databases to facilitate structure identification in ma...
Using online chemistry databases to facilitate structure identification in ma...Using online chemistry databases to facilitate structure identification in ma...
Using online chemistry databases to facilitate structure identification in ma...
 
2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekinge2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekinge
 
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single MoleculeThe Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
 
The Cardiac Organellar Peptide Spectral Library
The Cardiac Organellar Peptide Spectral LibraryThe Cardiac Organellar Peptide Spectral Library
The Cardiac Organellar Peptide Spectral Library
 
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013
 
Bacterial transcriptome profiling using Ion Torrent Proton™ technology
Bacterial transcriptome profiling using Ion Torrent Proton™ technologyBacterial transcriptome profiling using Ion Torrent Proton™ technology
Bacterial transcriptome profiling using Ion Torrent Proton™ technology
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentation
 
Computational Biology thesis defense
Computational Biology thesis defenseComputational Biology thesis defense
Computational Biology thesis defense
 
Data integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientistData integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientist
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 
IRSAE aquatic ecology 28 June 2018 metabolomics
IRSAE aquatic ecology 28 June 2018 metabolomicsIRSAE aquatic ecology 28 June 2018 metabolomics
IRSAE aquatic ecology 28 June 2018 metabolomics
 
A Pde Silva Slintec
A Pde Silva SlintecA Pde Silva Slintec
A Pde Silva Slintec
 

More from BioinformaticsInstitute

Comparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsComparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsBioinformaticsInstitute
 
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...BioinformaticsInstitute
 
Вперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкВперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкBioinformaticsInstitute
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр ПредеусBioinformaticsInstitute
 
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...BioinformaticsInstitute
 
Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)BioinformaticsInstitute
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...BioinformaticsInstitute
 
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)BioinformaticsInstitute
 

More from BioinformaticsInstitute (20)

Graph genome
Graph genome Graph genome
Graph genome
 
Nanopores sequencing
Nanopores sequencingNanopores sequencing
Nanopores sequencing
 
A superglue for string comparison
A superglue for string comparisonA superglue for string comparison
A superglue for string comparison
 
Comparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsComparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphs
 
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 
Вперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкВперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днк
 
Knime &amp; bioinformatics
Knime &amp; bioinformaticsKnime &amp; bioinformatics
Knime &amp; bioinformatics
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус
 
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
 
Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)
 
Плюрипотентность 101
Плюрипотентность 101Плюрипотентность 101
Плюрипотентность 101
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
 
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
 
Biodb 2011-everything
Biodb 2011-everythingBiodb 2011-everything
Biodb 2011-everything
 
Biodb 2011-05
Biodb 2011-05Biodb 2011-05
Biodb 2011-05
 
Biodb 2011-04
Biodb 2011-04Biodb 2011-04
Biodb 2011-04
 
Biodb 2011-03
Biodb 2011-03Biodb 2011-03
Biodb 2011-03
 
Biodb 2011-01
Biodb 2011-01Biodb 2011-01
Biodb 2011-01
 
Biodb 2011-02
Biodb 2011-02Biodb 2011-02
Biodb 2011-02
 
Ngs 3 1
Ngs 3 1Ngs 3 1
Ngs 3 1
 

Recently uploaded

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Bioalgo 2012-03-massspec

  • 1. www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms Protein Sequencing and Identification by Mass Spectrometry
  • 2. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Outline • Tandem Mass Spectrometry • De Novo Peptide Sequencing • Spectrum Graph • Protein Identificationvia Database Search • IdentifyingPost Translationally Modified Peptides • Spectral Convolution • Spectral Alignment
  • 3. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Different Amino Acid Have Different Masses H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 AA residuei-1 AA residuei AA residuei+1 N-terminus C-terminus
  • 4. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Peptide Fragmentation • Peptides tend to fragment along the backbone. • Mass spectrometer is a sophisticated (and rather expensive!) scale to measure the masses of these fragments H...-HN-CH-CO . . . NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 H+ Prefix Fragment Suffix Fragment Collision Induced Dissociation
  • 5. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Breaking Protein into Peptides and Peptides into Fragment Ions • Most mass spectrometers can only measure masses of short peptides (e.g., 20 amino acids or less) rather than masses of entire proteins (usually hundreds of amino acids). That‟s why: • Proteases, e.g. trypsin, break protein into short peptides. • A Tandem Mass Spectrometer further breaks the peptides down into fragment ions and measures the mass of each piece. • Mass Spectrometer accelerates the fragmented ions; heavier ions accelerate slower than lighter ones. • Mass Spectrometer measure mass/charge ratio of an ion.
  • 6. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info N- and C-terminal Peptides
  • 7. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Terminal peptides and ion types Peptide Mass (D) 57 + 97 + 147 + 114 = 415
  • 8. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Masses of fragment ions Peptide Mass (D) 57 + 97 + 147 + 114 = 415 Peptide Mass (D) 57 + 97 + 147 + 114 – 18 = 397 without
  • 9. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info N- and C-terminal Peptides 415 486 301 154 57 71 185 332 429
  • 10. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info N- and C-terminal Peptides 415 486 301 154 57 71 185 332 429
  • 11. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Theoretical Spectrum 415 486 301 154 57 71 185 332 429 Reconstruct peptide from the set of masses of fragment ions (mass-spectrum) 57 71 154 185 301 332 415 429 486
  • 12. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Reconstructing Peptides Reconstruct peptide from the set of masses of fragment ions (mass-spectrum) 57 71 154 185 301 332 415 429 486
  • 13. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Reconstructing Peptides Reconstruct peptide from the set of masses of fragment ions (mass-spectrum) 57 71 81 100 112 131 160 172 177 185 201 221 235 301 312 325 332 370 387 409 415 423 429 460 472 486154
  • 14. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Reconstructing Peptides Reconstruct peptide from the set of masses of fragment ions (mass-spectrum) 57 71 81 100 112 131 160 172 177 185 201 221 235 301 312 325 370 387 409 415 423 429 460 472 486
  • 15. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Peptide Fragmentation y3 b2 y2 y1 b3a2 a3 HO NH3 + | | R1 O R2 O R3 O R4 | || | || | || | H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH | | | | | | | H H H H H H H b2-H2O y3 -H2O b3- NH3 y2 - NH3
  • 16. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Mass Spectra G V D L K mass 0 57 Da = „G‟ 99 Da = „V‟ LK D V G • The peaks in the mass spectrum: • Prefix • Fragments with neutral losses (-H2O, -NH3) • Noise and missing peaks. and Suffix Fragments. D H2O
  • 17. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Protein Identification with MS/MS mass 0 Intensity mass 0 MS/MS Peptide Identification Protein database
  • 18. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Tandem Mass Spectrum • Tandem Mass Spectrometry mainly generates N- and C-terminal fragment ions • Chemical noise often complicates the spectrum. • Represented in 2-D: mass/charge axis vs. intensity axis S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 T: + c d Full ms2 638.00 [ 165.00 - 1925.00] 200 400 600 800 1000 1200 1400 1600 1800 2000 m/z 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 RelativeAbundance 850.3 687.3 588.1 851.4 425.0 949.4 326.0 524.9 589.2 1048.6 397.1226.9 1049.6 489.1 629.0
  • 19. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Tandem Mass-Spectrometry
  • 20. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Breaking Proteins into Peptides peptides MPSER …… GTDIMR PAKID …… HPLC To MS/MSMPSERGTDIMRPAKID...... protein
  • 21. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Mass Spectrometry Matrix-Assisted Laser Desorption/Ionization (MALDI) From lectures by Vineet Bafna (UCSD)
  • 22. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Tandem Mass Spectrometry RT: 0.01- 80.02 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Time(min) 0 10 20 30 40 50 60 70 80 90 100 RelativeAbundance 1389 1991 1409 2149 1615 1621 1411 2147 1611 19951655 1593 1387 21551435 1987 2001 2177 1445 1661 1937 2205 1779 2135 2017 1313 22071307 2329 1105 17071095 2331 NL: 1.52E8 BasePeakF:+ cFullms[ 300.00 - 2000.00] S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 T: + c d Full ms2 638.00 [ 165.00 - 1925.00] 200 400 600 800 1000 1200 1400 1600 1800 2000 m/z 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 RelativeAbundance 850.3 687.3 588.1 851.4 425.0 949.4 326.0 524.9 589.2 1048.6 397.1226.9 1049.6 489.1 629.0 Scan 1708 LC S#: 1707 RT: 54.44 AV: 1 NL: 2.41E7 F: + c Full ms [ 300.00 - 2000.00] 200 400 600 800 1000 1200 1400 1600 1800 2000 m/z 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 RelativeAbundance 638.0 801.0 638.9 1173.8 872.3 1275.3 687.6 944.7 1884.51742.11212.0783.3 1048.3 1413.9 1617.7 Scan 1707 MS MS/MS Ion Source MS-1 collision cell MS-2
  • 23. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Protein Identification by Tandem Mass Spectrometry (MS/MS) S e q u e n c e S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 T: + c d Full ms2 638.00 [ 165.00 - 1925.00] 200 400 600 800 1000 1200 1400 1600 1800 2000 m/z 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 RelativeAbundance 850.3 687.3 588.1 851.4 425.0 949.4 326.0 524.9 589.2 1048.6 397.1226.9 1049.6 489.1 629.0 MS/MS instrument database search •Sequest, Mascot, etc de novo interpretation •Lutefisk, Peaks, etc
  • 24. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info De Novo vs. Database Search S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 T: + c d Full ms2 638.00 [ 165.00 - 1925.00] 200 400 600 800 1000 1200 1400 1600 1800 2000 m/z 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 RelativeAbundance 850.3 687.3 588.1 851.4 425.0 949.4 326.0 524.9 589.2 1048.6 397.1226.9 1049.6 489.1 629.0 W R A C V G E K DW L P T L T W R A C V G E K DW L P T L T De Novo AVGELTK Database Search Database of all peptides = 20n AAAAAAAA,AAAAAAAC,AAAAAAAD,AAAAAAAE, AAAAAAAG,AAAAAAAF,AAAAAAAH,AAAAAAI, AVGELTI, AVGELTK , AVGELTL, AVGELTM, YYYYYYYS,YYYYYYYT,YYYYYYYV,YYYYYYYY Database of known peptides MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN.. Database of known peptides MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN.. Mass, Score
  • 25. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info De Novo vs. Database Search: A Paradox • The database of all peptides is huge ≈ 20n peptides of length n • The database of all known peptides is much smaller ≈ 108 peptides • However, de novo algorithms can be much faster, even though their search space is much larger! • A database search scans all peptides in the database of all known peptides to find best one. • De novo eliminates the need to scan database of all peptides by modeling the problem as a graph search.
  • 26. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Three Algorithmic Problems • Searching for a million words in a text. Suppose it takes 1 sec to find a word in a text. How much time would it take to find 1 million words in the text? • Searching for a word without even looking at 99.999% of the text. Suppose you search for a word in a text. Would it be possible to ignore 99.999% of the text, scan only the remaining part and guarantee that the word you are looking for will be found? • Finding spelling errors in a book written in an unknown language. Given a book (in an unknown language) and a misspelled word (with insertions, deletions, and substitutions of letters) correct spelling errors in the word.
  • 27. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Three Algorithmic Problems • Searching for a million words in a text. Suppose it takes 1 sec to find a word in a text. How much time would it take to find 1 million words in the text? 1 millionseconds? • Searching for a word without even looking at 99.999% of the text. Suppose you search for a word in a text. Would it be possible to ignore 99.999% of the text, scan only the remaining part and guarantee that the word you are looking for will be found? • Finding spelling errors in a book written in an unknown language. Given a book (in an unknown language) and a misspelled word (with insertions, deletions, and substitutions of letters) correct spelling errors in the word.
  • 28. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Genomics: Problems Solved. • Searching for a million words in a text. Aho-Corasik algorithm takes roughly the same time with a million words as it takes with a single word. • Searching for a word without even looking at 99.999% of the text. Filtration algorithms (like FASTA or BLAST) ignore 99.999% of the text. • Finding spelling errors. Sequence alignment algorithms (like Smith- Waterman) do it in quadratic time
  • 29. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Proteomics: Three Problems • Comparing a million spectra against a database. Suppose it takes 1 sec to interpret a spectrum. How much time would it take to interpret 1 million spectra? • Mass-spectrometry database search without even looking at 99.999% of the database. Suppose you compare a spectrum against a database. Would it be possible to ignore 99.999% of the database, scan only the remaining part and guarantee that you still can identify a peptide of interest? • Blind PTM search and discovery of new PTM types. Given a spectrum of a peptide with unknown PTM types, find this peptide in the database. Discover new PTM types by data mining of large MS/MS datasets.
  • 30. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Three Solutions • Comparing a million spectra against a database. InsPecT (Tanner et al., Anal. Chem, 2005) • MS/MS database search without even looking at 99.999% of the database. InsPecT (Tanner et al., Anal. Chem, 2005) • Blind PTM search and discovery of new PTM types. Given a spectrum of a peptide with unknown PTM types, find this peptide in the database. Discover new PTM types by data mining of large MS/MS datasets. MS-Alignment (Tsur et al., Nature Biotech., 2005)
  • 31. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Filtration: Combining De Novo Sequencing and Database Search in Mass-Spectrometry • So far de novo and database search were presented as two separate techniques • Database search is rather slow: many labs generate more than 100,000 spectra per day. SEQUEST takes approximately 1 minute to compare a single spectrum against SWISS-PROT (54Mb) on a desktop. • It will take SEQUEST more than 2 months to analyze the MS/MS data produced in a single day. • Can slow database search be combined with fast de novo analysis?
  • 32. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info De novo Peptide Sequencing S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 T: + c d Full ms2 638.00 [ 165.00 - 1925.00] 200 400 600 800 1000 1200 1400 1600 1800 2000 m/z 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 RelativeAbundance 850.3 687.3 588.1 851.4 425.0 949.4 326.0 524.9 589.2 1048.6 397.1226.9 1049.6 489.1 629.0 Sequence
  • 33. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Building Spectrum Graph • How to create vertices (from masses) • How to create edges (from mass differences) • How to score vertices • How to score paths • How to find the best path
  • 34. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info S E Q U E N C E b-ions (prefix or N-terminal ions) Mass/Charge (M/Z)
  • 35. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info a-ions = b-ions - CO = b-ions - 28 Mass/Charge (M/Z) S E Q U E N C E
  • 36. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info S E Q U E N C E Mass/Charge (M/Z) Shifting Peaks: a-ions = b-ions - CO = b-ions - 28
  • 37. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info y-ions (suffix of C-terminal ions) Mass/Charge (M/Z) E C N E U Q E S
  • 38. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Mass/Charge (M/Z) Intensity
  • 39. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Mass/Charge (M/Z) Intensity
  • 40. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info noise Mass/Charge (M/Z)
  • 41. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info MS/MS Spectrum Mass/Charge (M/z) Intensity
  • 42. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Some Mass Differences between Peaks Correspond to Amino Acids s s s e e e e e e e e q q qu u u n n n e c c c
  • 43. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Some Mass Differences between Peaks Correspond to Amino Acids s s s e e e e e e e e q q qu u u n n n e c c c
  • 44. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Ion Types • Some masses correspond to fragment ions, others are just random noise • Knowing ion types Δ={δ1, δ2,…, δk} allows one to distinguish fragment ions from noise • We can learn ion types δi and their probabilities qi by analyzing a large test sample of annotated spectra.
  • 45. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Example of Ion Type • Δ={δ1, δ2,…, δk} • Ion types {b, b-NH3, b-H2O, b-CO} correspond to Δ={0, 17, 18, 28} *Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity
  • 46. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Match between Spectra and the Shared Peak Count • The match between two spectra is the number of masses (peaks) they share (Shared Peak Count or SPC) • In practice mass-spectrometrists use the weighted SPC that reflects intensities of the peaks • Match between experimental and theoretical spectra is defined similarly
  • 47. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Peptide Sequencing Problem Goal: Find a peptide with maximal match between an experimental and theoretical spectrum. Input: • S: experimental spectrum • Δ: set of possible ion types Output: • A peptide whose theoretical spectrum matches the experimental spectrum the best
  • 48. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info S E Q U E N C E Mass/Charge (M/Z) Shifting Peaks: a-ions = b-ions - CO = b-ions - 28
  • 49. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Reverse Shifts Shift in H2O+NH3 Shift in H2O
  • 50. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Vertices of the Spectrum Graph • Masses of potential N-terminalpeptides • Vertices aregenerated by reverseshifts correspondingto ion types Δ={δ1, δ2,…, δk} • Every N-terminalpeptidecan generateup to k ions m-δ1, m-δ2, …, m-δk • Every mass s in an MS/MS spectrumgenerates k vertices V(s) = {s+δ1, s+δ2, …, s+δk} correspondingto potentialN-terminalpeptides • Vertices of the spectrum graph: {initialvertex} V(s1) V(s2) ... V(sm) {terminalvertex}
  • 51. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Edges of the Spectrum Graph • Two vertices with mass difference corresponding to an amino acid A: • Connect with an edge labeled by A • Gap edges corresponding to the mass of pairs of amino acids (optional)
  • 52. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Paths • Paths in the labeled graph spell out amino acid sequences • There are many paths, how to find the correct one? • We need scoring to evaluate paths
  • 53. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Path Score • p(P,S) = probability that peptide P produces spectrum S= {s1,s2,…sq} • p(P, s) = the probability that peptide P generates a peak s • Scoring = computing probabilities • p(P,S) = πsєS p(P, s)
  • 54. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info • For a position t that represents ion type dj : qj, if peak is generated at t p(P,st) = 1-qj , otherwise Peak Score
  • 55. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Peak Score (cont’d) • For a position t that is not associated with an ion type: qR , if peak is generated at t pR(P,st) = 1-qR , otherwise • qR = the probability of a noisy peak that does not correspond to any ion type
  • 56. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Finding Optimal Paths in the Spectrum Graph • For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P: • Peptides = paths in the spectrum graph • P’ = the optimal path in the spectrum graph p(P,S)p(P',S) Pmax
  • 57. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Ions and Probabilities • Tandem mass spectrometry is characterized by a set of ion types {δ1,δ2,..,δk} and their probabilities {q1,...,qk} • δi-ions of a partial peptide are produced independently with probabilities qi
  • 58. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Ions and Probabilities • A peptide has all k peaks with probability • and no peaks with probability • A peptide also produces a ``random noise'' with uniform probability qR in any position. k i iq 1 k i iq 1 )1(
  • 59. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Ratio Test Scoring for Partial Peptides • Incorporates premiums for observed ions and penalties for missing ions. • Example: for k=4, assume that for a partial peptide P‟ we only see ions δ1,δ2,δ4. The score is calculated as: RRRR q q q q q q q q 4321 )1( )1(
  • 60. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Scoring Peptides • T- set of all positions. • Ti={t δ1,, t δ2,..., ,t δk,}- set of positions that represent ions of partial peptides Pi. • A peak at position tδj is generated with probability qj. • R=T- U Ti - set of positions that are not associated with any partial peptides (noise).
  • 61. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Probabilistic Model • For a position t δj Ti the probability p(t, P,S) that peptide P produces a peak at position t. • Similarly, for t R, the probability that P produces a random noise peak at t is: otherwise1 position tatgeneratedispeakaif ),,( j j j q q SPtP otherwise1 position tatgeneratedispeakaif )( R R R q q tP
  • 62. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Probabilistic Score • For a peptide P with n amino acids, the score for the whole peptides is expressed by the following ratio test: n i k j iR i R j j tp SPtp Sp SPp 1 1 )( ),,( )( ),(
  • 63. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info De Novo vs. Database Search S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 T: + c d Full ms2 638.00 [ 165.00 - 1925.00] 200 400 600 800 1000 1200 1400 1600 1800 2000 m/z 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 RelativeAbundance 850.3 687.3 588.1 851.4 425.0 949.4 326.0 524.9 589.2 1048.6 397.1226.9 1049.6 489.1 629.0 W R A C V G E K DW L P T L T W R A C V G E K DW L P T L T De Novo AVGELTK Database Search Database of known peptides MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN.. Database of known peptides MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN..
  • 64. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info De Novo vs. Database Search: A Paradox • de novo algorithms are much faster, even though their search space is much larger! • A database search scans all peptides to find the best one. • De novo eliminates the need to scan all peptides by modeling the problem as a graph search. Why not sequence de novo?
  • 65. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Why Not Sequence De Novo? • De novo sequencing is not very accurate: • Less than 30% of the peptides sequenced were completely correct! Algorithm Amino Acid Accuracy Whole Peptide Accuracy Lutefisk, 1997 0.566 0.189 SHERENGA, 1999 0.690 0.289 Peaks, 2003 0.673 0.246 PepNovo, 2005 0.727 0.296
  • 66. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Pros and Cons of de novo Sequencing • Advantage: • Gets the sequences that are not necessarily in the database. • An additional similarity search step using these sequences may identify the related proteins in the database. • Disadvantage: • Requires higher quality spectra to be accurate.
  • 67. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Peptide Sequencing Problem Goal: Find a peptide with maximal match between an experimental and theoretical spectrum. Input: • S: experimental spectrum • Δ: set of possible ion types Output: • A peptide whose theoretical spectrum matches the experimental S spectrum the best
  • 68. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Peptide Identification Problem Goal: Find a peptide from the database with maximal match between an experimental and theoretical spectrum. Input: • S: experimental spectrum • database of peptides • Δ: set of possible ion types Output: • A peptide from the database whose theoretical spectrum matches the experimental S spectrum the best
  • 69. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info MS/MS Database Search Database search in mass-spectrometry has been successful in identification of already known proteins. Experimental spectrum can be compared with theoretical spectra of database peptides to find the best fit. SEQUEST (Yates et al., 1995) But reliable algorithms for identification of modified peptides is a much more difficult problem.
  • 70. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info The dynamic nature of the proteome • The proteome of the cell is changing • Various extra-cellular, and other signals activate various protein pathways. • A key mechanism of protein activation is post-translational modification (PTM) • These pathways may lead to other genes being switched on or off • Mass spectrometry is key to probing the proteome and detecting PTMs
  • 71. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Post-Translational Modifications Proteins are involved in cellular signaling and metabolic regulation. They are subject to a large number of biological modifications. Most protein sequences are post-translationally modified and 600 types (!) of modifications of amino acid residues are known.
  • 72. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Examples of Post-Translational Modification Post-translational modifications increase the number of “letters” in amino acid alphabet and lead to a combinatorial explosion in both database search and de novo approaches.
  • 73. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Peptide Masses 415 486 301 154 57 71 185 332 429
  • 74. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Peptide Masses: Modification +16 on N 415 486 301 154 57 71 185 332 429 +16 +16 +16 +16 +16 +16
  • 75. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Reconstructing Modified Peptides Reconstruct peptide from the set of masses of fragment ions (mass-spectrum) 57 71 154 185 301 332 415 429 486 +16 +16 +16 +16 +16 57 71 154 201 301 348 431 445 502
  • 76. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Reconstructing Modified Peptides Reconstruct peptide from the set of masses of fragment ions (mass-spectrum) 57 71 81 100 112 131 154 160 172 177 185 201 221 235 301 312 325 332 370 387 409 415 423 429 460 472 486 57 71 81 100 112 131 154 160 172 177 185 201 221 235 301 312 325 332 348 387 409 415 423 431 445 472 502
  • 77. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Peptide Identification Problem Revisited Goal: Find a peptide from the database with maximal match between an experimental and theoretical spectrum. Input: • S: experimental spectrum • database of peptides • Δ: set of possible ion types Output: • A peptide from the database whose theoretical spectrum matches the experimental S spectrum the best
  • 78. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Modified Peptide Identification Problem Goal: Find a modified peptide from the database with maximal match between an experimental and theoretical spectrum. Input: • S: experimental spectrum • database of peptides • Δ: set of possible ion types • Parameter k (# of mutations/modifications) Output: • A peptide that is at most k mutations/modifications apart from a database peptide and whose theoretical spectrum matches the experimental S spectrum the best
  • 79. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Search for Modified Peptides: Virtual Database Approach • YFDSTDYNMAK • 25=32 possibilities, with 2 types of modifications! Phosphorylation? Oxidation? Yates et al.,1995: an exhaustive search in a virtual database of all modified peptides. Combinatorial explosion, even for a small set of modifications types. A larger set of spurious matches must be filtered out. It‟s much more likely that incorrect matches will have high scores. Problem. Extend the virtual database approach to a large set of modifications.
  • 80. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Restrictive vs Unrestrictive (Blind) Search for Modified Peptides • Restrictive search requires the researcher to guess which modification types are present in the sample • How would you feel if you were allowed to perform only 10 out of 20x19=380 possible point mutations (with all other forbidden) while aligning two proteins? • This is exactly what the restrictive PTM search algorithms do.
  • 81. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Restrictive vs Unrestrictive (Blind) Search for Modified Peptides • Restrictive search requires the researcher to guess which modification types are present in the sample • Blind search performs an unrestrictive search for all possible modification offsets at once.
  • 82. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Database Search: Sequence Analysis vs. MS/MS Analysis Sequence analysis: similar peptides (that a few mutations apart) have similar sequences MS/MS analysis: similar peptides (that a few mutations apart) have dissimilar spectra
  • 83. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Peptide Identification Problem: Challenge Very similar peptides may have very different spectra! Goal: Define a notion of spectral similarity that correlates well with the sequence similarity. If peptides are a few mutations/modifications apart, the spectral similarity between their spectra should be high.
  • 84. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Deficiency of the Shared Peaks Count Shared peaks count (SPC): intuitive measure of spectral similarity. Problem: SPC diminishes very quickly as the number of mutations increases. Only a small portion of correlations between the spectra of mutated peptides is captured by SPC.
  • 85. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info SPC Diminishes Quickly S(PRTEIN) = {98, 133, 246, 254, 355, 375, 476, 484, 597, 632} S(PRTEYN) = {98, 133, 254, 296, 355, 425, 484, 526, 647, 682} S(PGTEYN) = {98, 133, 155, 256, 296, 385, 425, 526, 548, 583} no mutations SPC=10 1 mutation SPC=5 2 mutations SPC=2
  • 86. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Convolution )0)(( ))(( , 12 12 122211 22111212 : SS xSS ssSsSs }S,sS:ss{sSS x :peak)(SPCcountpeakssharedThe withpairsofNumber
  • 87. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Elements of S2 S1 represented as elements of a difference matrix. The elements with multiplicity >2 are colored; the elements with multiplicity =2 are circled. The SPC takes into account only the red entries
  • 88. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info 1 2 3 4 5 0 -150 -100 -50 0 50 100 150 x Spectral Convolution: An Example
  • 89. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Comparison: Difficult Case S = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100} Which of the spectra S’ = {10, 20, 30, 40, 50, 55, 65, 75,85, 95} or S” = {10, 15, 30, 35, 50, 55, 70, 75, 90, 95} fits the spectrum S the best? SPC: both S’ and S” have 5 peaks in common with S. Spectral Convolution: reveals the peaks at 0 and 5.
  • 90. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Comparison: Difficult Case S S’ S S’’
  • 91. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Limitations of the Spectrum Convolutions Spectral convolution does not reveal that spectra S and S’ are similar, while spectra S and S” are not. Clumps of shared peaks: the matching positions in S’ come in clumps while the matching positions in S” don't. This important property was not captured by spectral convolution.
  • 92. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Sequence Alignment=Path in a Grid A R N G A L R A 1 1 R 1 1 N 1 G 1 Z A 1 1 L 1 R 1 1 Finding similarities between two peptides is equivalent to finding an optimal path in a Manhattan-like grid (sequence alignment).
  • 93. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Sequence Alignment=Path in a Grid A R N G A L R A 1 1 R 1 1 N 1 G 1 Z A 1 1 L 1 R 1 1 Finding similarities between two peptides is equivalent to finding an optimal path in a Manhattan-like grid (sequence alignment). Every horizontal/vertical segment in this path corresponds to insertion/deletion of an amino acid.
  • 94. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Sequence Alignment=Path in a Grid A R N G A L R A 1 1 R 1 1 N 1 G 1 Z A 1 1 L 1 R 1 1 Finding similarities between two peptides is equivalent to finding an optimal path in a Manhattan-like grid (sequence alignment). Every horizontal/vertical segment in this path corresponds to insertion/deletion of an amino acid. Can we find similarities between a spectrum and a peptide using a similar approach (spectral alignment)?
  • 95. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Converting Spectra into 0-1 Sequences • Convert spectrum into a 0-1 string with 1s corresponding to the positions of the peaks.
  • 96. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Modified peptide Modifications are modeled as insertion (or deletions) of blocks of zeroes 000101001010000000110000001001 Spectrum 00010100001-----00010000001001 Peptide A modification with positive offset - inserting a block of 0s A modification with negative offset - deleting a block of 0s
  • 97. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectra Comparison vs. String Comparison • Comparison of theoretical and experimental spectra (represented as 0-1 strings) corresponds to a (somewhat unusual) edit distance/alignment between 0-1 strings where elementary edit operations are insertions and deletions of blocks of 0s • Use sequence alignment algorithms!
  • 98. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Alignment Graph 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 Horizontal axis: Experimental spectrum Vertical axis: Theoretical spectrum of a peptide A B C D E
  • 99. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Alignment Graph 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 Like in alignment algorithms, every path in the spectral alignment graph represents a possible interpretation of a spectra. A path covering maximal number of 1s is the “best” interpretation of the spectrum. Vertical / horizontal segment in the optimal path are modifications A B C D E
  • 100. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Alignment vs. Sequence Alignment • Alignment graph with different alphabet and scoring. • Movement can be diagonal (matching masses) or horizontal/vertical (insertions/deletions corresponding to PTMs). • At most k horizontal/vertical moves.
  • 101. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Shifts A = {a1 < … < an} : an ordered set of natural numbers. A shift (i, ) is characterized by two parameters, the position (i) and the length ( ). The shift (i, ) transforms {a1, …., an} into {a1, ….,ai-1,ai+ ,…,an+ }
  • 102. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Shifts: An Example The shift (i, ) transforms {a1, …., an} into {a1, ….,ai-1,ai+ ,…,an+ } e.g. 10 20 30 40 50 60 70 80 90 10 20 30 35 45 55 65 75 85 10 20 30 35 45 55 62 72 82 shift (4, -5) shift (7,-3)
  • 103. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Alignment Problem • Find a series of k shifts that make the sets A={a1, …., an} and B={b1,….,bn} as similar as possible. • k-similarity between sets • D(k) - the maximum number of elements in common between sets after k shifts.
  • 104. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Comparing Spectra=Comparing 0-1 Strings • A modification with positive offset corresponds to inserting a block of 0s • A modification with negative offset corresponds to deleting a block of 0s • Comparison of theoretical and experimental spectra (represented as 0-1 strings) corresponds to a (somewhat unusual) edit distance/alignment problem where elementary edit operations are insertions/deletions of blocks of 0s • Use sequence alignment algorithms!
  • 105. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Product A={a1,…., an} and B={b1,…., bn} SpectralproductA B: two-dimensional matrix with nm 1s corresponding to all pairs of indices (ai,bj) and remaining elements being 0s. 10 20 30 40 50 55 65 75 85 95 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 SPC: the number of 1s at the main diagonal. -shifted SPC: the number of 1s on the diagonal (i,i+ )
  • 106. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Alignment: k-similarity k-similarity between spectra: the maximum number of 1s on a path through this graph that uses at most k+1 diagonals. k-optimal spectral alignment = a path. The spectral alignment allows one to detect more and more subtle similarities between spectra by increasing k.
  • 107. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info SPC reveals only D(0)=3 matching peaks. Spectral Alignment reveals more hidden similarities between spectra: D(1)=5 and D(2)=8 and detects corresponding mutations. Use of k-Similarity
  • 108. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Black line represent the path for k=0 Red lines represent the path for k=1 Blue lines (right) represents the path for k=2
  • 109. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Convolution’ Limitation The spectral convolution considers diagonals separately without combining them into feasible mutation scenarios. D(1) =10 shift function score = 10 D(1) =6 10 20 30 40 50 55 65 75 85 95 10 20 30 40 50 60 70 80 90 100 10 15 30 35 50 55 70 75 90 95 10 20 30 40 50 60 70 80 90 100
  • 110. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Dynamic Programming for Spectral Alignment Dij(k): the maximum number of 1s on a path to (ai,bj) that uses at most k+1 diagonals. (i’,j’) ~ (i,j): the points are located on the same diagonal Running time: O(n4 k) otherwisekD jijiifkD kD ji ji jiji ij ,1)1( ),(~)','(,1)( max)( '' '' ),()','( { )(max)( kDkD ij ij
  • 111. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Edit Graph for Fast Spectral Alignment diag(i,j) – the position of previous 1 on the same diagonal as (i,j)
  • 112. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Fast Spectral Alignment Algorithm 1)1( 1)( max)( 1,1 ),( kM kD kD ji jidiag ij )(max)( '' ),()','( kDkM ji jiji ij )( )( )( max)( 1, ,1 kM kM kD kM ji ji ij ij Running time: O(n2 k)
  • 113. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Alignment: Complications Spectra are combinations of an increasing (N- terminal ions) and a decreasing (C-terminal ions) number series. These series form two diagonals in the spectral product, the main diagonal and the perpendicular diagonal. The described algorithm deals with the main diagonal only.
  • 114. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Alignment: Complications • Simultaneous analysis of N- and C-terminal ions • Taking into account the intensities and charges • Analysis of minor ions
  • 115. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info MS/MS and East-West Traveling Salesman Problem
  • 116. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info EE SE E SE D E SE D T E S 129 244 345 47487 216 317 432 E D T ES S E S T D E E S Anti-symmetric paths E D T ES S E S T D E E S Anti-symmetric path problem (Tim Chen, SODA 2001) avoids reusing the same peak twice (as both b- and y-ions) in the optimal path Without anti-symmetric condition, the correct peptide SETDEwould be missed and the algorithm would output a symmetric peptide: ESESE Why? De-novo problem Input: Spectrum graph Output: Anti-symmetric longestpath
  • 117. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info PTM Frequency Matrix A C D E F G H I K L M N P Q R S T V W Y 10 1 1 3 2 0 4 0 0 2 1 1 0 7 2 3 5 4 3 0 0 11 7 2 0 1 1 6 0 1 1 1 1 2 0 0 0 7 2 1 0 1 12 2 2 1 1 6 3 1 0 5 2 3 1 0 5 0 2 1 3 0 0 13 4 0 3 2 0 5 1 4 4 5 3 1 0 2 2 5 4 2 2 1 14 12 2 5 16 1 12 0 7 ## 4 43 3 5 2 2 13 4 20 0 3 15 5 0 3 2 2 8 0 7 16 7 18 5 3 4 1 6 1 12 0 1 16 6 0 20 63 2 21 5 73 2 63 ## 14 10 18 2 7 8 10 ## 6 17 5 0 7 8 3 9 2 18 2 23 ## 32 5 18 0 8 2 5 29 3 18 0 3 3 3 3 10 1 6 4 9 15 7 43 5 1 5 3 5 2 1 19 2 0 0 3 0 7 1 3 9 4 3 3 7 1 0 7 1 12 4 0 20 3 3 0 3 2 5 0 2 3 3 1 3 2 1 0 4 0 0 0 0 21 8 0 1 3 0 3 0 1 1 5 0 1 1 4 2 2 1 2 1 2 22 12 0 25 25 15 39 8 20 5 25 1 29 7 27 1 39 27 22 1 13 23 1 1 3 6 0 14 0 3 4 5 0 26 6 2 1 3 2 0 0 2 24 0 0 1 2 0 6 1 4 1 2 0 0 1 1 4 6 1 3 0 1 25 1 6 2 0 1 4 1 0 3 8 0 0 0 4 1 6 0 8 0 0 26 1 2 1 2 2 5 0 1 0 4 0 2 4 3 1 6 2 3 0 1 27 2 1 1 2 0 4 0 2 3 3 1 1 1 3 0 2 2 3 1 0 28 5 5 2 2 4 1 1 12 ## 29 2 2 4 1 6 40 1 56 0 2 29 4 0 0 1 4 3 2 4 28 5 1 6 1 0 3 11 1 9 0 1 30 6 1 1 3 3 20 0 6 13 1 2 3 2 13 1 6 4 4 1 1 31 3 3 4 1 4 6 1 8 8 9 7 1 2 19 2 7 8 6 0 0 32 5 3 0 0 4 7 1 2 1 5 ## 3 1 4 9 1 2 6 43 0 33 1 1 0 1 2 6 1 1 3 9 33 2 0 4 2 6 1 1 8 3 34 5 0 2 2 2 8 4 7 9 19 3 7 1 5 4 4 1 0 0 2 35 0 1 0 2 1 7 0 1 5 2 1 2 2 1 0 2 2 3 0 2 50,000 spectra (IKKb sample) were searched in blind mode, and identifications with p-value <0.05 were retained Shading of the cell (x,y) reflects the number of annotations with modification: (offset x, amino acid y)
  • 118. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info PTM Frequency Matrix A C D E F G H I K L M N P Q R S T V W Y 10 1 1 3 2 0 4 0 0 2 1 1 0 7 2 3 5 4 3 0 0 11 7 2 0 1 1 6 0 1 1 1 1 2 0 0 0 7 2 1 0 1 12 2 2 1 1 6 3 1 0 5 2 3 1 0 5 0 2 1 3 0 0 13 4 0 3 2 0 5 1 4 4 5 3 1 0 2 2 5 4 2 2 1 14 12 2 5 16 1 12 0 7 ## 4 43 3 5 2 2 13 4 20 0 3 15 5 0 3 2 2 8 0 7 16 7 18 5 3 4 1 6 1 12 0 1 16 6 0 20 63 2 21 5 73 2 63 ## 14 10 18 2 7 8 10 ## 6 17 5 0 7 8 3 9 2 18 2 23 ## 32 5 18 0 8 2 5 29 3 18 0 3 3 3 3 10 1 6 4 9 15 7 43 5 1 5 3 5 2 1 19 2 0 0 3 0 7 1 3 9 4 3 3 7 1 0 7 1 12 4 0 20 3 3 0 3 2 5 0 2 3 3 1 3 2 1 0 4 0 0 0 0 21 8 0 1 3 0 3 0 1 1 5 0 1 1 4 2 2 1 2 1 2 22 12 0 25 25 15 39 8 20 5 25 1 29 7 27 1 39 27 22 1 13 23 1 1 3 6 0 14 0 3 4 5 0 26 6 2 1 3 2 0 0 2 24 0 0 1 2 0 6 1 4 1 2 0 0 1 1 4 6 1 3 0 1 25 1 6 2 0 1 4 1 0 3 8 0 0 0 4 1 6 0 8 0 0 26 1 2 1 2 2 5 0 1 0 4 0 2 4 3 1 6 2 3 0 1 27 2 1 1 2 0 4 0 2 3 3 1 1 1 3 0 2 2 3 1 0 28 5 5 2 2 4 1 1 12 ## 29 2 2 4 1 6 40 1 56 0 2 29 4 0 0 1 4 3 2 4 28 5 1 6 1 0 3 11 1 9 0 1 30 6 1 1 3 3 20 0 6 13 1 2 3 2 13 1 6 4 4 1 1 31 3 3 4 1 4 6 1 8 8 9 7 1 2 19 2 7 8 6 0 0 32 5 3 0 0 4 7 1 2 1 5 ## 3 1 4 9 1 2 6 43 0 33 1 1 0 1 2 6 1 1 3 9 33 2 0 4 2 6 1 1 8 3 34 5 0 2 2 2 8 4 7 9 19 3 7 1 5 4 4 1 0 0 2 35 0 1 0 2 1 7 0 1 5 2 1 2 2 1 0 2 2 3 0 2 16 on W - Oxidation 16 on M - Oxidation 14 on K - Methylation 22 - Sodium 32 on M - Double oxidation 28 on K - Dimethylation
  • 119. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Filtration in Similarity Searches Scoring Protein Query SequenceAlignment – Smith-Waterman Algorithm Sequence matches Filtration SequenceAlignment: – BLAST Database actgcgctagctacggatagctgatcc agatcgatgccataggtagctgatcc atgctagcttagacataaagcttgaat cgatcgggtaacccatagctagctcg atcgacttagacttcgattcgatcgaat tcgatctgatctgaatatattaggtccg atgctagctgtggtagtgatgtaaga • BLAST filters out very few correct matches and is almost as accurate as Smith – Waterman algorithm. Database actgcgctagctacggatagctgatcc agatcgatgccataggtagctgatcc atgctagcttagacataaagcttgaat cgatcgggtaacccatagctagctcg atcgacttagacttcgattcgatcgaat tcgatctgatctgaatatattaggtccg atgctagctgtggtagtgatgtaaga
  • 120. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Filtration and MS/MS Scoring MS/MS spectrum Peptide Sequencing – SEQUEST / Mascot Sequence matches Filtration Database MDERHILNMKLQWVCSDLPT YWASDLENQIKRSACVMTLA CHGGEMNGALPQWRTHLLE RTYKMNVVGGPASSDALITG MQSDPILLVCATRGHEWAILF GHNLWACVNMLETAIKLEGVF GSVLRAEKLNKAAPETYIN.. Database MDERHILNMKLQWVCSDLPT YWASDLENQIKRSACVMTLA CHGGEMNGALPQWRTHLLE RTYKMNVVGGPASSDALITG MQSDPILLVCATRGHEWAILF GHNLWACVNMLETAIKLEGVF GSVLRAEKLNKAAPETYIN..
  • 121. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Filtration in MS/MS Sequencing • Filtration in MS/MS is more difficult than in BLAST. • Early approaches using Peptide Sequence Tags were not able to substitute the complete database search. • Can we design a filtration based search that can replace the database search, and is orders of magnitude faster?
  • 122. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Asking the Old Question Again: Why Not Sequence De Novo? • De novo sequencing is still not very accurate! Algorithm Amino Acid Accuracy Whole Peptide Accuracy Lutefisk (Taylor and Johnson, 1997). 0.566 0.189 SHERENGA (Dancik et. al., 1999). 0.690 0.289 Peaks (Ma et al., 2003). 0.673 0.246 PepNovo (Frank and Pevzner, 2005). 0.727 0.296
  • 123. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info So What Can be Done with De Novo? • Given an MS/MS spectrum: • Can de novo predict the entire peptide sequence? • Can de novo predict partial sequences? • Can de novo predict a set of partial sequences, that with high probability, contains at least one correct tag? A Covering Set of Tags - No! (accuracy is less than 30%). - No! (accuracy is 50% for GutenTag and 80% for PepNovo ) - Yes!
  • 124. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Peptide Sequence Tags • A Peptide Sequence Tag is short substring of a peptide. Example: G V D L K G V D V D L D L K Tags:
  • 125. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Filtration with Peptide Sequence Tags • Peptide sequence tags can be used as filters in database searches. • The Filtration: Consider only database peptides that contain the tag (in its correct relative mass location). • First suggested by Mann and Wilm (1994). • Similar concepts also used by: • GutenTag - Tabb et. al. 2003. • MultiTag - Sunayev et. al. 2003. • OpenSea - Searle et. al. 2004.
  • 126. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Why Filter Database Candidates? • Filtration makes genomic database searches practical (BLAST). • Effective filtration can greatly speed-up the process, enabling expensive searches involving post-translational modifications. • Goal: generate a small set of covering tags and use them to filter the database peptides.
  • 127. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Tag Generation - Global Tags • Parse tags from de novo reconstruction. • Only a small number of tags can be generated. • If the de novo sequence is completely incorrect, none of the tags will be correct. W R A C V G E K DW L P T L T AVGELTK TAG Prefix Mass AVG 0.0 VGE 71.0 GEL 170.1 ELT 227.1 LTK 356.2
  • 128. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Tag Generation - Local Tags • Extract the highest scoring subspaths from the spectrum graph. • Sometimes gets misled by locally promising-looking “garden paths”. W R A C V G E K DW L P T L T TAG Prefix Mass AVG 0.0 WTD 120.2 PET 211.4
  • 129. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Ranking Tags • Each additional tag used to filter increases the number of database hits and slows down the database search. • Tags can be ranked according to their scores, however this ranking is not very accurate. • It is better to determine for each tag the “probability” that it is correct, and choose most probable tags.
  • 130. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Reliability of Amino Acids in Tags • For each amino acid in a tag we want to assign a probability that it is correct. • Each amino acid, which corresponds to an edge in the spectrum graph, is mapped to a feature space that consists of the features that correlate with reliability of amino acid prediction, e.g. score reduction due to edge removal
  • 131. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Score Reduction Due to Edge Removal • The removal of an edge corresponding to a genuine amino acid usually leads to a reduction in the score of the de novo path. • However, the removal of an edge that does not correspond to a genuine amino acid tends to leave the score unchanged. W R A C V G K DW L P T L T W R A C V G K DW L P T L T W R A C V G K DW L P T L T E W W R A C V G K D L P T L T
  • 132. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Probabilities of Tags • How do we determine the probability of a predicted tag ? • We use the predicted probabilities of its amino acids and follow the concept: a chain is only as strong as its weakest link
  • 133. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Experimental Results Length 3 Length 4 Length 5 Algorithm #tags 1 10 1 10 1 10 GlobalTag 0.80 0.94 0.73 0.87 0.66 0.80 LocalTag+ 0.75 0.96 0.70 0.90 0.57 0.80 GutenTag 0.49 0.89 0.41 0.78 0.31 0.64
  • 134. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Tag-based Database Search Tag filter SignificanceScore Tag extension De novo Db 55M peptides Candidate Peptides (700)
  • 135. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Matching Multiple Tags • Matching of a sequence tag against a database is fast • Even matching many tags against a database is fast • k tags can be matched against a database in time proportional to database size, but independent of the number of tags. • keyword trees (Aho-Corasick algorithm) • Scan time can be amortized by combining scans for many spectra all at once. • build one keyword tree from multiple spectra
  • 136. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Keyword Trees Y A K SN N F F AT YFAK YFNS FNTA …..Y F R A Y F N T A…..
  • 137. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Tag Extension Filter SignificanceScoreExtension De novo Db 55M peptides Candidate Peptides (700)
  • 138. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info • Given: • tag with prefix and suffix masses <mP> xyz <mS> • match in the database • Compute if a suffix and prefix match with allowable modifications. • Compute a candidate peptide with most likely positions of modifications (attachment points). Fast Extension xyz <mP>xyz<mS>
  • 139. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Scoring Modified Peptides Filter SignificanceScoreExtension De novo Db 55M peptides
  • 140. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Scoring • Input: • Candidate peptide with attached modifications • Spectrum • Output: • Score function that normalizes for length, as variable modifications can change peptide length.
  • 141. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Assessing Reliability of Identifications Filter SignificanceScoreextension De novo Db 55M peptides
  • 142. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Selecting Features for Separating Correct and Incorrect Predictions • Features: • Score S: as computed • Explained Intensity I: fraction of total intensity explained by annotated peaks. • b-y score B: fraction of b+y ions annotated • Explained peaks P: fraction of top 25 peaks annotated. • Each of I,S,B,P features is normalized (subtract mean and divide by s.d.) • Problem: separate correct and incorrect identifications using I,S,B,P
  • 143. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Separating power of features
  • 144. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Separating power of features Quality scores: Q = wI I + wS S + wB B + wP P The weights are chosen to minimize the mis-classification error
  • 145. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Distribution of Quality Scores
  • 146. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Results on ISB data-set • All ISB spectra were searched. • The top match is valid for 2978 spectra (2765 for Sequest) • InsPecT-Sequest: 644 spectra (I-S dataset) • Sequest-InsPecT: 422 spectra (S-I dataset) • Average explained intensity of I-S = 52% • Average explained intensity of S-I = 28% • Average explained intensity I S = 58% • ~70 Met. Oxidations • Run time is 0.7 secs. per spectrum (2.7 secs. for Sequest)
  • 147. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Filtration Results The search was done against SWISS-PROT (54Mb). • With 10 tags of length 3: • The filtration is 1500 more efficient. • Less than 4% of spectra are filtered out. • The search time per spectrum is reduced by two orders of magnitude as compared to SEQUEST. PTMs Tag Length # Tags Filtration InsPecT Runtime SEQUEST Runtime None 3 1 3.4 10-7 0.17 sec > 1 minute 3 10 1.6 10-6 0.27 sec Phosphory lation 3 1 5.8 10-7 0.21 sec > 2 minutes 3 10 2.7 10-6 0.38 sec
  • 148. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info SPIDER: Yet Another Application of de novo Sequencing • Suppose you have a good MS/MS spectrum of an elephant peptide • Suppose you even have a good de novo reconstruction of this spectra • However, until elephant genome is sequenced, it is hard to verify this de novo reconstruction • Can you search de novo reconstruction of a peptide from elephant against human protein database? • SPIDER algorithm addresses this comparative proteomics problem Slides from Bin Ma, University of Western Ontario
  • 149. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Common de novo sequencing errors GG N and GG have the same mass
  • 150. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info From de novo Reconstruction to Database Candidate through Real Sequence • Given a sequence with errors, search for the similar sequences in a DB. (Seq) X: LSCFAV (Real) Y: SLCFAV (Match) Z: SLCF-V sequencing error (Seq) X: LSCF-AV (Real) Y: EACF-AV (Match) Z: DACFKAV mass(LS)=mass(EA) Homology mutations
  • 151. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Alignment between de novo Candidate and Database Candidate • If real sequence Y is known then: d(X,Z) = seqError(X,Y) + editDist(Y,Z) (Seq) X: LSCF-AV (Real) Y: EACF-AV (Match) Z: DACFKAV
  • 152. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Alignment between de novo Candidate and Database Candidate • If real sequence Y is known then: d(X,Z) = seqError(X,Y) + editDist(Y,Z) • If real sequence Y is unknown then the distance between de novo candidate X and database candidate Z: • d(X,Z) = minY ( seqError(X,Y) + editDist(Y,Z) ) (Seq) X: LSCF-AV (Real) Y: EACF-AV (Match) Z: DACFKAV
  • 153. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Alignment between de novo Candidate and Database Candidate • If real sequence Y is known then: d(X,Z) = seqError(X,Y) + editDist(Y,Z) • If real sequence Y is unknown then the distance between de novo candidate X and database candidate Z: • d(X,Z) = minY ( seqError(X,Y) + editDist(Y,Z) ) • Problem: search a database for Z that minimizes d(X,Z) • The core problem is to compute d(X,Z) for given X and Z. (Seq) X: LSCF-AV (Real) Y: EACF-AV (Match) Z: DACFKAV
  • 154. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Computing seqError(X,Y) • Align X and Y (according to mass). • A segment of X can be aligned to a segment of Y only if their mass is the same! • For each erroneous mass block (Xi,Yi), the cost is f(Xi,Yi)=f(mass(Xi)). • f(m) depends on how often de novo sequencing makes errors on a segment with mass m. • seqError(X,Y) is the sum of all f(mass(Xi)). X Y Z seqError editDist (Seq) X: LSCFAV (Real) Y: EACFAV
  • 155. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Computing d(X,Z) • Dynamic Programming: • Let D[i,j]=d(X[1..i], Z[1..j]) • We examine the last block of the alignment of X[1..i] and Z[1..j]. (Seq) X: LSCF-AV (Real) Y: EACF-AV (Match) Z: DACFKAV
  • 156. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Dynamic Programming: Four Cases • Cases A, B, C - no de novo sequencing errors • Case D: de novo sequencing error D[i,j]=D[i,j-1]+indel D[i,j]=D[i-1,j]+indel D[i,j]=D[i-1,j-1]+dist(X[i],Z[j]) D[i,j]=D[i’-1,j’-1]+alpha(X[i’..i],Z[j’..j]) • D[i,j]is the minimum of the four cases.
  • 157. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Computing alpha(.,.) • alpha(X[i’..i],Z[j’..j]) = min m(y)=m(X[i’..i]) [seqError (X[i’..i],y)+editDist(y,Z[j’..j])] = min m(y)=m[i’..i] [f(m[i’..i])+editDist(y,Z[j’..j])]. = f(m[i’..i]) + min m(y)=m[i’..i] editDist(y,Z[j’..j]). • This is like to align a mass with a string. • Mass-alignment Problem: Given a mass m and a peptide P, find a peptide of mass m that is most similar to P (among all possible peptides)
  • 158. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Solving Mass-Alignment Problem ])[,()])1..([),((min )])1..([,( ])..[),((min min])..[,( jZydistjiZymm indeljiZm indeljiZymm jiZm y y
  • 159. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Improving the Efficiency • Homology Match mode: • Assumes tagging (only peptides that share a tag of length 3 with de novo reconstruction are considered) and extension of found hits by dynamic programming around the hits. • Non-gapped homology match mode: • Sequencing error and homology mutations do not overlap. • Segment Match mode: • No homology mutations. • Exact Match mode: • No sequencing errors and homology mutations.
  • 160. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Experiment Result • The correct peptide sequence for each spectrum is known. • The proteins are all in Swissprot but not in Human database. • SPIDER searches 144 spectra against both Swissprot and human databases
  • 161. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Example • Using de novo reconstruction X=CCQWDAEACAFNNPGK, the homolog Z was found in human database. At the same time, the correct sequence Y, was found in SwissProt database. Seq(X): CCQ[W ]DAEAC[AF]<NN><PG>K Real(Y): CCK AD DAEAC FA VE GP K Database(Z): CCK[AD]DKETC[FA]<EE><GK>K sequencing errors homology mutations