SlideShare a Scribd company logo
1 of 74
PROTEIN FUNCTIONAL SITE
PREDICTION USING THE SHORTEST-
PATH GRAPH KERNEL METHOD
Presented by :: Malinda Sanjaka
Major Advisor:: Dr. Changhui Yan
Graduate Committee Members::
Dr. Juan (Jen) Li
Dr. Jun Kong
Dr. Nan Yu
Date:: 04/22/2013
1
Outline
Problem Statement
Introduction
Materials and Methods
Results and Discussion
Conclusion
Future Work
2
Problem Statement
 Problem : Prediction of functional sites on protein
structures
 What are the functional sites
 The functional sites are the small portion of a protein where substrate
molecules bind and undergo a chemical reaction.
 Example:
3
Phosphorylation SiteProtein 3D Structure
Problem Statement(2)
Importance of Functional Sites Prediction
 To understand protein functionalities
 To structure based drug design
 To design new protein
4
Outline
Problem Statement
Introduction
Materials and Methods
Results and Discussion
Conclusion
Future Work
5
Introduction
20 Amino Acid
Protein
6
Introduction(2)
Protein Functional Sites
D. Catalytic active site atlas
 Catalytic active site atlas
 Phosphorylation Site
 DNA binding Site
 Zinc-binding site
7
Addition of a phosphate to an amino acid
 The functional sites are the small portion of a protein where substrate molecules bind
and undergo a chemical reaction.
Introduction(3)
Laboratory Methods for Functional Sites Determination
 X-ray Crystallography
 Nuclear Magnetic Resonance(NMR)
 Challenges
 Time consume
 High cost
 Lack of support for some protein
 Need skilled professional bodies
8
Introduction(4)
The Need for Computational Methods
Structural Genomics (SG) projects reveal large number of protein structures
but least understanding of protein function.
 Advantages
 Low cost
 Less execution time
 Less environmental impacts
 Results optimize by repeating
 Reusable
 Run as simulation
 Reduce human mistakes
 Disadvantage
 Accuracy is less than laboratory experimental results
 Computational methods provide helpful guide line for experimental approach
9
Introduction(5)
Computational Methods for Functional Sites Prediction
 Template-based
 Identify the structure similar template
 An alignment a target and the template
 Predict functional groups
 Micro environment-based
 Focus on a single residue or position
 Used structural and physicochemical properties
 Supervised machine learning approaches
 Macro environment-based
 Local structural region is involved
 Protein to protein interaction
 Structure-based drug design
 DNA-binding sites and ligand-binding sites
10
Introduction(6)
Overview of Our Approach
We used graphs to represent each residue with contacting neighbors in a
protein structure.
Central Residue
(+/Functional)
Contacting Residues
One Residue is
consist of number of
atoms
11
Residue
(-/Non-Functional) Contacting
Introduction(7)
Overview of Our Approach –Prediction
Database Knowledge
(Experimentally Verified)
Positive
(Functional/Active)
Negative
(Non-Functional/Non-Active)
Target Graph
(Functional or Non-Functional)
Similarity Prediction
Nearest Neighbor
Method
Shortest-Path Graph
Kernel
12
Outline
Problem Statement
Introduction
Materials and Methods
Results and Discussion
Conclusion
Future Work
13
Materials and Methods
Datasets
 How to get protein structure
 Download::
[http://ftp.wwpdb.org/pub/pdb/data/biounit/coordinates/all/]
 How to get the protein sequence
 PDB Database ::
[ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt].
 PDB ID and Change ID :: 101m_A
 FASTA Format:: >101m_Amol:protein length:154 MYOGLOBIN
MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKASEDLKKH
14
Materials and Methods(2)
Catalytic Binding Site (CSA)
[http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/CSA/CSA_Show_EC_List.pl]
 73 Protein Chains
 201 Active Catalytic Sites
 20398 Non-Active Residues
 Balanced Dataset
 201 Active Catalytic Sites
 201 Non-Active Residues
Phosphorylation Site
 Section 3.3.4 of this paper
[http://www.informatics.indiana.edu/predrag/publications.htm].
 679 Protein Chains
 2062 Active Phosphorylation Site Residues
 139795 Non-Active Residues
 Balanced Dataset
 2062 Active Phosphorylation Site Residues
 2062 Non-Active Residues
15
Materials and Methods(3)
Graph Representation
 Definition
 A graph G=<V, E>
 V vertices (nodes) and E edges (arcs)
 A path in G is a sequence of vertices
<v0, v1, v2, ..., vn>
 Directed Graph
 Undirected Graph
 Adjacency Matrix
16
Node
(Label)
Edge(Weight)
Materials and Methods(4)
Graph Representation Contd.
 Node
 Edge
 Weight
 Labels
(PSSM <Biological conservation of amino acid>)
(Position-specific scoring matrix)
 blast-2.2.25+
 NR Database
 Distance Contacting
Residue (Node-
Labeled(PSSM))
Edge
(Arch) –weight (1)
Calculation
Distance (d1)
2+ (y1-y2)2+ (z1-z2)2
 VDW- radius of each atoms
(van der Waals-VDW.radii file)
d1 <= (R1+R2+0.5)
Protein Sequences
17
R1 R2
d1<x,y,z> PDB
Residue1.Atom1 Residue2.Atom1
Materials and Methods(6)
Shortest-path graph Kernel
 What is a kernel
 Simply Kernel is a matrix
 AxA =<v1…..Vn,v1…..Vn> =Matrix elements
 What is a graph kernel
 Use graph instead of vectors
 What is shortest-path graph kernel
 Compare the each pair of node by using
shortest- path between each node
V1
V1
V2
V2
Vn
Vn
g1 g2 gn
g2
g1
gn
18
Materials and Methods(7)
Shortest-Path Graph Kernel Contd.
 Original G1 and G2 graphs converted into shortest-path graphs S1 (V1, E1) and S2
(V2, E2)
 The Floyd-Warshall algorithm
 The kernel function is used to calculate similarity between G1 and G2 by
comparing all pairs of edges between S1 and S2.
 Calculation
11 22
),(),( 2121
Ee Ee
edge eekGGK
Where, kedge ( ) is a kernel function for comparing two edges
19
e1 e2
v1 w1 w2v2
Materials and Methods(8)
)
2
||)()(||
exp(),( 2
2
wlabelsvlabels
wvknode
Where, labels (v) returns the vector of attributes associated with node v. Note that Knode() is a Gaussian
kernel function. 2
2
1
was set to 72 by trying different values between 32 and 128 with increments of 2.
|))()(|,0max(),( 2121 eweighteweightceekweight
Where, weight (e) returns the weight of edge e. Kweight( ) is a Brownian bridge kernel that assigns the
highest value to the edges that are identical in length. Constant c was set to 2 as in Borgward et
al.(2005).
Shortest-Path Graph Kernel Contd.
Let e1 be the edge between nodes v1 and w1, and e2 be the edge between nodes v2 and w2. Then,
),(*),(*),(),( 21212121 wwkeekvvkeek nodeweightnodeedge
Where, knode( ) is a kernel function for comparing the labels of two nodes, and kweight( ) is a
kernel function for comparing the weights of two edges. These two functions are defined as
in Borgward et al.(2005):
20
v1
<Pssm1>
e1=1
w2
w1 v2 e2=1
<Pssm2> <Pssm3>
<Pssm4>
Materials and Methods(9)
Prediction Methods
 Nearest Neighbor Algorithm
 Classify a new example x by finding the training
example <Xi-Yj> that is nearest to x according to
Euclidean distance:
 NNM_Max
 NNM_AVE
 NNM_TOP10AVE
Positive
(Functional/Active)
Negative
(Non-Functional/Non-Active) ?
Test Set
Train Set(Experimentally Verified )
21
Similarity
Materials and Methods(10)
 K-fold Cross-Validation
 Leave-One-Out Cross-Validation
Evolution of Predictors
22
Materials and Methods(11)
Measurements for Evaluation
True Positive/ False Positive
Sensitivity
Specificity
Accuracy
23
Outline
Problem Statement
Introduction
Materials and Methods
Results and Discussion
Conclusion
Future Work
24
Results and Discussion
Enzyme Catalytic Site
Enzyme catalytic site
TP TP % FN FN% FP FP% TN TN% Contact Not Contact Accuracy Sensitivity Specificity
NNM_Max 150 74.5% 51 25.3% 64 31.8% 137 68.1% 5 59 71.3% 74.5% 68.1%
NNM_Ave 155 77.1% 46 22.8% 46 22.8% 155 77.1% 5 41 77.1% 77.1% 77.1%
NNM_Top10Ave 156 77.6% 45 22.3% 51 25.3% 150 74.6% 5 46 76.1% 77.6% 74.6%
Phosphorylation Site
Phosphorylation
TP TP% FN FN% FP FP% TN TN% Contact Not Contact Accuracy Sensitivity Specificity
NNM_Max 1104 53.5% 958 46.4% 758 36.7% 1304 50.1% 73 685 58.3% 53.5% 50.1%
NNM_Ave 1054 51.1% 1008 48.8% 482 23.3% 1580 76.6% 54 428 63.8% 51.1% 76.6%
NNM_Top10Ave
1085 52.6% 977 47.3% 667 32.3 1395 67.6% 60 607 60.1% 52.6% 67.6%
25
Results and Discussion(2)
Percentile Ranking
 Used full dataset
 Ordered list
 Position ranking
 Majority of functional sites
are less 10% percentile
 NNM_MAX
 NNM_AVE
 NNM_TOP10AVE
26
Percentile Result(CSA) Active(Functional)
0
20
40
60
80
0.0-0.1
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1.0
Number Active Residues Vs.
Percentile[Max]
Number Active
Residues
0
20
40
60
80
0.0-0.1
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1.0
Number Active Residues Vs. Percentile[Ave]
Number Active
Residues
0
20
40
60
80
0.0-0.1
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1.0
Number Active Residues Vs. Percentile[Top
10 Ave]
Number Active
Residues
Results and Discussion(3)
27
Percentile Result(CSA) Non-Active(Non-Functional)
18.5
19
19.5
20
20.5
21
0.0-0.1
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1.0
Number Non-Active Residues Vs.
Percentile[Max]
Number Non-Active
Residues
18.5
19
19.5
20
20.5
21
0.0-0.1
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1.0
Number Non-Active Residues Vs.
Percentile[Ave]
Number Non-Active
Residues
18.5
19
19.5
20
20.5
21
0.0-0.1
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1.0
Number Non-Active Residues Vs.
Percentile[Top 10 Ave]
Number Non-Active
Residues
Results and Discussion(4)
28
Outline
Problem Statement
Introduction
Materials and Methods
Results and Discussion
Conclusion
Future Work
29
Conclusions
 We developed an innovative graph method to represent protein
surface based on how amino acid residues contact with each other.
 We implemented a shortest-path graph kernel method and used it
to compute the similarity between graphs.
 We developed three nearest neighbor variants to predict both
dataset based on the similarity matrix that the graph kernel method
produced.
 The predictors were able to predict catalytic sites with accuracy up
to 77.1%.
 This work showed that the proposed methods were able to capture
the similarity between enzyme catalytic sites and would provide a
useful tool for catalytic site prediction.
30
Outline
Problem Statement
Introduction
Materials and Methods
Results and Discussion
Conclusion
Future Work
31
Future Work
Add more parameters into labels(graphs, nodes)
Improve the program as web service
Working with other kernel methods such
as, Minimum Spring Tree and etc.
Optimize algorithm for large datasets
32
Acknowledgements
I would like to express my deep gratitude to my adviser Dr.
Changhui Yan for his continuous
encouragements, guidance, and supports to complete this
paper successfully.
My sincere thanks also go to my committee members, Dr. Juan
(Jen) Li, Dr. Jun Kong, and Dr. Nan Yu for their willingness to
serve as committee members.
33
Thank you.
?
34
Introduction …
.vdw
.PDB
NR
Database
Blast
35
Protein
…-CUA-AAA-GAA-GGU-GUU-AGC-AAG
…-L-K-E-G-V-S-K-D-…
DNA
protein sequence
36
Important of Functional Site
Prediction
Understanding Protein Functionalities
Reveal the Structural Protein
Drug Design
Design New Protein
37
Rationale for Understanding Protein Structure and
Function
Protein sequence
-large numbers of
sequences, including
whole genomes
Protein function
- rational drug design and treatment of disease
- protein and genetic engineering
- build networks to model cellular pathways
- study organismal function and evolution
?
structure determination
structure prediction
homology
rational mutagenesis
biochemical analysis
model studies
Protein structure
- three dimensional
- complicated
38
Existing Applications for Protein
Active Sites Prediction
39
Our Approach
 Shortest-path Distance Theory
 Graph with Adjacent Matrix and Graph kernel
 Nearest Neighbor Variant (Max, Ave, Top10 Ave)
 Leave-one-out Cross-Validation
 True Positive & False Positive
 Increment percentile
40
Literature Review
 Graph
 Adjacency Matrix
 Shortest Distance Path Algorithm
 Cross Validation
 True Positive vs. False Positive
 Percentile Ranking 41
Graph
 A graph G=<V, E>
 V vertices (nodes) and E edges (arcs)
 A path in G is a sequence of vertices <v0, v1, v2, ..., vn>
 Directed Graph
 Undirected Graph 42
Adjacency Matrix
 A simple graph is a matrix with rows and columns
labeled by graph vertices
1 = Adjacent
0 = Not Adjacent
0s on the diagonal
43
Shortest Distance Path Algorithm
 Used in communications, transportation, electronics, and
bioinformatics problems.
 The all-pairs shortest-path problem involves finding the
shortest path between all pairs of vertices in a graph.
A i j=1 if there is an edge (Vi,Vj) ; otherwise, A i j =0
44
Percentile Ranking
 There is no proper definition for percentile
calculation
 Ordered List
 Position Ranking
 Max, Ave, Top10
45
Method And Material
 Data Gathering
 Identify the Active Residues
 Balance Dataset
 Generating a Map File
 Generate Set of Graphs
 Development of Graph Kernel
46
Data Gathering
Catalytic Binding Site (CSA)
http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/CSA/CSA_Show_EC_List.pl
 EC1, EC2…EC6
 HTML
 Regular Expression
 Finding Large Single Group
 Selected EC 3.4
 73 Protein chains
 201 Active Catalytic Site
 20398 Non-Active Resides
47
Data Gathering..
Phosphorylation Site
 Section 3.3.4 of This Paper
[http://www.informatics.indiana.edu/predrag/publications.htm].
 679 protein chains
 2062 Active Phosphorylation Site Residues
 139795 Non-Active Resides
48
Identify the Active Residues
Catalytic Binding Site (CSA)
 CSA Annotation –Database(CSA_2_2_12.dat)
[ http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/CSA/CSA_Download.pl]
 251777 Records
 List of Active Residue(201)
Phosphorylation Site
[http://www.informatics.indiana.edu/predrag/publications.htm]
 List of Active Residue(2062)
49
Balance Dataset
Computation Time
Leave-One-Out Cross-Validation
Random Selection
Catalytic Binding Site (CSA)
-Active 201 , Non Active 201
Phosphorylation Site
-Active 2062, Non Active 2062
50
Generating a Map File
 Map with Protein PDB ID with Protein Sequences
 Atomic Solvent Accessible Area Calculations (RASA)
 Position-Specific Scoring Matrix Calculations (PSSM)
 Active Residues
51
Map with Protein PDB ID with Protein
Sequences
 PDB ID and Change ID
101m_A
 PDB Database
[ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt].
FASTA Format
>101m_Amol:protein length:154 MYOGLOBIN
MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKA
SEDLKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
GNFGADAQGAMNKALELFRKDIAAKYKELGYQG
52
Atomic Solvent Accessible Area
Calculations (RASA)
 Calculate the Solvent Accessible Area (RASA) of each
Protein
 Naccess V2.11 Program
– Linux/Unix systems /Cygwin
– [http://www.bioinf.manchester.ac.uk/naccess/]
– ./naccess 1a91.pdb & ./naccess 1afo.pdb & ./naccess 1aig.pdb
 PDB DATA Bank –PDB File
– [http://ftp.wwpdb.org/pub/pdb/data/biounit/coordinates/all/]
ncbi-blast-2.2.24+
RASA >0
53
Position-Specific Scoring Matrix
Calculations (PSSM)
 Download PDB Files
 blast-2.2.25+ Program
– Microsoft Windows
 NR Database (non-redundant protein sequence)
Process p = new Process();
p.StartInfo.UseShellExecute = false;
p.StartInfo.RedirectStandardOutput = true;
p.StartInfo.FileName = "C:blast-2.2.25+binpsiblast.exe";
p.StartInfo.Arguments = string.Format("{0}", "-query " + FileNameIN + " -db C:blast-
• 2.2.25+dbnr -num_iterations 2 -out_ascii_pssm " + FileNameOUT);
p.Start();
• Example: Sample record of .PSSM
1 A 5 -2 -2 -2 -1 -1 -2 1 -2 -2 -3 -1 -2 -3 -2 2 -1 -3 -3 -1 77 0 0 0 0 0 0 10 0 0 0 0 0 0 0 13 0 0 0 0 0.59 1.#J
54
Sample Mapping File
>1neg_A
Seq :
KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLAAAWSHPQF
SUR :
11101011111111111111111111111111111111011111111011110111111111111
Site :
00000000000000000000000000000000000000000000000000000000000010000
rASA
:115.47,81.22,64.82,.00,20.59,.00,41.60,111.13,56.32,14.17,124.18,35.41,127.39,43.03,111.84,1
60.37,10.00,.71,33.57,1.82,120.20,91.83,15.89,41.40,69.81,.77,20.31,2.22,49.44,65.40,30.56,97
.39,80.11,152.72,75.17,80.10,47.20,64.49,.00,57.09,16.33,101.38,111.31,104.16,71.57,2.73,60.8
4,.00,18.67,8.04,64.07,71.08,.00,125.10,66.68,24.97,32.49,79.86,65.19,179.94,87.62,51.01,109.
35,145.21,71.53,
entropy
:0.80,0.85,0.25,0.92,0.44,1.48,1.02,2.42,1.57,2.01,0.44,0.93,0.49,0.73,0.73,0.83,1.72,1.46,0.59,
2.15,0.72,0.98,1.99,1.65,0.60,1.20,0.35,0.94,0.66,0.65,0.51,0.23,1.04,0.45,1.09,4.74,3.91,0.67,1
.38,0.61,0.45,0.75,1.43,0.49,0.36,2.32,0.72,1.63,3.17,0.46,1.53,2.78,1.61,0.38,0.45,0.26,0.15,0.
51,0.17,0.38,0.47,0.46,0.93,2.04,1.73,
pdbindex
:6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,
38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68
,69,70, 55
Generate Set of Graphs
Shorted Distance Path (Dijkstra Theory)
Adjacent Matrix Theory
Contacting Neighbor’s Residues
Labeled
Weighted
Various Numbers of Node and Edge
Normalization Graph
– Linear Normalization(X1) =(X-Min)/ (Max-Min)
56
Calculate Distance between Atoms
and Check the Contacting
2+ (y1-y2)2+ (z1-z2)2
 PDB File
 VDW
(van der Waals-VDW.radii file)
D1 <= (R1+R2+0.5)
Example of a contact residue
2 A _ 3 A! : 1.33441
Example of a non-contact residue.
4 A _ 2 A : 4.14432 57
Structure of a Graph
58
Development of Graph Kernel
Original G1 and G2 graph converted into
shortest-path graphs S1 (V1, E1) and S2 (V2, E2)
The Floyd-Warshall algorithm
The kernel function is used to calculate the
similarity between G1 and G2 by comparing
all pairs of edges between S1 and S2.
59
The Floyd-Warshall Algorithm
for i = 1 to N
for j = 1 to N
if there is an edge from i to j
dist[0][i][j] = the length of the edge from i to j
else dist[0][i][j] = INFINITY
for k = 1 to N
for i = 1 to N
for j = 1 to N
dist[k][i][j] = min(dist[k-1][i][j], dist[k-1][i][k] + dist[k-1][k][j])
To find the shortest path between all vertices v V for a weighted graph G = (V; E).
D(k)
ij=the weight of the shortest path from vertex I to vertex j for which all intermediate
vertices are in the set {1,2,……k}
60
Implementation
doublePssm(intResidueA, intResidueB)
{
inti;
double sum=0;
for (i=0; i<20; i++)
{
sum+=pow((double)(seq_a_pssm[ResidueA][i]-seq_b_pssm[ResidueB][i]), 2);
}
sum=((double)sum);
return sum;
}
dis+=Pssm(i, j);
attr_dis[i][j]=exp((-1)*parm_gamma*dis);
sum=0;
for (i=0; i<seq_a_len; i++)
for (j=0; j<seq_b_len; j++)
for (k=i+1; k<seq_a_len; k++)
for (r=j+1; r<seq_b_len; r++)
{
xx1 = seq_a_dist[i][k]-seq_b_dist[j][r];
Klen=MaxValue(0, CC-fabs(xx1));
product1=attr_dis[i][j]*attr_dis[k][r];
product2=attr_dis[k][j]*attr_dis[i][r];
value=MaxValue(product1, product2);
sum+=value*Klen;
}
return sum;
61
Compare Similarity
Max
Ave
Top 10 Ave
62
Result and Discussion
Comparison Similarity (TP/FP)
– Max
– Ave
– Top 10 Ave
Percentile Ranking calculation
 RASA Value
63
Percentile Result(CSA)
64
rASA Vs. Active Residues
65
66
static
IEnumerable<string>SortByLength(IEnumerabl
e<string> e)
{
var sorted = from s in e
orderbys.Length descending
select s;
return sorted;
}
Section 3.4
67
Protein Chain (CSA)
68
List of
Phosphorylation
Site
69
Catalytic Binding Site (CSA)-Active Residue
Back
70
Phosphorylation Site-Active Residues
Back
71
van der Waals-VDW.radii file
Back
RESIDUE ATOM ALA 5
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
RESIDUE ATOM ARG 11
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.87 0
ATOM CD 1.87 0
ATOM NE 1.65 1
ATOM CZ 1.76 0
ATOM NH1 1.65 1
ATOM NH2 1.65 1
RESIDUE ATOM ASP 8
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.76 0
ATOM OD1 1.40 1
ATOM OD2 1.40 1
RESIDUE ATOM ASN 8
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.76 0
ATOM OD1 1.40 1
ATOM ND2 1.65 1
RESIDUE ATOM CYS 6
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM SG 1.85 0
RESIDUE ATOM GLU 9
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.87 0
ATOM CD 1.76 0
ATOM OE1 1.40 1
ATOM OE2 1.40 1
RESIDUE ATOM GLN 9
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.87 0
ATOM CD 1.76 0
ATOM OE1 1.40 1
ATOM NE2 1.65 1
RESIDUE ATOM GLY 4
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
RESIDUE ATOM HIS 10
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.76 0
ATOM ND1 1.65 1
ATOM CD2 1.76 0
ATOM CE1 1.76 0
ATOM NE2 1.65 1
RESIDUE ATOM ILE 8
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG1 1.87 0
ATOM CG2 1.87 0
ATOM CD1 1.87 0
RESIDUE ATOM LEU 8
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.87 0
ATOM CD1 1.87 0
ATOM CD2 1.87 0
RESIDUE ATOM LYS 9
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.87 0
ATOM CD 1.87 0
ATOM CE 1.87 0
ATOM NZ 1.50 1
72
PDB FILE SAMPLE
Back
73
Distance File Example
Back
74

More Related Content

What's hot

molecular docking its types and de novo drug design and application and softw...
molecular docking its types and de novo drug design and application and softw...molecular docking its types and de novo drug design and application and softw...
molecular docking its types and de novo drug design and application and softw...GAUTAM KHUNE
 
In silico drug desigining
In silico drug desiginingIn silico drug desigining
In silico drug desiginingDevesh Shukla
 
Drug design and discovery ppt
Drug design and discovery pptDrug design and discovery ppt
Drug design and discovery pptDr NEETHU ASOKAN
 
analogue based drug design and discovery.pptx
analogue based drug design and discovery.pptxanalogue based drug design and discovery.pptx
analogue based drug design and discovery.pptxramadevi824914
 
27.docking protein-protein and protein-ligand
27.docking protein-protein and protein-ligand27.docking protein-protein and protein-ligand
27.docking protein-protein and protein-ligandAbhijeet Kadam
 
Structure base drug design
Structure base drug designStructure base drug design
Structure base drug designJayshreeUpadhyay
 
Pharmacohoreppt
PharmacohorepptPharmacohoreppt
PharmacohorepptAbhik Seal
 
Pharmacophore mapping
Pharmacophore mapping Pharmacophore mapping
Pharmacophore mapping GamitKinjal
 
Structure based drug design
Structure based drug designStructure based drug design
Structure based drug designADAM S
 
Computer aided drug design
Computer aided drug designComputer aided drug design
Computer aided drug designN K
 
CoMFA CoMFA Comparative Molecular Field Analysis)
CoMFA CoMFA Comparative Molecular Field Analysis)CoMFA CoMFA Comparative Molecular Field Analysis)
CoMFA CoMFA Comparative Molecular Field Analysis)Pinky Vincent
 
PHARMACOHORE MAPPING AND VIRTUAL SCRRENING FOR RESEARCH DEPARTMENT
PHARMACOHORE MAPPING AND VIRTUAL SCRRENING FOR RESEARCH DEPARTMENTPHARMACOHORE MAPPING AND VIRTUAL SCRRENING FOR RESEARCH DEPARTMENT
PHARMACOHORE MAPPING AND VIRTUAL SCRRENING FOR RESEARCH DEPARTMENTShikha Popali
 
Basics Of Molecular Docking
Basics Of Molecular DockingBasics Of Molecular Docking
Basics Of Molecular DockingSatarupa Deb
 
Pharmacophore Modeling in Drug Designing
Pharmacophore Modeling in Drug DesigningPharmacophore Modeling in Drug Designing
Pharmacophore Modeling in Drug DesigningVinod Tonde
 

What's hot (20)

molecular docking its types and de novo drug design and application and softw...
molecular docking its types and de novo drug design and application and softw...molecular docking its types and de novo drug design and application and softw...
molecular docking its types and de novo drug design and application and softw...
 
In silico drug desigining
In silico drug desiginingIn silico drug desigining
In silico drug desigining
 
Lipinski rule
Lipinski rule Lipinski rule
Lipinski rule
 
3d qsar
3d qsar3d qsar
3d qsar
 
Drug design and discovery ppt
Drug design and discovery pptDrug design and discovery ppt
Drug design and discovery ppt
 
analogue based drug design and discovery.pptx
analogue based drug design and discovery.pptxanalogue based drug design and discovery.pptx
analogue based drug design and discovery.pptx
 
27.docking protein-protein and protein-ligand
27.docking protein-protein and protein-ligand27.docking protein-protein and protein-ligand
27.docking protein-protein and protein-ligand
 
Structure base drug design
Structure base drug designStructure base drug design
Structure base drug design
 
Denovo Drug Design
Denovo Drug DesignDenovo Drug Design
Denovo Drug Design
 
Lead identification
Lead identification Lead identification
Lead identification
 
Pharmacohoreppt
PharmacohorepptPharmacohoreppt
Pharmacohoreppt
 
Pharmacophore mapping
Pharmacophore mapping Pharmacophore mapping
Pharmacophore mapping
 
Structure based drug design
Structure based drug designStructure based drug design
Structure based drug design
 
Drug likeness Properties
Drug likeness  PropertiesDrug likeness  Properties
Drug likeness Properties
 
Computer aided drug design
Computer aided drug designComputer aided drug design
Computer aided drug design
 
CoMFA CoMFA Comparative Molecular Field Analysis)
CoMFA CoMFA Comparative Molecular Field Analysis)CoMFA CoMFA Comparative Molecular Field Analysis)
CoMFA CoMFA Comparative Molecular Field Analysis)
 
Uplc ppt
Uplc ppt Uplc ppt
Uplc ppt
 
PHARMACOHORE MAPPING AND VIRTUAL SCRRENING FOR RESEARCH DEPARTMENT
PHARMACOHORE MAPPING AND VIRTUAL SCRRENING FOR RESEARCH DEPARTMENTPHARMACOHORE MAPPING AND VIRTUAL SCRRENING FOR RESEARCH DEPARTMENT
PHARMACOHORE MAPPING AND VIRTUAL SCRRENING FOR RESEARCH DEPARTMENT
 
Basics Of Molecular Docking
Basics Of Molecular DockingBasics Of Molecular Docking
Basics Of Molecular Docking
 
Pharmacophore Modeling in Drug Designing
Pharmacophore Modeling in Drug DesigningPharmacophore Modeling in Drug Designing
Pharmacophore Modeling in Drug Designing
 

Similar to Protein Functional Site Prediction Using Shortest-Path Graph Kernel

Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
Introduction to Chainer Chemistry
Introduction to Chainer ChemistryIntroduction to Chainer Chemistry
Introduction to Chainer ChemistryPreferred Networks
 
Pruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inferencePruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inferenceKaushalya Madhawa
 
Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructureJeremy Besnard
 
Morgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 distMorgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 distddm314
 
Making effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsMaking effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsOregon State University
 
Mpp Rsv 2008 Public
Mpp Rsv 2008 PublicMpp Rsv 2008 Public
Mpp Rsv 2008 Publiclab13unisa
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
Segway and the Graphical Models Toolkit: a framework for probabilistic genomi...
Segway and the Graphical Models Toolkit: a framework for probabilistic genomi...Segway and the Graphical Models Toolkit: a framework for probabilistic genomi...
Segway and the Graphical Models Toolkit: a framework for probabilistic genomi...Michael Hoffman
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Paolo Missier
 
Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Pedro Lopes
 
Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...André Gonçalves
 
Analytical Modeling of End-to-End Delay in OpenFlow Based Networks
Analytical Modeling of End-to-End Delay in OpenFlow Based NetworksAnalytical Modeling of End-to-End Delay in OpenFlow Based Networks
Analytical Modeling of End-to-End Delay in OpenFlow Based NetworksAzeem Iqbal
 
modelling assignment
modelling assignmentmodelling assignment
modelling assignmentShwetA Kumari
 

Similar to Protein Functional Site Prediction Using Shortest-Path Graph Kernel (20)

Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Aqbd seminar DOE
Aqbd seminar DOEAqbd seminar DOE
Aqbd seminar DOE
 
Introduction to Chainer Chemistry
Introduction to Chainer ChemistryIntroduction to Chainer Chemistry
Introduction to Chainer Chemistry
 
1207.2600
1207.26001207.2600
1207.2600
 
Pruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inferencePruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inference
 
Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical Structure
 
P0126557 slides
P0126557 slidesP0126557 slides
P0126557 slides
 
Morgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 distMorgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 dist
 
Making effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsMaking effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computations
 
Mpp Rsv 2008 Public
Mpp Rsv 2008 PublicMpp Rsv 2008 Public
Mpp Rsv 2008 Public
 
DefenseTalk_Trimmed
DefenseTalk_TrimmedDefenseTalk_Trimmed
DefenseTalk_Trimmed
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Segway and the Graphical Models Toolkit: a framework for probabilistic genomi...
Segway and the Graphical Models Toolkit: a framework for probabilistic genomi...Segway and the Graphical Models Toolkit: a framework for probabilistic genomi...
Segway and the Graphical Models Toolkit: a framework for probabilistic genomi...
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
 
Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013
 
Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...
 
EnVisioning Pathways
EnVisioning PathwaysEnVisioning Pathways
EnVisioning Pathways
 
Analytical Modeling of End-to-End Delay in OpenFlow Based Networks
Analytical Modeling of End-to-End Delay in OpenFlow Based NetworksAnalytical Modeling of End-to-End Delay in OpenFlow Based Networks
Analytical Modeling of End-to-End Delay in OpenFlow Based Networks
 
modelling assignment
modelling assignmentmodelling assignment
modelling assignment
 

Recently uploaded

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 

Recently uploaded (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

Protein Functional Site Prediction Using Shortest-Path Graph Kernel

  • 1. PROTEIN FUNCTIONAL SITE PREDICTION USING THE SHORTEST- PATH GRAPH KERNEL METHOD Presented by :: Malinda Sanjaka Major Advisor:: Dr. Changhui Yan Graduate Committee Members:: Dr. Juan (Jen) Li Dr. Jun Kong Dr. Nan Yu Date:: 04/22/2013 1
  • 2. Outline Problem Statement Introduction Materials and Methods Results and Discussion Conclusion Future Work 2
  • 3. Problem Statement  Problem : Prediction of functional sites on protein structures  What are the functional sites  The functional sites are the small portion of a protein where substrate molecules bind and undergo a chemical reaction.  Example: 3 Phosphorylation SiteProtein 3D Structure
  • 4. Problem Statement(2) Importance of Functional Sites Prediction  To understand protein functionalities  To structure based drug design  To design new protein 4
  • 5. Outline Problem Statement Introduction Materials and Methods Results and Discussion Conclusion Future Work 5
  • 7. Introduction(2) Protein Functional Sites D. Catalytic active site atlas  Catalytic active site atlas  Phosphorylation Site  DNA binding Site  Zinc-binding site 7 Addition of a phosphate to an amino acid  The functional sites are the small portion of a protein where substrate molecules bind and undergo a chemical reaction.
  • 8. Introduction(3) Laboratory Methods for Functional Sites Determination  X-ray Crystallography  Nuclear Magnetic Resonance(NMR)  Challenges  Time consume  High cost  Lack of support for some protein  Need skilled professional bodies 8
  • 9. Introduction(4) The Need for Computational Methods Structural Genomics (SG) projects reveal large number of protein structures but least understanding of protein function.  Advantages  Low cost  Less execution time  Less environmental impacts  Results optimize by repeating  Reusable  Run as simulation  Reduce human mistakes  Disadvantage  Accuracy is less than laboratory experimental results  Computational methods provide helpful guide line for experimental approach 9
  • 10. Introduction(5) Computational Methods for Functional Sites Prediction  Template-based  Identify the structure similar template  An alignment a target and the template  Predict functional groups  Micro environment-based  Focus on a single residue or position  Used structural and physicochemical properties  Supervised machine learning approaches  Macro environment-based  Local structural region is involved  Protein to protein interaction  Structure-based drug design  DNA-binding sites and ligand-binding sites 10
  • 11. Introduction(6) Overview of Our Approach We used graphs to represent each residue with contacting neighbors in a protein structure. Central Residue (+/Functional) Contacting Residues One Residue is consist of number of atoms 11 Residue (-/Non-Functional) Contacting
  • 12. Introduction(7) Overview of Our Approach –Prediction Database Knowledge (Experimentally Verified) Positive (Functional/Active) Negative (Non-Functional/Non-Active) Target Graph (Functional or Non-Functional) Similarity Prediction Nearest Neighbor Method Shortest-Path Graph Kernel 12
  • 13. Outline Problem Statement Introduction Materials and Methods Results and Discussion Conclusion Future Work 13
  • 14. Materials and Methods Datasets  How to get protein structure  Download:: [http://ftp.wwpdb.org/pub/pdb/data/biounit/coordinates/all/]  How to get the protein sequence  PDB Database :: [ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt].  PDB ID and Change ID :: 101m_A  FASTA Format:: >101m_Amol:protein length:154 MYOGLOBIN MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKASEDLKKH 14
  • 15. Materials and Methods(2) Catalytic Binding Site (CSA) [http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/CSA/CSA_Show_EC_List.pl]  73 Protein Chains  201 Active Catalytic Sites  20398 Non-Active Residues  Balanced Dataset  201 Active Catalytic Sites  201 Non-Active Residues Phosphorylation Site  Section 3.3.4 of this paper [http://www.informatics.indiana.edu/predrag/publications.htm].  679 Protein Chains  2062 Active Phosphorylation Site Residues  139795 Non-Active Residues  Balanced Dataset  2062 Active Phosphorylation Site Residues  2062 Non-Active Residues 15
  • 16. Materials and Methods(3) Graph Representation  Definition  A graph G=<V, E>  V vertices (nodes) and E edges (arcs)  A path in G is a sequence of vertices <v0, v1, v2, ..., vn>  Directed Graph  Undirected Graph  Adjacency Matrix 16 Node (Label) Edge(Weight)
  • 17. Materials and Methods(4) Graph Representation Contd.  Node  Edge  Weight  Labels (PSSM <Biological conservation of amino acid>) (Position-specific scoring matrix)  blast-2.2.25+  NR Database  Distance Contacting Residue (Node- Labeled(PSSM)) Edge (Arch) –weight (1) Calculation Distance (d1) 2+ (y1-y2)2+ (z1-z2)2  VDW- radius of each atoms (van der Waals-VDW.radii file) d1 <= (R1+R2+0.5) Protein Sequences 17 R1 R2 d1<x,y,z> PDB Residue1.Atom1 Residue2.Atom1
  • 18. Materials and Methods(6) Shortest-path graph Kernel  What is a kernel  Simply Kernel is a matrix  AxA =<v1…..Vn,v1…..Vn> =Matrix elements  What is a graph kernel  Use graph instead of vectors  What is shortest-path graph kernel  Compare the each pair of node by using shortest- path between each node V1 V1 V2 V2 Vn Vn g1 g2 gn g2 g1 gn 18
  • 19. Materials and Methods(7) Shortest-Path Graph Kernel Contd.  Original G1 and G2 graphs converted into shortest-path graphs S1 (V1, E1) and S2 (V2, E2)  The Floyd-Warshall algorithm  The kernel function is used to calculate similarity between G1 and G2 by comparing all pairs of edges between S1 and S2.  Calculation 11 22 ),(),( 2121 Ee Ee edge eekGGK Where, kedge ( ) is a kernel function for comparing two edges 19 e1 e2 v1 w1 w2v2
  • 20. Materials and Methods(8) ) 2 ||)()(|| exp(),( 2 2 wlabelsvlabels wvknode Where, labels (v) returns the vector of attributes associated with node v. Note that Knode() is a Gaussian kernel function. 2 2 1 was set to 72 by trying different values between 32 and 128 with increments of 2. |))()(|,0max(),( 2121 eweighteweightceekweight Where, weight (e) returns the weight of edge e. Kweight( ) is a Brownian bridge kernel that assigns the highest value to the edges that are identical in length. Constant c was set to 2 as in Borgward et al.(2005). Shortest-Path Graph Kernel Contd. Let e1 be the edge between nodes v1 and w1, and e2 be the edge between nodes v2 and w2. Then, ),(*),(*),(),( 21212121 wwkeekvvkeek nodeweightnodeedge Where, knode( ) is a kernel function for comparing the labels of two nodes, and kweight( ) is a kernel function for comparing the weights of two edges. These two functions are defined as in Borgward et al.(2005): 20 v1 <Pssm1> e1=1 w2 w1 v2 e2=1 <Pssm2> <Pssm3> <Pssm4>
  • 21. Materials and Methods(9) Prediction Methods  Nearest Neighbor Algorithm  Classify a new example x by finding the training example <Xi-Yj> that is nearest to x according to Euclidean distance:  NNM_Max  NNM_AVE  NNM_TOP10AVE Positive (Functional/Active) Negative (Non-Functional/Non-Active) ? Test Set Train Set(Experimentally Verified ) 21 Similarity
  • 22. Materials and Methods(10)  K-fold Cross-Validation  Leave-One-Out Cross-Validation Evolution of Predictors 22
  • 23. Materials and Methods(11) Measurements for Evaluation True Positive/ False Positive Sensitivity Specificity Accuracy 23
  • 24. Outline Problem Statement Introduction Materials and Methods Results and Discussion Conclusion Future Work 24
  • 25. Results and Discussion Enzyme Catalytic Site Enzyme catalytic site TP TP % FN FN% FP FP% TN TN% Contact Not Contact Accuracy Sensitivity Specificity NNM_Max 150 74.5% 51 25.3% 64 31.8% 137 68.1% 5 59 71.3% 74.5% 68.1% NNM_Ave 155 77.1% 46 22.8% 46 22.8% 155 77.1% 5 41 77.1% 77.1% 77.1% NNM_Top10Ave 156 77.6% 45 22.3% 51 25.3% 150 74.6% 5 46 76.1% 77.6% 74.6% Phosphorylation Site Phosphorylation TP TP% FN FN% FP FP% TN TN% Contact Not Contact Accuracy Sensitivity Specificity NNM_Max 1104 53.5% 958 46.4% 758 36.7% 1304 50.1% 73 685 58.3% 53.5% 50.1% NNM_Ave 1054 51.1% 1008 48.8% 482 23.3% 1580 76.6% 54 428 63.8% 51.1% 76.6% NNM_Top10Ave 1085 52.6% 977 47.3% 667 32.3 1395 67.6% 60 607 60.1% 52.6% 67.6% 25
  • 26. Results and Discussion(2) Percentile Ranking  Used full dataset  Ordered list  Position ranking  Majority of functional sites are less 10% percentile  NNM_MAX  NNM_AVE  NNM_TOP10AVE 26
  • 27. Percentile Result(CSA) Active(Functional) 0 20 40 60 80 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 Number Active Residues Vs. Percentile[Max] Number Active Residues 0 20 40 60 80 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 Number Active Residues Vs. Percentile[Ave] Number Active Residues 0 20 40 60 80 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 Number Active Residues Vs. Percentile[Top 10 Ave] Number Active Residues Results and Discussion(3) 27
  • 28. Percentile Result(CSA) Non-Active(Non-Functional) 18.5 19 19.5 20 20.5 21 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 Number Non-Active Residues Vs. Percentile[Max] Number Non-Active Residues 18.5 19 19.5 20 20.5 21 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 Number Non-Active Residues Vs. Percentile[Ave] Number Non-Active Residues 18.5 19 19.5 20 20.5 21 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 Number Non-Active Residues Vs. Percentile[Top 10 Ave] Number Non-Active Residues Results and Discussion(4) 28
  • 29. Outline Problem Statement Introduction Materials and Methods Results and Discussion Conclusion Future Work 29
  • 30. Conclusions  We developed an innovative graph method to represent protein surface based on how amino acid residues contact with each other.  We implemented a shortest-path graph kernel method and used it to compute the similarity between graphs.  We developed three nearest neighbor variants to predict both dataset based on the similarity matrix that the graph kernel method produced.  The predictors were able to predict catalytic sites with accuracy up to 77.1%.  This work showed that the proposed methods were able to capture the similarity between enzyme catalytic sites and would provide a useful tool for catalytic site prediction. 30
  • 31. Outline Problem Statement Introduction Materials and Methods Results and Discussion Conclusion Future Work 31
  • 32. Future Work Add more parameters into labels(graphs, nodes) Improve the program as web service Working with other kernel methods such as, Minimum Spring Tree and etc. Optimize algorithm for large datasets 32
  • 33. Acknowledgements I would like to express my deep gratitude to my adviser Dr. Changhui Yan for his continuous encouragements, guidance, and supports to complete this paper successfully. My sincere thanks also go to my committee members, Dr. Juan (Jen) Li, Dr. Jun Kong, and Dr. Nan Yu for their willingness to serve as committee members. 33
  • 37. Important of Functional Site Prediction Understanding Protein Functionalities Reveal the Structural Protein Drug Design Design New Protein 37
  • 38. Rationale for Understanding Protein Structure and Function Protein sequence -large numbers of sequences, including whole genomes Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution ? structure determination structure prediction homology rational mutagenesis biochemical analysis model studies Protein structure - three dimensional - complicated 38
  • 39. Existing Applications for Protein Active Sites Prediction 39
  • 40. Our Approach  Shortest-path Distance Theory  Graph with Adjacent Matrix and Graph kernel  Nearest Neighbor Variant (Max, Ave, Top10 Ave)  Leave-one-out Cross-Validation  True Positive & False Positive  Increment percentile 40
  • 41. Literature Review  Graph  Adjacency Matrix  Shortest Distance Path Algorithm  Cross Validation  True Positive vs. False Positive  Percentile Ranking 41
  • 42. Graph  A graph G=<V, E>  V vertices (nodes) and E edges (arcs)  A path in G is a sequence of vertices <v0, v1, v2, ..., vn>  Directed Graph  Undirected Graph 42
  • 43. Adjacency Matrix  A simple graph is a matrix with rows and columns labeled by graph vertices 1 = Adjacent 0 = Not Adjacent 0s on the diagonal 43
  • 44. Shortest Distance Path Algorithm  Used in communications, transportation, electronics, and bioinformatics problems.  The all-pairs shortest-path problem involves finding the shortest path between all pairs of vertices in a graph. A i j=1 if there is an edge (Vi,Vj) ; otherwise, A i j =0 44
  • 45. Percentile Ranking  There is no proper definition for percentile calculation  Ordered List  Position Ranking  Max, Ave, Top10 45
  • 46. Method And Material  Data Gathering  Identify the Active Residues  Balance Dataset  Generating a Map File  Generate Set of Graphs  Development of Graph Kernel 46
  • 47. Data Gathering Catalytic Binding Site (CSA) http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/CSA/CSA_Show_EC_List.pl  EC1, EC2…EC6  HTML  Regular Expression  Finding Large Single Group  Selected EC 3.4  73 Protein chains  201 Active Catalytic Site  20398 Non-Active Resides 47
  • 48. Data Gathering.. Phosphorylation Site  Section 3.3.4 of This Paper [http://www.informatics.indiana.edu/predrag/publications.htm].  679 protein chains  2062 Active Phosphorylation Site Residues  139795 Non-Active Resides 48
  • 49. Identify the Active Residues Catalytic Binding Site (CSA)  CSA Annotation –Database(CSA_2_2_12.dat) [ http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/CSA/CSA_Download.pl]  251777 Records  List of Active Residue(201) Phosphorylation Site [http://www.informatics.indiana.edu/predrag/publications.htm]  List of Active Residue(2062) 49
  • 50. Balance Dataset Computation Time Leave-One-Out Cross-Validation Random Selection Catalytic Binding Site (CSA) -Active 201 , Non Active 201 Phosphorylation Site -Active 2062, Non Active 2062 50
  • 51. Generating a Map File  Map with Protein PDB ID with Protein Sequences  Atomic Solvent Accessible Area Calculations (RASA)  Position-Specific Scoring Matrix Calculations (PSSM)  Active Residues 51
  • 52. Map with Protein PDB ID with Protein Sequences  PDB ID and Change ID 101m_A  PDB Database [ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt]. FASTA Format >101m_Amol:protein length:154 MYOGLOBIN MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKA SEDLKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP GNFGADAQGAMNKALELFRKDIAAKYKELGYQG 52
  • 53. Atomic Solvent Accessible Area Calculations (RASA)  Calculate the Solvent Accessible Area (RASA) of each Protein  Naccess V2.11 Program – Linux/Unix systems /Cygwin – [http://www.bioinf.manchester.ac.uk/naccess/] – ./naccess 1a91.pdb & ./naccess 1afo.pdb & ./naccess 1aig.pdb  PDB DATA Bank –PDB File – [http://ftp.wwpdb.org/pub/pdb/data/biounit/coordinates/all/] ncbi-blast-2.2.24+ RASA >0 53
  • 54. Position-Specific Scoring Matrix Calculations (PSSM)  Download PDB Files  blast-2.2.25+ Program – Microsoft Windows  NR Database (non-redundant protein sequence) Process p = new Process(); p.StartInfo.UseShellExecute = false; p.StartInfo.RedirectStandardOutput = true; p.StartInfo.FileName = "C:blast-2.2.25+binpsiblast.exe"; p.StartInfo.Arguments = string.Format("{0}", "-query " + FileNameIN + " -db C:blast- • 2.2.25+dbnr -num_iterations 2 -out_ascii_pssm " + FileNameOUT); p.Start(); • Example: Sample record of .PSSM 1 A 5 -2 -2 -2 -1 -1 -2 1 -2 -2 -3 -1 -2 -3 -2 2 -1 -3 -3 -1 77 0 0 0 0 0 0 10 0 0 0 0 0 0 0 13 0 0 0 0 0.59 1.#J 54
  • 55. Sample Mapping File >1neg_A Seq : KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLAAAWSHPQF SUR : 11101011111111111111111111111111111111011111111011110111111111111 Site : 00000000000000000000000000000000000000000000000000000000000010000 rASA :115.47,81.22,64.82,.00,20.59,.00,41.60,111.13,56.32,14.17,124.18,35.41,127.39,43.03,111.84,1 60.37,10.00,.71,33.57,1.82,120.20,91.83,15.89,41.40,69.81,.77,20.31,2.22,49.44,65.40,30.56,97 .39,80.11,152.72,75.17,80.10,47.20,64.49,.00,57.09,16.33,101.38,111.31,104.16,71.57,2.73,60.8 4,.00,18.67,8.04,64.07,71.08,.00,125.10,66.68,24.97,32.49,79.86,65.19,179.94,87.62,51.01,109. 35,145.21,71.53, entropy :0.80,0.85,0.25,0.92,0.44,1.48,1.02,2.42,1.57,2.01,0.44,0.93,0.49,0.73,0.73,0.83,1.72,1.46,0.59, 2.15,0.72,0.98,1.99,1.65,0.60,1.20,0.35,0.94,0.66,0.65,0.51,0.23,1.04,0.45,1.09,4.74,3.91,0.67,1 .38,0.61,0.45,0.75,1.43,0.49,0.36,2.32,0.72,1.63,3.17,0.46,1.53,2.78,1.61,0.38,0.45,0.26,0.15,0. 51,0.17,0.38,0.47,0.46,0.93,2.04,1.73, pdbindex :6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37, 38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68 ,69,70, 55
  • 56. Generate Set of Graphs Shorted Distance Path (Dijkstra Theory) Adjacent Matrix Theory Contacting Neighbor’s Residues Labeled Weighted Various Numbers of Node and Edge Normalization Graph – Linear Normalization(X1) =(X-Min)/ (Max-Min) 56
  • 57. Calculate Distance between Atoms and Check the Contacting 2+ (y1-y2)2+ (z1-z2)2  PDB File  VDW (van der Waals-VDW.radii file) D1 <= (R1+R2+0.5) Example of a contact residue 2 A _ 3 A! : 1.33441 Example of a non-contact residue. 4 A _ 2 A : 4.14432 57
  • 58. Structure of a Graph 58
  • 59. Development of Graph Kernel Original G1 and G2 graph converted into shortest-path graphs S1 (V1, E1) and S2 (V2, E2) The Floyd-Warshall algorithm The kernel function is used to calculate the similarity between G1 and G2 by comparing all pairs of edges between S1 and S2. 59
  • 60. The Floyd-Warshall Algorithm for i = 1 to N for j = 1 to N if there is an edge from i to j dist[0][i][j] = the length of the edge from i to j else dist[0][i][j] = INFINITY for k = 1 to N for i = 1 to N for j = 1 to N dist[k][i][j] = min(dist[k-1][i][j], dist[k-1][i][k] + dist[k-1][k][j]) To find the shortest path between all vertices v V for a weighted graph G = (V; E). D(k) ij=the weight of the shortest path from vertex I to vertex j for which all intermediate vertices are in the set {1,2,……k} 60
  • 61. Implementation doublePssm(intResidueA, intResidueB) { inti; double sum=0; for (i=0; i<20; i++) { sum+=pow((double)(seq_a_pssm[ResidueA][i]-seq_b_pssm[ResidueB][i]), 2); } sum=((double)sum); return sum; } dis+=Pssm(i, j); attr_dis[i][j]=exp((-1)*parm_gamma*dis); sum=0; for (i=0; i<seq_a_len; i++) for (j=0; j<seq_b_len; j++) for (k=i+1; k<seq_a_len; k++) for (r=j+1; r<seq_b_len; r++) { xx1 = seq_a_dist[i][k]-seq_b_dist[j][r]; Klen=MaxValue(0, CC-fabs(xx1)); product1=attr_dis[i][j]*attr_dis[k][r]; product2=attr_dis[k][j]*attr_dis[i][r]; value=MaxValue(product1, product2); sum+=value*Klen; } return sum; 61
  • 63. Result and Discussion Comparison Similarity (TP/FP) – Max – Ave – Top 10 Ave Percentile Ranking calculation  RASA Value 63
  • 65. rASA Vs. Active Residues 65
  • 66. 66
  • 67. static IEnumerable<string>SortByLength(IEnumerabl e<string> e) { var sorted = from s in e orderbys.Length descending select s; return sorted; } Section 3.4 67
  • 70. Catalytic Binding Site (CSA)-Active Residue Back 70
  • 72. van der Waals-VDW.radii file Back RESIDUE ATOM ALA 5 ATOM N 1.65 1 ATOM CA 1.87 0 ATOM C 1.76 0 ATOM O 1.40 1 ATOM CB 1.87 0 RESIDUE ATOM ARG 11 ATOM N 1.65 1 ATOM CA 1.87 0 ATOM C 1.76 0 ATOM O 1.40 1 ATOM CB 1.87 0 ATOM CG 1.87 0 ATOM CD 1.87 0 ATOM NE 1.65 1 ATOM CZ 1.76 0 ATOM NH1 1.65 1 ATOM NH2 1.65 1 RESIDUE ATOM ASP 8 ATOM N 1.65 1 ATOM CA 1.87 0 ATOM C 1.76 0 ATOM O 1.40 1 ATOM CB 1.87 0 ATOM CG 1.76 0 ATOM OD1 1.40 1 ATOM OD2 1.40 1 RESIDUE ATOM ASN 8 ATOM N 1.65 1 ATOM CA 1.87 0 ATOM C 1.76 0 ATOM O 1.40 1 ATOM CB 1.87 0 ATOM CG 1.76 0 ATOM OD1 1.40 1 ATOM ND2 1.65 1 RESIDUE ATOM CYS 6 ATOM N 1.65 1 ATOM CA 1.87 0 ATOM C 1.76 0 ATOM O 1.40 1 ATOM CB 1.87 0 ATOM SG 1.85 0 RESIDUE ATOM GLU 9 ATOM N 1.65 1 ATOM CA 1.87 0 ATOM C 1.76 0 ATOM O 1.40 1 ATOM CB 1.87 0 ATOM CG 1.87 0 ATOM CD 1.76 0 ATOM OE1 1.40 1 ATOM OE2 1.40 1 RESIDUE ATOM GLN 9 ATOM N 1.65 1 ATOM CA 1.87 0 ATOM C 1.76 0 ATOM O 1.40 1 ATOM CB 1.87 0 ATOM CG 1.87 0 ATOM CD 1.76 0 ATOM OE1 1.40 1 ATOM NE2 1.65 1 RESIDUE ATOM GLY 4 ATOM N 1.65 1 ATOM CA 1.87 0 ATOM C 1.76 0 ATOM O 1.40 1 RESIDUE ATOM HIS 10 ATOM N 1.65 1 ATOM CA 1.87 0 ATOM C 1.76 0 ATOM O 1.40 1 ATOM CB 1.87 0 ATOM CG 1.76 0 ATOM ND1 1.65 1 ATOM CD2 1.76 0 ATOM CE1 1.76 0 ATOM NE2 1.65 1 RESIDUE ATOM ILE 8 ATOM N 1.65 1 ATOM CA 1.87 0 ATOM C 1.76 0 ATOM O 1.40 1 ATOM CB 1.87 0 ATOM CG1 1.87 0 ATOM CG2 1.87 0 ATOM CD1 1.87 0 RESIDUE ATOM LEU 8 ATOM N 1.65 1 ATOM CA 1.87 0 ATOM C 1.76 0 ATOM O 1.40 1 ATOM CB 1.87 0 ATOM CG 1.87 0 ATOM CD1 1.87 0 ATOM CD2 1.87 0 RESIDUE ATOM LYS 9 ATOM N 1.65 1 ATOM CA 1.87 0 ATOM C 1.76 0 ATOM O 1.40 1 ATOM CB 1.87 0 ATOM CG 1.87 0 ATOM CD 1.87 0 ATOM CE 1.87 0 ATOM NZ 1.50 1 72

Editor's Notes

  1. Hi All,Good Morning. My research title is protein functional site prediction using the shortest path graph kernel
  2. This is my presentation outline ….(read list)
  3. Problem statement of our research approach is mainly related to determine functional sites in a protein structure,What are the functional sites ? a residue or group in a protein that activity participate for biochemical relation with another element such as Magnesium, Zinc , phosphate groupthe picture shows a example for a functional site [ (Phosphorylation Site).] papal colorI will give more details aware the functional site in rest of my presentation.
  4. Function sites prediction has several importance such as;1. The Functional sites prediction helps to Understand protein functionalities.2. protein functional information can be used by Drug design companies todesign new Structure based drugs andalso3. Protein engineers used functional sites information for Design new proteins based on strong identified functionalities.
  5. Next section of my presentation is introduction
  6. Proteins are very important molecules in biological cells. They are involved in virtually all cell functions. Each protein within the body has a specific role.A Protein is consist of one or more amino acids of 20 amino acids which are shown in the following table that means a protein is a sequence of amino acids, each amino acid in a protein sequence commonly calls a residue. In the other words, each residues in a protein represent one amino acid.Proteinitself have a 3 dimensional structurewhich is used to identify the functional important groups . In the other words , active sites
  7. As I mentioned in the problem statement section , functional site is a part of protein which is involve in various biochemical reactions.Here have shown few functional sites example such as Read list , but In our approach we considered only first two functional sites that mean phosphoryation site and catalytic active site.This picture shows a example for a reaction which is happening in a phosphoryation sites and it describes the way of the involving a addition or removal of a phosphate from a protein structure.Similarly , other functional sites also involve to various biochemical reactions.
  8. Now we have a brief idea about the importance aware the protein functional sites prediction, so we need to know that how we can determine these functionally importance group in a protein. One of the popular way is conducting laboratory experimental such as x-ray or NMR but there are some challenges related with laboratory experimental methods.This list show what are the those , laboratory method might be time consuming or needs valuable equipment ,further Or some protein don’t support for laboratory processes, example NMR, all protein need to be liquid status but reality is that all protein cannot be convert into liquid status . So next side I will explain what are the available alternative.
  9. Large number of structural genomics projects are working on finding protein structures and already have large number of structure in protein data banks. The problem is that lack of knowledge of functionality of those protein structures. In briefly, we already have large number of protein structure without knowing their functionalities. Further day to day increase the gab between knowing protein structures with lack of knowing their functionalities. It is need a some alternative to minimize this gab so computer professional try to provide methods with high accuracy as solution to this problem.Further we can identify some advantages of computational methods when comparing with laboratory methods such as …the computational method is..(Read list)………Also have some disadvantage such as accuracy . most of time laboratory methods have more accuracy than computation methods but the information discover by computational method related with functional important group in a protein can be used as a good guild line for laboratory technician for their research .
  10. Now a days, there are few computational methods are used such as template based method, micro environmental based method and macro environment based method. Briefly template based needs to find a similar template from a protein database which is experimentally verified, then used an alignment method to determine functionality of a target protein structureMicro- environmentmethod basically used the nearest neighbor method for determine unknown functionality of a target protein with comparing structural and physicochemical properties of their neighbors. I will explain more about the nearest neighbor method in next few sides.The macro – environment-based method used same process used in micro- environmental method but only different is number neighbor residues in macro based method comparatively high than micro-environment based method.In Our approach we used macro environment –based method for prediction functional sites from a protein structure.
  11. we proposed a graphs kernel based computational approach,we generate set of graph on each residues which are either positive or negative and those are experimentally verified their functionality. The number of nodes in a graph defend on the number of residues contacting with a central node of a graph. And number of graphs are equal to number residue in protein sequence This is only the overview of our approach but I will explain in detail about thisprocess in the materials and methods section.
  12. As mentioned in the previous sides, we generated set of graphs based on each residues , the type of residues is either positive or negative . The functionality of each residues are verified by experimentally methods so the set of graphs in train set can be consider as knowledge based which is consist of functional site graph and non-functional site graphs .further we used two set of knowledge based one for catalytic site prediction and other for phosphoslation side prediction. This knowledge based used to calculate similarity between each residues in the train set and a target residue further we used nearest neighbor method as predictor of the proposed methodI will explain more about the similarity calculation and the prediction process in the metrical and methods section
  13. The next section is marterials and methods
  14. Material and methods,We used following databases to retrieve information protein structures and sequence 1.We used this link to download PDB Files of each protein , the pdb file provides information related with protein structure, it provide geometric coordinate of each atoms of residues in a protein structure. This information we used for checking contacting or not any given residues in a protein sequences each to others2. This link is used to download all protein sequences, it provide protein sequences in fasta format so it easy to map with pdb id of each datasets and retrieve relevant protein sequences
  15. In our research approach used two datasets, one is catalytic binding site and other is phosphorylation site. This link is used to download catalytic binding site protein’s pdb id and map with pdb database for get a relevant protein sequence as mentioned in previous site.and we selected a dataset which contain 73 protein sequences in order contain at least one phosphorylation site in a protein sequence based on the information provide by the CSA.DAT databaseThe CSA.DAT database provides literature information related with catalytic site active residues which are experimentally discovered Then we mapped each residues in protein sequences through residue’s index number with CSA.dat database for identify catalytic active residues and non-active residues . finally we found 201 active residues and 23 hundred and nightly eight non active residues but this dataset is unfair to get reasonable predication so we selected a balanced dataset randomly based on number of active resides and finally our balanced dataset contain 201 active residues and 201 non active residues in other words…functionalWe used this link to download phosporyation site and itself provide information related with Phosporyation active residues ,the dataset contain 679 protein sequences, we used same process used on catalytic site database, used PDBdatabase to map with phosporyation PDB ID. Then we used active residue list to find active phosporyation site in each sequences. finally we found 2062 active residues and And ------- hundred and nightly eight non active residues but this dataset also unfair to get reasonable predication so we selected a balanced dataset randomly based on number of active resides and finally our balanced dataset which contain 2062 active residues and 2062 non active residues.
  16. A Graph can be defined by using their vertices and edges , in our research approach we used undirected , labeled and weighted graph. Simple graph can be represent by using adjacency matrix alsowe used adjacency matrix to represent Contacting each residues in a graph.
  17. In our proposed method, we generate set of graphs based on contacting each residues each to others, nodes in a graph is represent a residue of protein structure which might be positive or negative on the other words, functional active site or non –functional site these node are labeled by using pssm values , the pssm values are indicated biological conservation of each amino acids.The edge is defined based on contacting residues each to othersFinding Contacting between each residues is little bit complicate because of each residues consist of number of atoms so we need to consider all atoms in a residues with each atoms in a another residue, if at least one atom in a residue contacting with a atom in another residue then these two residue can be consider as contacting. Based on this contacting we create a edge between two nodes. These edge is weighted by length between two nodes. In our approach always we assume length equal to 1.The information need to calculate distance between two residues provide by pdb files and VDW file.
  18. In simple kernel is a matrix, each element of the matrix is result of the vector product. Graph kernel is also a matrix which used graph instead of vector, each element of the matrix is similarity between two graphs, graph similarity calculate based on comparison each pair of nodes of both graphs.Shortest path graph also graph kernel which calculated similarity between two graph based shorted path between each pair of nodes in each graphs.
  19. We used the floyed-warshall algorithm convert original graph to shortest-path graphs kernel, the shortest-path graphs kernel is used to calculate similarity between two graph. Graph similarity is calculated by comparing each pair of nodes in both graphs based on labeled values of each nodes and weighted of edges between particular pair of nodes.This function is used to calculate similarity between two graphs, e1 and e2 mean two edges between pair of each nodes
  20. The nearest neighbor method is used to classify target dataset by using training set based on their some properties example is distance with their neighborsIn our approach we used three nearest neighbor variants to classify test set based on graph similarity between training set and test set.The training dataset contain set of functional site and set of non-functional graphs which are verified by laboratory experiment. The test set always represent only graph which is either functional or non-functional which also verified by laboratory experiment.
  21. There are two type of cross validationIn K-fold cross validation , whole dataset is divided into number of part equal to k then one of them used as test dataset and rest of them used as training set.But when k equal to number of instance in a dataset , it becomes a leave one out cross validation. In our approach we used leave one out cross validation for better evolution of predictors. in other words , in our approach , every time use a graph as test dataset while rest of all graphs used as training set.How ever we eliminate graphs of same type in a same protein when used the predictors.
  22. Next Section is result and discussion.
  23. Result and discussion We used two balanced dataset , one for catalytic enzyme active site and other one for phosporylational site, both datasets is consist of functional sites and non-functional sites which are laboratory verified .Catalytic enzyme active site dataset is consist of 201 functional sites and similar number of non-active sites.And we used nearest neighbor method for classification and used three variant of nearest neighbor method based on similarity , max , average and top10ave.Based on the given classification , we calculated percentage of accuracy of our predication method, the method is shown the best performance which value is 77.1% with catalytic site dataset. While we calculate same value of phosphorylation site , It shows 63.8% best performance.As a summary of result , our method is shown best performance with catalytic enzyme dataset than the posphoryation dataset.
  24. The process of calculating percentile data, first need to sort based on similarity values on ascending order then divide the position location of each element by total numbers element of the list. In our approach we use full dataset and calculated the percentile based of all nearest neighbor variants in other words , max ,average and top 10 average values.The given result shown in next side.
  25. The result are clearly shown that most of active sites belong to group under 10% of percentile
  26. Opsitely no-active s
  27. BLAST :Basic Local Alignment Search Tool , The basic BLAST algorithm can be implemented in DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences.
  28. 1. For template-based modeling (TBM) and fold recognition methods, a prediction model can be built based on the coordinates of the appropriate template(s) [1]. These approaches generally involve four steps: 1) a representative protein structure database is searched to identify a template that is structurally similar to the protein target; 2) an alignment between the target and the template is generated that should align equivalent residues together as in the case of a structural alignment; 3) a prediction structure of the target is built based on the alignment and the selected template structure, and 4) model quality evaluation. The first two steps significantly affect the quality of the final model prediction in TBM methods.2. The main signature of residue microenvironment‐based methods is the focus on a single residue or position in the structure and its surrounding environment. Usually, a set of structural, physicochemical and evolutionary properties are collected and encoded into a fixed‐length vector. Sets of functional (positive) and non‐functional (negative) residues are then incorporated into supervised machine learning approaches3. Most methods discussed in this paper focus on the prediction of enzyme active sites, co-factor binding sites, orpost-translational modification sites, where a relatively compact local structural region is involved. However, a largegroup of algorithms and tools have been developed to identify particular classes of larger structural neighborhoods, e.g.surface patches, pockets, cavities or clefts, which provide interfaces to ligands or macromolecular partners. Thesemethods are highly valuable because protein-protein interactions lie at the center of almost every cellular process andprotein-DNA binding is essential for genetic activities. Similarly, accurate identification of ligand-binding sites is valuablein the context of structure-based drug design. Residue macroenvironment-based methods have been reviewed recently,thus we provide only a brief summary and refer authors to relevant publications where appropriate.4. Based on the types ofstructuralpatternsthey search for, graph‐theoretic approaches can be used in anyof the three main methodological groups (template, residue microenvironment, residuemacroenvironment). However, these approaches represent a special category based on the distinctproblem formulations and algorithmic approaches. Instead of using atomic coordinates directly, graph‐theoretic methodsstart with transforming protein structuresinto graphs and then exploit various motiffinders and graph similarity measures, combined with machine learning, to discover functional sites.Representative graph similarity measures involve subgraph enumeration, subgraph isomorphism, oridentification of frequentsubgraphs, although other measures, e.g. random walk‐based scoring, can beapplied as well
  29. NR Database:non-redundant protein sequence database,