2. Why Bacterial exotoxin identification?
Major cause of diseases, leading to symptoms and lesions
during infection
• Becomes important to study there mechanism to fight against
There toxins are specific to a species
• So species specific information is needed
Exotoxins in particular, though completely neutralized in
vivo, are only partialy inhibited in vitro
• Implying they are regulated by environmental signals as well, study of
properties that interact with the environment becomes important
Most bacteria become resistant to antibiotics because of
mutation or genetic recombination
• Requires identification of new sequences
Futher inactive exotoxins that form toxoids, still reatining the
antigenic properties can be used to cure cartain disesases
3. Support Vector Machine?
Introduced by Vapnik, in 1992.
Set of related supervised learning methods that analyze and
recognize patterns
Used for classification and regression analysis
Non-probablistic binary linear classifier
Based on statistical learning and optimization theories
Can handle multiple, continuous as well as categorical data
4. Principle
• Representation of examples as points in space
• Mapped such that examples of separate
categories are divided by a gap as wide as
possible
• Constructs a hyperplane or a set of hyperplane
in high or infinite dimensional space
• Such that the hyperplane is at maximum
distance from nearest data point of either of
the classes
5. Working:
Given a training set of instance-label pairs (xi , yi ), i = 1, . . . , n , where xi ∈ Rn
and yi ∈ {1, −1} as below:
Maximize the margin (from the nearest
data points of either classes), m = yi
(wTxi + b) = 1 /||w||
w/||w||
(x1, 1)
Original problem in finite dimensional
space may not be linearly separable , so
mapped to higher dimensional space
m
(xn, -1) wTx + b = 0
Intoduction of kernel function to make
computations in higher dimenional
space easier.
6. Optimization problem
require the solution of the following optimization problem:
min w,b,ξ (1/2)wTw+C Σξi,
subject to yi (wT φ(xi ) + b) ≥ 1 − ξi ,
ξi ≥ 0, where
φ – function mapping from input space to feature space
C > 0 is the penalty parameter of the error term.
ξi - error term introduced
The dual solution of the optimization problem found using Lagrange’s
theorem , depends only on the inner product of the support vectors
and the new vector x, to determine its class.
Kernel Function, given by K(x,z) = φ(x). φ(z) makes SVM to learn in
the high dimensional feature space without having to explicitly
calculate φ(x).
7. Kernel Function
A valid kernel function must satisfy Mercer Theorem which defines that the
corresponding kernel matrix be symmetric positive semi-definite (zTKz >= 0).
Following are commonly used kernel functions:
linear: K(xi , xj ) = xT xj
polynomial: K(xi , xj ) = (γxi T xj + r)d , γ > 0
radial basis function (RBF): K(xi , xj ) = exp(−γ|xi − xj|2 ), γ > 0
sigmoid: K(xi , xj ) = tanh(γxi T xj + r).
Effectivenss of SVM depends on the selection of kernel, kernel parameters
and the soft margin paarmeter C.
8. Data Collection
To model SVM to classify human pathogenic bacterial toxins from nontoxins, 2 major
databases were compiled, that of bacterial toxins and that of nontoxins.
294 bacterial toxin sequences were taken from the Bacterial Toxin Database from the site
http://www.hpppi/iicb.res.in/btox
It contained representative protein sequences from 24 different genus of human pathogenic
bacteria inFASTA format
this database created after evaluating and processing over the 4750 toxin sequences from 24
different genus, retrieved from NCBI: www.ncbi.nlm.nih.gov, to remove the redundancies,
and obtain the representatives
9. Next 2940 nontoxinsequences were manually assembled from NCBI,
Selecting protein sequences siginificant to metabolic processes and others
and then removing the sequences with more than 90% sequence identity using CDhit
Of the 294 toxin(positive samples) and 2940 nontoxin(negative samples) sequences,
44 toxin and 440 nontoxin set apart for remaining 250 toxin and 2500 nontoxin
testing feature vectors.
10. Feature Extraction
twelve physicochemicalproperties have been employed to describe
each protein
• Including include Hydrophobicity, Contact Features,Absolute Entropy, Hydration
Potential, Isoelectric point, Net Charge, Normalisedflexibility parameters, Relative
Mutability, Side chain Oriental Preference,Occurence frequency, PkARcooh,and
Polarity
ith feature in the feature vector of jth protein sequence, for i = 1, 2,
...,12 is given by,
Fj(i) = Σ(prpk(i) * Nk)/N, where
• prpk(i) : ith property of the kth aminoacid,∀ k=1, 2, ..., 20
• Nk : number of kth aminoacid residue in the sequence
• N : length of the sequence
dipeptides and tripeptides composition; to reduce the dimensionality
of feature space, amino acids grouped according to properties into 11
groups:
• FWY, R, K, DE, H, M, QN, ST,C, and AGILVP
11. LIBSVM tool
svmtrain: svmpredict:
for preparing models that predicts the class
(classifiers) trained of the test or
from training sets experimental samples
Steps followed before applying svmtrain module:
• checkdata.py from the tools folder in the package to check if the data
intances are in acceptable format.
• Application of subset.py from the tools folder to subset the data instances
into 80% and remaining 20%, training and testing modules
• Scale the data, using svmscale
• Application of grid.py from the tools folder again for selection of optimal
parameter values to the kernel function and parameter, C
The values for g and C were incremented stepwise(step 1) through a
combination of :
powers of 2 from -11 through to +3 for g, and
powers of 2 from -9 to +5 for C using the tool grid.py,
which used 5fold cross validation accuracy to select the optimal parameter
set.
12. LIBSVM also provides a tool fselect.py to remove possible redundant
features from original feature set.
fselect.py ranks the features by assigning them a Fscore value.
Higher the value, more significant is the feature in prediction of classes.
Performance Evaluation
· Accuracy = (TP + TN)/(TP +TN + FP + FN)
· Balanced Accuracy, BAC = (Specificity + Sensitivity)/2 , where
◦ Specificity = TP/(TP + FP)
◦ Sensitivity = TP/(TP + FN)
· AUC : area under the curve of sensitivity against (1specificity)
· Matthew's correlation coefficient[1],
MCC = (TP*TN – FP*FN)/((TN+FN)*(TN+FP)*(TP+FP)*(TP+FN))^(1/2)
13. Result
•92.27% average accuracy and 0.998 area under curve (AUC) values were
obtained when all the features (298) were utilized whereas ,
•91.16% accuracy and 0.94 AUC were achieved with an optimized set of 114
features (supplementary file 2).
•Much higher accuracies were achieved (98.13% and 97.92% for 298 and 114
features, respectively) when an absolutely separate test set consisting of
39toxins and 390 non-toxins (1:10 ratio) were used to test.
Conclusion
The top features can be studied to identify the important functionalities of the
toxic proteins.
Effective in identifying the bacterial toxins, not being computationally
intensive at the same time.