SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
INTERNATIONALComputer Engineering and Technology ENGINEERING
  International Journal of JOURNAL OF COMPUTER (IJCET), ISSN 0976-
  6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
                            & TECHNOLOGY (IJCET)

ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)                                                        IJCET
Volume 4, Issue 2, March – April (2013), pp. 142-157
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
                                                                            ©IAEME
www.jifactor.com



         GENERIC APPROACH FOR PREDICTING UNANNOTATED
              PROTEIN PAIR FUNCTION USING PROTEIN

                               Anjan Kumar Payra1, Sovan Saha1
                                1
                               Dept. of Computer Science &Engg
                 Dr. Sudhir Chandra Sur Degree Engineering College, DumDum
                                         Kolkata, India


  ABSTRACT

          Proteins are the most versatile macromolecules in living systems and serve crucial
  functions in essentially all biological processes. With successful sequencing of several
  genomes, the challenging problem now is to determine the functions of proteins in post
  genomic era. Determining protein functions experimentally is a laborious and time-
  consuming task involving many resources. Therefore, research is going on to predict protein
  functions using various computational methods since at present there are various diseases
  whose recovery drugs are still unknown or yet to be discovered and the drug discovery
  process starts with protein identification because proteins are responsible for many functions
  required for maintenance of life. So Protein identification further needs determination of
  protein function. These methods are based on sequence and structure, gene neighborhood,
  gene fusions, cellular localization, protein-protein interactions etc. In this work, we present an
  approach to predict functions of unannotated protein pair in an intelligent way based on their
  protein interaction network. The success rate obtained in our work is 94.4 %.

  Keywords: Protein interaction network, Unannotated protein pair function prediction,
  Functional groups, success rate.

  I. INTRODUCTION

          Proteins are the building blocks of life. Human body needs protein to repair and
  maintain itself. So proteins have versatile functions to perform. However the concept of
  protein function is highly context-sensitive and not very well-defined. In fact, this concept
  typically acts as an umbrella term for all types of activities that a protein is involved in, be it
  cellular, molecular or physiological. One such categorization of the types of functions a
  protein can perform has been suggested by Bork et al. [1998]:

                                                 142
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

o Molecular function: The biochemical functions performed by a protein, such as ligand
binding, catalysis of biochemical reactions and conformational changes.
o Cellular function: Many proteins come together to perform complex physiological
functions, such as operation of metabolic pathways and signal transduction, to keep the
various components of the organism working well.
o Phenotypic function: The integration of the physiological subsystems, consisting of
various proteins performing their cellular functions, and the interaction of this integrated
system with environmental stimuli determines the phenotypic properties and behavior of the
organism.
In order to predict protein function we have to study the existing data types which can be
broadly classified under 8 sections:
       Amino acid sequences
       Protein structure
       Genome sequences
       Phylogenetic data
       Micro array expression data
       Protein interaction networks and protein complexes
       Biomedical literature
       Combination of multiple data types

      Amino acid sequences: An amino acid sequence is the order that amino acids join
together to form peptide chains, or polypeptides. If the peptide chain is a protein, this
sequence is often called the primary structure of the protein. Due to the structure of amino
acids and how they bond together, the order of the amino acids is only read in one direction
and is specific for the peptide being formed. It can be used to identify a protein or
homologous proteins through searches in databases and also to obtain information about post
translational cleavage points. In addition, the sequence results provide information about the
purity of a preparation. It limits of detectable contamination depend on the sequences of the
analyzed proteins. The central dogma of molecular biology is the conversion of a gene to
protein via the transcription and translation phases as shown in Fig. 1. The result of this
process is a sequence constructed from twenty amino acids, and is known as the protein’s
primary structure. This sequence is the most fundamental form of information available about
the protein since it determines different characteristics of the protein such as its sub-cellular,
localization, structure and function.




                         Fig. 1 Central dogma of molecular biology


                                               143
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

The most popular experimental method for the identification of protein sequences is mass
spectrometry [Sickmann et al. 2003], which, in combination with algorithms such as
ProFound [Zhang and Chait 2000], comes in various flavors, such as peptide mass finger
printing, peptide fragmentation and other comparative methods. However, these methods are
low-throughput, and thus, with the exponential generation of genome sequences, the focus
has shifted to computational approaches that can identify genes from these genomes.
Specifically, techniques that predict protein function from sequence can be categorized into
three classes, namely, sequence homology-based approaches, subsequence-based approaches
and feature-based approaches, which are explained below:

Homology-based approaches: Homologous traits of organism are therefore due to decent
from common ancestor. The homology based search process more sensitive by multiple
means, such as making the search probabilistic and adding evidence from other sources of
data to obtain more accurate and confident annotations for the query proteins.
Subsequence-based approaches: It has been reflected in several studies that often not the
whole sequence, but only some segments of it are important for determining the function of a
given protein. Consequently, the approaches in this category treat these segments or
subsequences as features of a protein sequence and construct models for the mapping of these
features to protein function. These models are then used to predict the function of a query
protein.
Feature-based approaches: The final category of approaches attempts to exploit the
perspective that the amino acid sequence is a unique characterization of a protein, and
determines several of its physical and functional features. These features are used to construct
a predictive model which can map the feature-value vector of a query protein to its function.

   Protein Structure: A protein is an organic biopolymer that is comprised of a set of amino
acids, and assumes a configuration in three-dimensional space due to interactions between
these constituents as shown in Fig. 2. Protein structures may be specified at multiple levels.
Usually, it is specified at three levels, with a fourth level being specified for some cases
[Schulz and Schirmer 1996]. Following is a brief description of these levels:
Primary structure: The primary structure of a protein is simply a sequence of amino acids.
Secondary structure: The sequence of a protein influences its conformation in three
dimensional spaces via the formation of bonds between spatially close amino acids in the
sequence. This process is popularly known as protein folding, and leads to the creation of
substructures such as α-helices, β-sheets, turns and random coils, of which the first two are
the most common, while the last two are formed very rarely. The collection of these
substructures forms the secondary structure of a protein.
Tertiary structure: The attractive and repulsive forces among the substructures caused by
the folding balance each other and provide the protein with a relatively stable, though
complicated, three-dimensional structure. This structure is known as the tertiary structure of
the protein.
Quaternary structure: Some proteins, such as the spectrin protein [Fuller et al.1974] consist
of multiple amino acid sequences, also known as protein subunits. Each of these sequences
folds to form its own tertiary structure, which come together to produce the quarter nary
structure of protein.




                                              144
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

The existing approaches in predicting protein functions from protein structure are:

Similarity-based approaches: Given the structure of a protein, these approaches identify the
protein with the most similar structure using structural alignment techniques, and transfer its
functional annotations to the query protein.




                                  Fig. 2 Structure of protein

Motif-based approaches: The approaches in this category attempt to identify three
dimensional motifs, that are substructures conserved in a set of functionally related proteins,
and estimate a mapping between the function of a protein and the structural motifs it contains.
This mapping is then used to predict the functions of unannotated proteins.
Surface-based approaches: It is sometimes necessary to analyze the structure of a protein at
a higher resolution than that of distances between consecutive amino acids. This corresponds
to the modeling of a continuous surface for the structure and identifying features such as
voids or holes in these surfaces. The approaches in this category utilize these features to infer
a protein’s function.
Learning-based approaches: This category of recent approaches employ effective
classification methods, such as SVM and k-nearest neighbor, to identify the most appropriate
functional class for a protein from its most relevant structural features.

   Genomic sequences: Genome sequencing is a laboratory process that determines the
complete DNA sequence of an organism's genome at a single time. This entails sequencing
all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and,
for plants, in the chloroplast. Almost any biological sample containing a full copy of the
DNA—even a very small amount of DNA or ancient DNA—can provide the genetic material
necessary for full genome sequencing.DNA itself is typically a double stranded molecule
,where one of the strands is constituted of four characters, namely A, T , C and G, which
denote the four nucleotides adenosine, guanine, cytosine and thymine, and other strand is
complimentary to the first, owing to the complimentarity of the A−C and T−G nucleotide
pairs as shown in Fig. 3 . Several approaches have been proposed to accomplish the target of
deriving functional associations from genomic data, and possible function prediction
subsequently.


                                              145
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-    0976
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME




                                   Fig. 3 DNA molecules

These approaches largely fall into one of the following three categories [Marcotte 2000]:
Genome-wide homology-based annotation transfer: This category consists simply of the
                           based               transfer             ory
use of larger databases for searching proteins homologous to the query proteins, and the
transfer of functional annotation from the closest results.
Gene neighborhood- or gene order-based approaches: These approaches are based on the
                                order                       :
hypothesis that proteins, whose corresponding genes are located “close” to each other in
multiple genomes, are expected to interact functionally. This hypothesis is supported by the
concept of an operon, and its relevance to protein function [Salgado et al. 2000].
                      ,
Gene fusion-based approaches: These approaches attempt to discover pairs or sets of genes
              based approaches:
in one genome that are merged to form a single gene in another genome. The underlying
hypothesis here is that these sets of genes are functionally related, and is supported by
                                                                 related,
biochemical and structural evidence [Marcotte et al. 1999].
   Phylogenetic data: A phylogenetic tree or evolutionary tree is a branching diagram or
"tree" showing the inferred evolutionary relationships among various biological speci or
                                                                                    species
other entities based upon similarities and differences in their physical and/or genetic
characteristics. The organisms are joined together in the tree, are implied to have descended
from a ancestor. In a rooted phylogenetic tree, each node with descendants represents the
                                                                    descendants
inferred most recent common ancestor of the descendants and the edge lengths in some trees
may be interpreted as time estimates. Each node is called a taxonomic unit. Internal nodes are
generally called hypothetical taxonomic units, as they cannot be directly observed.
                                                      as
Phylogenetic profiling is a bioinformatics technique in which the joint presence or joint
absence of two traits across large numbers of species is used to infer a meaningful biological
connection, such as involvement of two different proteins in the same biological pathway. It
is essential to include the evolutionary perspective in any complete understanding of protein
function. As a result, several approaches for predicting protein function using evolution
                                                                                    evolution-
based data have recently been proposed. The field of biology that deals with the evolutionary
               ve                           he
relationships among living organisms is also known as phylogenetics [Bittar and Sonderegger
2004]. The phylogenetic profile of a protein is (generally) a binary vector whose length is
                                                                                      l

                                             146
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

the number of available genomes. The vector contains a 1 in the ith position if the ith genome
contains a homologue of the corresponding gene, else a 0.In several other studies, a more
extensive representation of evolutionary knowledge is used [Bittar and Sonderegger 2004].
This representation is known as a phylogenetic tree [Baldauf 2003], which is a standard tree
with respect to the graph theoretical definition, but whose nodes and branches carry special
meaning as shown in Fig. 4.

   Micro array expression data: Protein synthesis from genes occurs in prokaryotic
organisms in two phases [Weaver 2002]. In the transcription phase, an mRNA is created from
the original gene by converting the latter to the corresponding RNA code. The protein is then
synthesized from mRNA by translating the RNA code to the corresponding amino acid
sequence according to the codon translation rules. Gene expression experiments are a method
to quantitatively measure the transcription phase of protein synthesis [Nguyen et al. 2002].
The most common category of these experiments uses square-shaped glass chips measuring
as little as 1 inch on either side, also known as cDNA micro arrays. Experiment using Micro
array is shown in Fig. 5. The experiment is carried out in the following stages.




                        Fig. 4 Constructing a simple phylogenetic tree

In the first stage, the chip is laid out with a matrix of dots of cDNAs, usually several
thousands in number, one corresponding to each of the gene being measured. In parallel,
mRNA is extracted from both the normal as well as the cells of the organism that have been
exposed to the condition being studied. These mRNA are reverse transcripted to cDNA and
colored with green and red colors respectively. These colored cDNAs are then spread on the
micro array chip, leading to a hybridization of the cDNA already on the chip with those
produced by the genes in the two types of cells. This generates a spot of a certain color on the
chip for each gene which denotes its expression level. In the final stage of the experiment, the
intensity of this region is measured by a laser scanners connected to a computer, which
generates a real valued measurement of the expression of each gene as the ratio of the log
intensities of red and blue colors in the region. The result of the experiment thus is a
measurement of the transcription activity of the genes under the specified condition.


                                              147
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME




                                 Fig. 5 Micro array procedure

Existing approaches in gene expression data are:

Clustering-based approaches: An underlying hypothesis of gene expression analysis is that
functionally similar genes have similar expression profiles, since they are expected to be
activated and repressed under the same conditions. Because clustering is a natural approach
for grouping similar data points, approaches in this category cluster genes on the basis of
their gene expression profiles, and assign functions to the unannotated proteins using the
most dominant function for the respective clusters containing them.
Classification-based approaches: A more direct solution to the problem of predicting
protein function from gene expression profiles is the data mining approach of classification.
Thus, approaches in this category build various types of models for the expression function
mapping using classifiers, such as neural networks, SVMs and the naive Bayes classifier, and
use these models to annotate novel proteins.
Temporal analysis-based approaches: Temporal gene expression experiments measure the
activity of genes at different instances of time, for instance, during a disease. This behavior
can also be used to predict protein function. Thus, approaches in this category derive features
from this temporal data and use classification.

    Protein interaction networks and protein complexes: A protein almost never performs
its function in isolation. Rather, it usually interacts with other proteins in order to accomplish
a certain function. However, in keeping with the complexity of the biological machinery,
these interactions are of various kinds. At the highest level, they can be categorized into
genetic and physical interactions. Genetic interactions occur when the mutations in one gene
cause modifications in the behavior of another gene, which implies that these interactions are
only conceptual and do not occur physically in a genome. In our project we consider the
physical interactions between proteins, since they are more directly related to the process
through which a protein accomplishes its functions. Since a protein generally interacts with
more than one other protein, these interactions can be structured to form a network, and
hence the name protein interaction networks which is shown in Fig. 6 and Fig. 7.


                                               148
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME




                       Fig. 6 Organic View (Cytoscape) of our data set

Existing Approaches that attempt to predict function of proteins from a protein interaction
network can be broadly categorized into the following four categories:
Neighborhood-based approaches: These approaches utilize the neighborhood of the query
protein in the interaction network and the most “dominant” annotations among these
neighbors to predict its function.




                        Fig. 7 Circle View (Cytoscape) of our data set

Global optimization-based approaches: In many cases, the neighborhood of the query
protein may not contain enough information, such as annotated proteins, for determining the
function of the query protein robustly. Under these conditions, it may be advantageous to
consider the structure of the entire network and use the annotations of the proteins indirectly
connected to the query protein also. The approaches in this category are based on this idea,
and in most cases, are based on the optimization of an objective function based on the
annotations of the proteins in the network.



                                             149
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

Clustering-based approaches: The approaches in this category were based on the
hypothesis that dense regions in the interaction network represented functional modules,
which are natural units in which proteins perform their function. Thus, these approaches
apply graph clustering algorithms to these networks and then determine the functions of
unannotated proteins in the extracted modules using measures such as majority.

Association-based approaches: Recently, several computationally efficient algorithms have
been proposed for finding frequently occurring patterns in data, in the field of association
analysis in data mining [Tan et al. 2005]. The approaches in this category use these
algorithms to detect frequently occurring sets of interactions in interaction networks of
protein complexes, and hypothesize that these sub graphs denote function modules. Function
prediction from these modules is performed as in the clustering based approaches.

    Biomedical literature: As in all other research communities, researchers in the fields of
biology and medicine publish the results of their research in various journals and conferences.
As a result, over the past, a huge repository of knowledge has been created in the form of
papers, books, reports, theses and other such texts. Clearly, these repositories contain a huge
amount of information about important biological concepts such as protein structure and
function, cancer-causing genes and several others. Thus, there is great utility in the mining of
these repositories and retrieval of useful information as shown in Fig. 8.
Multiple data types: With a plethora of data being generated by a wide spectrum of
proteomics experiments, it may be hypothesized that sometimes what can’t be discovered
from one source of information may become obvious when multiple sources are analyzed
simultaneously. This intuition has been concretized by Kemmeren and Holstege [2003], who
have suggested the following distinct advantages achieved by integrating functional genomics
data:




                                 Fig. 8 Biomedical literature



                                              150
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

o Usually, individual biological data sets provide information about complimentary
biological processes, such as gene expression and protein interaction networks. Thus,
combining them provides a global picture of the biological phenomena a set of genes is
involved in.
o Often, data quality varies between different types of data, as well as within different
sources of data of the same type. For instance, studies have shown significant variations
between the qualities of different protein interaction data sets [Deng et al. 2003]. Thus, the
combination of several data sources/types improves the quality of the overall data set, since
the errors in one data set may be corrected in another.
o The most important advantage of the integrative approach is that since only conclusions
valid over a set of data types are accepted, the predictions made by this approach are usually
more confident than those made on the basis of individual data sets.
     Hence, now we have a clear idea regarding the different existing data types. So now let
us highlight about our work. Our objective is to assign un-annotated “protein pair” to
different functional groups. So we now focus on discussing the existing computational
techniques that use protein-protein interaction data to predict protein function. Protein
functionality can be predicted by neighborhood property which suggests that the PPI network,
neighbors of a particular protein have similar function. In the work of Schwikowski [1] a
neighborhood-counting method is proposed to assign k functions to a protein by identifying
the k most frequent functional labels among its interacting partners. It is simple and effective,
but the full topology is not considered and no confidence scores are assigned for the
annotations. But in the chi-square method, Hishigaki et al. [2] assigns k functions to a protein
with the k largest chi-square scores. For a protein P, each function f is assigned a score
ሺ௡೑ ି௘೑ ሻమ
             , where nf is the number of proteins in the n-neighborhood of P that have the function
   ௘೑
f; The value ef is the expectation of this number based on the frequency of f among all
proteins in the network. Chen et al. [3] extends this neighborhood property to higher levels in
the network. They speculate the functional similarity between a protein and its neighbors
from the level-1 and level-2. An algorithm developed here is to assign a weight to each of its
level-1 and level-2 neighbors by estimating its functional similarity. Many graph algorithms
have been applied for its functional analysis. Vazquez et al. [4] assign proteins to a function
so as to maximize the connectivity of a protein assigned with the same function. They map
this problem into an optimization problem using simulated annealing where they maximizes
the number of edges that connect proteins ( un-annotated or previously annotated) assigned
with the same function. Karaoz et al. [5] apply a similar approach to a collection of PPI data
and gene expression data. They construct a distinct network for each function in GO. For a
particular state of function of each annotated protein v equals +1 if v has function f and -1 if v
has different function. Nabieva et al. [6] proposes a flow based approach to predict protein
function from the protein interaction network. Considering both the local and global
properties of the graph, this approach assigns function to un-annotated protein based on the
amount of flow it receives during simulation whereas each annotated protein is the source of
functional flow. Deng et al. [7] proposes an approach employing the theory of Markov
random field where they estimates the posterior probability of a protein of interest. Letvsky
and Kasif [8] use loopy belief propagation with the assumption of a binomial model for local
neighbors of protein annotated with a given time. Similarly, Wu et al. [9] propose a related
probabilistic model to annotate functions of unknown proteins and PPI networks based on the
structure of the PPI network. Joshi et al. [10] develop new integrated probabilistic method for


                                                  151
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

cellular function by combining information from protein-protein interaction, protein
complexes, micro array gene expression profiles and annotations of known protein through an
integrative statistical model. In the work of Samanta et al. [11], a network based statistical
algorithm is proposed, which assumes that if two proteins share significantly larger number of
common interacting partners they share a common functionality. Another application is
UVCLUSTER based on bi-clustering which iteratively explored distance datasets proposed
by Arnau et al. [12].Apart from graph clustering, in the early stage, Bader and Hogue [13]
propose Molecular Complex Detection (MCODE) where dense regions are detected
according to some parameters.Altaf-ul-Amin et al.[14] also use a clustering approach. It starts
from a single node in a graph and clusters are gradually grown until the similarity of every
added node within a cluster and density of clusters reaches a certain limit. Spirin and Mirny
[15] use graph clustering approach where they detect densely connected modules within
themselves as well as sparsely connected with the rest of the network based on super
paramagnetic clustering and Monte Carlo algorithm. Pruzli et al. [16] use graph theoretic
approach where clusters are identified using Leda’s routine components and those clusters are
analyzed by Highly Connected Sub graphs (HCS) algorithm. Later King et al. [17] partition
networks into clusters using a cost function applying Restricted Neighborhood Search
Clustering algorithm (RNCS). Clusters are filtered according to their size, density and
functional homogeneity. Krogan et al. [18] use Markov clustering algorithm to predict
Protein function.

II. PRESENT WORK

o Motivation: Many approaches have been discussed in the previous section over protein-
protein interaction network (PPI).After studying and going through various papers it can be
analyzed that very few assessment had been pursued on PPI considering protein pairs and
interconnection within their PPI network. This analyzation has encouraged us to work over
PPI network and to predict function of unannotated protein pair using a generic approach
which will be discussed in the forward sections.

o Dataset: In this work, the protein-protein interaction data of yeast (Saccharomyces
Cerevisiae) from ftp://ftpmips.gsf.de/yeast/PPI/, is collected which contains 15613 genetic
and physical interactions. Self-interactions are discarded. A set of 12487 unique binary
interactions involving 4648 proteins are taken as data. In our proposed method 15 functional
groups are considered. They are cell cycle control (O1), cell polarity (O2), cell wall
organization and biogenesis (O3), chromatin chromosome structure (O4), co-immuno-
precipitation (O5), co-purification (O6), DNA Repair(O7), lipid metabolism (O8), nuclear-
cytoplasmic transport (O9), pol II transcription (O10), protein folding (O11), protein
modification (O12), protein synthesis(O13), small molecule transport (O14) and vesicular
transport (O15). For each functional group, 90% protein pairs are taken as training samples
and rest (2-8%) among them are considered as test samples.

o Basic terminologies:

Protein interaction network: Protein–protein interactions occur when two or
more proteins bind together, often to carry out their biological function. Many of the most
important molecular processes in the cell such as DNA replication are carried out by large

                                             152
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

molecular machines that are built from a large number of protein components organized by
their protein–protein interactions. These protein interactions form a network like structure
which is known as Protein interaction network. Here protein interaction network is
represented as a graph GP which consists of a set of vertex (nodes) V connected by edges
(links) E. Thus GP = (V, E).Here each protein is represented as a node and their
interconnections are represented by edges.

Sub graph: A graph G´P is a sub graph of a graph GP if the vertex set of G´P is a subset of the
vertex set of GP and if the edge set of G´P is a subset of the edge set of GP. That is, if G´P =
(V', E’) and GP= (V, E), then G´P is called as sub graph of GP if V′ ‫ ك‬V andE′ ‫ ك‬E. G´P may
be defined as a set of {K ‫ ׫‬U} where K represents the set of un-annotated protein pair while
U represents the set of annotated protein pair.
Level-1 neighbors: In G´P, the directly connected neighbors of a particular vertex are called
level-1 neighbors.

o Proposed Work: Here the work which has been proposed is to deduce the PPI network of
each individual protein belonging to unannotated protein pair chosen from the original data
set mentioned earlier. Hence afterward identifying the common interaction between those
deduced PPI networks and thereby estimating success rate by using a Generic Approach for
predicting function of unannotated protein pair.

o Method: In this method, given ‫′ܩ‬௉ , a sub graph of protein interaction network, consisting
of protein pair as nodes associated with any element of set O= {O1, O2, O3,….,O15} where Oi
represents a particular functional group, this method maps the elements of the set of un-
annotated protein pair U to any element of set O. Steps associated with this method is
described as follows:

Step 1: Take any protein pair as an element from set U.
Step 2: Deduce PPI network for each protein belonging to selected
         protein pair in Step 1.
Step 3: Find common interacting pair in between PPI network
        deduced in step 2.
Step 4: Count the number of occurrences Si (i=1,..,15) of set O= {O1, O2,O3,….,O15} in between
      common interacting pair found in Step 3.
Step 5: Assign Oi of set O= {O1, O2, O3,….,O15} corresponding
        Max (Si (i=1,..,15) ) to unannotated protein pair considered
        in Step 1.

o Illustration of Method-I with an example:
An un-annotated protein pair YAL011w-YDL181w is taken from our test dataset U, which is
shown in yellow color in Fig 9. From GP,‫′ܩ‬ଢ଼୅୐଴ଵଵ୵ is taken where its level-1 neighbors are
YDR146c,YCR033w,YDR181c,YDL080c,YDR269w. Similarly, level-1 neighbors are taken
for ‫′ܩ‬ଢ଼ୈ୐ଵ଼ଵ୵ ,which are YPL078c,YPL240c,YBR118w,and YER148w respectively. Two
functional groups (i.e., DNA repair and cell polarity) are involved in level-1 which is shown
in Fig 9.




                                              153
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME




     Fig. 9 Sub-graph G´P of Protein pair YAL011w-YDL181w and its level-1 neighbor

Then common interacting pair between ‫′ܩ‬ଢ଼୅୐଴ଵଵ୵ and ‫′ܩ‬ଢ଼ୈ୐ଵ଼ଵ୵ is considered. So, In Fig
9, it is seen that there exists only one common interacting pair that is YDL080c-YPL078c
which is marked in green color in Fig 9.By studying our dataset ,it is derived that the protein
pair YDL080c-YPL078c belongs to functional group DNA Repair(O7).Now the number of
occurrences of each functional groups among the common interacting pair is enlisted and
highest number of occurrences of a particular functional group is assigned as the functional
group of unannotated protein pair. So, as in Fig 9, there exists one interacting pair of O7, we
assign O7 to unannotated protein pair YAL011w-YDL181w.




    Fig. 10 Sub-graph G´P of Protein pair YMR236w-YHR099w and its level-1 neighbor
.



                                             154
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

Another example of sub graph obtained in our work has been highlighted above in Fig. 10 and further the
method for predicting function of YMR236w-YHR099w is same as mentioned earlier. In our work, we select
unannotated protein pairs and predict their functional group using Generic approach as shown in TABLE -I.
Simultaneously, by counting matched and unmatched set of predicting protein pairs, we obtained success rate or
probability of success, as shown in TABLE-II.

                                                        TABLE - I

C       Unannotated protein pair                Original function              Predicted function              R
1         YNL250w|YKL101w                       Cell cycle control              Cell cycle control
2          YBR023c|YER111c                      Cell cycle control              Cell cycle control
3          YPL174c|YLR210w                           Mitosis                         Mitosis
4         YLR229c|YPL161c                          Two hybrid                       Two hybrid
5         YBR023c|YLR370c                          Cell polarity                    Cell polarity
6         YNL233w|YCR009c                          Cell polarity       Cell wall organization and biogenesis       ˟
7         YBL061c|YLR342w                          Cell polarity                    Cell polarity
 8        YFR036w|YLR127c                 Coimmunoprecipitation              Coimmunoprecipitation
 9        YDR108w|YML077w                 Coimmunoprecipitation              Coimmunoprecipitation
10        YFR002w|YGR119c                     two hybrid                           two hybrid
11        YBL014c|YML043c                 Coimmunoprecipitation               affinity purification                ˟
12        YBR193c|YOL135c                 Coimmunoprecipitation              Coimmunoprecipitation
13        YBL084c|YDR118w                 Coimmunoprecipitation              Coimmunoprecipitation
14        YDR145w|YGR252w                     copurification                     copurification
15        YHR099w|YOL148c                     copurification                     copurification
16        YHR099w|YMR236w                         copurification                   copurification
17        YGL112c|YHR099w                         copurification                   copurification
18        YBR081c|YDR392w                         copurification                   copurification
19         YGL097w|YIL063c                        copurification                   copurification
20         YGL097w|YIL063c                       synthetic lethal                  synthetic lethal
21        YDR145w|YDR176w                         copurification                   copurification
22        YDR145w|YLR055c                         copurification                   copurification
23        YNL273w|YGL163c                          DNA repair                      DNA repair
24        YCL061c|YMR190c                          DNA repair                      DNA repair
25        YKL113c|YDR369c                          DNA repair                      DNA repair
26        YGR078c|YFR019w                       Lipid metabolism                Lipid metabolism
27        YBR023c|YFR019w                       Lipid metabolism                Lipid metabolism
28        YCL061c|YAR002w              Nuclear-cytoplasmic transport      Nuclear-cytoplasmic transport
29         YLR418c|YLR384c                      Pol II transcription            Pol II transcription
30         YLR418c|YJR140c                      Pol II transcription            Pol II transcription
31        YPR135w|YGL244w                    Pol II transcription               Pol II transcription
32        YPR135w|YHR200w                    Pol II transcription               Pol II transcription
33        YOR070c|YJR032w                     Protein folding                    Protein folding
34        YDR420w|YDR245w                   Protein modification               Protein modification
35        YLR418c|YDR363w-a                     Vesicular transport             Vesicular transport
36         YLR039c|YLR360w                      Vesicular transport             Vesicular transport


                                                        TABLE - II

        Total no. of Unannotated protein pair               Matched    Unmatched               Success rate
                         36                                   34          2                       94.4



                                                          155
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

III. RESULTS& DISCUSSION

The above methods are evaluated by success rate which is defined as

                      ‫ܡܔܜ܋܍ܚܚܗ܋ ܌܍ܜ܋ܑ܌܍ܚܘ ܖܗܑܜ܋ܖܝ۴ ܚܑ܉۾ ܖܑ܍ܜܗܚܘ ܎ܗ ܚ܍܊ܕܝܖ‬
 ࡿ࢛ࢉࢉࢋ࢙࢙ ࢘ࢇ࢚ࢋ ൌ           ‫ܛܚܑ܉۾ ܖܑ܍ܜܗܚܘ ܌܍ܜ܉ܜܗܖܖ܉ܖ܃ ܎ܗ ܚ܍܊ܕܝܖ ܔ܉ܜܗܜ‬


In our work, we predict functions of protein pairs using algorithm of Generic Approach and
estimate success rate of 15 considered functional groups, out of which the probability of
success for six functional groups (co-purification (O6), co-immuno-precipitation (O5), pol
II transcription (O10), vesicular transport (O15), DNA Repair (O7), cell polarity (O2)) have
been shown in tabular and pictorial representation, as shown in TABLE-III and Fig. 12
respectively.
                                         TABLE - III

                                    NUMBER OF             NUMBER OF         PROBABLITY OF
         FUNCTIONAL GROUP         UNANNOTATED           MATCHED PROTEIN        SUCCESS
                                   PROTEIN PAIR              PAIR
                O6                      8                     8                    1
                O5                      5                     4                   0.8
                O10                     4                     4                    1
                O15                     2                     2                    1
                O2                      3                     2                  0.66
                O7                      3                     3                    1




                           9
                           8                                   NUMBER OF
                           7                                   UNANNOTATED
                           6                                   PROTEIN PAIR
                           5
                           4                                   NUMBER OF
                           3                                   MATCHED PROTEIN
                           2
                           1                                   PAIR
                           0
                                                               PROBABLITY OF
                                                               SUCCESS




          Fig. 12 Pictorial representation of success rate for five functional groups.

Our proposed work adds an extra dimension to existing graph-theoretic methods as it
computes functions of unannotated protein pair instead of single protein considering level-1
neighbors. We hope the performance of generic approach will increase if we consider more a
large interaction network and level-2 neighbors. In future, our aim is to work with more
functional groups and for different organisms also.


                                                  156
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME

REFERENCES

 [1] B. Schwikowski, P. Uetz and S. Fields, A network of protein- protein interactions in yeast.
     Nature Biotech.18, 1257-1261, 2000.
 [2] H. Hishigaki, K. Nakai, T. Ono, A. Tanigami, and T. Tagaki, Assessment of prediction
     accuracy of protein function from Protein- protein interaction data. Yeast 18, 523-531,
     2001.
 [3] J. Chen, W. Hsu, M. L. Lee, and S. K. Ng. Labeling network motifs in protein
     interactomes for protein function prediction. Proc 23rd International Conference on Data
     Engineering (ICDE). 546- 555, 2007.
 [4] Vazquez, “Global Protein Function Prediction from Protein-Protein Interaction
     Networks,” Nature Biotechnology, vol. 21, pp. 0697- 700, June, 2003.
 [5] U. Karaoz, T. M. Murali, S. Letovsky, Y. Zheng, C. Ding, C. R. Cantor, and S. Kasif.
     Whole-genome annotation by using evidence Integration in functional-linkage.
 [6] E. Nabieva, K. Jim, A. Agarwal, B. Chazelle, M. Singh. Whole Proteome prediction of
     protein functions via graph-theoretic analysis of interaction maps. Bioinformatics 21
     (Suppl 1): i302– i310, 2005.
 [7] M. Deng, Inferring domain-domain interactions from protein protein interactions.
     Genome Res. 12(10):1540-8, 2002.
 [8] S. Letovsky, S. Kasif. Predicting protein function from protein protein interaction data: a
     probabilistic approach. Bioinformatics.19 (Suppl 1): i197–i204, 2003.
 [9] D. D. Wu, X. Hu, An efficient approach to detect a protein community from a seed. 2005
     IEEE Symposium on Computational Intelligence in Bioinformatics and Computational
     Biology (CIBCB2005).La Jolla CA, USA: IEEE pp. 135–141, 2005.
 [10] Vazquez, “Global Protein Function Prediction from Protein-Protein Interaction
     Networks,” Nature Biotechnology, vol. 21, pp. 697- 700, June 2003.
 [11] M. P. Samanta,S. Liang, Predicting protein functions from
     redundancies in large scale protein interaction networks. ProcNatlAcadSci USA 100:
     12579–12583, 2003.
 [12] V. Arnau, S. Mars, Marin I Iterative cluster analysis of protein interaction data.
     Bioinformatics 21: 364–378, 2005.
 [13] G. D. Bader,C. W. Hogue, An automated method for finding molecular complexes in
     large protein interaction networks.BMC Bioinformatics 4: 2,2003.
 [14] M. Altaf-Ul-Amin,Y. Shinbo,K. Mihara,K. Kurokawa,S. Kanaya Development and
     implementation of an algorithm for detection of protein complexes in large interaction
     networks. BMC bioinformatics 7: 207, 2006.
 [15] V. Spirin, L. A. Mirny, Protein complexes and functional modules in molecular
     networks. ProcNatlAcadSci USA 100:12123–12128, 2003.
 [16] A. D. King, N. Przulj, I. Jurisica, Protein complex prediction via cost-based clustering.
     Bioinformatics 20: 3013–3020, 2004.
 [17] S. Asthana, O. D. King, F. D. Gibbons, F. P. Roth, Predicting protein complex
     membership using probabilistic network reliability. Genome Res 14: 1170–1175, 2004.
 [18] N. J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, Global
     landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440: 637–
     643, 2006.
 [19] Deepalakshmi. R and Jothi Venkateswaran C, “A Survey on Mining Methods for
     Protein Sequence Analysis: An Aerial View”, International journal of Computer
     Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp. 28 - 34, ISSN Print:
     0976 – 6367, ISSN Online: 0976 – 6375.

                                              157

Weitere ähnliche Inhalte

Was ist angesagt?

Advanced Systems Biology Methods in Drug Discovery
Advanced Systems Biology Methods in Drug DiscoveryAdvanced Systems Biology Methods in Drug Discovery
Advanced Systems Biology Methods in Drug DiscoveryMikel Txopitea Elorriaga
 
Proteome
ProteomeProteome
ProteomeHARIS.P
 
Protein structure 2
Protein structure 2Protein structure 2
Protein structure 2Rainu Rajeev
 
Molecular dynamics and Simulations
Molecular dynamics and SimulationsMolecular dynamics and Simulations
Molecular dynamics and SimulationsAbhilash Kannan
 
The Chaotic Structure of Bacterial Virulence Protein Sequences
The Chaotic Structure of Bacterial Virulence Protein SequencesThe Chaotic Structure of Bacterial Virulence Protein Sequences
The Chaotic Structure of Bacterial Virulence Protein Sequencescsandit
 
Introduction to systems biology
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biologylemberger
 
Impacts of genomics, proteomics, and metabolomics ppt
Impacts of genomics, proteomics, and metabolomics pptImpacts of genomics, proteomics, and metabolomics ppt
Impacts of genomics, proteomics, and metabolomics pptGloria Okenze
 
Drug design and discovery
Drug design and discoveryDrug design and discovery
Drug design and discoveryShikha Popali
 
Construction of phylogenetic tree from multiple gene trees using principal co...
Construction of phylogenetic tree from multiple gene trees using principal co...Construction of phylogenetic tree from multiple gene trees using principal co...
Construction of phylogenetic tree from multiple gene trees using principal co...IAEME Publication
 

Was ist angesagt? (15)

Advanced Systems Biology Methods in Drug Discovery
Advanced Systems Biology Methods in Drug DiscoveryAdvanced Systems Biology Methods in Drug Discovery
Advanced Systems Biology Methods in Drug Discovery
 
Proteomic and metabolomic
Proteomic and metabolomicProteomic and metabolomic
Proteomic and metabolomic
 
Homology modeling
Homology modelingHomology modeling
Homology modeling
 
Proteome
ProteomeProteome
Proteome
 
Structural proteomics
Structural proteomicsStructural proteomics
Structural proteomics
 
Protein structure 2
Protein structure 2Protein structure 2
Protein structure 2
 
proteomics
 proteomics proteomics
proteomics
 
Molecular dynamics and Simulations
Molecular dynamics and SimulationsMolecular dynamics and Simulations
Molecular dynamics and Simulations
 
The Chaotic Structure of Bacterial Virulence Protein Sequences
The Chaotic Structure of Bacterial Virulence Protein SequencesThe Chaotic Structure of Bacterial Virulence Protein Sequences
The Chaotic Structure of Bacterial Virulence Protein Sequences
 
Introduction to systems biology
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biology
 
Systems biology
Systems biologySystems biology
Systems biology
 
Impacts of genomics, proteomics, and metabolomics ppt
Impacts of genomics, proteomics, and metabolomics pptImpacts of genomics, proteomics, and metabolomics ppt
Impacts of genomics, proteomics, and metabolomics ppt
 
H0563843
H0563843H0563843
H0563843
 
Drug design and discovery
Drug design and discoveryDrug design and discovery
Drug design and discovery
 
Construction of phylogenetic tree from multiple gene trees using principal co...
Construction of phylogenetic tree from multiple gene trees using principal co...Construction of phylogenetic tree from multiple gene trees using principal co...
Construction of phylogenetic tree from multiple gene trees using principal co...
 

Andere mochten auch

Computational Protein Design. 1. Challenges in Protein Engineering
Computational Protein Design. 1. Challenges in Protein EngineeringComputational Protein Design. 1. Challenges in Protein Engineering
Computational Protein Design. 1. Challenges in Protein EngineeringPablo Carbonell
 
Protein Engineering
Protein EngineeringProtein Engineering
Protein Engineeringnmicaelo
 
Protein engineering
Protein engineeringProtein engineering
Protein engineeringSiti Julaiha
 
Protein engineering
Protein engineeringProtein engineering
Protein engineeringSiti Julaiha
 
Protein engineering
Protein engineeringProtein engineering
Protein engineeringBen Mair
 
Protein engineering saurav
Protein engineering sauravProtein engineering saurav
Protein engineering sauravSaurav Das
 

Andere mochten auch (9)

Bioinformatica 01-12-2011-t7-protein
Bioinformatica 01-12-2011-t7-proteinBioinformatica 01-12-2011-t7-protein
Bioinformatica 01-12-2011-t7-protein
 
Directed evolution
Directed evolutionDirected evolution
Directed evolution
 
Part I : Introduction to Protein Structure
Part I : Introduction to Protein StructurePart I : Introduction to Protein Structure
Part I : Introduction to Protein Structure
 
Computational Protein Design. 1. Challenges in Protein Engineering
Computational Protein Design. 1. Challenges in Protein EngineeringComputational Protein Design. 1. Challenges in Protein Engineering
Computational Protein Design. 1. Challenges in Protein Engineering
 
Protein Engineering
Protein EngineeringProtein Engineering
Protein Engineering
 
Protein engineering
Protein engineeringProtein engineering
Protein engineering
 
Protein engineering
Protein engineeringProtein engineering
Protein engineering
 
Protein engineering
Protein engineeringProtein engineering
Protein engineering
 
Protein engineering saurav
Protein engineering sauravProtein engineering saurav
Protein engineering saurav
 

Ähnlich wie Generic approach for predicting unannotated protein pair function using protein

STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSHEETHUMOLKS
 
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...ijitcs
 
Knowing Your NGS Downstream: Functional Predictions
Knowing Your NGS Downstream: Functional PredictionsKnowing Your NGS Downstream: Functional Predictions
Knowing Your NGS Downstream: Functional PredictionsGolden Helix Inc
 
Functional proteomics, and tools
Functional proteomics, and toolsFunctional proteomics, and tools
Functional proteomics, and toolsKAUSHAL SAHU
 
Computational Analysis of RNA Nucleotide Sequences
Computational Analysis of RNA Nucleotide SequencesComputational Analysis of RNA Nucleotide Sequences
Computational Analysis of RNA Nucleotide Sequencesijtsrd
 
Protein structure prediction by means
Protein structure prediction by meansProtein structure prediction by means
Protein structure prediction by meansijaia
 
Proteomics, definatio , general concept, signficance
Proteomics,  definatio , general concept, signficanceProteomics,  definatio , general concept, signficance
Proteomics, definatio , general concept, signficanceKAUSHAL SAHU
 
A comparative study using different measure of filteration
A comparative study using different measure of filterationA comparative study using different measure of filteration
A comparative study using different measure of filterationpurkaitjayati29
 
BEL110 presentation
BEL110 presentationBEL110 presentation
BEL110 presentationvariable_orr
 
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...sipij
 
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO csandit
 
Classification of Enzymes Using Machine Learning Based Approaches: A Review
Classification of Enzymes Using Machine Learning Based Approaches: A Review Classification of Enzymes Using Machine Learning Based Approaches: A Review
Classification of Enzymes Using Machine Learning Based Approaches: A Review mlaij
 
Bioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sirBioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sirKAUSHAL SAHU
 
Assignment Of Protein Function And Discovery Of Novel Nucleolar Proteins Base...
Assignment Of Protein Function And Discovery Of Novel Nucleolar Proteins Base...Assignment Of Protein Function And Discovery Of Novel Nucleolar Proteins Base...
Assignment Of Protein Function And Discovery Of Novel Nucleolar Proteins Base...Jim Jimenez
 
Importance of microbial research
Importance of microbial researchImportance of microbial research
Importance of microbial researchCreative Proteomics
 
Proteomics contributes to your microbial research
Proteomics contributes to your microbial researchProteomics contributes to your microbial research
Proteomics contributes to your microbial researchCreative Proteomics
 
Integrative omics approches
Integrative omics approches   Integrative omics approches
Integrative omics approches Sayali Magar
 
Particle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster IdentificationParticle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster IdentificationEditor IJCATR
 

Ähnlich wie Generic approach for predicting unannotated protein pair function using protein (20)

www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
 
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
 
Knowing Your NGS Downstream: Functional Predictions
Knowing Your NGS Downstream: Functional PredictionsKnowing Your NGS Downstream: Functional Predictions
Knowing Your NGS Downstream: Functional Predictions
 
protein modeling.pptx
protein modeling.pptxprotein modeling.pptx
protein modeling.pptx
 
Functional proteomics, and tools
Functional proteomics, and toolsFunctional proteomics, and tools
Functional proteomics, and tools
 
Computational Analysis of RNA Nucleotide Sequences
Computational Analysis of RNA Nucleotide SequencesComputational Analysis of RNA Nucleotide Sequences
Computational Analysis of RNA Nucleotide Sequences
 
Protein structure prediction by means
Protein structure prediction by meansProtein structure prediction by means
Protein structure prediction by means
 
Proteomics, definatio , general concept, signficance
Proteomics,  definatio , general concept, signficanceProteomics,  definatio , general concept, signficance
Proteomics, definatio , general concept, signficance
 
A comparative study using different measure of filteration
A comparative study using different measure of filterationA comparative study using different measure of filteration
A comparative study using different measure of filteration
 
BEL110 presentation
BEL110 presentationBEL110 presentation
BEL110 presentation
 
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
 
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO
 
Classification of Enzymes Using Machine Learning Based Approaches: A Review
Classification of Enzymes Using Machine Learning Based Approaches: A Review Classification of Enzymes Using Machine Learning Based Approaches: A Review
Classification of Enzymes Using Machine Learning Based Approaches: A Review
 
Bioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sirBioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sir
 
Assignment Of Protein Function And Discovery Of Novel Nucleolar Proteins Base...
Assignment Of Protein Function And Discovery Of Novel Nucleolar Proteins Base...Assignment Of Protein Function And Discovery Of Novel Nucleolar Proteins Base...
Assignment Of Protein Function And Discovery Of Novel Nucleolar Proteins Base...
 
Importance of microbial research
Importance of microbial researchImportance of microbial research
Importance of microbial research
 
Proteomics contributes to your microbial research
Proteomics contributes to your microbial researchProteomics contributes to your microbial research
Proteomics contributes to your microbial research
 
Integrative omics approches
Integrative omics approches   Integrative omics approches
Integrative omics approches
 
Particle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster IdentificationParticle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster Identification
 

Mehr von IAEME Publication

IAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME Publication
 
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...IAEME Publication
 
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSA STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSIAEME Publication
 
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSBROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSIAEME Publication
 
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSDETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSIAEME Publication
 
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSIAEME Publication
 
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOVOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOIAEME Publication
 
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IAEME Publication
 
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYVISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYIAEME Publication
 
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...IAEME Publication
 
GANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEGANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEIAEME Publication
 
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...IAEME Publication
 
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...IAEME Publication
 
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...IAEME Publication
 
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...IAEME Publication
 
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...IAEME Publication
 
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...IAEME Publication
 
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...IAEME Publication
 
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...IAEME Publication
 
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTA MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTIAEME Publication
 

Mehr von IAEME Publication (20)

IAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdf
 
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
 
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSA STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
 
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSBROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
 
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSDETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
 
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
 
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOVOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
 
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
 
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYVISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
 
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
 
GANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEGANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICE
 
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
 
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
 
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
 
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
 
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
 
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
 
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
 
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
 
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTA MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
 

Generic approach for predicting unannotated protein pair function using protein

  • 1. INTERNATIONALComputer Engineering and Technology ENGINEERING International Journal of JOURNAL OF COMPUTER (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) IJCET Volume 4, Issue 2, March – April (2013), pp. 142-157 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) ©IAEME www.jifactor.com GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR FUNCTION USING PROTEIN Anjan Kumar Payra1, Sovan Saha1 1 Dept. of Computer Science &Engg Dr. Sudhir Chandra Sur Degree Engineering College, DumDum Kolkata, India ABSTRACT Proteins are the most versatile macromolecules in living systems and serve crucial functions in essentially all biological processes. With successful sequencing of several genomes, the challenging problem now is to determine the functions of proteins in post genomic era. Determining protein functions experimentally is a laborious and time- consuming task involving many resources. Therefore, research is going on to predict protein functions using various computational methods since at present there are various diseases whose recovery drugs are still unknown or yet to be discovered and the drug discovery process starts with protein identification because proteins are responsible for many functions required for maintenance of life. So Protein identification further needs determination of protein function. These methods are based on sequence and structure, gene neighborhood, gene fusions, cellular localization, protein-protein interactions etc. In this work, we present an approach to predict functions of unannotated protein pair in an intelligent way based on their protein interaction network. The success rate obtained in our work is 94.4 %. Keywords: Protein interaction network, Unannotated protein pair function prediction, Functional groups, success rate. I. INTRODUCTION Proteins are the building blocks of life. Human body needs protein to repair and maintain itself. So proteins have versatile functions to perform. However the concept of protein function is highly context-sensitive and not very well-defined. In fact, this concept typically acts as an umbrella term for all types of activities that a protein is involved in, be it cellular, molecular or physiological. One such categorization of the types of functions a protein can perform has been suggested by Bork et al. [1998]: 142
  • 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME o Molecular function: The biochemical functions performed by a protein, such as ligand binding, catalysis of biochemical reactions and conformational changes. o Cellular function: Many proteins come together to perform complex physiological functions, such as operation of metabolic pathways and signal transduction, to keep the various components of the organism working well. o Phenotypic function: The integration of the physiological subsystems, consisting of various proteins performing their cellular functions, and the interaction of this integrated system with environmental stimuli determines the phenotypic properties and behavior of the organism. In order to predict protein function we have to study the existing data types which can be broadly classified under 8 sections: Amino acid sequences Protein structure Genome sequences Phylogenetic data Micro array expression data Protein interaction networks and protein complexes Biomedical literature Combination of multiple data types Amino acid sequences: An amino acid sequence is the order that amino acids join together to form peptide chains, or polypeptides. If the peptide chain is a protein, this sequence is often called the primary structure of the protein. Due to the structure of amino acids and how they bond together, the order of the amino acids is only read in one direction and is specific for the peptide being formed. It can be used to identify a protein or homologous proteins through searches in databases and also to obtain information about post translational cleavage points. In addition, the sequence results provide information about the purity of a preparation. It limits of detectable contamination depend on the sequences of the analyzed proteins. The central dogma of molecular biology is the conversion of a gene to protein via the transcription and translation phases as shown in Fig. 1. The result of this process is a sequence constructed from twenty amino acids, and is known as the protein’s primary structure. This sequence is the most fundamental form of information available about the protein since it determines different characteristics of the protein such as its sub-cellular, localization, structure and function. Fig. 1 Central dogma of molecular biology 143
  • 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME The most popular experimental method for the identification of protein sequences is mass spectrometry [Sickmann et al. 2003], which, in combination with algorithms such as ProFound [Zhang and Chait 2000], comes in various flavors, such as peptide mass finger printing, peptide fragmentation and other comparative methods. However, these methods are low-throughput, and thus, with the exponential generation of genome sequences, the focus has shifted to computational approaches that can identify genes from these genomes. Specifically, techniques that predict protein function from sequence can be categorized into three classes, namely, sequence homology-based approaches, subsequence-based approaches and feature-based approaches, which are explained below: Homology-based approaches: Homologous traits of organism are therefore due to decent from common ancestor. The homology based search process more sensitive by multiple means, such as making the search probabilistic and adding evidence from other sources of data to obtain more accurate and confident annotations for the query proteins. Subsequence-based approaches: It has been reflected in several studies that often not the whole sequence, but only some segments of it are important for determining the function of a given protein. Consequently, the approaches in this category treat these segments or subsequences as features of a protein sequence and construct models for the mapping of these features to protein function. These models are then used to predict the function of a query protein. Feature-based approaches: The final category of approaches attempts to exploit the perspective that the amino acid sequence is a unique characterization of a protein, and determines several of its physical and functional features. These features are used to construct a predictive model which can map the feature-value vector of a query protein to its function. Protein Structure: A protein is an organic biopolymer that is comprised of a set of amino acids, and assumes a configuration in three-dimensional space due to interactions between these constituents as shown in Fig. 2. Protein structures may be specified at multiple levels. Usually, it is specified at three levels, with a fourth level being specified for some cases [Schulz and Schirmer 1996]. Following is a brief description of these levels: Primary structure: The primary structure of a protein is simply a sequence of amino acids. Secondary structure: The sequence of a protein influences its conformation in three dimensional spaces via the formation of bonds between spatially close amino acids in the sequence. This process is popularly known as protein folding, and leads to the creation of substructures such as α-helices, β-sheets, turns and random coils, of which the first two are the most common, while the last two are formed very rarely. The collection of these substructures forms the secondary structure of a protein. Tertiary structure: The attractive and repulsive forces among the substructures caused by the folding balance each other and provide the protein with a relatively stable, though complicated, three-dimensional structure. This structure is known as the tertiary structure of the protein. Quaternary structure: Some proteins, such as the spectrin protein [Fuller et al.1974] consist of multiple amino acid sequences, also known as protein subunits. Each of these sequences folds to form its own tertiary structure, which come together to produce the quarter nary structure of protein. 144
  • 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME The existing approaches in predicting protein functions from protein structure are: Similarity-based approaches: Given the structure of a protein, these approaches identify the protein with the most similar structure using structural alignment techniques, and transfer its functional annotations to the query protein. Fig. 2 Structure of protein Motif-based approaches: The approaches in this category attempt to identify three dimensional motifs, that are substructures conserved in a set of functionally related proteins, and estimate a mapping between the function of a protein and the structural motifs it contains. This mapping is then used to predict the functions of unannotated proteins. Surface-based approaches: It is sometimes necessary to analyze the structure of a protein at a higher resolution than that of distances between consecutive amino acids. This corresponds to the modeling of a continuous surface for the structure and identifying features such as voids or holes in these surfaces. The approaches in this category utilize these features to infer a protein’s function. Learning-based approaches: This category of recent approaches employ effective classification methods, such as SVM and k-nearest neighbor, to identify the most appropriate functional class for a protein from its most relevant structural features. Genomic sequences: Genome sequencing is a laboratory process that determines the complete DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and, for plants, in the chloroplast. Almost any biological sample containing a full copy of the DNA—even a very small amount of DNA or ancient DNA—can provide the genetic material necessary for full genome sequencing.DNA itself is typically a double stranded molecule ,where one of the strands is constituted of four characters, namely A, T , C and G, which denote the four nucleotides adenosine, guanine, cytosine and thymine, and other strand is complimentary to the first, owing to the complimentarity of the A−C and T−G nucleotide pairs as shown in Fig. 3 . Several approaches have been proposed to accomplish the target of deriving functional associations from genomic data, and possible function prediction subsequently. 145
  • 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 0976 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME Fig. 3 DNA molecules These approaches largely fall into one of the following three categories [Marcotte 2000]: Genome-wide homology-based annotation transfer: This category consists simply of the based transfer ory use of larger databases for searching proteins homologous to the query proteins, and the transfer of functional annotation from the closest results. Gene neighborhood- or gene order-based approaches: These approaches are based on the order : hypothesis that proteins, whose corresponding genes are located “close” to each other in multiple genomes, are expected to interact functionally. This hypothesis is supported by the concept of an operon, and its relevance to protein function [Salgado et al. 2000]. , Gene fusion-based approaches: These approaches attempt to discover pairs or sets of genes based approaches: in one genome that are merged to form a single gene in another genome. The underlying hypothesis here is that these sets of genes are functionally related, and is supported by related, biochemical and structural evidence [Marcotte et al. 1999]. Phylogenetic data: A phylogenetic tree or evolutionary tree is a branching diagram or "tree" showing the inferred evolutionary relationships among various biological speci or species other entities based upon similarities and differences in their physical and/or genetic characteristics. The organisms are joined together in the tree, are implied to have descended from a ancestor. In a rooted phylogenetic tree, each node with descendants represents the descendants inferred most recent common ancestor of the descendants and the edge lengths in some trees may be interpreted as time estimates. Each node is called a taxonomic unit. Internal nodes are generally called hypothetical taxonomic units, as they cannot be directly observed. as Phylogenetic profiling is a bioinformatics technique in which the joint presence or joint absence of two traits across large numbers of species is used to infer a meaningful biological connection, such as involvement of two different proteins in the same biological pathway. It is essential to include the evolutionary perspective in any complete understanding of protein function. As a result, several approaches for predicting protein function using evolution evolution- based data have recently been proposed. The field of biology that deals with the evolutionary ve he relationships among living organisms is also known as phylogenetics [Bittar and Sonderegger 2004]. The phylogenetic profile of a protein is (generally) a binary vector whose length is l 146
  • 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME the number of available genomes. The vector contains a 1 in the ith position if the ith genome contains a homologue of the corresponding gene, else a 0.In several other studies, a more extensive representation of evolutionary knowledge is used [Bittar and Sonderegger 2004]. This representation is known as a phylogenetic tree [Baldauf 2003], which is a standard tree with respect to the graph theoretical definition, but whose nodes and branches carry special meaning as shown in Fig. 4. Micro array expression data: Protein synthesis from genes occurs in prokaryotic organisms in two phases [Weaver 2002]. In the transcription phase, an mRNA is created from the original gene by converting the latter to the corresponding RNA code. The protein is then synthesized from mRNA by translating the RNA code to the corresponding amino acid sequence according to the codon translation rules. Gene expression experiments are a method to quantitatively measure the transcription phase of protein synthesis [Nguyen et al. 2002]. The most common category of these experiments uses square-shaped glass chips measuring as little as 1 inch on either side, also known as cDNA micro arrays. Experiment using Micro array is shown in Fig. 5. The experiment is carried out in the following stages. Fig. 4 Constructing a simple phylogenetic tree In the first stage, the chip is laid out with a matrix of dots of cDNAs, usually several thousands in number, one corresponding to each of the gene being measured. In parallel, mRNA is extracted from both the normal as well as the cells of the organism that have been exposed to the condition being studied. These mRNA are reverse transcripted to cDNA and colored with green and red colors respectively. These colored cDNAs are then spread on the micro array chip, leading to a hybridization of the cDNA already on the chip with those produced by the genes in the two types of cells. This generates a spot of a certain color on the chip for each gene which denotes its expression level. In the final stage of the experiment, the intensity of this region is measured by a laser scanners connected to a computer, which generates a real valued measurement of the expression of each gene as the ratio of the log intensities of red and blue colors in the region. The result of the experiment thus is a measurement of the transcription activity of the genes under the specified condition. 147
  • 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME Fig. 5 Micro array procedure Existing approaches in gene expression data are: Clustering-based approaches: An underlying hypothesis of gene expression analysis is that functionally similar genes have similar expression profiles, since they are expected to be activated and repressed under the same conditions. Because clustering is a natural approach for grouping similar data points, approaches in this category cluster genes on the basis of their gene expression profiles, and assign functions to the unannotated proteins using the most dominant function for the respective clusters containing them. Classification-based approaches: A more direct solution to the problem of predicting protein function from gene expression profiles is the data mining approach of classification. Thus, approaches in this category build various types of models for the expression function mapping using classifiers, such as neural networks, SVMs and the naive Bayes classifier, and use these models to annotate novel proteins. Temporal analysis-based approaches: Temporal gene expression experiments measure the activity of genes at different instances of time, for instance, during a disease. This behavior can also be used to predict protein function. Thus, approaches in this category derive features from this temporal data and use classification. Protein interaction networks and protein complexes: A protein almost never performs its function in isolation. Rather, it usually interacts with other proteins in order to accomplish a certain function. However, in keeping with the complexity of the biological machinery, these interactions are of various kinds. At the highest level, they can be categorized into genetic and physical interactions. Genetic interactions occur when the mutations in one gene cause modifications in the behavior of another gene, which implies that these interactions are only conceptual and do not occur physically in a genome. In our project we consider the physical interactions between proteins, since they are more directly related to the process through which a protein accomplishes its functions. Since a protein generally interacts with more than one other protein, these interactions can be structured to form a network, and hence the name protein interaction networks which is shown in Fig. 6 and Fig. 7. 148
  • 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME Fig. 6 Organic View (Cytoscape) of our data set Existing Approaches that attempt to predict function of proteins from a protein interaction network can be broadly categorized into the following four categories: Neighborhood-based approaches: These approaches utilize the neighborhood of the query protein in the interaction network and the most “dominant” annotations among these neighbors to predict its function. Fig. 7 Circle View (Cytoscape) of our data set Global optimization-based approaches: In many cases, the neighborhood of the query protein may not contain enough information, such as annotated proteins, for determining the function of the query protein robustly. Under these conditions, it may be advantageous to consider the structure of the entire network and use the annotations of the proteins indirectly connected to the query protein also. The approaches in this category are based on this idea, and in most cases, are based on the optimization of an objective function based on the annotations of the proteins in the network. 149
  • 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME Clustering-based approaches: The approaches in this category were based on the hypothesis that dense regions in the interaction network represented functional modules, which are natural units in which proteins perform their function. Thus, these approaches apply graph clustering algorithms to these networks and then determine the functions of unannotated proteins in the extracted modules using measures such as majority. Association-based approaches: Recently, several computationally efficient algorithms have been proposed for finding frequently occurring patterns in data, in the field of association analysis in data mining [Tan et al. 2005]. The approaches in this category use these algorithms to detect frequently occurring sets of interactions in interaction networks of protein complexes, and hypothesize that these sub graphs denote function modules. Function prediction from these modules is performed as in the clustering based approaches. Biomedical literature: As in all other research communities, researchers in the fields of biology and medicine publish the results of their research in various journals and conferences. As a result, over the past, a huge repository of knowledge has been created in the form of papers, books, reports, theses and other such texts. Clearly, these repositories contain a huge amount of information about important biological concepts such as protein structure and function, cancer-causing genes and several others. Thus, there is great utility in the mining of these repositories and retrieval of useful information as shown in Fig. 8. Multiple data types: With a plethora of data being generated by a wide spectrum of proteomics experiments, it may be hypothesized that sometimes what can’t be discovered from one source of information may become obvious when multiple sources are analyzed simultaneously. This intuition has been concretized by Kemmeren and Holstege [2003], who have suggested the following distinct advantages achieved by integrating functional genomics data: Fig. 8 Biomedical literature 150
  • 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME o Usually, individual biological data sets provide information about complimentary biological processes, such as gene expression and protein interaction networks. Thus, combining them provides a global picture of the biological phenomena a set of genes is involved in. o Often, data quality varies between different types of data, as well as within different sources of data of the same type. For instance, studies have shown significant variations between the qualities of different protein interaction data sets [Deng et al. 2003]. Thus, the combination of several data sources/types improves the quality of the overall data set, since the errors in one data set may be corrected in another. o The most important advantage of the integrative approach is that since only conclusions valid over a set of data types are accepted, the predictions made by this approach are usually more confident than those made on the basis of individual data sets. Hence, now we have a clear idea regarding the different existing data types. So now let us highlight about our work. Our objective is to assign un-annotated “protein pair” to different functional groups. So we now focus on discussing the existing computational techniques that use protein-protein interaction data to predict protein function. Protein functionality can be predicted by neighborhood property which suggests that the PPI network, neighbors of a particular protein have similar function. In the work of Schwikowski [1] a neighborhood-counting method is proposed to assign k functions to a protein by identifying the k most frequent functional labels among its interacting partners. It is simple and effective, but the full topology is not considered and no confidence scores are assigned for the annotations. But in the chi-square method, Hishigaki et al. [2] assigns k functions to a protein with the k largest chi-square scores. For a protein P, each function f is assigned a score ሺ௡೑ ି௘೑ ሻమ , where nf is the number of proteins in the n-neighborhood of P that have the function ௘೑ f; The value ef is the expectation of this number based on the frequency of f among all proteins in the network. Chen et al. [3] extends this neighborhood property to higher levels in the network. They speculate the functional similarity between a protein and its neighbors from the level-1 and level-2. An algorithm developed here is to assign a weight to each of its level-1 and level-2 neighbors by estimating its functional similarity. Many graph algorithms have been applied for its functional analysis. Vazquez et al. [4] assign proteins to a function so as to maximize the connectivity of a protein assigned with the same function. They map this problem into an optimization problem using simulated annealing where they maximizes the number of edges that connect proteins ( un-annotated or previously annotated) assigned with the same function. Karaoz et al. [5] apply a similar approach to a collection of PPI data and gene expression data. They construct a distinct network for each function in GO. For a particular state of function of each annotated protein v equals +1 if v has function f and -1 if v has different function. Nabieva et al. [6] proposes a flow based approach to predict protein function from the protein interaction network. Considering both the local and global properties of the graph, this approach assigns function to un-annotated protein based on the amount of flow it receives during simulation whereas each annotated protein is the source of functional flow. Deng et al. [7] proposes an approach employing the theory of Markov random field where they estimates the posterior probability of a protein of interest. Letvsky and Kasif [8] use loopy belief propagation with the assumption of a binomial model for local neighbors of protein annotated with a given time. Similarly, Wu et al. [9] propose a related probabilistic model to annotate functions of unknown proteins and PPI networks based on the structure of the PPI network. Joshi et al. [10] develop new integrated probabilistic method for 151
  • 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME cellular function by combining information from protein-protein interaction, protein complexes, micro array gene expression profiles and annotations of known protein through an integrative statistical model. In the work of Samanta et al. [11], a network based statistical algorithm is proposed, which assumes that if two proteins share significantly larger number of common interacting partners they share a common functionality. Another application is UVCLUSTER based on bi-clustering which iteratively explored distance datasets proposed by Arnau et al. [12].Apart from graph clustering, in the early stage, Bader and Hogue [13] propose Molecular Complex Detection (MCODE) where dense regions are detected according to some parameters.Altaf-ul-Amin et al.[14] also use a clustering approach. It starts from a single node in a graph and clusters are gradually grown until the similarity of every added node within a cluster and density of clusters reaches a certain limit. Spirin and Mirny [15] use graph clustering approach where they detect densely connected modules within themselves as well as sparsely connected with the rest of the network based on super paramagnetic clustering and Monte Carlo algorithm. Pruzli et al. [16] use graph theoretic approach where clusters are identified using Leda’s routine components and those clusters are analyzed by Highly Connected Sub graphs (HCS) algorithm. Later King et al. [17] partition networks into clusters using a cost function applying Restricted Neighborhood Search Clustering algorithm (RNCS). Clusters are filtered according to their size, density and functional homogeneity. Krogan et al. [18] use Markov clustering algorithm to predict Protein function. II. PRESENT WORK o Motivation: Many approaches have been discussed in the previous section over protein- protein interaction network (PPI).After studying and going through various papers it can be analyzed that very few assessment had been pursued on PPI considering protein pairs and interconnection within their PPI network. This analyzation has encouraged us to work over PPI network and to predict function of unannotated protein pair using a generic approach which will be discussed in the forward sections. o Dataset: In this work, the protein-protein interaction data of yeast (Saccharomyces Cerevisiae) from ftp://ftpmips.gsf.de/yeast/PPI/, is collected which contains 15613 genetic and physical interactions. Self-interactions are discarded. A set of 12487 unique binary interactions involving 4648 proteins are taken as data. In our proposed method 15 functional groups are considered. They are cell cycle control (O1), cell polarity (O2), cell wall organization and biogenesis (O3), chromatin chromosome structure (O4), co-immuno- precipitation (O5), co-purification (O6), DNA Repair(O7), lipid metabolism (O8), nuclear- cytoplasmic transport (O9), pol II transcription (O10), protein folding (O11), protein modification (O12), protein synthesis(O13), small molecule transport (O14) and vesicular transport (O15). For each functional group, 90% protein pairs are taken as training samples and rest (2-8%) among them are considered as test samples. o Basic terminologies: Protein interaction network: Protein–protein interactions occur when two or more proteins bind together, often to carry out their biological function. Many of the most important molecular processes in the cell such as DNA replication are carried out by large 152
  • 12. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME molecular machines that are built from a large number of protein components organized by their protein–protein interactions. These protein interactions form a network like structure which is known as Protein interaction network. Here protein interaction network is represented as a graph GP which consists of a set of vertex (nodes) V connected by edges (links) E. Thus GP = (V, E).Here each protein is represented as a node and their interconnections are represented by edges. Sub graph: A graph G´P is a sub graph of a graph GP if the vertex set of G´P is a subset of the vertex set of GP and if the edge set of G´P is a subset of the edge set of GP. That is, if G´P = (V', E’) and GP= (V, E), then G´P is called as sub graph of GP if V′ ‫ ك‬V andE′ ‫ ك‬E. G´P may be defined as a set of {K ‫ ׫‬U} where K represents the set of un-annotated protein pair while U represents the set of annotated protein pair. Level-1 neighbors: In G´P, the directly connected neighbors of a particular vertex are called level-1 neighbors. o Proposed Work: Here the work which has been proposed is to deduce the PPI network of each individual protein belonging to unannotated protein pair chosen from the original data set mentioned earlier. Hence afterward identifying the common interaction between those deduced PPI networks and thereby estimating success rate by using a Generic Approach for predicting function of unannotated protein pair. o Method: In this method, given ‫′ܩ‬௉ , a sub graph of protein interaction network, consisting of protein pair as nodes associated with any element of set O= {O1, O2, O3,….,O15} where Oi represents a particular functional group, this method maps the elements of the set of un- annotated protein pair U to any element of set O. Steps associated with this method is described as follows: Step 1: Take any protein pair as an element from set U. Step 2: Deduce PPI network for each protein belonging to selected protein pair in Step 1. Step 3: Find common interacting pair in between PPI network deduced in step 2. Step 4: Count the number of occurrences Si (i=1,..,15) of set O= {O1, O2,O3,….,O15} in between common interacting pair found in Step 3. Step 5: Assign Oi of set O= {O1, O2, O3,….,O15} corresponding Max (Si (i=1,..,15) ) to unannotated protein pair considered in Step 1. o Illustration of Method-I with an example: An un-annotated protein pair YAL011w-YDL181w is taken from our test dataset U, which is shown in yellow color in Fig 9. From GP,‫′ܩ‬ଢ଼୅୐଴ଵଵ୵ is taken where its level-1 neighbors are YDR146c,YCR033w,YDR181c,YDL080c,YDR269w. Similarly, level-1 neighbors are taken for ‫′ܩ‬ଢ଼ୈ୐ଵ଼ଵ୵ ,which are YPL078c,YPL240c,YBR118w,and YER148w respectively. Two functional groups (i.e., DNA repair and cell polarity) are involved in level-1 which is shown in Fig 9. 153
  • 13. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME Fig. 9 Sub-graph G´P of Protein pair YAL011w-YDL181w and its level-1 neighbor Then common interacting pair between ‫′ܩ‬ଢ଼୅୐଴ଵଵ୵ and ‫′ܩ‬ଢ଼ୈ୐ଵ଼ଵ୵ is considered. So, In Fig 9, it is seen that there exists only one common interacting pair that is YDL080c-YPL078c which is marked in green color in Fig 9.By studying our dataset ,it is derived that the protein pair YDL080c-YPL078c belongs to functional group DNA Repair(O7).Now the number of occurrences of each functional groups among the common interacting pair is enlisted and highest number of occurrences of a particular functional group is assigned as the functional group of unannotated protein pair. So, as in Fig 9, there exists one interacting pair of O7, we assign O7 to unannotated protein pair YAL011w-YDL181w. Fig. 10 Sub-graph G´P of Protein pair YMR236w-YHR099w and its level-1 neighbor . 154
  • 14. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME Another example of sub graph obtained in our work has been highlighted above in Fig. 10 and further the method for predicting function of YMR236w-YHR099w is same as mentioned earlier. In our work, we select unannotated protein pairs and predict their functional group using Generic approach as shown in TABLE -I. Simultaneously, by counting matched and unmatched set of predicting protein pairs, we obtained success rate or probability of success, as shown in TABLE-II. TABLE - I C Unannotated protein pair Original function Predicted function R 1 YNL250w|YKL101w Cell cycle control Cell cycle control 2 YBR023c|YER111c Cell cycle control Cell cycle control 3 YPL174c|YLR210w Mitosis Mitosis 4 YLR229c|YPL161c Two hybrid Two hybrid 5 YBR023c|YLR370c Cell polarity Cell polarity 6 YNL233w|YCR009c Cell polarity Cell wall organization and biogenesis ˟ 7 YBL061c|YLR342w Cell polarity Cell polarity 8 YFR036w|YLR127c Coimmunoprecipitation Coimmunoprecipitation 9 YDR108w|YML077w Coimmunoprecipitation Coimmunoprecipitation 10 YFR002w|YGR119c two hybrid two hybrid 11 YBL014c|YML043c Coimmunoprecipitation affinity purification ˟ 12 YBR193c|YOL135c Coimmunoprecipitation Coimmunoprecipitation 13 YBL084c|YDR118w Coimmunoprecipitation Coimmunoprecipitation 14 YDR145w|YGR252w copurification copurification 15 YHR099w|YOL148c copurification copurification 16 YHR099w|YMR236w copurification copurification 17 YGL112c|YHR099w copurification copurification 18 YBR081c|YDR392w copurification copurification 19 YGL097w|YIL063c copurification copurification 20 YGL097w|YIL063c synthetic lethal synthetic lethal 21 YDR145w|YDR176w copurification copurification 22 YDR145w|YLR055c copurification copurification 23 YNL273w|YGL163c DNA repair DNA repair 24 YCL061c|YMR190c DNA repair DNA repair 25 YKL113c|YDR369c DNA repair DNA repair 26 YGR078c|YFR019w Lipid metabolism Lipid metabolism 27 YBR023c|YFR019w Lipid metabolism Lipid metabolism 28 YCL061c|YAR002w Nuclear-cytoplasmic transport Nuclear-cytoplasmic transport 29 YLR418c|YLR384c Pol II transcription Pol II transcription 30 YLR418c|YJR140c Pol II transcription Pol II transcription 31 YPR135w|YGL244w Pol II transcription Pol II transcription 32 YPR135w|YHR200w Pol II transcription Pol II transcription 33 YOR070c|YJR032w Protein folding Protein folding 34 YDR420w|YDR245w Protein modification Protein modification 35 YLR418c|YDR363w-a Vesicular transport Vesicular transport 36 YLR039c|YLR360w Vesicular transport Vesicular transport TABLE - II Total no. of Unannotated protein pair Matched Unmatched Success rate 36 34 2 94.4 155
  • 15. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME III. RESULTS& DISCUSSION The above methods are evaluated by success rate which is defined as ‫ܡܔܜ܋܍ܚܚܗ܋ ܌܍ܜ܋ܑ܌܍ܚܘ ܖܗܑܜ܋ܖܝ۴ ܚܑ܉۾ ܖܑ܍ܜܗܚܘ ܎ܗ ܚ܍܊ܕܝܖ‬ ࡿ࢛ࢉࢉࢋ࢙࢙ ࢘ࢇ࢚ࢋ ൌ ‫ܛܚܑ܉۾ ܖܑ܍ܜܗܚܘ ܌܍ܜ܉ܜܗܖܖ܉ܖ܃ ܎ܗ ܚ܍܊ܕܝܖ ܔ܉ܜܗܜ‬ In our work, we predict functions of protein pairs using algorithm of Generic Approach and estimate success rate of 15 considered functional groups, out of which the probability of success for six functional groups (co-purification (O6), co-immuno-precipitation (O5), pol II transcription (O10), vesicular transport (O15), DNA Repair (O7), cell polarity (O2)) have been shown in tabular and pictorial representation, as shown in TABLE-III and Fig. 12 respectively. TABLE - III NUMBER OF NUMBER OF PROBABLITY OF FUNCTIONAL GROUP UNANNOTATED MATCHED PROTEIN SUCCESS PROTEIN PAIR PAIR O6 8 8 1 O5 5 4 0.8 O10 4 4 1 O15 2 2 1 O2 3 2 0.66 O7 3 3 1 9 8 NUMBER OF 7 UNANNOTATED 6 PROTEIN PAIR 5 4 NUMBER OF 3 MATCHED PROTEIN 2 1 PAIR 0 PROBABLITY OF SUCCESS Fig. 12 Pictorial representation of success rate for five functional groups. Our proposed work adds an extra dimension to existing graph-theoretic methods as it computes functions of unannotated protein pair instead of single protein considering level-1 neighbors. We hope the performance of generic approach will increase if we consider more a large interaction network and level-2 neighbors. In future, our aim is to work with more functional groups and for different organisms also. 156
  • 16. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME REFERENCES [1] B. Schwikowski, P. Uetz and S. Fields, A network of protein- protein interactions in yeast. Nature Biotech.18, 1257-1261, 2000. [2] H. Hishigaki, K. Nakai, T. Ono, A. Tanigami, and T. Tagaki, Assessment of prediction accuracy of protein function from Protein- protein interaction data. Yeast 18, 523-531, 2001. [3] J. Chen, W. Hsu, M. L. Lee, and S. K. Ng. Labeling network motifs in protein interactomes for protein function prediction. Proc 23rd International Conference on Data Engineering (ICDE). 546- 555, 2007. [4] Vazquez, “Global Protein Function Prediction from Protein-Protein Interaction Networks,” Nature Biotechnology, vol. 21, pp. 0697- 700, June, 2003. [5] U. Karaoz, T. M. Murali, S. Letovsky, Y. Zheng, C. Ding, C. R. Cantor, and S. Kasif. Whole-genome annotation by using evidence Integration in functional-linkage. [6] E. Nabieva, K. Jim, A. Agarwal, B. Chazelle, M. Singh. Whole Proteome prediction of protein functions via graph-theoretic analysis of interaction maps. Bioinformatics 21 (Suppl 1): i302– i310, 2005. [7] M. Deng, Inferring domain-domain interactions from protein protein interactions. Genome Res. 12(10):1540-8, 2002. [8] S. Letovsky, S. Kasif. Predicting protein function from protein protein interaction data: a probabilistic approach. Bioinformatics.19 (Suppl 1): i197–i204, 2003. [9] D. D. Wu, X. Hu, An efficient approach to detect a protein community from a seed. 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB2005).La Jolla CA, USA: IEEE pp. 135–141, 2005. [10] Vazquez, “Global Protein Function Prediction from Protein-Protein Interaction Networks,” Nature Biotechnology, vol. 21, pp. 697- 700, June 2003. [11] M. P. Samanta,S. Liang, Predicting protein functions from redundancies in large scale protein interaction networks. ProcNatlAcadSci USA 100: 12579–12583, 2003. [12] V. Arnau, S. Mars, Marin I Iterative cluster analysis of protein interaction data. Bioinformatics 21: 364–378, 2005. [13] G. D. Bader,C. W. Hogue, An automated method for finding molecular complexes in large protein interaction networks.BMC Bioinformatics 4: 2,2003. [14] M. Altaf-Ul-Amin,Y. Shinbo,K. Mihara,K. Kurokawa,S. Kanaya Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC bioinformatics 7: 207, 2006. [15] V. Spirin, L. A. Mirny, Protein complexes and functional modules in molecular networks. ProcNatlAcadSci USA 100:12123–12128, 2003. [16] A. D. King, N. Przulj, I. Jurisica, Protein complex prediction via cost-based clustering. Bioinformatics 20: 3013–3020, 2004. [17] S. Asthana, O. D. King, F. D. Gibbons, F. P. Roth, Predicting protein complex membership using probabilistic network reliability. Genome Res 14: 1170–1175, 2004. [18] N. J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440: 637– 643, 2006. [19] Deepalakshmi. R and Jothi Venkateswaran C, “A Survey on Mining Methods for Protein Sequence Analysis: An Aerial View”, International journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp. 28 - 34, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. 157