SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Protein Function Prediction
 by Integrating Different
 Data Sources
Liang Lan, Nemanja Djuric, Yuhong Guo, Slobodan Vucetic
            Temple University, Philadelphia, USA




            AFP/CAFA, Vienna, Austria, July 15-16th, 2011
Introduction

 Protein function annotation - key challenge in
 post-genomic era
 Experimental annotation accurate, but slow and
 expensive
 Large amount of information available
 Data mining techniques can help when dealing
 with available, large-scale data sets
Our approach

 Integrate information from different sources to
 predict gene functions
 We consider following data sources:
   Protein sequence similarity
   Protein-protein interaction
   Gene expression
 Hypothesis: Including information from various
 sources results in better predictor performance
Methodology

 We use weighted k-Nearest Neighbor algorithm
 to calculate likelihood that protein p has function f
 score( p, f ) =      ∑ sim( p, p' ) ⋅ I ( f ∈ functions ( p' ))
                   p '∈       ( p)
                          k


 Simple to implement, yet competitive when
 compared to SVM
 Different sim(p, p’) can be obtained with different
 data sources
Methodology - Cont’d
 We calculated different scores using different
 sources (sequence similarity, PPI, gene
 expression) for each (p, f) pair
 Total of J gene expressions, resulted in J+2
 scores that are combined:

   score( p, f ) = w SEQ ⋅ score SEQ ( p, f ) +
                   w PPI ⋅ score PPI ( p, f ) +
                      J
                   ∑ j =1 j
                         ( w EXP ⋅ score EXP ( p, f ))
                                         j
Integrating different scores
   How to find weights wSEQ, wPPI, wjEXP?
   We considered several methods:
      Assigning the same weights to all scores
      Weight optimization by likelihood maximization
      Weight optimization by large margin approaches
   Also considered enhancing similarity scheme
   using approach from Pandev et al.*

* “Incorporating functional inter-relationships into protein function
    prediction algorithms”, BMC Bioinformatics (2009)
Max-margin approach
 Define the following optimization problem:
   Given n genes and m scores, and f(x, y) is an m x 1
   vector of scores for gene x and function y, solve:
       1
  min || w || 2 +C ∑ ∑ ξ i ( y, y )
  w ,ξ 2
                   i y∈Yi , y∈Yi

  s.t. w Τ ( f ( xi , y ) − f ( xi , y )) ≥ 1 − ξ i ( y, y ), ∀i, y ∈ Yi , y ∈ Yi
  ξ i ( y, y ) ≥ 0, ∀i, y ∈ Yi , y ∈ Yi

   where w is an m x 1 weight vector learned during
   training, and C is a regularization parameter
Experimental setup

 We focused on function prediction for human
 proteins
 Data sources:
  Sequence similarity scores for all pairs of CAFA
  proteins
  Gene expressions - 392 Affymetrix GPL96 Platform
  microarray data sets from GEO
  PPI - Physical interactions between human proteins
  listed in OPHID database
Experimental setup - Cont’d

  8,714 annotated human proteins in CAFA
  training set
  Out of those 8,714, total of 2,869 proteins
  covered by all three data sources
  For evaluation, only GO functions annotated by
  more than 10 proteins are considered
    This resulted in 240 MF and 1,123 BP GO terms
  Neighborhood size fixed to 20
Score averaging scheme

 None of the considered approaches worked
 significantly and consistently better than simple
 averaging
 As a result, we give the same weight to all 3
 data sources:

                wSEQ = wPPI = 1/3
                wj EXP= 1/(3J)
Results (average AUC)
   ver. 1 - neighbors found among only 2,869 overlapping human
   proteins
   ver. 2 - neighbors found among all 8,714 human proteins
   ver. 3 - neighbors found among all 36,924 CAFA training proteins

Data source                          MF terms            BP terms
Microarray data                       0.6442              0.6279
PPI data                              0.6283              0.6671
Protein Sequence data, ver. 1         0.7636              0.6642
Protein Sequence data, ver. 2         0.7896              0.6921
Protein Sequence data, ver. 3         0.8396              0.7537
Integrating 3 data sources, ver. 1    0.8134              0.7468
Integrating 3 data sources, ver. 2    0.8494              0.7939
Integrating 3 data sources, ver. 3    0.8788              0.8165
Discussion

 Several important conclusions arise:
   Gene expression is more useful for MF, while PPI is
   more useful for BP prediction
   Sequence similarity data is superior to both gene
   expression and PPI data
   It is beneficial to transfer functions to human proteins
   from their orthologues
   Integration of data sources improves AUC significantly
   for both MF and BP terms
Results - Cont’d
  Comparison of AUC of sequence similarity
  scores (ver. 3) and integrated scores (ver. 3) for
  each GO term
Conclusion

 Some sources are more beneficial for BP, while
 some for MF terms prediction
 Integration of different sources improves function
 prediction significantly
 Exploring new integration techniques could lead
 to even better results
Thank you!
                 Questions?


Nemanja Djuric supported by travel grant
from US Department of Energy

Weitere ähnliche Inhalte

Ähnlich wie Afp cafa djuric

ANTIC-2021_paper_95.pdf
ANTIC-2021_paper_95.pdfANTIC-2021_paper_95.pdf
ANTIC-2021_paper_95.pdfDrGRevathy
 
Project Presentation
Project PresentationProject Presentation
Project Presentationbutest
 
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...Enrico Glaab
 
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...Sara Alvarez
 
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Protein Distance Map Prediction based on a Nearest Neighbors ApproachProtein Distance Map Prediction based on a Nearest Neighbors Approach
Protein Distance Map Prediction based on a Nearest Neighbors ApproachGualberto Asencio Cortés
 
Proteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data setsProteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data setsLars Juhl Jensen
 
Comparative study of artificial neural network based classification for liver...
Comparative study of artificial neural network based classification for liver...Comparative study of artificial neural network based classification for liver...
Comparative study of artificial neural network based classification for liver...Alexander Decker
 
Drug Target Interaction (DTI) prediction (MSc. thesis)
Drug Target Interaction (DTI) prediction (MSc. thesis) Drug Target Interaction (DTI) prediction (MSc. thesis)
Drug Target Interaction (DTI) prediction (MSc. thesis) Dimitris Papadopoulos
 
A novel optimized deep learning method for protein-protein prediction in bioi...
A novel optimized deep learning method for protein-protein prediction in bioi...A novel optimized deep learning method for protein-protein prediction in bioi...
A novel optimized deep learning method for protein-protein prediction in bioi...IJECEIAES
 
IRJET- Disease Prediction using Machine Learning
IRJET-  Disease Prediction using Machine LearningIRJET-  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine LearningIRJET Journal
 
Stock market analysis using ga and neural network
Stock market analysis using ga and neural networkStock market analysis using ga and neural network
Stock market analysis using ga and neural networkAmr Abd El Latief
 
working_example_poster
working_example_posterworking_example_poster
working_example_posterHuikun Zhang
 
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...Keiji Takamoto
 
Quantitative Proteomics: From Instrument To Browser
Quantitative Proteomics: From Instrument To BrowserQuantitative Proteomics: From Instrument To Browser
Quantitative Proteomics: From Instrument To BrowserNeil Swainston
 
Chronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine LearningChronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine LearningIJCSIS Research Publications
 

Ähnlich wie Afp cafa djuric (20)

ANTIC-2021_paper_95.pdf
ANTIC-2021_paper_95.pdfANTIC-2021_paper_95.pdf
ANTIC-2021_paper_95.pdf
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
 
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
 
Proteomics
ProteomicsProteomics
Proteomics
 
SecondaryStructurePredictionReport
SecondaryStructurePredictionReportSecondaryStructurePredictionReport
SecondaryStructurePredictionReport
 
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Protein Distance Map Prediction based on a Nearest Neighbors ApproachProtein Distance Map Prediction based on a Nearest Neighbors Approach
Protein Distance Map Prediction based on a Nearest Neighbors Approach
 
Proteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data setsProteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data sets
 
Comparative study of artificial neural network based classification for liver...
Comparative study of artificial neural network based classification for liver...Comparative study of artificial neural network based classification for liver...
Comparative study of artificial neural network based classification for liver...
 
Drug Target Interaction (DTI) prediction (MSc. thesis)
Drug Target Interaction (DTI) prediction (MSc. thesis) Drug Target Interaction (DTI) prediction (MSc. thesis)
Drug Target Interaction (DTI) prediction (MSc. thesis)
 
A novel optimized deep learning method for protein-protein prediction in bioi...
A novel optimized deep learning method for protein-protein prediction in bioi...A novel optimized deep learning method for protein-protein prediction in bioi...
A novel optimized deep learning method for protein-protein prediction in bioi...
 
IRJET- Disease Prediction using Machine Learning
IRJET-  Disease Prediction using Machine LearningIRJET-  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine Learning
 
Stock market analysis using ga and neural network
Stock market analysis using ga and neural networkStock market analysis using ga and neural network
Stock market analysis using ga and neural network
 
working_example_poster
working_example_posterworking_example_poster
working_example_poster
 
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
 
Poster_JOBIM_v4.2
Poster_JOBIM_v4.2Poster_JOBIM_v4.2
Poster_JOBIM_v4.2
 
B017410916
B017410916B017410916
B017410916
 
B017410916
B017410916B017410916
B017410916
 
Quantitative Proteomics: From Instrument To Browser
Quantitative Proteomics: From Instrument To BrowserQuantitative Proteomics: From Instrument To Browser
Quantitative Proteomics: From Instrument To Browser
 
Chronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine LearningChronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine Learning
 

Mehr von Iddo

What can Community Challenges do for You?
What can Community Challenges do for You?What can Community Challenges do for You?
What can Community Challenges do for You?Iddo
 
Surviving Scientific Presentations
Surviving Scientific PresentationsSurviving Scientific Presentations
Surviving Scientific PresentationsIddo
 
Friedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nrFriedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nrIddo
 
The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...Iddo
 
Why Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongWhy Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongIddo
 
Tracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in BacteriaTracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in BacteriaIddo
 
Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...Iddo
 
Friedberg lab-overview-grad-students
Friedberg lab-overview-grad-studentsFriedberg lab-overview-grad-students
Friedberg lab-overview-grad-studentsIddo
 
Understanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low OutputUnderstanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low OutputIddo
 
Random Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in ScienceRandom Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in ScienceIddo
 
Genome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin DiscoveryGenome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin DiscoveryIddo
 
Convergent divergent
Convergent divergentConvergent divergent
Convergent divergentIddo
 
Some US Science Funding sources
Some US Science Funding sourcesSome US Science Funding sources
Some US Science Funding sourcesIddo
 
CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013Iddo
 
Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013Iddo
 
Metagenomics Biocuration 2013
Metagenomics Biocuration 2013Metagenomics Biocuration 2013
Metagenomics Biocuration 2013Iddo
 
Ismb grant-writing-2012
Ismb grant-writing-2012Ismb grant-writing-2012
Ismb grant-writing-2012Iddo
 
David Jones AFP/CAFA2011
David Jones AFP/CAFA2011David Jones AFP/CAFA2011
David Jones AFP/CAFA2011Iddo
 
Vienna afp2011
Vienna afp2011Vienna afp2011
Vienna afp2011Iddo
 
Go camp 2010_cacao
Go camp 2010_cacaoGo camp 2010_cacao
Go camp 2010_cacaoIddo
 

Mehr von Iddo (20)

What can Community Challenges do for You?
What can Community Challenges do for You?What can Community Challenges do for You?
What can Community Challenges do for You?
 
Surviving Scientific Presentations
Surviving Scientific PresentationsSurviving Scientific Presentations
Surviving Scientific Presentations
 
Friedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nrFriedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nr
 
The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...
 
Why Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongWhy Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is Wrong
 
Tracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in BacteriaTracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in Bacteria
 
Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...
 
Friedberg lab-overview-grad-students
Friedberg lab-overview-grad-studentsFriedberg lab-overview-grad-students
Friedberg lab-overview-grad-students
 
Understanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low OutputUnderstanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low Output
 
Random Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in ScienceRandom Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in Science
 
Genome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin DiscoveryGenome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin Discovery
 
Convergent divergent
Convergent divergentConvergent divergent
Convergent divergent
 
Some US Science Funding sources
Some US Science Funding sourcesSome US Science Funding sources
Some US Science Funding sources
 
CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013
 
Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013
 
Metagenomics Biocuration 2013
Metagenomics Biocuration 2013Metagenomics Biocuration 2013
Metagenomics Biocuration 2013
 
Ismb grant-writing-2012
Ismb grant-writing-2012Ismb grant-writing-2012
Ismb grant-writing-2012
 
David Jones AFP/CAFA2011
David Jones AFP/CAFA2011David Jones AFP/CAFA2011
David Jones AFP/CAFA2011
 
Vienna afp2011
Vienna afp2011Vienna afp2011
Vienna afp2011
 
Go camp 2010_cacao
Go camp 2010_cacaoGo camp 2010_cacao
Go camp 2010_cacao
 

Kürzlich hochgeladen

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceIES VE
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxMarkSteadman7
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...caitlingebhard1
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 

Kürzlich hochgeladen (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

Afp cafa djuric

  • 1. Protein Function Prediction by Integrating Different Data Sources Liang Lan, Nemanja Djuric, Yuhong Guo, Slobodan Vucetic Temple University, Philadelphia, USA AFP/CAFA, Vienna, Austria, July 15-16th, 2011
  • 2. Introduction Protein function annotation - key challenge in post-genomic era Experimental annotation accurate, but slow and expensive Large amount of information available Data mining techniques can help when dealing with available, large-scale data sets
  • 3. Our approach Integrate information from different sources to predict gene functions We consider following data sources: Protein sequence similarity Protein-protein interaction Gene expression Hypothesis: Including information from various sources results in better predictor performance
  • 4. Methodology We use weighted k-Nearest Neighbor algorithm to calculate likelihood that protein p has function f score( p, f ) = ∑ sim( p, p' ) ⋅ I ( f ∈ functions ( p' )) p '∈ ( p) k Simple to implement, yet competitive when compared to SVM Different sim(p, p’) can be obtained with different data sources
  • 5. Methodology - Cont’d We calculated different scores using different sources (sequence similarity, PPI, gene expression) for each (p, f) pair Total of J gene expressions, resulted in J+2 scores that are combined: score( p, f ) = w SEQ ⋅ score SEQ ( p, f ) + w PPI ⋅ score PPI ( p, f ) + J ∑ j =1 j ( w EXP ⋅ score EXP ( p, f )) j
  • 6. Integrating different scores How to find weights wSEQ, wPPI, wjEXP? We considered several methods: Assigning the same weights to all scores Weight optimization by likelihood maximization Weight optimization by large margin approaches Also considered enhancing similarity scheme using approach from Pandev et al.* * “Incorporating functional inter-relationships into protein function prediction algorithms”, BMC Bioinformatics (2009)
  • 7. Max-margin approach Define the following optimization problem: Given n genes and m scores, and f(x, y) is an m x 1 vector of scores for gene x and function y, solve: 1 min || w || 2 +C ∑ ∑ ξ i ( y, y ) w ,ξ 2 i y∈Yi , y∈Yi s.t. w Τ ( f ( xi , y ) − f ( xi , y )) ≥ 1 − ξ i ( y, y ), ∀i, y ∈ Yi , y ∈ Yi ξ i ( y, y ) ≥ 0, ∀i, y ∈ Yi , y ∈ Yi where w is an m x 1 weight vector learned during training, and C is a regularization parameter
  • 8. Experimental setup We focused on function prediction for human proteins Data sources: Sequence similarity scores for all pairs of CAFA proteins Gene expressions - 392 Affymetrix GPL96 Platform microarray data sets from GEO PPI - Physical interactions between human proteins listed in OPHID database
  • 9. Experimental setup - Cont’d 8,714 annotated human proteins in CAFA training set Out of those 8,714, total of 2,869 proteins covered by all three data sources For evaluation, only GO functions annotated by more than 10 proteins are considered This resulted in 240 MF and 1,123 BP GO terms Neighborhood size fixed to 20
  • 10. Score averaging scheme None of the considered approaches worked significantly and consistently better than simple averaging As a result, we give the same weight to all 3 data sources: wSEQ = wPPI = 1/3 wj EXP= 1/(3J)
  • 11. Results (average AUC) ver. 1 - neighbors found among only 2,869 overlapping human proteins ver. 2 - neighbors found among all 8,714 human proteins ver. 3 - neighbors found among all 36,924 CAFA training proteins Data source MF terms BP terms Microarray data 0.6442 0.6279 PPI data 0.6283 0.6671 Protein Sequence data, ver. 1 0.7636 0.6642 Protein Sequence data, ver. 2 0.7896 0.6921 Protein Sequence data, ver. 3 0.8396 0.7537 Integrating 3 data sources, ver. 1 0.8134 0.7468 Integrating 3 data sources, ver. 2 0.8494 0.7939 Integrating 3 data sources, ver. 3 0.8788 0.8165
  • 12. Discussion Several important conclusions arise: Gene expression is more useful for MF, while PPI is more useful for BP prediction Sequence similarity data is superior to both gene expression and PPI data It is beneficial to transfer functions to human proteins from their orthologues Integration of data sources improves AUC significantly for both MF and BP terms
  • 13. Results - Cont’d Comparison of AUC of sequence similarity scores (ver. 3) and integrated scores (ver. 3) for each GO term
  • 14. Conclusion Some sources are more beneficial for BP, while some for MF terms prediction Integration of different sources improves function prediction significantly Exploring new integration techniques could lead to even better results
  • 15. Thank you! Questions? Nemanja Djuric supported by travel grant from US Department of Energy