Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources Doina Caragea, Jyotishman Pathak, Jie Bao, Adrian Silvescu, Carson Andorf, Drena Dobbs and Vasant Honavar July 26, 2005

Background and Motivation ,[object Object],[object Object],[object Object],InterPro MIPS Swissprot

INDUS ( IN telligent D ata U nderstanding S ystem) Goal: knowledge discovery from large, distributed, semantically heterogeneous data

Outline ,[object Object],[object Object],[object Object],[object Object]

Semantically Heterogeneous Data Data sources need to be made self-describing by specifying the relevant meta data D 1 D 2 Aspartyl/asparaginyl beta-hydroxylase Beta-adrenergic receptor kinase 2 Protein Name 1.14.11.16 Peptide-aspartate beta-dioxygenase TPR TPR_REGION TPR MAQRKNAKSS GNSSSSGSGS … Q12797 2.7.1.126 Beta-adrenergic receptor kinase RGS PROT_KIN_DOM PH_DOMAIN MADLEAVLAD VSYLMAMEKS … P35626 EC Number Prosite Motifs Protein Sequence Protein ID RIIa HSP70 Pfam Domains 415 692 Length BCY1 SSE1 Gene 16.19.01 cyclic nucleotide binding (cAMP, cGMP, etc.) VSSLPKESQA ELQLFQNEIN … P07278 16.01 protein binding STPFGLDLGN NNSVLAVARN … P32589 MIPS Funcat AA Sequence Accession Number AN

Meta Data ,[object Object],[object Object],[object Object],[object Object],Schema for protein data in D 1 EC Number: EC Hierarchy Prosite Motifs: Motifs Protein Sequence: AA String Protein Name: String Protein ID : Swissprot ID

Attribute value hierarchy An attribute value hierarchy (AVH) is a partial order ontology over the values of attributes of data Example: MIPS Funcat Hierarchy

Making data sources self-describing - Ontology-extended data source Data Schema Ontology + + MIPS Funcat: MIPS Hierarchy Prosite Motifs: Motifs Length: Positive Integer Gene: Gene ID Accession Number: MIPS ID RIIa HSP70 415 692 BCY1 SSE1 16.19.01 cyclic nucleotide binding (cAMP, cGMP.) VSSLPKESQA ELQLFQNEIN P07278 16.01 protein binding STPFGLDLGN NNSVLAVARN P32589

User view MIPS Swissprot User Schema Data Sources of Interest User View User Ontology A user view is given by : ,[object Object],[object Object],[object Object],GO Function: GO Hierarchy Structural Class: SCOP Protein: AA String Source: Species String PID: Swissprot ID

Mappings ,[object Object],[object Object],[object Object]

Mappings at schema level Protein ID: Swissprot ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene ID AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String

Mappings at schema level Protein ID : D 1 ≡ PID : D U Accession Number AN : D 2 ≡ PID : D U Protein ID: Swissprot ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene Set AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String

Mappings at schema level Protein ID : D 1 ≡ PID : D U Accession Number AN : D 2 ≡ PID : D U Protein Sequence : D 1 ≡ AA Composition : D U AA Sequence : D 2 ≡ AA Composition : D U Protein ID: Swissprot ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene ID AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String

Mappings at schema level Protein ID : D 1 ≡ PID : D U Accession Number AN : D 2 ≡ PID : D U Protein Sequence : D 1 ≡ AA Composition : D U AA Sequence : D 2 ≡ AA Composition : D U EC Number : D 1 ≡ GO Function : D U’ MIPS Funcat : D 2 ≡ GO Function : D U Protein ID: SwissProt ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene ID AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: SwissProt ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String

Mappings at ontology level D U D U D 1

Mappings at ontology level EC 2.7.1.126 : D 1 ≡ GO 0047696 : D U D U D 1

Mappings at ontology level D U EC 2.7.1 : D 1  GO 00047696 : D U D 1

Mappings at ontology level D 1 EC 2.7.1.126 : D 1  GO 0004672 : D U D U

Integration ontology ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Sample Query ,[object Object],[object Object]

Learning classifiers from data Data Labeled Examples Standard learning algorithms assume centralized access to data Unlabeled Examples Learner Classifier (hypothesis) Classification Learning Classifier Class

Human and yeast protein training data GO 0016208: AMP binding GO 0005515: protein binding GO 0004597: peptide-aspartate GO 0047696: beta-adrenergic-receptor kinase activity GO Function VSSLPKESQA ELQLFQNEIN STPFGLDLGN NNSVLAVARN MAQRKNAKSS GNSSSSGSGS MADLEAVLAD VSYLMAMEKS Sequence Mainly alpha Alpha beta Yeast P39708 Mainly alpha Yeast Q01574 Not Known Human Q12797 Mainly beta Few Secondary Structures Human P35626 Structural Classes Source PID Attributes/Features/Variables Class/Label Examples/ Instances/ Cases

Probabilistic models for protein function classification GO 0016208: AMP binding GO 0005515: protein binding GO 0004597: peptide-aspartate GO 0047696: beta-adrenergic-receptor kinase activity GO Function VSSLPKESQA ELQLFQNEIN STPFGLDLGN NNSVLAVARN MAQRKNAKSS GNSSSSGSGS MADLEAVLAD VSYLMAMEKS Sequence P39708 Q01574 Q12797 P35626 PID ,[object Object],[object Object],[object Object],[object Object],Most probable class of c ( S ) is:

Learning classifiers from data revisited Learning = Information extraction + Hypothesis generation Query s ( D,h i ->h i+1 ) Answer s ( D,h i ->h i+1 ) Information extraction = Sufficient statistics gathering Data D Learner Partial hypothesis h i Hypothesis Generation h i+ 1  R ( h i , s ( D, h i ->h i+1 )) Statistical query formulation

Sufficient statistics for learning classifiers ,[object Object],[object Object]

Naïve Bayes learning as information gathering and hypothesis generation count(AminoAcid,Class) and count(Class) Sufficient statistics: Naïve Bayes class: Query answering engine Naïve Bayes Data For each a i & For each c j Counts Counts(A i |c j ), Counts(c j ) P ( c j ) & P ( a i |c j ) Compute

Learning classifiers from distributed data Information extraction from distributed data + Hypothesis generation Query s ( D,h i ->h i+1 ) Answer s ( D,h i ->h i+1 ) Query Decomposition Answer Composition D 1 D 2 D K Learner Partial hypothesis h i Query answering engine q 1 q 2 q K Statistical Query Formulation Hypothesis Generation h i+ 1  R ( h i , s ( D, h i ->h i+1 ))

Learning classifiers from semantically heterogeneous data sources O Query s ( D,h i ) Answer s ( D,h i ) Query Decomposition Answer Composition D 1 ,O 1 D 2 , O 2 D K , O K Ontology M(O 1 ...O K , O) Mappings from O 1 … O K to O Statistical Query Formulation Hypothesis Generation h i+ 1  R ( h i , s ( D, h i )) Learner Partial hypothesis h i q 2 q K q 1

Ontology-based information integration in INDUS

Capabilities of INDUS ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

INDUS Tools ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

INDUS Users: Domain Ontologists ,[object Object],[object Object],[object Object],[object Object],[object Object]

INDUS Users: Data Providers ,[object Object],[object Object],[object Object],[object Object],[object Object]

INDUS Users: Domain Experts ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

INDUS Users: Domain Scientists ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

INDUS ,[object Object],[object Object],[object Object],[object Object],[object Object]

Related work ,[object Object],[object Object],[object Object],[object Object]

Work in progress ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Work in progress ,[object Object],[object Object],[object Object],[object Object]

Work in progress ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

http://www.cild.iastate.edu/software/indus.html

Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Ähnlich wie Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources (20)

Mehr von Jie Bao

Mehr von Jie Bao (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Hinweis der Redaktion