Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Â
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
1. Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources Doina Caragea, Jyotishman Pathak, Jie Bao, Adrian Silvescu, Carson Andorf, Drena Dobbs and Vasant Honavar July 26, 2005
4. INDUS ( IN telligent D ata U nderstanding S ystem) Goal: knowledge discovery from large, distributed, semantically heterogeneous data
5.
6. Semantically Heterogeneous Data Data sources need to be made self-describing by specifying the relevant meta data D 1 D 2 Aspartyl/asparaginyl beta-hydroxylase Beta-adrenergic receptor kinase 2 Protein Name 1.14.11.16 Peptide-aspartate beta-dioxygenase TPR TPR_REGION TPR MAQRKNAKSS GNSSSSGSGS ⊠Q12797 2.7.1.126 Beta-adrenergic receptor kinase RGS PROT_KIN_DOM PH_DOMAIN MADLEAVLAD VSYLMAMEKS ⊠P35626 EC Number Prosite Motifs Protein Sequence Protein ID RIIa HSP70 Pfam Domains 415 692 Length BCY1 SSE1 Gene 16.19.01 cyclic nucleotide binding (cAMP, cGMP, etc.) VSSLPKESQA ELQLFQNEIN ⊠P07278 16.01 protein binding STPFGLDLGN NNSVLAVARN ⊠P32589 MIPS Funcat AA Sequence Accession Number AN
7.
8. Attribute value hierarchy An attribute value hierarchy (AVH) is a partial order ontology over the values of attributes of data Example: MIPS Funcat Hierarchy
9. Making data sources self-describing - Ontology-extended data source Data Schema Ontology + + MIPS Funcat: MIPS Hierarchy Prosite Motifs: Motifs Length: Positive Integer Gene: Gene ID Accession Number: MIPS ID RIIa HSP70 415 692 BCY1 SSE1 16.19.01 cyclic nucleotide binding (cAMP, cGMP.) VSSLPKESQA ELQLFQNEIN P07278 16.01 protein binding STPFGLDLGN NNSVLAVARN P32589
10.
11.
12. Mappings at schema level Protein ID: Swissprot ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene ID AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
13. Mappings at schema level Protein ID : D 1 ⥠PID : D U Accession Number AN : D 2 ⥠PID : D U Protein ID: Swissprot ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene Set AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
14. Mappings at schema level Protein ID : D 1 ⥠PID : D U Accession Number AN : D 2 ⥠PID : D U Protein Sequence : D 1 ⥠AA Composition : D U AA Sequence : D 2 ⥠AA Composition : D U Protein ID: Swissprot ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene ID AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
15. Mappings at schema level Protein ID : D 1 ⥠PID : D U Accession Number AN : D 2 ⥠PID : D U Protein Sequence : D 1 ⥠AA Composition : D U AA Sequence : D 2 ⥠AA Composition : D U EC Number : D 1 ⥠GO Function : D Uâ MIPS Funcat : D 2 ⥠GO Function : D U Protein ID: SwissProt ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene ID AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: SwissProt ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
23. Learning classifiers from data Data Labeled Examples Standard learning algorithms assume centralized access to data Unlabeled Examples Learner Classifier (hypothesis) Classification Learning Classifier Class
24. Human and yeast protein training data GO 0016208: AMP binding GO 0005515: protein binding GO 0004597: peptide-aspartate GO 0047696: beta-adrenergic-receptor kinase activity GO Function VSSLPKESQA ELQLFQNEIN STPFGLDLGN NNSVLAVARN MAQRKNAKSS GNSSSSGSGS MADLEAVLAD VSYLMAMEKS Sequence Mainly alpha Alpha beta Yeast P39708 Mainly alpha Yeast Q01574 Not Known Human Q12797 Mainly beta Few Secondary Structures Human P35626 Structural Classes Source PID Attributes/Features/Variables Class/Label Examples/ Instances/ Cases
25.
26. Learning classifiers from data revisited Learning = Information extraction + Hypothesis generation Query s ( D,h i ->h i+1 ) Answer s ( D,h i ->h i+1 ) Information extraction = Sufficient statistics gathering Data D Learner Partial hypothesis h i Hypothesis Generation h i+ 1 ï R ( h i , s ( D, h i ->h i+1 )) Statistical query formulation
27.
28. NaĂŻve Bayes learning as information gathering and hypothesis generation count(AminoAcid,Class) and count(Class) Sufficient statistics: NaĂŻve Bayes class: Query answering engine NaĂŻve Bayes Data For each a i & For each c j Counts Counts(A i |c j ), Counts(c j ) P ( c j ) & P ( a i |c j ) Compute
29. Learning classifiers from distributed data Information extraction from distributed data + Hypothesis generation Query s ( D,h i ->h i+1 ) Answer s ( D,h i ->h i+1 ) Query Decomposition Answer Composition D 1 D 2 D K Learner Partial hypothesis h i Query answering engine q 1 q 2 q K Statistical Query Formulation Hypothesis Generation h i+ 1 ï R ( h i , s ( D, h i ->h i+1 ))
30. Learning classifiers from semantically heterogeneous data sources O Query s ( D,h i ) Answer s ( D,h i ) Query Decomposition Answer Composition D 1 ,O 1 D 2 , O 2 D K , O K Ontology M(O 1 ...O K , O) Mappings from O 1 ⊠O K to O Statistical Query Formulation Hypothesis Generation h i+ 1 ï R ( h i , s ( D, h i )) Learner Partial hypothesis h i q 2 q K q 1
INDUS â a federated, query centric approach to the problem of knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources Learning algorithms that can be decomposed into information gathering (obtained by answering queries) and hypothesis generation can be easily linked to INDUS INDUS makes possible the exchange of data and findings between scientists or institutions working on related problems (e.g., bioinformatics)
Design that is tailored for predictive model building using machine learning algorithms from distributed, semantically heterogeneous, autonomous data sources
INDUS â a federated, query centric approach to the problem of knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources Learning algorithms that can be decomposed into information gathering (obtained by answering queries) and hypothesis generation can be easily linked to INDUS INDUS makes possible the exchange of data and findings between scientists or institutions working on related problems (e.g., bioinformatics)