12. Our approach – Mine the literature Literature: Still the largest and most popular source of knowledge. Hypothesis : The semantic profiles of entities and events can be extracted from the domain literature.
22. Term Classification driven approach 1) get a corpus 2) get all terms 3) get seed examples 4) find relevant ones using term profiling and comparison to seed examples Learn bioinformatics terms from literature
23.
24.
25.
26. Contextual Profile Verb Profile Produce Noun Profile genscan, program, list, transcript Left Pattern (LP) Class-Level (LP 1 ) <Term> , produce, <NP> , of Right Pattern (RP) Class-Level (RP 1 ) of, <NP> Sentence Genscan program node can produce a list of nucleotide FASTAs of predicted transcripts
29. Statistics about textual corpus Full Text Articles # of documents 2,691 # of distinct candidate terms 113,280 # of candidate term occurrences 533,418 # of distinct sentences 294,614 # of distinct context noun stems ~79,000 # of distinct context verb stems ~2,500
30. The Bioinformatics Controlled Vocabulary Number of Terms ATR (C-Value) – total number of candidate terms 113,280 Number of terms with lexical similarity to resource terms 95,437 Number of terms with context noun similarity to resource terms 103,104 Number of terms with context verb similarity to resource terms 73,478 Number of terms with context pattern similarity to resource terms 21,182 Number of terms with combined contextual similarity (Nouns ∪ Verbs ∪ Patterns) 98,307
31. 2 nd Module Mining Semantic Descriptions from Literature
46. Information Extraction Input Sentence: “ Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein sequences” SC instance (resource) Matrix Global Alignment Tool MatGAT SC Application Task Generate Predicted input DNA or protein sequences Predicted output similarity/identity matrices Descriptors similarity/identity matrices, DNA or protein sequences
47.
48.
49.
50. Example – GeneClass Functional Content Predicate (Task) Subject Functional Description Input/Output predict GeneClass Algorithm predicting differential gene expression starts with a candidate set of motifs x003bc
51.
52. Evaluated for their capability to be used for semantic description of a given bioinformatics resource (0) irrelevant (1) partially useful (2) useful HeatMapper The HeatMapper tool has already proven to be very useful in several studies Kalign To compare Kalign to other MSA programs, the following test sets were used. Cognitor To add a new species to the COG system, the annotated protein sequences from the respective genome were compared to the proteins in the COG database by using the BLAST program and assigned to pre-existing COGs by using the COGNITOR program Evaluation of semantic profiles
53.
54. 3 rd Module Mining Semantic Networks from Literature
55.
56.
57. What Next ? (Proposed in BioHackathon2010) Phylogenetic trees are then generated by the ClustalW program by the neighbour-joining method [PMC1973088] . We also used the CLUSTALW program for multialignment as a control process [PMC434493] . Resource1 Resource2 Resource3 Phylogenetic Tree ClustalW Program Multialignment RDF Store # Data # Task Phylogenetic Tree Generated by ClustalW Program Multialignment Is used for
Mention that this example is taken from myGrid project.
The volume of knowledge being generated in different research domains is increasing, with new concepts and terms being added continuously. Therefore, automated methods are required to automatically distil information, extract facts, discover implicit links and generate hypotheses relevant to user’s needs. Automatic acquisition of knowledge from unstructured text typically starts with the identification of terminology relevant for a specific domain, topic or task. Terms provide a means of communication, and it is the terms and their relationships that convey knowledge across scientific articles in particular (Krauthammer and Nenadic 2004). Terms are usually structurally organised not only to help information retrieval and extraction, but also to facilitate the smooth expansion of terminology where newly discovered terms/concepts are integrated into an existing taxonomy.
The volume of knowledge being generated in different research domains is increasing, with new concepts and terms being added continuously. Therefore, automated methods are required to automatically distil information, extract facts, discover implicit links and generate hypotheses relevant to user’s needs. Automatic acquisition of knowledge from unstructured text typically starts with the identification of terminology relevant for a specific domain, topic or task. Terms provide a means of communication, and it is the terms and their relationships that convey knowledge across scientific articles in particular (Krauthammer and Nenadic 2004). Terms are usually structurally organised not only to help information retrieval and extraction, but also to facilitate the smooth expansion of terminology where newly discovered terms/concepts are integrated into an existing taxonomy.