Literature Based Framework for Semantic Descriptions of e-Science resources

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],A Literature based framework for semantic descriptions of e-Science resources [email_address]

Who am I ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

e-Science Perspective ,[object Object],[object Object],[object Object],[object Object],[object Object]

e-Science Resources ,[object Object],[object Object],[object Object],[object Object]

Semantic Web ,[object Object],[object Object]

Semantic Web ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Bioinformatics e-Resources ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Semantic Descriptions of Bioinformatics e-Resources ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

BioCatalogue Beta version at http://beta.biocatalogue.org/ Launch June 2009 at ISMB

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Semantic Descriptions in Bioinformatics Domain

Our approach – Mine the literature Literature: Still the largest and most popular source of knowledge. Hypothesis : The semantic profiles of entities and events can be extracted from the domain literature.

Example Semantically Annotated Web Service Annotations combine textual descriptions ontological mappings text

The rest of the talk ,[object Object],[object Object],[object Object],[object Object],[object Object]

1 st Module Building Controlled Vocabulary from Literature

Terminology Building ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Controlled Vocabulary Building – a challenging task ,[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Controlled Vocabulary Building – Solution

Building controlled vocabulary from literature

Term Classification driven approach 1) get a corpus 2) get all terms 3) get seed examples 4) find relevant ones using term profiling and comparison to seed examples Learn bioinformatics terms from literature

Bioinformatics terminology ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Bioinformatics terminology ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Lexical Profile Term (t) Lexical Profile LP(t) protein (1) Protein Protein sequence (1) protein (2) sequence (3) protein sequence protein sequence alignment ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Contextual Profile Verb Profile Produce Noun Profile genscan, program, list, transcript Left Pattern (LP) Class-Level (LP 1 ) <Term> , produce, <NP> , of Right Pattern (RP) Class-Level (RP 1 ) of, <NP> Sentence Genscan program node can produce a list of nucleotide FASTAs of predicted transcripts

Bioinformatics terminology ,[object Object]

Statistics about textual corpus Full Text Articles # of documents 2,691 # of distinct candidate terms 113,280 # of candidate term occurrences 533,418 # of distinct sentences 294,614 # of distinct context noun stems ~79,000 # of distinct context verb stems ~2,500

The Bioinformatics Controlled Vocabulary Number of Terms ATR (C-Value) – total number of candidate terms 113,280 Number of terms with lexical similarity to resource terms 95,437 Number of terms with context noun similarity to resource terms 103,104 Number of terms with context verb similarity to resource terms 73,478 Number of terms with context pattern similarity to resource terms 21,182 Number of terms with combined contextual similarity (Nouns ∪ Verbs ∪ Patterns) 98,307

2 nd Module Mining Semantic Descriptions from Literature

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Semantic classes – myGrid Ontology

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Semantic classes – myGrid Ontology

Semantic classes identification ,[object Object],Semantic class Typical terminological heads Application application, tool, service, software, system, program Algorithm algorithm, method, approach, procedure, analysis, alignment Data data, record, report, sequence, structure Data Resource resource, database, dataset, repository

Resource mentions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Semantic classes and instances

Extraction/functional rules ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],“ Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein sequences” “ Term_App generates similarity/identity matrices for DNA or protein sequences”

Extraction/functional rules ,[object Object],[object Object],[object Object],[object Object],“ Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein sequences” “ Term_App generates similarity/identity matrices for DNA or protein sequences”

Extraction/functional rules ,[object Object],Function Associated verbs Generic functionality/ Task specification applied, access, achieve, align, allow, based, developed, implemented, present, provide, used, is a, called Inputs, outputs accept, applied, create, provide, query, retrieve, starts with, take, used, generate Comparison outperform, perform, compare Implementation technique, Programming language implement(ed) Composition, subtasks contain(ed), construct(ed), generate(d) Availability available

Information Extraction Input Sentence: “ Matrix Global Alignment Tool MatGAT generates similarity/identity matrices for DNA or protein sequences” SC instance (resource) Matrix Global Alignment Tool MatGAT SC Application Task Generate Predicted input DNA or protein sequences Predicted output similarity/identity matrices Descriptors similarity/identity matrices, DNA or protein sequences

Experiments ,[object Object],[object Object],[object Object],Semantic Class Total # of instances Algorithm 5,722 Application 2,076 Data 2,662 Data Resource 1,992 Total 12,452

Example – GeneClass ,[object Object],Descriptors Frequency of co-occurrence motif data 4 differential gene expression 3 reliable predictive model 2 genome-wide protein-DNA binding data 2 transcriptional gene regulation 2 gene expression data 1 2) MyGrid terms BIND 3) Related resources Robust GeneClass Algorithm

Example – GeneClass Functional Content Predicate (Task) Subject Functional Description Input/Output predict GeneClass Algorithm predicting differential gene expression starts with a candidate set of motifs x003bc

Example – GeneClass ,[object Object],[object Object],[object Object],[object Object]

Evaluated for their capability to be used for semantic description of a given bioinformatics resource (0) irrelevant (1) partially useful (2) useful HeatMapper The HeatMapper tool has already proven to be very useful in several studies Kalign To compare Kalign to other MSA programs, the following test sets were used. Cognitor To add a new species to the COG system, the annotated protein sequences from the respective genome were compared to the proteins in the COG database by using the BLAST program and assigned to pre-existing COGs by using the COGNITOR program Evaluation of semantic profiles

[object Object],[object Object],[object Object],Evaluation of semantic profiles Quality comparison of various components of resource description profiles from the two experiments

3 rd Module Mining Semantic Networks from Literature

What next? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

What Next ? (Proposed in BioHackathon2010) Phylogenetic trees are then generated by the ClustalW program by the neighbour-joining method [PMC1973088] . We also used the CLUSTALW program for multialignment as a control process [PMC434493] . Resource1 Resource2 Resource3 Phylogenetic Tree ClustalW Program Multialignment RDF Store # Data # Task Phylogenetic Tree Generated by ClustalW Program Multialignment Is used for

Conclusion ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Related Selected Publications ,[object Object],[object Object],[object Object],[object Object],[object Object]

Literature Based Framework for Semantic Descriptions of e-Science resources

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Literature Based Framework for Semantic Descriptions of e-Science resources

Ähnlich wie Literature Based Framework for Semantic Descriptions of e-Science resources (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Literature Based Framework for Semantic Descriptions of e-Science resources

Hinweis der Redaktion