1. Modelling and computing the quality of information in e-science Paolo Missier , Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK http://www.qurator.org Roma, 3/4/07
7. Correctness in biology - examples No false positives: Every protein in the output is actually present in the cell sample Generate peptides peak lists, match peak lists (eg Imprint) Qualitative proteomics: Protein identification No false positives, no false negatives Microarray data analysis Transcriptomics: Gene expression report (up/down-regulation) Functional annotation f for p correct if function f can reliably be attributed to p Manual curation Uniprot protein annotation Correctness Creation process Data type
8.
9.
10. Example: protein identification Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Correct entry true positive This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain Evidence : mass coverage (MC) measures the amount of protein sequence matched Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting
11. Correctness of protein identification Estimator function: (computes a score rather than a probability) PMF score = (HR x 100) + MC + (ELDP x 10) Prediction performance – comparing 3 models: ROC curve: True positives vs false positives
12.
13.
14. Layered definition of Quality DB DB Data sources custom quality knowledge Quality Assertions functions QA QA QA Quality Views: definition of acceptability regions QV QV QV QV quality evidence annotations Env Annotation functions Long-lived reusable Commodities Expert-defined Dynamic User controlled
24. Reference (semantic) model quality evidence annotations custom quality knowledge DB DB Env Data sources Annotation functions Quality Assertions functions QA QA QA Quality Views definition of acceptability regions QV QV QV QV Common Semantic Model (IQ Ontology)
25. A semantic model for quality concepts Quality “upper ontology” (OWL) Evidence annotations are class instances Quality evidence types Evidence Meta-data model (RDF)
26. Main taxonomies and properties assertion-based-on-evidence: QualityAssertion QualityEvidence is-evidence-for: QualityEvidence DataEntity Class restriction: MassCoverage is-evidence-for . ImprintHitEntry Class restriction: PIScoreClassifier assertion-based-on-evidence . HitScore PIScoreClassifier assertion-based-on-evidence . Mass Coverage
27. The ontology-driven user interface Detecting inconsistencies: no annotators for this Evidence type Detecting inconsistencies: Unsatisfied input requirements for Quality Assertion
From traditional DQ to the biologist’s problem of defining quality based on data semantics
Data produced for the first time Mention evolution of experimental techniques Its production not streamlined No agreement on how to define its quality
Searching for “nuggets of quality knowledge”
Here is the compilation model for mapping bound views to a sub-workflow
Embedding the sub-flow requires a deployment descriptor : Adapters between host flow and quality subflow Data and control links between host flow tasks and quality flow tasks
Activated during execution of the quality sub-flow – blocks the workflow for the duration of the interaction
Our quality view specification language allows users to define abstract quality processes. Evidence types are ontology classes. Evidence values are class individuals, which are represented by variables. These variables are bound to values at runtime; the values themselves are either fetched from a repository of persistent annotations, or they are computed on demand by annotation functions. In our use cases, we have found examples of both. This process steps abstracts out from the issue of annotation lifetime Assertions are computed by services, which are represented by ontology classes, too. The tagName is the single output of the service (one for each input data item) Finally, the action step contains the condition/action pairs – here conditions are expressed on the variables introduced earlier, which define the scope. The semantics of the action step is that the expression is evaluated for each data item, and the corresponding action is taken, eg the item is sent to a specific channel
Benefit of this model: Ability to share definitions within a community Consistency checking through reasoning -- cite previous papers? Flexibility
From right to left: Data / knowledge layer Framework services Quality views management Targeted compiler(s)