Invited talk @Roma La Sapienza, April '07

Modelling and computing the quality of information in e-science Paolo Missier , Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK http://www.qurator.org Roma, 3/4/07

Quality of data ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Data quality control in the data management practice

Common quality issues ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Taxonomy for data quality dimensions

Our motivation: quality in public e-science data ,[object Object],[object Object],Problem: using third party data of unknown quality may result in misleading scientific conclusions GenBank UniProt EnsEMBL Entrez dbSNP

Some quality issues in biology ,[object Object],[object Object],[object Object],[object Object],[object Object],Each of these issues calls for a separate testing procedure Difficult to generalize

Correctness in biology - examples No false positives: Every protein in the output is actually present in the cell sample Generate peptides peak lists, match peak lists (eg Imprint) Qualitative proteomics: Protein identification No false positives, no false negatives Microarray data analysis Transcriptomics: Gene expression report (up/down-regulation) Functional annotation f for p correct if function f can reliably be attributed to p Manual curation Uniprot protein annotation Correctness Creation process Data type

Defining quality in e-science is challenging ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],“ Quality”  personal criteria for data acceptability

Research goals ,[object Object],[object Object],[object Object],Elicit “nuggets” of latent quality knowledge from the experts ,[object Object],[object Object],[object Object]

Example: protein identification Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Correct entry  true positive This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain Evidence : mass coverage (MC) measures the amount of protein sequence matched Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting

Correctness of protein identification Estimator function: (computes a score rather than a probability) PMF score = (HR x 100) + MC + (ELDP x 10) Prediction performance – comparing 3 models: ROC curve: True positives vs false positives

Quality process components Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Goal: to automatically add the additional filtering step in a principled way ,[object Object],[object Object],[object Object],[object Object],PMF score = (HR x 100) + MC + (ELDP x 10) Quality filtering Quality assertion :

Quality Assertions ,[object Object],[object Object],[object Object],C Quality-equivalent regions      D         B A Actions associated to regions: Eg accept/reject but possibly more

Layered definition of Quality DB DB Data sources custom quality knowledge Quality Assertions functions QA QA QA Quality Views: definition of acceptability regions QV QV QV QV quality evidence annotations Env Annotation functions Long-lived reusable Commodities Expert-defined Dynamic User controlled

Abstract Quality Views ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Computable quality views as commodities ,[object Object],[object Object],[object Object],Abstract quality views binding and compilation Executable Quality process ,[object Object],[object Object],Qurator architectural framework:

Quality hypotheses discovery and testing abstract quality view Quality model Performance assessment Execution on test data Compilation Compilation Targeted Compilation Quality-enhanced User environment Quality-enhanced User environment Quality-enhanced User environment Target-specific Quality component Target-specific Quality component Target-specific Quality component Deployment Deployment Deployment ,[object Object],[object Object],[object Object],Quality model definition

Experimental quality ,[object Object],[object Object], Discovery and validation: “nuggets of quality knowldege” Quality View Model testing Test datasets  Embedding quality views and flow-through testing +

Execution model for Quality views ,[object Object],[object Object],[object Object],Host workflow Abstract Quality view Embedded quality workflow QV compiler D D’ Quality view on D’ Host workflow: D  D’ Qurator quality framework Services registry Services implementation

Example: original proteomics workflow Taverna workflow Quality flow embedding point

Example: embedded quality workflow

Interactive conditions / actions

Generic quality process pattern Collect evidence - Fetch persistent annotations - Compute on-the-fly annotations <variables <var variableName=" Coverage “ evidence=" q:Coverage "/> <var variableName=" PeptidesCount “ evidence=" q:PeptidesCount "/> </variables> Evaluate conditions Execute actions <action> <filter> <condition> ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12 </condition> </filter> </action> Compute assertions Classifier Classifier Classifier <QualityAssertion serviceName=" PIScoreClassifier " serviceType=" q:PIScoreClassifier " tagSemType=" q:PIScoreClassification " tagName=" ScoreClass " Persistent evidence

Reference (semantic) model quality evidence annotations custom quality knowledge DB DB Env Data sources Annotation functions Quality Assertions functions QA QA QA Quality Views definition of acceptability regions QV QV QV QV Common Semantic Model (IQ Ontology)

A semantic model for quality concepts Quality “upper ontology” (OWL) Evidence annotations are class instances Quality evidence types Evidence Meta-data model (RDF)

Main taxonomies and properties assertion-based-on-evidence: QualityAssertion  QualityEvidence is-evidence-for: QualityEvidence  DataEntity Class restriction: MassCoverage   is-evidence-for . ImprintHitEntry Class restriction: PIScoreClassifier   assertion-based-on-evidence . HitScore PIScoreClassifier   assertion-based-on-evidence . Mass Coverage

The ontology-driven user interface Detecting inconsistencies: no annotators for this Evidence type Detecting inconsistencies: Unsatisfied input requirements for Quality Assertion

Quality-aware query processing

Research issues ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Publications: http://www.qurator.org Qurator is registered with OMII-UK

Invited talk @Roma La Sapienza, April '07

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Invited talk @Roma La Sapienza, April '07

Ähnlich wie Invited talk @Roma La Sapienza, April '07 (20)

Mehr von Paolo Missier

Mehr von Paolo Missier (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Invited talk @Roma La Sapienza, April '07

Hinweis der Redaktion