5. IBM Watson delivered ‘unsafe and inaccurate’ cancer recommendations
https://www.massdevice.com/report-ibm-watson-delivered-unsafe-and-
inaccurate-cancer-recommendations/
In one example, a patient was recommended a drug that could lead to severe
or fatal hemorrhage while he was already dealing with severe bleeding due to
his condition. In another example, a Florida doctor who reviewed the system
told the company that the technology is "a piece of shit."
Consequences of Poor Data Quality - Inaccuracy
8. Definitions
•Data Quality: commonly conceived as a multi-
dimensional construct with a popular definition ‘fitness
for use’*.
•Dimension: characteristics of a dataset.
•Metric: or indicator is a procedure for measuring an
information quality dimension.
*Juran et al., The Quality Control Handbook, 1974
9. Data Quality Assessment Goals
Fix data quality issues in given sets of (semantic) data
Such quality issues may
• be in source datasets (e.g., inaccurate or wrong data items, outdated data items)
• result from imperfections of a data integration process (e.g., data items that have
been incorrectly linked with each other)
• reveal themselves only after the data integration (e.g., duplicates, inconsistencies)
Data cleaning may be relevant both, for original datasets before combining/
integrating and for datasets resulting from an integration.
Source: http://www.ida.liu.se/research/semanticweb/events/SemDataMgmtTutorial-Part7-Cleaning.pdf
12. Open Quality Dimensions -
Accessibility
Availability - extent to which data (or some portion of it) is present, obtainable and
ready for use
Metrics:
• accessibility of the SPARQL endpoint and the server
• dereferenceability of the URI
Is there access information for resources provided?
Is information available that can help to discover/search datasets?
Is there information about format, size or update frequency of the resources?
Can the described resources be retrieved by an agent?
Also relates to Metadata
Quality!
13. Open Quality Dimensions -
Interlinking
Interlinking - degree to which entities that represent the same
concept are linked to each other, be it within or between two or
more data sources
Metrics:
• detection of the existence and usage of external URIs (target
dataset)
• detection of all local in-links or back-links: all triples from a
dataset that have the resource’s URI as the object
14. Open Quality Dimensions -
Contextual
Trustworthiness - defined as the degree to which the information is accepted to
be correct, true, real and credible.
Metrics
• annotating triples with provenance data and usage of provenance history to
evaluate the trustworthiness of facts
- FAIR provenance and metadata, provenance ontologies (PROV-O, HCLS)
• opinion-based method, which use trust annotations made by several
individuals
• Specification of the license of the dataset or resource
15. Open Data Quality Dimensions
- Open Data
Is the specified format and license information suitable to
classify a dataset as open?
Metrics:
• Is the file format based on an open standard?
• Can the file format be considered as machine readable?
• Does the used license conform to the open definition?
Sebastian Neumaier, Jurgen Umbrich, and Axel Polleres, 2015. Automated Quality
Assessment of Metadata across Open Data Portals. ACM J. Data Inform. DOI: http://
dx.doi.org/10.1145/0000000.0000000
18. Crowdsourcing Quality Assessment
Acosta, M.; Zaveri, A.; Simperl, E.; Kontokostas, D.; Flöck, F. & Lehmann, J. (2016), 'Detecting Linked
Data Quality Issues via Crowdsourcing: A DBpedia Study', Semantic Web Journal .