This document discusses using Wikidata as a central repository for chemistry data currently found in Wikipedia infoboxes. It notes issues with the current approach and outlines Wikidata's data model and features that make it suitable for this purpose. As an example, it describes how gene Wiki info boxes have been migrated to Wikidata. It provides guidance on resolving issues with isomers and outlines efforts to improve data quality for chemical compounds in Wikidata.
1. Drug and chemical compound items in Wikidata
as a data source for Wikipedia infoboxes
https://commons.wikimedia.org/wiki/File:Wikidata-logo-en.svg
Sebastian Burgstaller-Muehlbacher, PhD
User:Sebotic
Twitter: @sebotic
2. Contents
● The Problem
● Introduction to Wikidata
– Data model
– References
– Values/data types
● Gene Wiki Info Boxes - An Example solution
● Chemistry Data in Wikidata
– Issues with the data
– Community cleanup
– Migration of Info Boxes to Wikidata
3. The Problem (with chemistry data)
● Wikipedia has ~300 different languages projects
● Currently, chemistry data resides as info box parameter
– Data are not reusable between language projects
– Data are not machine readable
– Data are hard to update automatically
– Data cannot be reused for other purposes, e.g. science.
6. Wikidata items
● Two types of entities
– Properties (Pxxxx):
● Describe the nature of a data value
● Different data types
● 2,900 different properties in Wikidata
– Data items (Qxxxx):
● A set of claims or statements
● Consist of property value pairs
● 20 million items in Wikidata
10. Wikidata Data types
● The current Wikidata data types:
– String
– WDItemID
– External ID
– MonolingualText
– Property
– Quantity
– Time
– Url
– GlobeCoordinate
– CommonsMedia
– Mathematical formula
11. Unique Features of Wikidata
● Completely free, even for commercial usage (CC0).
● Granular: Single values with references.
● Anybody can contribute.
● Extensive item history.
● A repository for data on all domains of knowledge.
● Full integration with the semantic web.
● Essentially: A giant graph of knowledge.
15. Issues with chemical data in the Wiki space
● Incorrect identifiers in info boxes or on Wikidata items
● Incorrect chemical properties
● Incorrect labels, aliases
● Incorrect isomeric forms of the compound
● Mixture of different isomeric forms
17. How to solve Isomerism issues?
● Make sure that the structure in Wikidata and Wikipedia are correct
and consistent:
– Use the InChI (International Chemical Identifier) or InChI key to determine
what isomer a certain article or WD item is actually talking about.
18. What are InChIs
● IUPAC InChI (International Chemical Identifier).
● Describes the structure of a chemical compound or substance.
● Freely usable.
● Can be computed from e.g SMILES, or MOL format.
● Do not need to be assigned by an organization.
19. What are InChI keys
● The SHA-256 hashed version of an InChI
● Makes chemicals searchable on the Web
● Makes chemicals easily comparable
● Short, unique
UEJJHQNACJXSKW-UHFFFAOYSA-N
First block (14 letter) encodes
skeleton (connectivtiy)
Second block (8 letter) encodes
stereochemistry and radioisotopes
Last letter, number of protons
(charge)
20. How to solve Isomerism issues?
● Make sure that the structure in Wikidata and Wikipedia are correct
and consistent:
– Use the InChI (International Chemical Identifier) or InChI key to determine
what isomer a certain article or WD item is actually talking about.
– Minimum requirement: Correct, unique InChI key on item.
– Best case: Make sure all structural identifiers are correct (isomeric
SMILES, canonical SMILES, InChI or InCh key).
– A minimum of a correct InChI key allows for the rest of the chemical
compound item to be populated by (our) bots.
21. What has been accomplished so far?
● Discussion on Wikiproject chemistry:
https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry#Wiki
– General consensus that info boxes should use Wikidata
– Wikidata needs to improve on data quality
● Of the 17,000 original chemical compound Wikidata items, 16,000
have been validated around an InChI key.
● More chemical data has been imported, so they are readily
available for new Wikipedia articles or correction of existing ones.
22. Things that need your attention
● I generated a list of items at Wikidata project chemistry which
need human intervention.
https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Chemistry#Annota
Please have a look at those and unify the sterechemistry and
identifiers around one unique InChI key!
23. Data maintenance in Wikidata
● Our bots are written in Python (2.7 and 3.x compatible).
● Python bots keep Wikidata in sync with authoritative data
source. (PubChem, ChemSpider, ChEBI, ChEMBL)
● Bots are run according to data release cycles of authoritative
data sources.
● Mechanisms in place for detection of inconsistencies.
● Contributions of other Wikidata users are being accounted for,
based on references.
24. Wikidata API and query endpoints
● Three ways to access data:
– Wikidata API allows read, write and full text search.
(www.wikidata.org/w/api.php)
– REST endpoint for fast, direct data access.
(queryr.wmflabs.org/)
– Wikidata query service (WDQS) as a SPARQL endpoint for complex
queries.
(query.wikidata.org/)
25. Acknowledgments
Andrew Su
Benjamin Good
Tim Putman
Julia Turner
Gregg Stupp
(TSRI)
Gang Fu
Evan Bolton
(NIH, PubChem)
Andra Waagmeester
(Micelio.be)
Elvira Mitraka
Lynn Schriml
(Disease Ontology, U Baltimore)
Hinweis der Redaktion
-Labels, descriptions, aliases in different languages
-Diverse Properties
-Sitelinks
-Properties must be proposed and approved by the community
-Data items can be edited by any Wikidata user and are the true data stores.
Claim: Property with value + optional qualifiers
Statement: A claim with its references
-Many querys to the Wikidata API make the bot slow and might make Wikimedia people/adminstrators unhappy.
-Calling wbeditentity ensures that all data is either written or not, so if the connection or bot breaks, no harm is done.
-No new items will be created and then left unpopulated.
Single value refs/nano publications
Revisions/data releases
The Sparql endpoint allows complex and also federated queries on the full WD content.
REST and SPARQL are still in beta mode.