1. Wikidata & Heritage Data
Where do we stand? What’s next?
Lausanne, 14 September 2017
Sijie Dai, Captain Alving – Prix de Lausanne 2010. Photo by Inisheer, CC BY-SA (Wikimedia Commons)
Unless otherwise noted,, the content of this presentation is made available under the CC BY 4.0 license.
2. ▶ The aim of this project is to coordinate, facilitate and promote
the ingestion of cultural heritage related data into
Wikidata, to facilitate the cleansing and enhancement of this
data and to promote its use across Wikipedia, its sister
projects and beyond.
▶ It is our vision to establish Wikidata as a central hub for data
integration, data enhancement, and data management in
the heritage domain.
Aim and Vision (WikiProject Cultural Heritage)
3. ▶ Establish Wikidata as a database that covers the entire world’s
cultural heritage.
▶ Establish Wikidata as a central hub that interlinks GLAM collections
around the world; and provides links to bibliographic, genealogic,
scientifc and other collections of information; create the ultimate
authority file.
▶ Foster truly multilingual and global collaboration among people
from various backgrounds.
▶ Leverage synergies between institutions, reduce duplicate work.
▶ Encourage debate in the community by highlighting and
interrogating differences in perspective.
▶ Provide a single source of data for some of the most popular web
sites and apps, including Wikipedia infoboxes and lists.
Vision (Blog posts: Stinson et al. 2016; Thornton / Cochrane 2016; Poulter 2017)
7. ▶ Wikidata needs to be explained to institutions in view of data
donations.
• Lack of awareness of the importance of open licenses in
databases
• Fears of loss of control related to publishing data under CC-0
• What can institutions gain from their involvement in Wikidata?
▶ Community members need assistance with scraping data from
websites.
▶ Present coverage is biased; it is highest for Western Europe and
North America; how to get access to data from other world regions?
How To Get Access to Freely Licensed Data?
8. ▶ http://make.opendata.ch/wiki/data:glam_ch
• Personnalités Vaudoises (BCUL)
• Swiss Photography Metadata (Büro für Fotografiegeschichte)
• Artist data from the SIKART Lexicon on art in Switzerland (SIK-ISEA)
• Metadata of the Historical Dictionary of Switzerland (HLS)
• PCP Inventory (Federal Office for Civil Protection)
• Inventory of Historical Monuments (Canton of Zurich)
• Inventory of Historical Monuments (City of Zurich)
• Inventory of classified Gardens and Parks (City of Zurich)
• Art in the Urban Space (City of Zurich)
• Swiss GLAM Inventory (OpenGLAM)
• Inventory of Research Libraries in Switzerland (Swissbib)
• ISplus Swiss (G)LAM Inventory (Swiss National Library)
• Schauspielhaus Zürich Repertoire of Theatre and other Productions, 1938–1968
• Swiss Theatre Metadata (Swiss Theatre Collection)
• Plazi TreatmentBank (repository of the world's species) (Plazi.org)
• Historical Statistics of Switzerland (University of Zurich)
Data Provision – Which Datasets are Useful?
13. ▶ Coping with the Bazaar:
• Sometimes changes to property definitions are too easily made by
volunteers
• There is a rigorous process for creating new properties, but not for
changing definitions of properties or creating new classes
• No master language; how to keep translations of definitions in synch?
• Sometimes different approaches are used to model the same thing.
▶ What are good design principles?
• Re-usability of properties across various domains
• Select high priority areas first, do not try to solve everything overnight for
the entire cultural heritage domain
• …
▶ Finding a balance between:
• The expressive power of an ontology
• Its practicability when it comes to large scale use by many people
• Its queryability (usability from the perspective of data users)
Challenges Related to Ontology Development (2/2)
14. ▶ Mapping Between Data Models
• Getting an overview of appropriate properties and classes can be a
time-consuming exercise.
• Creating new properties requires community agreement and may involve
lengthy discussions and compromises.
• There is still a lot of work to be done in the area of typologies and
thesauri [Example]
▶ Matching Items / Disambiguation
• There are tools like Mix’n’Match and OpenRefine to support this, but it
remains a major challenge, esp. with datasets which haven’t resolved this
issue internally.
▶ Incorrect / Incoherent Data on Wikidata
• Many data ingestion projects require cleansing up of existing data.
▶ Repeated Ingestion / Updates
• How to approach the historicization of data?
• How to set up processes to regularly update data?
Challenges Related to Data Ingestion
N.B.: We are not filling a void or starting from scratch, but contributing to an
existing ecosystem of data, data models, and community members!
17. ▶ Establishing and Documenting Data Quality
• Getting rid of duplicates
• Dealing with incorrect and inconsistent data
• How to monitor data quality and data completeness?
▶ Building a Network of Trust
• Linking all statements to a reliable source
• In the future: “Signed Statements”
▶ Data Exchange Between Wikidata and Primary Databases
▶ Data synchronization: How to keep data mutually up to date?
▶ How to make it easier for GLAM employees to follow
changes/improvements to their data on Wikidata?
Challenges Related to Data Maintenance
18. ▶ Chicken and Egg Problem:
• Data usage drives data quality & completeness
• Data quality & completeness are prerequisites of data use
Challenges Related to Data Use
20. ▶ Linking Wikidata with other databases
• Map existing standards from the GLAM sector to Wikidata
• Merge data imported from Wikipedia with data from reliable databases
▶ In what areas is Wikidata supposed to…
• serve as the master database (referencing sources other than databases)?
• hold data imported from reliable databases?
• link to authoritative databases (without holding the actual data)?
▶ How should GLAMs organize their relationship with Wikidata?
• Provide mutual links?
• Ingest part or all of their data into Wikidata?
• Synchronize part or all of their data with Wikidata?
• Use Wikidata as their main database?
Wikidata and the Wider Data Landscape
21. ▶ How to improve guidelines, community structures, reporting etc. in
order to be able to involve more GLAM personnel in Wikidata?
▶ How best to foster a shared data modelling practice in various
areas? (Need for more modelling show cases, coordination, etc.)
▶ Need for training and tools (to facilitate the accomplishment of
certain tasks).
▶ The evolving tools landscape constitutes a challenge when
establishing processes and working with guidelines.
▶ https://www.wikidata.org/wiki/Wikidata:WikiProject_Cultural_heritag
e
▶ Wikidata + GLAM Facebook Group
Community & Collaboration
22. Useful Tools
▶ Example: Tools I used for the ingest of the Swiss GLAM
Inventory:
• Microsoft Excel / Open Office Calc
• Wikidata Query Service
• Open Refine
• Reconcile-csv
• Listeria
• Quick Statements
• Microsoft Word / Excel (mail merge)
• Hatnote: «Listen to Wikipedia»
23. ▶ Diff tools to help tracking changes in datasets on Wikidata and to
synchronize with external databases
▶ Statistics tools (data completeness; data use)
▶ Data visualization tools (beyond what the Query service can already
do)
▶ Data tracking tools (data completeness; see how data evolves)
▶ Improved version of the Quick Statements Tool (see feature
requests)
▶ Customizable forms for manual data entry
Tools – Wishlist
24. Thank You for Your Attention!
Contact
Beat Estermann
Bern University of Applied Sciences
beat.estermann@bfh.ch
+41 31 848 34 38