Key fields in the Humanities such as History and Musicology are central in the major transformation carried by the Digital Humanities (DH). A fundamental question in DH is how humanities datasets can be represented digitally, in such a way machines can process them, understand their meaning, facilitate their inquiry, and exchange them on the Web. In this talk, I will motivate that humanities scholars and computer scientists interact further, by surveying our current work using Semantic Web technology to represent DH objects in Quantitative History and Symbolic Music. Importantly, I will also argue that the technical knowledge gap between the Semantic Web community and many of its application domains, DH among them, is currently too wide, and thus these domains face issues on accessing and consuming semantically-enabled humanities data. To address these, I will demo our current work on automatic Linked Data API construction (heavily inspired by work done at KMi), historical statistics preprocessing and publishing, and music linkage on the Web.
2. R E F I N I N G S TAT I S T I C A L D ATA O N T H E W E B
3. Vrije Universiteit Amsterdam
3
ME
• Postdoc researcher at VU University Amsterdam,
Knowledge Representation & Reasoning
• Interfaces between the Digital Humanities and the
Semantic Web
• Enabling intelligent data preprocessing and universal
access to Linked Data
• Ontologies, Linked Data, Data Integration, APIs,
repeatability, provenance
4. Vrije Universiteit Amsterdam
4
WHY THIS TALK?
• What is Digital Humanities?
• "to study human culture in a more scientific way”
• Albert: “doing humanities is exactly equal to doing
science”
• Repeatability
• Hypothesis testing
• Pragmatic, clean, idealized
• Jacky: “doing humanities is completely different to
doing science”
• Interpretative approach, relativistic
• Give value to argumentation and vagueness instead of truth
• Focus on the questions we do ask
• https://storify.com/ingorohlfing/overly-honest-methods-in-science
• Is doing humanities exactly equal to doing science?
5. ‹#› Het begint met een idee
QUANTITATIVE HISTORY
ON THE SEMANTIC WEB
7. Vrije Universiteit Amsterdam
7
DATA PREPARATION
Present data = high volume
Historical data = high variety
Multiple legacy (tabular) formats
Diverse identity, unity, rigidity and dependence
Preparing them to gain knowledge is expensive
Manual data munging
Hardly reproducible
12. Vrije Universiteit Amsterdam
12
DATA MODELS: CSVW + RDF DATA CUBE
Semi-automatic
Generic
Domain independent
Microdata =
CSVW
[COW]
Macrodata = RDF
Data Cube [QBer]
[TabLinker]
Credits to Rinke Hoekstra
13. Vrije Universiteit Amsterdam
LSD DIMENSIONS
http://lsd-dimensions.org/
Index of statistical dimensions and associated concept schemes on
the Web
14. R E F I N I N G S TAT I S T I C A L D ATA O N T H E W E B
15. R E F I N I N G S TAT I S T I C A L D ATA O N T H E W E B
16. R E F I N I N G S TAT I S T I C A L D ATA O N T H E W E B
17. R E F I N I N G S TAT I S T I C A L D ATA O N T H E W E B
New code lists
• HISCO
http://historyofwork.iisg.nl/ Credits to Richard Zijdeman
18. R E F I N I N G S TAT I S T I C A L D ATA O N T H E W E B
New code lists
• Gemeentegeschiedenis.nl
http://www.gemeentegeschiedenis.nl/ Credits to Ivo Zandhuis
19. R E F I N I N G S TAT I S T I C A L D ATA O N T H E W E B
New code lists
http://licr.io/ Credits to Ashkan Ashkpour
20. R E F I N I N G S TAT I S T I C A L D ATA O N T H E W E B
Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
? ?
21. R E F I N I N G S TAT I S T I C A L D ATA O N T H E W E B
3. Changing schemas over time
Do Linked Data vocabularies evolve in a predictable way?
1. Performant models can be learned from past
vocabulary versions
2. Can be used to pinpoint resources susceptible
to change, radical changes
3. Their predictive power depends on the
smoothness of changes between versions
4. 39.8% of the LOD vocabularies are highly
predictable
Snapshot data is insufficient
Fine-grained, commit-based stories (ALIGNED,
PERICLES, later in this presentation)
S L I D E 2 1 O F 1 3
22. R E F I N I N G S TAT I S T I C A L D ATA O N T H E W E B
4. Dubious data quality
Linked Edit Rules enable sharing knowledge to assess quality of
Linked Statistical Data
How to communicate the results of LER for
further processing?
23. Vrije Universiteit Amsterdam
23
A BASIC WEB SYSTEMS COMMUNICATION TOOLKIT
1. Endpoint location is volatile
Names encapsulate semantics of operations → Should be
meaningless, just as email addresses
HTTP : http://example.org/canihasdata
2. Consensus on data semantics is necessary
Simple object exchange format + 15 years of Web ontology
development to semantically describe data
JSON+LD : [{ "@id": "eg:Albert",
"rdf:type": [{ "@id": "foaf:Person" }]}]
26. R E F I N I N G S TAT I S T I C A L D ATA O N T H E W E B
http://scry.rocks/
• Custom data processing in
SPARQL needed in multiple
domains
• SPARQL cannot be extended
in a standard-compliant way
PREFIX : <http://scry.rocks/example/>
PREFIX scry: <http://scry.rocks/>
PREFIX impute: <http://scry.rocks/math/impute?>
PREFIX mean: <http://scry.rocks/math/mean?>
PREFIX sd: <http://scry.rocks/math/stdev?>
SELECT ?obs ?dim ?imputed_val WHERE {
?obs a qb:Observation .
?dim a qb:DimensionProperty|qb:MeasureProperty .
FILTER NOT EXISTS { ?obs ?dim ?val .}
?other_obs ?dim ?other_val .
SERVICE <http://sparql.scry.rocks/> {
SELECT ?imputed_val {
GRAPH ?g1 {impute:v scry:input ?other_val ;
scry:output ?imputed_val .}
}
}
}
Delegation of non-
standard function to
secondary endpoint
Credits to Bas Stringer
27. ‹#› Het begint met een idee
27 Het begint met een idee
One .rq file for SPARQL query
Good support of query curation
processes
> Versioning
> Branching
> Clone-pull-push
Web-friendly features!
> One URI per query
> Uniquely identifiable
> De-referenceable
(raw.githubusercontent.com)
27 Faculty / department / title presentation
GITHUB AS A HUB OF
SPARQL QUERIES
28. ‹#› Het begint met een idee
28 Het begint met een idee
http://sparql2git.amp.ops.labs.vu.nl/
29. ‹#› Het begint met een idee
29 Het begint met een idee
Cousin of BASIL in a SALAD
Same basic principle: 1 SPARQL query = 1
API operation
Automatically builds Swagger spec and UI
from SPARQL
But:
External query management
Organization of SPARQL queries in the
GitHub repo matches organization of the
API
Thin layer – nothing stored server-side
Maps
> GitHub API
> Swagger spec
29 Faculty / department / title presentation
32. Vrije Universiteit Amsterdam
32
THE GRLC SERVICE
Assuming your repo is at https://github.com/:owner/:repo
and your grlc instance at :host,
> http://:host/:owner/:repo/spec returns the JSON swagger spec
> http://:host/:owner/:repo/api-docs returns the swagger UI
> http://:host/:owner/:repo/:operation?p_1=v_1...p_n=v_n calls
operation with specifiec parameter values
> Uses BASIL’s SPARQL variable name convention for query parameters
Sends requests to
> https://api.github.com/repos/:owner/:repo to look for SPARQL queries and their
decorators
> https://raw.githubusercontent.com/:owner/:repo/master/file.rq to dereference
queries, get the SPARQL, and parse it
34. Vrije Universiteit Amsterdam
34
EVALUATION – USE CASES
CEDAR: Access to census data for
historians
> Hides SPARQL
> Allows them to fill query parameters
through forms
> Co-existence of SPARQL and non-SPARQL
clients
CLARIAH - Born Under a Bad Sign:
Do prenatal and early-life
conditions have an impact on
socioeconomic and health
outcomes later in life? (uses 1891
Canada and Sweden Linked Census Data)
> Reduction of coupling between SPARQL
libs and R
> Shorter R code – input stream as CSV
35. ‹#› Het begint met een idee
WHAT DOES A KNOWLEDGE
GRAPH SOUND LIKE?
(OR: MUSIC IS A GRAPH)
38. Vrije Universiteit Amsterdam
Besides how awesome this is…
The jam became global (i.e. de-referenceable URIs from
anywhere) rather than local
> But any video stream would have been more accurate (for humans)
The jam became machine readable
> But not all of it
Digital music as Linked Data?
38
REPRESENTING MUSIC IN RDF
39. Vrije Universiteit Amsterdam
Music representation format which is 100% digital
> (i.e. leaving nothing to analog signals actual instruments)
MIDI (Musical Instrument Digital Interface)
> Universal synthesizer interface
> Roland (I. Kakehashi), Yamaha, Korg, Kawai (1981)
> Digital, fine-grained representation of musical tracks and events
> Wide range of controllers and instruments
39
WEEKEND EXPERIMENT
42. Vrije Universiteit Amsterdam
Music representation format which is
> 100% digital (i.e. leaving nothing to analog signals)
> Secundary list
MIDI (Musical Instrument Digital Interface)
> Universal synthesizer interface
> Roland (I. Kakehashi), Yamaha, Korg, Kawai (1981)
> Digital, fine-grained representation of musical events
> Wide range of controllers and instruments
42
WEEKEND EXPERIMENT
43. Vrije Universiteit Amsterdam
midi2rdf
> Any MIDI to an RDF midi-comptabile representation
> All MIDI resources get URIs (can link, be linked, de-referenced)
> Music is a graph!
rdf2midi
> Round trip conversion
> Lossless!
> MIDI subseteq RDF
playrdf.sh
> RDF files can be ‘played’ without data loss
> Demo
43
LOSSLESS CONVERSION & STREAMING
47. Vrije Universiteit Amsterdam
47
CONCLUSIONS
Semantic Web and Digital Humanities: to science, or not to
science?
Data preparation = 80% of work
Linked Data based solutions
> Repeatable (non-disposable) research
> Statistical dimension & codelist enrichment
> Git for versioning and provenance
> Distributed notifications
> Universal Web APIs for modular Linked Data consuming applications
MIDI Linked Data
> Your ideas & contribs most welcome
About those statistics of Stairway to Heaven…
48. Vrije Universiteit Amsterdam
> Albert Meroño-Peñuela. “Humanists And Scientists: More Alike Than Different”. eHumanities Magazine,
number 7, February 2016 (HTML)
> Albert Meroño-Peñuela, Rinke Hoekstra. “grlc Makes GitHub Taste Like Linked Data APIs”. SALAD 2016 —
Services and Applications over Linked Data APIs and Data. International workshop, ESWC 2016, May 29th,
Heraklion, Crete, Greece (2016). (PDF)
> Rinke Hoekstra, Albert Meroño-Peñuela, Kathrin Dentler, Auke Rijpma, Richard Zijdeman, Ivo Zandhuis. “An
Ecosystem for Linked Humanities Data”. In: Proceedings of the 1st Workshop on Humanities in the SEmantic
web (WHiSE 2016). ESWC 2016, May 29th, Heraklion, Crete, Greece (2016). (PDF)
> Albert Meroño-Peñuela, Rinke Hoekstra. “The Song Remains the Same: Lossless Conversion and Streaming of
MIDI to RDF and Back”. In: 13th Extended Semantic Web Conference (ESWC 2016), posters and demos track.
May 29th — June 2nd, Heraklion, Crete, Greece (2016). (PDF)
> Albert Meroño-Peñuela. “Refining Statistical Data on the Web”. Vrije Universiteit Amsterdam (2016) (Amazon)
(VU-DARE)
> Albert Meroño-Peñuela, Christophe Guéret, Stefan Schlobach. “Linked Edit Rules: A Web Friendly Way of
Checking Quality of RDF Data Cubes”. Proceedings of the 3rd International Workshop on Semantic Statistics
(SemStats 2015), ISWC 2015, Bethlehem, PA, USA (2015). (PDF)
> Bas Stringer, Albert Meroño-Peñuela, Antonis Loizou, Sanne Abeln, Jaap Heringa. “To SCRY Linked Data:
Extending SPARQL the Easy Way”. Diversity++ workshop, ISWC 2015, Bethlehem, PA, USA (2015). (PDF)
> Albert Meroño-Peñuela, Ashkan Ashkpour, Marieke van Erp, Kees Mandemakers, Leen Breure, Andrea
Scharnhorst, Stefan Schlobach, Frank van Harmelen. “Semantic Technologies for Historical Research: A
Survey”. Semantic Web — Interoperability, Usability, Applicability, 6(6), pp. 539–564. IOS Press (2015).
> Albert Meroño-Peñuela, Ashkan Ashkpour, Christophe Guéret, Stefan Schlobach. “CEDAR: The Dutch
Historical Censuses as Linked Open Data”. Semantic Web — Interoperability, Usability, Applicability, 8(2), pp.
297–310. IOS Press (2015).48
PUBLICATIONS
49. ‹#› Het begint met een idee
THANK YOU!
@albertmeronyo
DATALEGEND.NET
CLARIAH.NL
49
50. ‹#› Het begint met een idee
THANK YOU!
@albertmeronyo
DATALEGEND.NET
CLARIAH.NL
50