The Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities

THE STATISTICS OF
STAIRWAY TO HEAVEN
Albert Meroño-Peñuela (and many others), KMi, March 9th 2017

R E F I N I N G S TAT I S T I C A L D ATA O N T H E W E B

Vrije Universiteit Amsterdam
3
ME
• Postdoc researcher at VU University Amsterdam,
Knowledge Representation & Reasoning
• Interfaces between the Digital Humanities and the
Semantic Web
• Enabling intelligent data preprocessing and universal
access to Linked Data
• Ontologies, Linked Data, Data Integration, APIs,
repeatability, provenance

4
WHY THIS TALK?
• What is Digital Humanities?
• "to study human culture in a more scientific way”
• Albert: “doing humanities is exactly equal to doing
science”
• Repeatability
• Hypothesis testing
• Pragmatic, clean, idealized
• Jacky: “doing humanities is completely different to
doing science”
• Interpretative approach, relativistic
• Give value to argumentation and vagueness instead of truth
• Focus on the questions we do ask
• https://storify.com/ingorohlfing/overly-honest-methods-in-science
• Is doing humanities exactly equal to doing science?

‹#› Het begint met een idee
QUANTITATIVE HISTORY
ON THE SEMANTIC WEB

6
THE (HISTORICAL) KNOWLEDGE DISCOVERY
PROCESS
VolumeVariety

7
DATA PREPARATION
Present data = high volume
Historical data = high variety
 Multiple legacy (tabular) formats
 Diverse identity, unity, rigidity and dependence
Preparing them to gain knowledge is expensive
 Manual data munging
 Hardly reproducible

8
DATA PREPARATION
This ‘data preparation’ step can take up to 60% 80% of the total work

We do this repeatedly for the same datasets!

10
CEDAR / CLARIAH
?
1795
1830
1889
1930
1971

11
TOWARDS 5-STAR HISTORICAL STATISTICAL DATA
>4 years ago
4 years ago

12
DATA MODELS: CSVW + RDF DATA CUBE
Semi-automatic
Generic
Domain independent
Microdata =
CSVW
[COW]
Macrodata = RDF
Data Cube [QBer]
[TabLinker]
Credits to Rinke Hoekstra

LSD DIMENSIONS
http://lsd-dimensions.org/
Index of statistical dimensions and associated concept schemes on
the Web

New code lists
• HISCO
http://historyofwork.iisg.nl/ Credits to Richard Zijdeman

New code lists
• Gemeentegeschiedenis.nl
http://www.gemeentegeschiedenis.nl/ Credits to Ivo Zandhuis

New code lists
http://licr.io/ Credits to Ashkan Ashkpour

Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
? ?

3. Changing schemas over time
Do Linked Data vocabularies evolve in a predictable way?
1. Performant models can be learned from past
vocabulary versions
2. Can be used to pinpoint resources susceptible
to change, radical changes
3. Their predictive power depends on the
smoothness of changes between versions
4. 39.8% of the LOD vocabularies are highly
predictable
Snapshot data is insufficient
Fine-grained, commit-based stories (ALIGNED,
PERICLES, later in this presentation)
S L I D E 2 1 O F 1 3

4. Dubious data quality
Linked Edit Rules enable sharing knowledge to assess quality of
Linked Statistical Data
How to communicate the results of LER for
further processing?

23
A BASIC WEB SYSTEMS COMMUNICATION TOOLKIT
1. Endpoint location is volatile
Names encapsulate semantics of operations → Should be
meaningless, just as email addresses
HTTP : http://example.org/canihasdata
2. Consensus on data semantics is necessary
Simple object exchange format + 15 years of Web ontology
development to semantically describe data
JSON+LD : [{ "@id": "eg:Albert",
"rdf:type": [{ "@id": "foaf:Person" }]}]

24
LINKED DATA NOTIFICATIONS
https://www.w3.org/TR/ldn/
Thanks to Sarven Capadisli

25
IMPLEMENTATIONS
http://pyldn.amp.ops.labs.vu.nl/
https://github.com/albertmeronyo/pyldn/

http://scry.rocks/
• Custom data processing in
SPARQL needed in multiple
domains
• SPARQL cannot be extended
in a standard-compliant way
PREFIX : <http://scry.rocks/example/>
PREFIX scry: <http://scry.rocks/>
PREFIX impute: <http://scry.rocks/math/impute?>
PREFIX mean: <http://scry.rocks/math/mean?>
PREFIX sd: <http://scry.rocks/math/stdev?>
SELECT ?obs ?dim ?imputed_val WHERE {
?obs a qb:Observation .
?dim a qb:DimensionProperty|qb:MeasureProperty .
FILTER NOT EXISTS { ?obs ?dim ?val .}
?other_obs ?dim ?other_val .
SERVICE <http://sparql.scry.rocks/> {
SELECT ?imputed_val {
GRAPH ?g1 {impute:v scry:input ?other_val ;
scry:output ?imputed_val .}
}
}
}
Delegation of non-
standard function to
secondary endpoint
Credits to Bas Stringer

27 Het begint met een idee
 One .rq file for SPARQL query
 Good support of query curation
processes
> Versioning
> Branching
> Clone-pull-push
 Web-friendly features!
> One URI per query
> Uniquely identifiable
> De-referenceable
(raw.githubusercontent.com)
27 Faculty / department / title presentation
GITHUB AS A HUB OF
SPARQL QUERIES

http://sparql2git.amp.ops.labs.vu.nl/

 Cousin of BASIL in a SALAD 
 Same basic principle: 1 SPARQL query = 1
API operation
 Automatically builds Swagger spec and UI
from SPARQL
But:
 External query management
 Organization of SPARQL queries in the
GitHub repo matches organization of the
API
 Thin layer – nothing stored server-side
 Maps
> GitHub API
> Swagger spec
29 Faculty / department / title presentation

30
MAPPING GITHUB AND SWAGGER

31
SPARQL DECORATOR SYNTAX

32
THE GRLC SERVICE
 Assuming your repo is at https://github.com/:owner/:repo
and your grlc instance at :host,
> http://:host/:owner/:repo/spec returns the JSON swagger spec
> http://:host/:owner/:repo/api-docs returns the swagger UI
> http://:host/:owner/:repo/:operation?p_1=v_1...p_n=v_n calls
operation with specifiec parameter values
> Uses BASIL’s SPARQL variable name convention for query parameters
 Sends requests to
> https://api.github.com/repos/:owner/:repo to look for SPARQL queries and their
decorators
> https://raw.githubusercontent.com/:owner/:repo/master/file.rq to dereference
queries, get the SPARQL, and parse it

33
SPICED-UP SWAGGER UI

34
EVALUATION – USE CASES
 CEDAR: Access to census data for
historians
> Hides SPARQL
> Allows them to fill query parameters
through forms
> Co-existence of SPARQL and non-SPARQL
clients
 CLARIAH - Born Under a Bad Sign:
Do prenatal and early-life
conditions have an impact on
socioeconomic and health
outcomes later in life? (uses 1891
Canada and Sweden Linked Census Data)
> Reduction of coupling between SPARQL
libs and R
> Shorter R code – input stream as CSV

WHAT DOES A KNOWLEDGE
GRAPH SOUND LIKE?
(OR: MUSIC IS A GRAPH)

 Maybe not that different…
36
SEMANTIC WEB AND THE HUMANITIES

37
ISWC 2013 JAM SESSION
Jam’s “metadata”

Besides how awesome this is…
 The jam became global (i.e. de-referenceable URIs from
anywhere) rather than local
> But any video stream would have been more accurate (for humans)
 The jam became machine readable
> But not all of it
 Digital music as Linked Data?
38
REPRESENTING MUSIC IN RDF

 Music representation format which is 100% digital
> (i.e. leaving nothing to analog signals  actual instruments)
 MIDI (Musical Instrument Digital Interface)
> Universal synthesizer interface
> Roland (I. Kakehashi), Yamaha, Korg, Kawai (1981)
> Digital, fine-grained representation of musical tracks and events
> Wide range of controllers and instruments
39
WEEKEND EXPERIMENT

40
MIDI

41
MIDI2RDF
https://github.com/albertmeronyo/midi2rdf

 Music representation format which is
> 100% digital (i.e. leaving nothing to analog signals)
> Secundary list
 MIDI (Musical Instrument Digital Interface)
> Universal synthesizer interface
> Roland (I. Kakehashi), Yamaha, Korg, Kawai (1981)
> Digital, fine-grained representation of musical events
> Wide range of controllers and instruments
42
WEEKEND EXPERIMENT

 midi2rdf
> Any MIDI to an RDF midi-comptabile representation
> All MIDI resources get URIs (can link, be linked, de-referenced)
> Music is a graph!
 rdf2midi
> Round trip conversion
> Lossless!
> MIDI subseteq RDF
 playrdf.sh
> RDF files can be ‘played’ without data loss
> Demo
43
LOSSLESS CONVERSION & STREAMING

http://midi-ld.github.io/
45
MIDI LINKED DATA
6B
triples!!!!!11one

46
MIDI LINKED DATA

47
CONCLUSIONS
 Semantic Web and Digital Humanities: to science, or not to
science?
 Data preparation = 80% of work
 Linked Data based solutions
> Repeatable (non-disposable) research
> Statistical dimension & codelist enrichment
> Git for versioning and provenance
> Distributed notifications
> Universal Web APIs for modular Linked Data consuming applications
 MIDI Linked Data
> Your ideas & contribs most welcome
 About those statistics of Stairway to Heaven…

> Albert Meroño-Peñuela. “Humanists And Scientists: More Alike Than Different”. eHumanities Magazine,
number 7, February 2016 (HTML)
> Albert Meroño-Peñuela, Rinke Hoekstra. “grlc Makes GitHub Taste Like Linked Data APIs”. SALAD 2016 —
Services and Applications over Linked Data APIs and Data. International workshop, ESWC 2016, May 29th,
Heraklion, Crete, Greece (2016). (PDF)
> Rinke Hoekstra, Albert Meroño-Peñuela, Kathrin Dentler, Auke Rijpma, Richard Zijdeman, Ivo Zandhuis. “An
Ecosystem for Linked Humanities Data”. In: Proceedings of the 1st Workshop on Humanities in the SEmantic
web (WHiSE 2016). ESWC 2016, May 29th, Heraklion, Crete, Greece (2016). (PDF)
> Albert Meroño-Peñuela, Rinke Hoekstra. “The Song Remains the Same: Lossless Conversion and Streaming of
MIDI to RDF and Back”. In: 13th Extended Semantic Web Conference (ESWC 2016), posters and demos track.
May 29th — June 2nd, Heraklion, Crete, Greece (2016). (PDF)
> Albert Meroño-Peñuela. “Refining Statistical Data on the Web”. Vrije Universiteit Amsterdam (2016) (Amazon)
(VU-DARE)
> Albert Meroño-Peñuela, Christophe Guéret, Stefan Schlobach. “Linked Edit Rules: A Web Friendly Way of
Checking Quality of RDF Data Cubes”. Proceedings of the 3rd International Workshop on Semantic Statistics
(SemStats 2015), ISWC 2015, Bethlehem, PA, USA (2015). (PDF)
> Bas Stringer, Albert Meroño-Peñuela, Antonis Loizou, Sanne Abeln, Jaap Heringa. “To SCRY Linked Data:
Extending SPARQL the Easy Way”. Diversity++ workshop, ISWC 2015, Bethlehem, PA, USA (2015). (PDF)
> Albert Meroño-Peñuela, Ashkan Ashkpour, Marieke van Erp, Kees Mandemakers, Leen Breure, Andrea
Scharnhorst, Stefan Schlobach, Frank van Harmelen. “Semantic Technologies for Historical Research: A
Survey”. Semantic Web — Interoperability, Usability, Applicability, 6(6), pp. 539–564. IOS Press (2015).
> Albert Meroño-Peñuela, Ashkan Ashkpour, Christophe Guéret, Stefan Schlobach. “CEDAR: The Dutch
Historical Censuses as Linked Open Data”. Semantic Web — Interoperability, Usability, Applicability, 8(2), pp.
297–310. IOS Press (2015).48
PUBLICATIONS

THANK YOU!
@albertmeronyo
DATALEGEND.NET
CLARIAH.NL
49

THANK YOU!
@albertmeronyo
DATALEGEND.NET
CLARIAH.NL
50

The Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie The Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities

Ähnlich wie The Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities (20)

Mehr von Albert Meroño-Peñuela

Mehr von Albert Meroño-Peñuela (14)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

The Statistics of Stairway to Heaven: A Semantic Story About Digital Humanities