Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Building a knowledge graph of the Belgian War Press

239 Aufrufe

Veröffentlicht am

Presentation by Brecht Van de Vyvere at Open Belgium 2017.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Building a knowledge graph of the Belgian War Press

  1. 1. Let’s talk Linked Data session Open Belgium 2017 Brecht Van de Vyvere | @brechtvdv Building a knowledge graph of the Belgium War Press
  2. 2. Can I easily link historic papers with other datasources?
  3. 3. Agenda • hetarchief.be • Knowledge graph • 5-star Open Data plan • Adding context • Linked Data as a Service • Future Work
  4. 4. Dataset
  5. 5. hetarchief.be “News from the Great War” • Newspapers 1914 - 1918 • 10+ Content Partners • Begin 2015: site launched • Functionality • Search by keyword • Map with place of publication • Collections 1k titles 55k newspapers 300k pages
  6. 6. Human-readable interface
  7. 7. Policy 1. Metadata • No restrictions → CC0 2. OCR, documents • Pictures, short stories… • Uncertain copyright status • No license or “terms of use” that minimises restrictions for re-use • Disclaimer
  8. 8. hetarchief.be • One of the biggest databases online • No raw data? • Title • Description → OCR from ALTO • Date created • Owner • IDs (carrier, Abraham, VIAA) • URL image
  9. 9. 9 5-stars Open Data Plan
  10. 10. First 3 Stars • Open License • Structured • Non-proprietary VIAA DB VIAA API NodeJS → github.com/viaacode/hetarchief2lod IDs Metadata CSV Transform
  11. 11. Step 4: URIs for everything • Map VIAAs internal ID to URI: • http://data.viaa.be/noid/{id} • Use ontologies • BBC → Creative Work Ontology • schema.org • Hydra → collections
  12. 12. Knowledge graph • Semantic network • Concepts • Relations • Linked Data • URIs • RDF
  13. 13. <http://dbpedia.org/page Albert_I_of_Belgium> rdfs:type <http://xmlns.com/ foaf/0.1/Person> <http://data.viaa.be/ noid/example> <http://www.bbc.co.uk/ontologies/ creativework#tag>
  14. 14. 5-star: link to other sources • ABRAHAM: catalogue of newspapers in Belgium <http://anet.be/record/abraham/opacbnc/c:bnc:26> <http://data.viaa.be/noid/tm71v5c76q_191510XX> owl:sameAs
  15. 15. L’illustration“1915-10-XX” http://data.viaa.be/noid/ tm71v5c76q_191510XX cwork:titlecwork:dateCreated On dit que c'est notre imagination et…. cwork:content cwork:CreativeWork rdf:type UGENT schema:copyrightHolder schema:inLanguage en Basic information triples
  16. 16. http://data.viaa.be/noid/ tm71v5c76q http://data.viaa.be/noid/ tm71v5c76q_191804xx_0003 http://data.viaa.be/noid/ tm71v5c76q_191804xx_0002 http://data.viaa.be/noid/ tm71v5c76q_191804xx_0001 first last previous/next first memberOf totalItemsHydra last 3 first/last
  17. 17. Problems • Node limited to 1.7 GB memory • OCR too big • Turtle file: 475 MB max (32k newspapers) • Compressed to HDT: 388 MB • Basic triples with HDT: • 54k newspapers → 8.2 MB
  18. 18. Adding context
  19. 19. Connect with other datasources • Cfr. Europeana, delpher.nl, lab.kbresearch.nl
  20. 20. Stanford NER • 4 types: Location, Organisation, Person and Other • Train your model: golden corpus • Write code that fits your needs • SPARQL query that matches strings • REPERTOIRE des COMMUNES et des PRINCIPAUX HAMEAUX de la ci-devant Belgique • Difficult to find cultural APIs (cfr. InFlandersField list of names, Abraham catalogue)
  21. 21. DBpedia Spotlight • Proof of concept • Models for all languages (nl, en, fr, de) NL/FR/EN/DE trained model DBpedia matcher Stanford NER
  22. 22. Results? • Filter on OCR quality; e.g. <90% assurance in ALTO • Wrong time period, e.g. geonames • Standard models, should be trained • Use DBpedia knowledge later to filter “impossible” tags
  23. 23. DBpedia Spotlight • Running your own endpoint is easy: • java -Xmx8G -jar dbpedia- spotlight-0.7.1.jar nl http://localhost:2223/ nl/rest • Or with Docker: • docker build -f Dockerfile -t dutch_spotlight . • docker run -i -p 2223:80 dutch_spotlight spotlight.sh
  24. 24. Linked Data as a Service • Allow federated queries • Low server cost • Be reliable • Triple Pattern Fragments: a Low-cost Knowledge Graph Interface for the Web
  25. 25. Linked Data Fragments querying • VIAA is part of the family! http://data.viaa.be/ ldfhttps://query.wikidata.org/ bigdata/ldf http:// data.linkeddatafragments. org/linkedgeodata http:// data.linkeddatafragments. org/dbpedia2014 Your browser Client-side algorithm GET fragments
  26. 26. Demo time!
  27. 27. Demo • Retrieve all newspaper titles: SELECT DISTINCT ?title WHERE { ?paper <http://www.bbc.co.uk/ontologies/creativework#title> ?title }
  28. 28. Demo • Retrieve more info from corresponding DBpedia URI: SELECT ?label ?comment WHERE { <http://data.viaa.be/noid/2z12n51476_19141120_0001> <http:// www.bbc.co.uk/ontologies/creativework#tag> ?tag . ?db owl:sameAs ?tag . ?db rdfs:label ?label . ?db rdfs:comment ?comment }
  29. 29. Battle of the Somme • Pages with military leaders from the Battle of the Somme mentioned + thumbnail: SELECT ?paper ?o ?thumbnail WHERE { <http://dbpedia.org/resource/Battle_of_the_Somme> <http://dbpedia.org/ ontology/commander> ?o . ?paper <http://www.bbc.co.uk/ontologies/creativework#tag> ?ctag . ?o owl:sameAs ?ctag . ?o <http://dbpedia.org/ontology/thumbnail> ?thumbnail . }
  30. 30. Frontpainters • Semi-automatic generation of collections, e.g. about frontpainters SELECT ?newspaper ?artist ?tag ?hetarchief WHERE { ?artist dc:subject <http://dbpedia.org/resource/ Category:Belgian_war_artists> . ?artist owl:sameAs ?tag . ?newspaper <http://www.bbc.co.uk/ontologies/creativework#tag> ? tag . ?newspaper <http://www.w3.org/2000/01/rdf-schema#seeAlso> ? hetarchief }
  31. 31. Conclusion • Extra search method for our researchers • NER versus OCR: enhanced findability • Adding extra information (cfr. Abraham) requires effort, we need more TPFs interfaces
  32. 32. Future work • Dereferencable URIs • http://data.viaa.be/noid/{id} • Content negotiation • HTML • JSON • RDF • Save location with OLR • Suggestions are welcome!
  33. 33. Q&A Brecht Van de Vyvere | @brechtvdv

×