LDIF Lightening Talk

LDIF translates heterogeneous Linked Data from the Web into a clean, local target representation while keeping track of data provenance.

  1. 1. LDIFLinked Data Integration Framework
  2. 2. | LINKED DATA CHALLENGES• Data sources that overlap in content may: • use a wide range of different RDF vocabularies • use different identifiers for the same real-world entity • provide conflicting values for the same properties• Implications: • Queries are usually hand-crafted against individual sources – no different than an API • Improvised or manual merging of entities• Integrating public datasets with internal databases poses the same problems
  3. 3. | LDIF• LDIF homogenizes Linked Data from multiple sources into a clean, local target representation while keeping track of data provenance 1 Collect data: Managed download and update 2 Translate data into a single target vocabulary 3 Resolve identifier aliases into local target URIs 4 Cleanse data resolving the conflicting values 5 Output• Open source (Apache License, Version 2.0)• Collaboration between Freie Universität Berlin and mes|semantics
  4. 4. | LDIF PIPELINE1 Collect data Supported data sources:2 Translate data • RDF dumps (various formats) • SPARQL Endpoints3 Resolve identities • Crawling Linked Data4 Cleanse data5 Output
  5. 5. | LDIF PIPELINE1 Collect data Sources use a wide range of different RDF vocabularies2 Translate data dbpedia-owl: City3 Resolve identities schema:Place R2R local:City fb:location.citytown4 Cleanse data5 Output • Mappings expressed in RDF (Turtle) • Simple mappings using OWL / RDFs statements (x rdfs:subClassOf y) • Complex mappings with SPARQL expressivity • Transformation functions
  6. 6. | LDIF PIPELINE1 Collect data Sources use different identifiers for the same entity2 Translate data Berlin, Germany , Berlin, CT 1′ N O ° 3 24′  52 °  Berlin, MD3 Resolve identities 13 Berlin, NJ Berlin, MA4 Cleanse data Berlin =5 Output Berlin Silk Berlin, ,  N 1′  O  3 ′ Germany 2°  24 5 ° 13 • Profiles expressed in XML • Supports various comparators and transformations
  7. 7. | LDIF PIPELINE Sources provide different values for the same property1 Collect data Berlin2 Translate data population is 3.4M3 Resolve identities ★ ★ Berlin4 Cleanse data population Berlin is 3.5M Sieve population is 3.5M5 Output ★ ★ ★ • Profiles expressed in XML • Supports various quality assessment policies and conflict resolution methods
  8. 8. | LDIF PIPELINE1 Collect data Output options:2 Translate data • N-Quads3 Resolve identities • N-Triples • SPARQL Update Stream4 Cleanse data5 Output • Provenance tracking using Named Graphs
  9. 9. ! |!!! LDIF ARCHITECTUREApplication!Layer! Application!Code!! SPARQL!or!RDF!API! !!!!!!LDIF!! !!Data!Access,!! Data! Identity! Data!Quality!Integration!and!! Web!Data! Integrated! Translation! Resolution! and!Fusion! Access!Module! Web!Data!Storage!Layer! ! Module! Module! Module! ! ! HTTP!Web!of!Data! HTTP! HTTP! HTTP! RDFa! LD!Wrapper! LD!Wrapper!Publication!Layer! RDF/X ML! Database!A! Database!B! CMS!
  10. 10. | LDIF VERSIONS• In-memory • keeps all intermediate results in memory • fast, but scalability limited by local RAM• RDF Store (TDB) • stores intermediate results in a Jena TDB RDF store • can process more data than In-memory but doesnt scale• Cluster (Hadoop) • scales by parallelizing work across multiple machines using Hadoop • can process a virtually unlimited amount of data
  11. 11. | THANK YOU• Website: http://ldif.wbsg.de• Google group: http://bit.ly/ldifgroup• Supported in part by • Vulcan Inc. as part of its Project Halo • EU FP7 project LOD2 - Creating Knowledge out of Interlinked Data (Grant No. 257943)