SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
SEMANTIC TECHNOLOGY & BUSINESS CONFERENCE                                   |
SAN FRANCISCO, JUNE 5, 2012




                   HOW TO
            INTEGRATE LINKED DATA
            INTO YOUR APPLICATION

                                                                          LDIF Team:
                                             Andreas Schultz, Freie Universität Berlin
                                                     Andrea Matteini, mes|semantics
                                                Robert Isele, Freie Universität Berlin
                                            Pablo N. Mendes, Freie Universität Berlin
                                                     Christian Becker, mes|semantics
                                              Christian Bizer, Freie Universität Berlin
                                                            With contributions by:
               Hannes Mühleisen, Freie Universität Berlin; William Smith, Vulcan Inc.
|


                           WHAT IS LINKED DATA?

•   Raw data (RDF)
•   Accessible on the web
•   Data can link to other data sources

              Thing               Thing               Thing               Thing               Thing

              Thing               Thing               Thing               Thing               Thing


                      data link           data link           data link           data link

               A                   B                  C                   D                    E




•   Benefits: Ease of access and re-use; enables discovery
•   One API for all data sources?
|


                          LINKING OPEN DATA CLOUD
                                                                                                                                                                                                                                              Linked
                                                                                                                                                                                                                                  LOV          User            Slideshare         tags2con
                                                                                                                                                                                                                Audio
                                                                                                                                                                                                                                             Feedback             2RDF            delicious
                                                                                                                                                                                             Moseley          Scrobbler                                                                             Bricklink        Sussex
                                                                                                                                                                                              Folk             (DBTune)                                                                                              Reading            St.
                                                                                                                                                                             GTAA
                                                                                                                                                            Magna-                                                                                                                                                    Lists          Andrews
                                                                                                                                                                                                                                                      Klapp-
                                                                                                                                                             tune                                                                                     stuhl-                                                                         Resource           NTU
                                                                                                                                             DB                                                                                                        club                                                                            Lists          Resource
                                                                                                                                           Tropes                                                                                       Lotico                         Semantic        yovisto
                                                                                                                                                                      John                     Music                                                                                                     Man-                                           Lists
                                                                                                                                                                                                                    Music                                               Tweet                           chester
                                                                                                                       Hellenic                                       Peel                     Brainz                                                                                                                                                                       NDL
                                                                                                                                                                     (DBTune)                  (Data                Brainz                                                                              Reading
                                                                                                                                                                                                                                                                                                                                                                          subjects
                                                                                                                        FBD                                                                                        (zitgist)                                                                             Lists                     Open
                                                                                                                                       EUTC                                                  Incubator)                                                Linked
                                                                                                      Hellenic                                                                                                                                                                                                                    Library                    Open                           t4gm
                                                                                                                                      Produc-                                                                                                         Crunch-
                                                                                                        PD                                               Surge                                                                       RDF                                                                                                                                                     info
                                                                                                                                       tions
                                                                                                                                                                             Discogs                                                                    base                                                                                                Library
                                                                                                                                                         Radio                                                                                                          Ontos          Source Code
                                                                                         Crime                                                                                                                                      ohloh                                                                          Plymouth                                 (Talis)
                                                                                                                                                                               (Data                                                                                    News                                                                                                                            LEM
                                                                                                                                                                                                                                                                                        Ecosystem                   Reading                                                   RAMEAU
                                                                                        Reports                           business                                           Incubator)
                                                                                                              Crime       data.gov.                                                                                                                                     Portal         Linked Data                    Lists                                                     SH
                                                                                          UK                                                                                                      Music             Jamendo
                                                                                                               (En-          uk
                                                                                                                                                                                                 Brainz             (DBtune)                                                                                                              LinkedL
                                                                            Ox                               AKTing)                      FanHubz                                                                                                 gnoss                                                                                                                                                                  ntnusc
                                                                                                                                                                                                (DBTune)                                                                                                    SSW                             CCN
                                                                           Points                                                                                                                                                                                                                                                                             Thesau-
                                                                                                                                                            Last.FM                                                                                                                                        Thesaur

                Media
                                                                                                                                                                                                                                 Poké-
                                                                                                  Popula-                                                    artists                                                             pédia                                     Didactal                          us                                                rus W                                    LIBRIS
                                                                                                 tion (En-                                                  (DBTune)                Last.FM                                                                                   ia                                               theses.                                                LCSH                                         Rådata
                                                              reegle                                                research          patents                                                                                                                                                                                                       MARC
                                                                                                  AKTing)                                                                           (rdfize)                                                                                                    my                                fr                                                                                                nå!
                                                                                                                    data.gov.         data.go                                                                                                                                                                                                       Codes
                                                   Ren.
                                                                                 NHS                                   uk              v.uk                                                                                                 Good-                                             Experi-
                                                                                                                                                                                                       Classical                                                                                                                                     List
                                                  Energy                         (En-                                                                                                                                                        win               flickr                          ment
                                                                                                                                                                                                         (DB             Pokedex            Family                                                                                                                                                                                           Norwe-
                                                  Genera-                       AKTing)                Mortality                                           BBC                                                                                                wrappr                                            Sudoc                                               PSH
                                                                                                                                                                                                        Tune)                                                                                                                                                                                                                                 gian
                                                                                                        (En-
                                                   tors                                                                                                  Program                                                                                                                                                                                                                                                                              MeSH

           Geographic
                                                                                                       AKTing)                                                                                                                                                              semantic
                                                                                                                                                           mes                  BBC                                                                                                                                                IdRef                                                               GND
                                                                                        CO2                         educatio          OpenEI                                                                                                                                web.org               SW
                                                                     Energy                                                                                                                                                                                                                                                        Sudoc                                           ndlna
                                                                                      Emission                      n.data.g                                                    Music                                                                                                            Dog                                                                                                                              VIAF
                                           EEA                        (En-                                                                                                                     Chronic-                             Linked
                                                                                        (En-                         ov.uk                                                                                                                            Portu-                                     Food                                                             UB
                                                                     AKTing)                                                                                                                     ling              Event             MDB
                                                                                      AKTing)                                                                                                                                                         guese                                                                                                      Mann-                                                                              Europeana
                                                                                                                                                                   BBC                         America             Media
                                                                                                                                                                                                                                                     DBpedia                                                                                     Calames         heim
                                                                                                       Ord-                                       Recht-          Wildlife                                                                                                                                                                                                                      Deutsche
                                           Open                                                                                                                                                                                                                          Revyu                                     DDC
                                                                                                                                 Openly           spraak.         Finder                                                                                                                                                                                                                           Bio-              lobid
                                                                                                      nance

          Publications
                                          Election                                                                                                                                                                                                                                          RDF                                                                                                  graphie
                                            Data             legislation                              Survey                      Local              nl                                                                                                                                                                           data                                            Ulm                           Resources                 NSZL               Swedish
                                 EU                                                                                                                                              Tele-                  New                                                                                 Book
                                          Project           data.gov.uk                                                                                                         graphis                                                                                                                                           bnf.fr                                                                                                 Catalog              Open
                                Insti-                                                                                                                                                                  York
                                                                                                                                                                                                                          URI                                 Open                         Mashup                                                                                                                                                            Cultural
                               tutions                                                                                                                                                                 Times                                 Greek                                                                                                            P20
                                                                                    UK Post-                                                                                                                             Burner                               Calais                                                                                                                                                                                         Heritage
                                                                                     codes                                                                                                                                                  DBpedia                                                                                                                                     ECS             Wiki
                                                                                                                    statistics                                                                                                                                                                                                                                                                                                  lobid
                                            GovWILD                                                                 data.gov.                                 Taxon                                                                                                        iServe                                                                                                      South-                                  Organi-
                                                                                                                                          LOIUS                                                                                                                                                                 BNB
                            Brazilian
                                                                                                                       uk                                    Concept                                                                                                                                                                                 ECS                               ampton                                  sations
                                                                                                                                                                                     Geo                 World                                                                                                                    BibBase                                                                                                          STW            GESIS

User-generated content
                                                                                                                                                                                                                                                                                               OS                                                   South-                ECS
                              Poli-                           ESD                                                                                                                   Names                Fact-
                                                                                                                                                                                                                                                                                                                                                    ampton              (RKB
                             ticians                         stan-         reference                                                                                                                     book                                                                                                                                                                                                    Budapest
                                                                                                 data.gov.uk                                                                                                               Freebase                                                                                                                 EPrints           Explorer)
                                                             dards         data.gov.                                                               NASA
                                                                               uk                 intervals                                                                                                                                                                      Project                                                                                                        OAI
                                             Lichfield                                                                     transport               (Data                                                                                              DBpedia                                       data
                                                                                                                                                                                                                                                                                 Guten-                                                                                                                                              Pisa
                                              Spen-                                                                        data.gov.               Incu-                                                                                                                                            dcs                                                                                                                                             RESEX         Scholaro-
                                ISTAT          ding                                                                                                bator)               Fishes                                                                                                    berg                              DBLP                 DBLP
                                                                                                                              uk                                                            Geo
                                                                                                                                                                                                                                                                                                                                                                                                                                                                   meter
                               Immi-                        Scotland                                                                                                   of Texas                                                                                                                                      (FU                 (L3S)
                                                            Pupils &                                                                                                                                       Uberblic                                                                                                                                      DBLP
                                                                                                                                                                                           Species                                                                                                                 Berlin)

          Government
                               gration                                                                                                                                                                                                                                                                                                                                                                          IRIT
                                                             Exams                                                                        Euro-                                                                                    dbpedia                                                     data-                                                     (RKB
                                                                           London                                                                                                                                                                                        TCM                                                                                                    ACM
                                                                                                                                           stat                                                                                      lite                                                      open-                                                   Explorer)                                                                                            NVD
                                                                           Gazette                                                        (FUB)                                                                                                                          Gene                                                                                                                                                       IBM
                                                  Traffic                                                                                                     Geo                                                                                                                              ac-uk
                                                 Scotland                                      TWC LOGD                Eurostat                                                                                                                       Daily               DIT
                                                                                                                                                             Linked                                                                                                                                                                    UN/
                               Data                                                                                                                                             UMBEL                                                                 Med                                                                ERA
                                                                                                                                                              Data                                                                                                                                                                   LOCODE                                                                                                    DEPLOY
                               Gov.ie                         CORDIS                                                                                                                             YAGO                                                                                                                                                                                                           New-
                                                                                                                                                                                                                      lingvoj                                                           Disea-
                                                               (RKB                                                                                                                                                                                                                     some               SIDER                                                                         RAE2001                castle                                        LOCAH
                                                             Explorer)                                                                   Linked                                                                                                                                                                                                   Eurécom


         Cross-domain
                                            CORDIS                                                                                                                                                                                                                      Drug                                                                                                                                                       Roma
                                                                                                               Eurostat               Sensor Data                                                                                                                                                                                                              CiteSeer
                                             (FUB)                                                            (Ontology                                                                                                                                                 Bank
                                                                               GovTrack                                                (Kno.e.sis)                                    Open                                                                                                                                        Pfam                                                                                                         Course-
                                                                                                               Central)                                          riese                                                           Enipedia
                                                                                                                                                                                       Cyc              Lexvo                                      LinkedCT                                                                                                                                                                                     ware
                                                                Linked                                                                                                                                                                                                                                          PDB
                                                                                                                                                                                                                                                                                       UniProt                                                         VIVO
                                         EURES                 EDGAR                                                                                                                                                                                                                                                                                                                                         dotAC
                                                                                                 US SEC                                                                                                                                                                                                                                               Indiana                ePrints                                                IEEE
                                                              (Ontology                                                                                                                                             totl.net
                                                                                               (rdfabout)
                                                               Central)                                                                                                         WordNet                                                                                                                                                                                                                                                             RISKS


          Life sciences
                                                                                                                                                                                 (VUA)                                                        Taxono               UniProt
                                                                                                                   US Census               EUNIS              Twarql                                                                                                                                                             HGNC
                                                                            Semantic                                                                                                               Cornetto                                                        (Bio2RDF)
                                                                                                                   (rdfabout)                                                                                                                   my                                                                                                                   VIVO
                                                  FTS                         XBRL                                                                                                                                                                                                         PRO-            ProDom                                 STITCH            Cornell                LAAS
                                                                                                                                                                                                                                                                                           SITE                                                                                                                        KISTI                NSF
                                                              Scotland
                                                                Geo-                        GeoWord                                                                                                                       LODE
                                                               graphy                         Net                                                                                WordNet           WordNet                                                                                                                                                                                            JISC
                                                                                                                                                                                  (W3C)              (RKB
                                                                                                                                       Climbing
                                                                                                                                                             Linked                                                                                       Affy-                                                                                                KEGG
                                                                                                                     SMC                                                                           Explorer)                              SISVU                                                                                            Pub                                    VIVO UF
                                                                                Piedmont                                                                    GeoData                                                                                       metrix                                                                                               Drug
                                                                                                                                                                                                                                                                                                                                                                                                                             ECCO-
                                                                Finnish                                            Journals                                                                                                                                                 PubMed                    Gene                SGD             Chem
                                                                                Accomo-               El                                                                                                                                                                                                                                                                                                                      TCP
                                                                Munici-                                                                                                                                            AGROV                                                                             Ontology
                                                                                 dations                                                                                         Alpine                                                                                                                                                                                                             bible
                                                                palities                           Viajero                                                                                                          OC
                                                                                                                                                                                  Ski                                                                                                                                                                                                             ontology
                                                                                                   Tourism                                                                                                                                                                                                                                                                 KEGG
                                                                                                                                                                                 Austria                                                                                                                                                                                                                             PBAC
                                                                                                                    Ocean                                                                      GEMET                                                                                                                                                                      Enzyme
                                                                                                                                                    Metoffice                                                                                    ChEMBL
                                                                                       Italian                     Drilling                                                                                                                                        OMIM                                                                                KEGG
                                                                                                                                                     Weather                                                    Open
                                                                                        public                     Codices            AEMET                                                                                      Linked                                                                         MGI                                   Pathway
                                                                                                                                                                                                                Data
                                                                                       schools                                                      Forecasts                                                                     Open                                                  InterPro                                    GeneID                                                       KEGG
                                                                                                                                                                             EARTh                             Thesau-
                                                                                                     Turismo
                                                                                                                                                                                                                 rus             Colors                                                                                                                                                         Reaction
                                                                                                        de
                                                                                                    Zaragoza                                                                                Product                                                Smart                                                                                                                           KEGG
                                                                                                                                                              Weather                         DB                                                    Link                                                                  Medi                                                     Glycan
                                                                                                                           Janus                              Stations                                    Product                                                                                                         Care                                       KEGG
                                                                                                                            AMP                                                                                                                                 UniParc             UniRef              UniSTS
                                                                                                                                                                                                           Types                Italian
                                                                                                                                                                                                                                                                                                                                                    Homolo           Com-
                                                                                                                                                Yahoo!                       Airports                                          Museums                                                                                                                               pound
                                                                                                                                                                                                          Ontology                          Google
                                                                                                                                                                                                                                                                                                                                                     Gene
                                                                                                                                                 Geo                                                                                          Art
                                                                                                                                                Planet        National                                                                                                                                                               Chem2
                                                                                                                                                                                                                                            wrapper
                                                                                                                                                               Radio-                                                                                                                                                               Bio2RDF
                                                                                                                                                              activity                                                                                                                                                UniPath
                                                                                                                                                                 JP                        Sears                Open                                            Linked                               OGOLOD            way
                                                                                                                                                                                                               Corpo-           Amster-                                          Reactome
                                                                                                                                                                                                                                 dam              medu-          Open
                                                                                                                                                                                                                rates                                          Numbers
                                                                                                                                                                                                                                Museum            cator




      http://lod-cloud.net                                                                                                                                                                                                                                                                                                                                       As of September 2011
|


            TYPES OF LINKED DATA
                                          VERY SOON?


          Open,           Linked
                                          Commercial
        Public Data      Enterprise
                                          Linked Data
       (LOD Cloud)         Data




... AND WHAT YOU CAN DO WITH THEM
•   Provide interfaces on top of them
•   Augment your website
•   Integrate them into your application logic
•   Create specialized data marts
|


AUGMENT YOUR WEBSITE: BBC

       BBC online properties make intensive use of
       data from Wikipedia and MusicBrainz
|


                   DATA MARTS: NEUROWIKI

•   NeuroWiki creates views
    for genes, drugs and
    diseases data from four
    RDF data sources
•   Provides navigation and
    composition tools for
    accessing and mining the
    data
|


          APPLICATION LOGIC: IBM WATSON




                                              http://www.flickr.com/photos/ibm_media/

•   IBM Watson makes use of Linked Data sources such as DBpedia
|




        4 STEPS TO
LINKED DATA INTEGRATION
|


                                     STEP #1:
                                ACCESS LINKED DATA
•   Linked Data is published via HTTP, SPARQL endpoints, RDF dumps
                               Access Methods                                     Decision Factors
     Architecture         HTTP                 Dump
                                     SPARQL               Recency Speed / Scalability        Reliability       Complexity
                       Dereferencing          import
On-The-Fly                  X                            High        Low                  Low               High
Dereferencing
                                                                     Decreases
                                                                                                            Moderate with
                                                                     exponentially as
Query Federation                       X                 High                             Low               SPARQL 1.1
                                                                     new sources are
                                                                                                            SERVICE clause
                                                                     added

Crawling and Caching        X          X          X      Depends High                     High              High

                                       Adapted from: Linked Data: Evolving the Web into a Global Data Space (Heath/Bizer 2011)



•   Live access allows quick prototyping and limited production use
•   As data sets grow in size and more data sources are added, a
    crawling/caching architecture often becomes necessary
|


                            STEP #1:
                       ACCESS LINKED DATA
Implementations:
•   On-the-fly dereferencing
    •   LDspider, SQUIN, Semantic Web Client library
•   Query federation
    •   SPARQL 1.1 SERVICE clause
•   Crawling and Caching
    •   Triplestore import script
    •   Public caches (e.g. Sindice, OpenLink LOD endpoint)
    •   LDIF
|


                                                                                   STEP #2:
                                                                            NORMALIZE VOCABULARIES
   Data sources that overlap in content use a wide range of vocabularies.
                                                                 mpeg7 swrc po
                                                                      dcam bib
                                                                     tl
                                                            wot rdfg
                                                         txncompass
                                                     metalex
                                                    doap



                                                                                    dc
                                                  wdrs
                                             admingeo
                                             vann
                                           api
                                         org
                                    sawsdl




                                                                                                    Over 60 % of all LOD sources use
                                  sdmx




                                                                                                •
                              geospecies
                               qb
                             xml
                         rev
                        vu-wordnet
                      umbel
                    uniprot
                   http
                 scovo
                void
              tag
                                                                                                    proprietary vocabularies
             dbp
            bio
          ore
         dbo
         gr
      dbpedia
      event
      time
     xsd                                                                                        •   It’s up to the data consumer to
     frbr
  geonames
      cc
                                                                                                    normalize the vocabularies
      sioc                                                                               foaf
     vcard                                                                                      •   Enterprise: Need to translate
       mo                                                                                           between internal and external
        bibo
           akt                                                                                      vocabularies
             xhtml                                                               skos
                                                        geo

         Most widely used vocabularies in the LOD cloud (08/10/2011)
Source: FU Berlin / DERI; http://www4.wiwiss.fu-berlin.de/lodcloud/state/
|


                             STEP #2:
                      NORMALIZE VOCABULARIES
Approaches to Schema Mapping:
•   Hand-crafting queries against individual sources – no different than an API
    OPTIONAL { ?ow fb:location.location.containedby [ ot:preferredLabel ?city_fb_con ] } .
    OPTIONAL { ?ow dbp-prop:location ?loc. ?loc rdf:type umbel-sc:City ; ot:preferredLabel ?city_db_loc }
    OPTIONAL { ?ow dbp-ont:city [ ot:preferredLabel ?city_db_cit ] }

                               Source: http://www.readwriteweb.com/archives/the_modigliani_test_for_linked_data.php
•   Ontology Representation Languages: OWL, RDFS
•   Rules: SWRL, RIF
•   Query Languages
    •   SPARQL CONSTRUCT clause
    •   TopQuadrant SPARQLMotion
    •   Mosto
    •   R2R (part of LDIF)
|


                             STEP #2:
                      NORMALIZE VOCABULARIES
Using SPARQL:
• Rename a class
    CONSTRUCT {
      ?s a mo:MusicArtist
    } WHERE {
      ?s a dbpedia-owl:MusicalArtist
    }


•   Value transformation
    CONSTRUCT {
      ?s movie:runtime ?runtimeInMinutes .
    } WHERE {
      ?s dbpedia-owl:runtime ?runtime .
      BIND(?runtime * 60 As ?runtimeInMinutes)
    }


•   Create URI from literal
    CONSTRUCT {
      ?s diseasome:omim ?omimuri .
      ?omimuri dc:identifier ?identifier .
    } WHERE {
      ?s dbpedia-owl:omim ?omim .
      BIND(IRI(concat(“http://bio2rdf.org/omim:”, ?omim)) As ?omimuri)
      BIND(concat(“omim:”, ?omim) As ?identifier)
    }

                                                                         Slide credits: Andreas Schultz
|


                                               STEP #3:
                                          RESOLVE IDENTIFIERS
  Data sources that overlap in content use different identifiers for the
  same real-world entity.

     1 linked data sets                                              98     •   Most LOD sources only provide
     2 linked data sets                                62
                                                                                owl:sameAs links to one other
                                                                                data source
     3 linked data sets                        38

     4 linked data sets              19
                                                                            •   It’s up to the data consumer to
                                                                                generate additional links
     5 linked data sets         5
                                                                            •   Enterprise: Need to link both
6 - 10 linked data sets              17
                                                                                internal and external resources
 > 10 linked data sets                    27

                            0        25         50       75        100
                    Number of linked data sets per source (08/10/2011)
Source: FU Berlin / DERI; http://www4.wiwiss.fu-berlin.de/lodcloud/state/
|


                             STEP #3:
                        RESOLVE IDENTIFIERS
Approaches to Identity Resolution:
•   Improvised or manual merging
•   Rule-based approaches:
    •   SILK (part of LDIF)
    •   LIMES
                                            Union Sq., New York
                                            Union Sq., Seattle
                                            Union Sq., San Francisco
                                     ′N
                                  47    W
                                  ° 24′
                                37 2°
                                   12

                                                                         Union Sq.
                                         Union                               =
                                         Square                         Union Sq.,
                                                                       San Francisco
                            ′N
                         47    W
                         ° 24′
                       37 2°
                          12
|


                                    STEP #4:
                                 FILTER DATA
Data sources that overlap in content provide data that is conflicting and of
varying quality.
•   Data sources have...
    •   ... different knowledge levels, views or intents
    •   ... wrong, biased, inconsistent or outdated information
•   Approaches:
    •   Import data into distinct Named Graphs; query them separately
        using the SPARQL GRAPH clause
    •   Sieve (part of LDIF)
|


    LDIF – LINKED DATA INTEGRATION FRAMEWORK
Integrates Linked Data from multiple sources into a clean, local target
representation while keeping track of data provenance

           1   Collect data: Managed download and update

           2   Translate data into a single target vocabulary

           3   Resolve identifier aliases into local target URIs

     NEW   4   Cleanse data; resolving the conflicting values

           5   Output

•   Follows the Crawling and Caching Architecture Pattern
•   Open source (Apache License, Version 2.0)
•   Collaboration between Freie Universität Berlin and mes|semantics
|


                         LDIF PIPELINE

1   Collect data         Supported data sources:

2   Translate data       •   RDF dumps (all common formats)
                         •   SPARQL Endpoints
3   Resolve identities
                         •   Crawling Linked Data via HTTP
4   Cleanse data

5   Output
|


                             LDIF PIPELINE

1   Collect data
                               Sources use a wide range of different RDF vocabularies

2   Translate data                dbpedia-owl: City


3   Resolve identities            schema:Place                R2R              local:City


                                  fb:location.citytown
4   Cleanse data

5   Output               •   Simple mappings using OWL / RDFS statements
                             (x rdfs:subClassOf y)
                         •   Complex mappings with SPARQL expressivity
                         •   Built-in transformation function library (XPath)
|


                                          LDIF PIPELINE

1   Collect data                              Sources use different identifiers for the same entity


2   Translate data
                                           Union Sq., New York
                                           Union Sq., Seattle
3   Resolve identities                     Union Sq., San Francisco
                                  ′N
                              ° 47 4′ W
                            37 2°2
                               12
4   Cleanse data                                                                              Union Sq.
                                     Union                                                        =
5   Output                           Square                             Silk                 Union Sq.,
                                                                                            San Francisco
                           ′N
                       ° 47 4′ W
                     37 2°2
                        12

                                   •      Automated link creation based on Link Specifications
                                   •      Supports various comparators and transformations
                                          (string similarity, basic arithmetics, time, geographical
                                          distance)
|


                           LDIF PIPELINE
                                Sources provide different values for the same property
1   Collect data
                                               San Francisco
2   Translate data                             population is
                                                   0.7M

3   Resolve identities
                                                              ★
                                                          ★

                         San Francisco
                                                                                 San
4   Cleanse data         population is
                                                                             Francisco
                             0.8M
                                                                  Sieve      population
5   Output                                                                    is 0.8M
                                           ★
                                       ★
                                   ★




                         1. Quality Assessment – assign quality scores to Named
                            Graphs (by time, by source preference, thresholds)
                         2. Data Fusion – resolve conflicting property values
                            (according to quality scores, frequency, averages)
|


                         LDIF PIPELINE

1   Collect data
                         Output options:
2   Translate data       •   N-Quads
3   Resolve identities   •   N-Triples
                         •   SPARQL Update Stream
4   Cleanse data

5   Output
                         •   Provenance tracking using Named Graphs
!
                                                                                                                   |
!

!

!
                    LDIF ARCHITECTURE

Application!Layer!                                         Application!Code!!


                                                                                SPARQL!or!RDF!API!

                                                   !!!!!!LDIF!!
                      !!
Data!Access,!!
                                              Data!                Identity!    Data!Quality!
Integration!and!!       Web!Data!                                                                    Integrated!
                                           Translation!           Resolution!    and!Fusion!
                      Access!Module!                                                                 Web!Data!
Storage!Layer!               !
                                             Module!               Module!        Module!
                                                !                      !

                              HTTP!


Web!of!Data!


                                         HTTP!                         HTTP!                    HTTP!

                                                                                  RDFa!
                           LD!Wrapper!             LD!Wrapper!
Publication!Layer!                                                                                      RDF/X
                                                                                                         ML!
                           Database!A!              Database!B!                   CMS!
|


                                   VERSIONS
•   In-memory
    •   fast, but scalability limited by local RAM
•   RDF Store (TDB)
    •   stores intermediate results in a Jena TDB RDF store
    •   can process more data than In-memory but doesn't scale
•   Cluster (Hadoop)
    •   scales by parallelizing work across multiple machines using Hadoop
    •   can process a virtually unlimited amount of data
    •   ready for Amazon Elastic MapReduce
|


       BENCHMARKS
KEGG GENES VS. UNIPROT (CLUSTER)

            300M TRIPLES




            3.6B TRIPLES
|




Q&A
|




                                 THANKS!
•   Early adopters wanted!

•   Website: http://bit.ly/ldifweb
•   Google Group: http://bit.ly/ldifgroup
•   http://mes-semantics.com


•   Supported in part by
    •   Vulcan Inc. as part of its Project Halo
    •   EU FP7 project LOD2 - Creating Knowledge out of Interlinked Data
        (Grant No. 257943)
•   Slide credits: Andrea Matteini, Robert Isele, Andreas Schultz

Weitere ähnliche Inhalte

Kürzlich hochgeladen

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Kürzlich hochgeladen (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Empfohlen

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Empfohlen (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

How to integrate Linked Data into your application

  • 1. SEMANTIC TECHNOLOGY & BUSINESS CONFERENCE | SAN FRANCISCO, JUNE 5, 2012 HOW TO INTEGRATE LINKED DATA INTO YOUR APPLICATION LDIF Team: Andreas Schultz, Freie Universität Berlin Andrea Matteini, mes|semantics Robert Isele, Freie Universität Berlin Pablo N. Mendes, Freie Universität Berlin Christian Becker, mes|semantics Christian Bizer, Freie Universität Berlin With contributions by: Hannes Mühleisen, Freie Universität Berlin; William Smith, Vulcan Inc.
  • 2. | WHAT IS LINKED DATA? • Raw data (RDF) • Accessible on the web • Data can link to other data sources Thing Thing Thing Thing Thing Thing Thing Thing Thing Thing data link data link data link data link A B C D E • Benefits: Ease of access and re-use; enables discovery • One API for all data sources?
  • 3. | LINKING OPEN DATA CLOUD Linked LOV User Slideshare tags2con Audio Feedback 2RDF delicious Moseley Scrobbler Bricklink Sussex Folk (DBTune) Reading St. GTAA Magna- Lists Andrews Klapp- tune stuhl- Resource NTU DB club Lists Resource Tropes Lotico Semantic yovisto John Music Man- Lists Music Tweet chester Hellenic Peel Brainz NDL (DBTune) (Data Brainz Reading subjects FBD (zitgist) Lists Open EUTC Incubator) Linked Hellenic Library Open t4gm Produc- Crunch- PD Surge RDF info tions Discogs base Library Radio Ontos Source Code Crime ohloh Plymouth (Talis) (Data News LEM Ecosystem Reading RAMEAU Reports business Incubator) Crime data.gov. Portal Linked Data Lists SH UK Music Jamendo (En- uk Brainz (DBtune) LinkedL Ox AKTing) FanHubz gnoss ntnusc (DBTune) SSW CCN Points Thesau- Last.FM Thesaur Media Poké- Popula- artists pédia Didactal us rus W LIBRIS tion (En- (DBTune) Last.FM ia theses. LCSH Rådata reegle research patents MARC AKTing) (rdfize) my fr nå! data.gov. data.go Codes Ren. NHS uk v.uk Good- Experi- Classical List Energy (En- win flickr ment (DB Pokedex Family Norwe- Genera- AKTing) Mortality BBC wrappr Sudoc PSH Tune) gian (En- tors Program MeSH Geographic AKTing) semantic mes BBC IdRef GND CO2 educatio OpenEI web.org SW Energy Sudoc ndlna Emission n.data.g Music Dog VIAF EEA (En- Chronic- Linked (En- ov.uk Portu- Food UB AKTing) ling Event MDB AKTing) guese Mann- Europeana BBC America Media DBpedia Calames heim Ord- Recht- Wildlife Deutsche Open Revyu DDC Openly spraak. Finder Bio- lobid nance Publications Election RDF graphie Data legislation Survey Local nl data Ulm Resources NSZL Swedish EU Tele- New Book Project data.gov.uk graphis bnf.fr Catalog Open Insti- York URI Open Mashup Cultural tutions Times Greek P20 UK Post- Burner Calais Heritage codes DBpedia ECS Wiki statistics lobid GovWILD data.gov. Taxon iServe South- Organi- LOIUS BNB Brazilian uk Concept ECS ampton sations Geo World BibBase STW GESIS User-generated content OS South- ECS Poli- ESD Names Fact- ampton (RKB ticians stan- reference book Budapest data.gov.uk Freebase EPrints Explorer) dards data.gov. NASA uk intervals Project OAI Lichfield transport (Data DBpedia data Guten- Pisa Spen- data.gov. Incu- dcs RESEX Scholaro- ISTAT ding bator) Fishes berg DBLP DBLP uk Geo meter Immi- Scotland of Texas (FU (L3S) Pupils & Uberblic DBLP Species Berlin) Government gration IRIT Exams Euro- dbpedia data- (RKB London TCM ACM stat lite open- Explorer) NVD Gazette (FUB) Gene IBM Traffic Geo ac-uk Scotland TWC LOGD Eurostat Daily DIT Linked UN/ Data UMBEL Med ERA Data LOCODE DEPLOY Gov.ie CORDIS YAGO New- lingvoj Disea- (RKB some SIDER RAE2001 castle LOCAH Explorer) Linked Eurécom Cross-domain CORDIS Drug Roma Eurostat Sensor Data CiteSeer (FUB) (Ontology Bank GovTrack (Kno.e.sis) Open Pfam Course- Central) riese Enipedia Cyc Lexvo LinkedCT ware Linked PDB UniProt VIVO EURES EDGAR dotAC US SEC Indiana ePrints IEEE (Ontology totl.net (rdfabout) Central) WordNet RISKS Life sciences (VUA) Taxono UniProt US Census EUNIS Twarql HGNC Semantic Cornetto (Bio2RDF) (rdfabout) my VIVO FTS XBRL PRO- ProDom STITCH Cornell LAAS SITE KISTI NSF Scotland Geo- GeoWord LODE graphy Net WordNet WordNet JISC (W3C) (RKB Climbing Linked Affy- KEGG SMC Explorer) SISVU Pub VIVO UF Piedmont GeoData metrix Drug ECCO- Finnish Journals PubMed Gene SGD Chem Accomo- El TCP Munici- AGROV Ontology dations Alpine bible palities Viajero OC Ski ontology Tourism KEGG Austria PBAC Ocean GEMET Enzyme Metoffice ChEMBL Italian Drilling OMIM KEGG Weather Open public Codices AEMET Linked MGI Pathway Data schools Forecasts Open InterPro GeneID KEGG EARTh Thesau- Turismo rus Colors Reaction de Zaragoza Product Smart KEGG Weather DB Link Medi Glycan Janus Stations Product Care KEGG AMP UniParc UniRef UniSTS Types Italian Homolo Com- Yahoo! Airports Museums pound Ontology Google Gene Geo Art Planet National Chem2 wrapper Radio- Bio2RDF activity UniPath JP Sears Open Linked OGOLOD way Corpo- Amster- Reactome dam medu- Open rates Numbers Museum cator http://lod-cloud.net As of September 2011
  • 4. | TYPES OF LINKED DATA VERY SOON? Open, Linked Commercial Public Data Enterprise Linked Data (LOD Cloud) Data ... AND WHAT YOU CAN DO WITH THEM • Provide interfaces on top of them • Augment your website • Integrate them into your application logic • Create specialized data marts
  • 5. | AUGMENT YOUR WEBSITE: BBC BBC online properties make intensive use of data from Wikipedia and MusicBrainz
  • 6. | DATA MARTS: NEUROWIKI • NeuroWiki creates views for genes, drugs and diseases data from four RDF data sources • Provides navigation and composition tools for accessing and mining the data
  • 7. | APPLICATION LOGIC: IBM WATSON http://www.flickr.com/photos/ibm_media/ • IBM Watson makes use of Linked Data sources such as DBpedia
  • 8. | 4 STEPS TO LINKED DATA INTEGRATION
  • 9. | STEP #1: ACCESS LINKED DATA • Linked Data is published via HTTP, SPARQL endpoints, RDF dumps Access Methods Decision Factors Architecture HTTP Dump SPARQL Recency Speed / Scalability Reliability Complexity Dereferencing import On-The-Fly X High Low Low High Dereferencing Decreases Moderate with exponentially as Query Federation X High Low SPARQL 1.1 new sources are SERVICE clause added Crawling and Caching X X X Depends High High High Adapted from: Linked Data: Evolving the Web into a Global Data Space (Heath/Bizer 2011) • Live access allows quick prototyping and limited production use • As data sets grow in size and more data sources are added, a crawling/caching architecture often becomes necessary
  • 10. | STEP #1: ACCESS LINKED DATA Implementations: • On-the-fly dereferencing • LDspider, SQUIN, Semantic Web Client library • Query federation • SPARQL 1.1 SERVICE clause • Crawling and Caching • Triplestore import script • Public caches (e.g. Sindice, OpenLink LOD endpoint) • LDIF
  • 11. | STEP #2: NORMALIZE VOCABULARIES Data sources that overlap in content use a wide range of vocabularies. mpeg7 swrc po dcam bib tl wot rdfg txncompass metalex doap dc wdrs admingeo vann api org sawsdl Over 60 % of all LOD sources use sdmx • geospecies qb xml rev vu-wordnet umbel uniprot http scovo void tag proprietary vocabularies dbp bio ore dbo gr dbpedia event time xsd • It’s up to the data consumer to frbr geonames cc normalize the vocabularies sioc foaf vcard • Enterprise: Need to translate mo between internal and external bibo akt vocabularies xhtml skos geo Most widely used vocabularies in the LOD cloud (08/10/2011) Source: FU Berlin / DERI; http://www4.wiwiss.fu-berlin.de/lodcloud/state/
  • 12. | STEP #2: NORMALIZE VOCABULARIES Approaches to Schema Mapping: • Hand-crafting queries against individual sources – no different than an API OPTIONAL { ?ow fb:location.location.containedby [ ot:preferredLabel ?city_fb_con ] } . OPTIONAL { ?ow dbp-prop:location ?loc. ?loc rdf:type umbel-sc:City ; ot:preferredLabel ?city_db_loc } OPTIONAL { ?ow dbp-ont:city [ ot:preferredLabel ?city_db_cit ] } Source: http://www.readwriteweb.com/archives/the_modigliani_test_for_linked_data.php • Ontology Representation Languages: OWL, RDFS • Rules: SWRL, RIF • Query Languages • SPARQL CONSTRUCT clause • TopQuadrant SPARQLMotion • Mosto • R2R (part of LDIF)
  • 13. | STEP #2: NORMALIZE VOCABULARIES Using SPARQL: • Rename a class CONSTRUCT { ?s a mo:MusicArtist } WHERE { ?s a dbpedia-owl:MusicalArtist } • Value transformation CONSTRUCT { ?s movie:runtime ?runtimeInMinutes . } WHERE { ?s dbpedia-owl:runtime ?runtime . BIND(?runtime * 60 As ?runtimeInMinutes) } • Create URI from literal CONSTRUCT { ?s diseasome:omim ?omimuri . ?omimuri dc:identifier ?identifier . } WHERE { ?s dbpedia-owl:omim ?omim . BIND(IRI(concat(“http://bio2rdf.org/omim:”, ?omim)) As ?omimuri) BIND(concat(“omim:”, ?omim) As ?identifier) } Slide credits: Andreas Schultz
  • 14. | STEP #3: RESOLVE IDENTIFIERS Data sources that overlap in content use different identifiers for the same real-world entity. 1 linked data sets 98 • Most LOD sources only provide 2 linked data sets 62 owl:sameAs links to one other data source 3 linked data sets 38 4 linked data sets 19 • It’s up to the data consumer to generate additional links 5 linked data sets 5 • Enterprise: Need to link both 6 - 10 linked data sets 17 internal and external resources > 10 linked data sets 27 0 25 50 75 100 Number of linked data sets per source (08/10/2011) Source: FU Berlin / DERI; http://www4.wiwiss.fu-berlin.de/lodcloud/state/
  • 15. | STEP #3: RESOLVE IDENTIFIERS Approaches to Identity Resolution: • Improvised or manual merging • Rule-based approaches: • SILK (part of LDIF) • LIMES Union Sq., New York Union Sq., Seattle Union Sq., San Francisco ′N 47 W ° 24′ 37 2° 12 Union Sq. Union = Square Union Sq., San Francisco ′N 47 W ° 24′ 37 2° 12
  • 16. | STEP #4: FILTER DATA Data sources that overlap in content provide data that is conflicting and of varying quality. • Data sources have... • ... different knowledge levels, views or intents • ... wrong, biased, inconsistent or outdated information • Approaches: • Import data into distinct Named Graphs; query them separately using the SPARQL GRAPH clause • Sieve (part of LDIF)
  • 17. | LDIF – LINKED DATA INTEGRATION FRAMEWORK Integrates Linked Data from multiple sources into a clean, local target representation while keeping track of data provenance 1 Collect data: Managed download and update 2 Translate data into a single target vocabulary 3 Resolve identifier aliases into local target URIs NEW 4 Cleanse data; resolving the conflicting values 5 Output • Follows the Crawling and Caching Architecture Pattern • Open source (Apache License, Version 2.0) • Collaboration between Freie Universität Berlin and mes|semantics
  • 18. | LDIF PIPELINE 1 Collect data Supported data sources: 2 Translate data • RDF dumps (all common formats) • SPARQL Endpoints 3 Resolve identities • Crawling Linked Data via HTTP 4 Cleanse data 5 Output
  • 19. | LDIF PIPELINE 1 Collect data Sources use a wide range of different RDF vocabularies 2 Translate data dbpedia-owl: City 3 Resolve identities schema:Place R2R local:City fb:location.citytown 4 Cleanse data 5 Output • Simple mappings using OWL / RDFS statements (x rdfs:subClassOf y) • Complex mappings with SPARQL expressivity • Built-in transformation function library (XPath)
  • 20. | LDIF PIPELINE 1 Collect data Sources use different identifiers for the same entity 2 Translate data Union Sq., New York Union Sq., Seattle 3 Resolve identities Union Sq., San Francisco ′N ° 47 4′ W 37 2°2 12 4 Cleanse data Union Sq. Union = 5 Output Square Silk Union Sq., San Francisco ′N ° 47 4′ W 37 2°2 12 • Automated link creation based on Link Specifications • Supports various comparators and transformations (string similarity, basic arithmetics, time, geographical distance)
  • 21. | LDIF PIPELINE Sources provide different values for the same property 1 Collect data San Francisco 2 Translate data population is 0.7M 3 Resolve identities ★ ★ San Francisco San 4 Cleanse data population is Francisco 0.8M Sieve population 5 Output is 0.8M ★ ★ ★ 1. Quality Assessment – assign quality scores to Named Graphs (by time, by source preference, thresholds) 2. Data Fusion – resolve conflicting property values (according to quality scores, frequency, averages)
  • 22. | LDIF PIPELINE 1 Collect data Output options: 2 Translate data • N-Quads 3 Resolve identities • N-Triples • SPARQL Update Stream 4 Cleanse data 5 Output • Provenance tracking using Named Graphs
  • 23. ! | ! ! ! LDIF ARCHITECTURE Application!Layer! Application!Code!! SPARQL!or!RDF!API! !!!!!!LDIF!! !! Data!Access,!! Data! Identity! Data!Quality! Integration!and!! Web!Data! Integrated! Translation! Resolution! and!Fusion! Access!Module! Web!Data! Storage!Layer! ! Module! Module! Module! ! ! HTTP! Web!of!Data! HTTP! HTTP! HTTP! RDFa! LD!Wrapper! LD!Wrapper! Publication!Layer! RDF/X ML! Database!A! Database!B! CMS!
  • 24. | VERSIONS • In-memory • fast, but scalability limited by local RAM • RDF Store (TDB) • stores intermediate results in a Jena TDB RDF store • can process more data than In-memory but doesn't scale • Cluster (Hadoop) • scales by parallelizing work across multiple machines using Hadoop • can process a virtually unlimited amount of data • ready for Amazon Elastic MapReduce
  • 25. | BENCHMARKS KEGG GENES VS. UNIPROT (CLUSTER) 300M TRIPLES 3.6B TRIPLES
  • 26. | Q&A
  • 27. | THANKS! • Early adopters wanted! • Website: http://bit.ly/ldifweb • Google Group: http://bit.ly/ldifgroup • http://mes-semantics.com • Supported in part by • Vulcan Inc. as part of its Project Halo • EU FP7 project LOD2 - Creating Knowledge out of Interlinked Data (Grant No. 257943) • Slide credits: Andrea Matteini, Robert Isele, Andreas Schultz