SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
Industrialized Linked Data

     Dave Reynolds, Epimorphics Ltd
                            @der42
Context: public sector Linked Data
Linked Data journey ...

    explore
   what is linked data?
   what use it is for us?
Linked Data journey ...

    explore
   what is linked data?
   what use it is for us?

                      self-describing                Integration
                      carries semantics with it      comparable
                      annotate and explain           slice and dice
                      data in context                web API
                      ...                            ...
Linked Data journey ...

    explore
   what is linked data?
   what use it is for us?

                      self-describing                Integration
                      carries semantics with it      comparable
                      annotate and explain           slice and dice
                      data in context                web API
                      ...                            ...
   what’s involved?
Linked Data journey ...

      explore                                        pilot


     data                     model                convert   publish   apply




Photo of The Thinker © dSeneste.dk@flicker CC BY
Linked Data journey ...

  explore                 pilot              routine?
Great pilot but ...
 can we reduce the time and cost?
 how do we handle changes and updates?
 how can we make the published data easier to use?


How do we make Linked Data “business as usual”?
Example case study: Environment Agency
   monitoring of bathing
    water quality
   static pilot
   live pilot
       historic annual
        assessments
       weekly assessments
   operational system
       additional data feeds
       live update
       integrated API
       data explorer
From pilot to practice
   reduce modelling costs
       patterns                  dive1
       reuse
   handling change and update
       patterns
       publication process
   automation
       conversion
       publication
   embed in the business process
       use internally as well as externally
       publish once, use many
       data platform
Reduce costs - modelling
1. Don’t do it
     map source data into isomorphic RDF, synthesize URIs
     loses some of the value proposition
2. Reuse existing ontologies intact or mix-and-match
     best solution when available
     W3C GLD work on vocabularies – people, organizations,
      datasets ...
3. Reusable vocabulary patterns
     example:
         Data cube plus reference URI sets
         adaptable to broad range of data – environmental, statistical,
          financial ...
Reusable patterns: Data cube
   Much public sector data has regularities
       set of measures
            observations, forecasts, budgets, assessments, statistics ...




                    >0.1                   34


                           27               good
        excellent
                                                                     poor
                            good                   125
Reusable patterns: Data cube
   Much public sector data has regularities
       sets of measures
           observations, forecasts, budgets, assessments, estimates ...
       organized along some dimensions
           region, agency, time, category, cost centre ...




              objective code             cost centre


                               12   15           25
measure: spend
                               8     9           11
                           120      130         180
                                                          time
Reusable patterns: Data cube
   Much public sector data has regularities
       sets of measures
           observations, forecasts, budgets, assessments, estimates ...
       organized along some dimensions
           region, agency, time, category, cost centre ...
       interpreted according to attributes
           units, multipliers, status

              objective code             cost centre

                                                              provisional
                           $12k      $15k      $25k
measure: spend
                               $8k    $9k      $11k
                                                                 final
                          $120k      $130k     $180k
                                                          time
Data cube vocabulary
Data cube pattern
   Pattern, not a fixed ontology
       customize by selecting measures, dimensions and attributes
       originated in publishing of statistics
       applied to environment measurements, weather forecasts, budgets
        and spend, quality assessments, regional demographics ...
   Supports reuse
       widely reusable URI sets – geography, time periods, agencies, units
       organization-wide sets
       modelling often only requires small increments on top of core
        pattern and reusable components
   opens door for reusable visualization tools
   standardization through W3C GLD
Application to case study
   Data Cubes for water quality measurement
       in-season weekly assessments
       end of season annual assessments
   dimensions:
       time intervals – UK reference time service
       location - reference URI set for bathing waters and sample pts
   cubes can reuse these dimensions
       just need to define specific measures
From pilot to practice
   reduce modelling costs
       patterns
       reuse
   handling change and update
       patterns                               dive 2
       publication process
   automation
       conversion
       publication
   embed in the business process
       use internally as well as externally
       publish once, use many
       data platform
Handling change
   critical challenge
       most initial pilots choose a snapshot dataset
           and go stale, fast
       understanding the nature of data updates and how to handle
        them is critical to successful scaling to business as usual
   types of change
       new data related to different time period
       corrections to data
       entities change
           properties
           identity
Modelling change
1. Individual data items relate to new time period
Pattern: n-ary relation
        observation resource relates value to time period and other context
        use Data Cube dimensions for this
                                                  bwq:sampleYear
                               bwq:bathingWater                        http://reference.data.gov.uk/id/year/2009
http://environment.data.gov.
        uk/id/bathing-                            bwq:classification    Higher
    water/ukk1202-36000
                                                  bwq:sampleYear
    Clevedon Beach                                                     http://reference.data.gov.uk/id/year/2010
                                                  bwq:classification
                                                                       Minimum

                                                  bwq:sampleYear
                                                                       http://reference.data.gov.uk/id/year/2011

                                                  bwq:classification
                                                                        Higher

History or latest?
        latest is non-monotonic but helpful for many practical uses
             materialize (SPARQL Update), implement in query, implement in API
        choice whether to keep history as well
             water quality v. weather forecasts
Modelling change
2. Corrections
   patterns
        silent change (!)
        explicit replacement
             API level hides replaced values but SPARQL query can retrieve & trace
        explicit change event

                                                                               bwq:sampleYear
http://environment.data.gov.   bwq:bathingWater
                                                   classification : Higher                      http://reference.data.gov.uk/id/year/2011
        uk/id/bathing-
    water/ukk1202-36000
                                                                dct:isReplacedBy          ev:after
    Clevedon Beach                            dct:replaces
                                                                                                                        ev:occuredOn
                                                  classification : Minimum
                                                      status: replaced
                                                                                                     analysis event
                                                     reason: reanalysis
                                                                             ev:before                                      ev:agent
Modelling change
3. Mutation
   Infrequent change of properties, essential identity remains
     e.g. renaming a school, adding another building
     routine accesses see property value, not function of time
   patterns
     in place update
     named graphs
           current graph + graphs for each previous state + meta-graph
       explicit versioning with open periods
Modelling change
3. Mutation
explicit versioning with open periods
                       dct:hasVersion                   dct:hasVersion
                                             endurant




                “Clevedon Beach”                            “Clevedon Sands”

                           time:intervalStarts                        time:intervalStarts
               dct:valid                         2003     dct:valid                         2011

                                                 2011
                           time:intervalFinishes



     find right version by query on validity interval
     simplify use through
         non-monotonic “latest value” link
         API to implement query filters automatically
Application to case study
   weekly and annual samples
       use Data Cube pattern (n-ary relation)
   withdrawn samples
       replacement pattern (no explicit change event)
       Data Cube slice for “latest valid assessment”
           generated by a SPARQL Update query
       API gives easy access to the latest valid values
       linked data following or raw SPARQL query allows drilling into changes
   changes to bathing water profile
       versioning pattern
       bathing water entity points to latest profile (SPARQL Update again)
From pilot to practice
   reduce modelling costs
       patterns
       reuse
   handling change and update
       patterns
       publication process
   automation
       conversion                             dive 3
       publication
   embed in the business process
       use internally as well as externally
       publish once, use many
       data platform
Automation
Transform and publish data feed increments
    transformation engine service
    reusable mappings, low cost to adapt to new feeds
    linking to reference data
    publication service that supports non-monotonic changes




                                                           publication
                                                             service
     data increments (csv)                 transform
                                             service


                                                                         replicated
                             xform xform         reconciliation
                                xform
                             spec. spec.                                 publication
                                spec.               service
                                                                           servers

                                                   Reference data
Transformation service
   declarative specification of transform
       single service support range of transformations
       easy to adapt transformation to new feeds and modelling
        changes
   R2RML – RDB to RDF Mapping Language
       specify mapping from database tables to RDF triples
       W3C candidate recommendation
   D2RML
       R2RML extension to treat CSV feed as a database table
Small D2RML example
:dataSource a dr:CSVDataSource ;
  rdfs:label "dataSource" .

:bathingWaterTermMap a dr:SubjectMap;
  dr:template "http://environment.data.gov.uk/id/bathing-water/{EUBWID2}" ;
  dr:class def-bw:BathingWater .

:bathingWaterMap
  dr:logicalTable :dataSource ;
  dr:subjectMap   :bathingWaterTermMap ;

  dr:predicateObjectMap [
    dr:predicate rdfs:label ;
    dr:objectMap [dr:column "description_english" ;   dr:language "en"   ] ]

  dr:predicateObjectMap [
    dr:predicate def-bw:eubwidNotation;
    dr:objectMap [ dr:column "EUBWID2"; dr:datatype def-bw:eubwid    ]   ] .
Using patterns
   problems with verbosity, increases reuse costs
   extend to support modelling patterns
   Data Cube
       specify mapping to observation with measures and dimensions
       engine generates Data Set and Data Structure Definition
        automatically
D2RML cube map example
:dataCubeMap a dr:DataCubeMap ;
    rr:logicalTable “dataSource”;
    dr:datasetIRI “http://example.org/datacube1”^^xsd:anyURI ;
    dr:dsdIRI “http://example.org/myDsd”^^xsd:anyURI ;

                                                            Instances will
    dr:observationMap [                                  automatically link to
     rr:subjectMap [                                        base Data Set
        rr:termType rr:IRI ;
        rr:template “http://example.org/observation/{PLACE}/{DATE}” ] ;
        rr:componentMap [
                                              Implies an entry in the Data
          dr:componentType qb:measure ;
                                              Structure Definition which is
          rr:predicate aq:concentration ;
                                                    auto-generated
          rr:objectMap [ rr:column “NO2” ; rr:datatype xsd:decimal ; ]
         ] ;
        ...                                     Define how measure
                                                    value is to be
                                                    represented
But what about linking?
   connect observations to reference data
       a core value of linked data
   R2RML has Term Maps to create values
       constants and templates
   extend to allow maps based on other data sources
       Lookup map
           lookup resource in a store, fetch predicate
       Reconcile
           specify lookup in a remote service
           use Google Refine reconciliation API
Automation
Transform and publish data feed increments
    transformation engine service 
    reusable mappings, low cost to adapt to new feeds 
    linking to reference data 
    publication service that supports non-monotonic changes




                                                           publication
                                                             service
     data increments (csv)                 transform
                                             service


                                                                         replicated
                             xform xform         reconciliation
                                xform
                             spec. spec.                                 publication
                                spec.               service
                                                                           servers

                                                   Reference data
Publication service
   goals
       cope with non-monotonic effects of change representation
       so replication is robust and cheap (=> make it idempotent)
   solution
       SPARQL Update
       publish transformed increment as a simple DATA INSERT
       then run SPARQL Update script for non-monotonic links
           dct:replacedBy links
           lastest value slices
Sample update script
DELETE {
  ?bw bwq:latestComplianceAssessment ?o .
} WHERE {
  ?bw bwq:latestComplianceAssessment ?o .
}



INSERT {
   ?bw bwq:latestComplianceAssessment ?o .
} WHERE {
 {
   ?slice a bwq:ComplianceByYearSlice;    bwq:sampleYear [interval:ordinalYear ?year].
   OPTIONAL {
     ?slice2 a bwq:ComplianceByYearSlice; bwq:sampleYear [interval:ordinalYear ?year2].
        FILTER (?year2 > ?year)
      } FILTER ( !bound(?slice2) )
   }
   ?slice qb:observation ?o .

    ?o bwq:bathingWater ?bw.
}
Automation
Transform and publish data feed increments
    transformation engine service 
    reusable mappings, low cost to adapt to new feeds 
    linking to reference data 
    publication service that supports non-monotonic changes 




                                                           publication
                                                             service
     data increments (csv)                 transform
                                             service


                                                                         replicated
                             xform xform         reconciliation
                                xform
                             spec. spec.                                 publication
                                spec.               service
                                                                           servers

                                                   Reference data
Application to case study
   Update server
       transforms based on scripts (earlier scripting utility)
       linking to reference data
       distributed publication via
        SPARQL Update
       extensible range of data sets
             annual assessments
             in-season assessments
             bathing water profile
             features (e.g. pollution sources)
             reference data
From pilot to practice
   reduce modelling costs
       patterns
       reuse
   handling change and update
       patterns
       publication process
   automation
       conversion
       publication
   embed in the business process              dive 4
       use internally as well as externally
       publish once, use many
       data platform
Embed in business process
 embedding is critical to ensure data kept up to date
 in turn needs usage
=> lower barrier to use                   external
                                                   use



                  data not
                   used                rich, up
                                       to date               invest
                                         data



      data goes              hard to
        stale                justify
                                                  internal
                                                    use
Lowering barrier to use
   simple REST APIs
       use Linked Data API specification
       rich query without learning SPARQL
       easy consumption as JSON, XML
       gets developers used to data and data model
                    publication




                                            LD API
                      service




        transform
          service
Application to case study
   embedded in process for weekly/daily updates
   infrastructure to automate conversion and publishing
   API plus extensive developer documentation
   third party and in-house applications built over API




   publish once, use many
   information products as applications over a data platform,
    usable externally as well as internally
The next stage
   grow range of data publications and uses
   range of reference data and sets brings new challenges
       discover reference terms and models to reuse
       discover datasets to use for application
       discover models and links between sets
   needs a coordination or registry service
   story for another day ...
Conclusions
   illustrated how public sector users of linked are moving
    from static pilots to operational systems
   keys are:
       reduce modelling costs through patterns and reuse
       design for continuous update
       automation of publication using declarative mappings and
        SPARQL Update
       lower barrier to use through API design and documentation
       embed in organization’s process so the data is used and useful
Acknowledgements
Only possible thanks to many smart colleagues: Stuart
Williams, Andy Seaborne, Ian Dickinson, Brian McBride,
Chris Dollin
plus Alex Coley and team from the Environment Agency

Weitere ähnliche Inhalte

Ähnlich wie Industrialized Linked Data

Paper Final Taube Bienert GridInterop 2012
Paper Final Taube Bienert GridInterop 2012Paper Final Taube Bienert GridInterop 2012
Paper Final Taube Bienert GridInterop 2012
Bert Taube
 
Y&L Information_Mgmt Portfolio
Y&L Information_Mgmt PortfolioY&L Information_Mgmt Portfolio
Y&L Information_Mgmt Portfolio
Clint Campbell
 
On Data Quality Assurance and Conflation Entanglement in Crowdsourcing for En...
On Data Quality Assurance and Conflation Entanglement in Crowdsourcing for En...On Data Quality Assurance and Conflation Entanglement in Crowdsourcing for En...
On Data Quality Assurance and Conflation Entanglement in Crowdsourcing for En...
Greenapps&web
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
Yael Garten
 

Ähnlich wie Industrialized Linked Data (20)

Linked Data Hypercubes
Linked Data HypercubesLinked Data Hypercubes
Linked Data Hypercubes
 
Linked services for the Web of Data
Linked services for the Web of DataLinked services for the Web of Data
Linked services for the Web of Data
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White Paper
 
Linking Services and Linked Data: Keynote for AIMSA 2012
Linking Services and Linked Data: Keynote for AIMSA 2012Linking Services and Linked Data: Keynote for AIMSA 2012
Linking Services and Linked Data: Keynote for AIMSA 2012
 
Environmental Linked Data - Semtech Biz London
Environmental Linked Data - Semtech Biz LondonEnvironmental Linked Data - Semtech Biz London
Environmental Linked Data - Semtech Biz London
 
Linking UK Government Data, John Sheridan
Linking UK Government Data, John SheridanLinking UK Government Data, John Sheridan
Linking UK Government Data, John Sheridan
 
Ipres 2011 The Costs and Economics of Preservation
Ipres 2011 The Costs and Economics of PreservationIpres 2011 The Costs and Economics of Preservation
Ipres 2011 The Costs and Economics of Preservation
 
OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in Datawarehousing
 
Paper Final Taube Bienert GridInterop 2012
Paper Final Taube Bienert GridInterop 2012Paper Final Taube Bienert GridInterop 2012
Paper Final Taube Bienert GridInterop 2012
 
Adoption of Cloud Computing in Scientific Research
Adoption of Cloud Computing in Scientific ResearchAdoption of Cloud Computing in Scientific Research
Adoption of Cloud Computing in Scientific Research
 
Water Innovation Network (WIN)
Water Innovation Network (WIN)Water Innovation Network (WIN)
Water Innovation Network (WIN)
 
Y&L Information_Mgmt Portfolio
Y&L Information_Mgmt PortfolioY&L Information_Mgmt Portfolio
Y&L Information_Mgmt Portfolio
 
On Data Quality Assurance and Conflation Entanglement in Crowdsourcing for En...
On Data Quality Assurance and Conflation Entanglement in Crowdsourcing for En...On Data Quality Assurance and Conflation Entanglement in Crowdsourcing for En...
On Data Quality Assurance and Conflation Entanglement in Crowdsourcing for En...
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
The NIH Data Commons - BD2K All Hands Meeting 2015
The NIH Data Commons -  BD2K All Hands Meeting 2015The NIH Data Commons -  BD2K All Hands Meeting 2015
The NIH Data Commons - BD2K All Hands Meeting 2015
 
COBWEB A quality assurance workflow authoring tool for citizen science and cr...
COBWEB A quality assurance workflow authoring tool for citizen science and cr...COBWEB A quality assurance workflow authoring tool for citizen science and cr...
COBWEB A quality assurance workflow authoring tool for citizen science and cr...
 
Models Done Better... - UDG2018 - Intertek and DHI
Models Done Better... - UDG2018 - Intertek and DHIModels Done Better... - UDG2018 - Intertek and DHI
Models Done Better... - UDG2018 - Intertek and DHI
 
Big Data Paris - A Modern Enterprise Architecture
Big Data Paris - A Modern Enterprise ArchitectureBig Data Paris - A Modern Enterprise Architecture
Big Data Paris - A Modern Enterprise Architecture
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Industrialized Linked Data

  • 1. Industrialized Linked Data Dave Reynolds, Epimorphics Ltd @der42
  • 3. Linked Data journey ... explore  what is linked data?  what use it is for us?
  • 4. Linked Data journey ... explore  what is linked data?  what use it is for us?  self-describing  Integration  carries semantics with it  comparable  annotate and explain  slice and dice  data in context  web API  ...  ...
  • 5. Linked Data journey ... explore  what is linked data?  what use it is for us?  self-describing  Integration  carries semantics with it  comparable  annotate and explain  slice and dice  data in context  web API  ...  ...  what’s involved?
  • 6. Linked Data journey ... explore pilot data model convert publish apply Photo of The Thinker © dSeneste.dk@flicker CC BY
  • 7. Linked Data journey ... explore pilot routine? Great pilot but ...  can we reduce the time and cost?  how do we handle changes and updates?  how can we make the published data easier to use? How do we make Linked Data “business as usual”?
  • 8. Example case study: Environment Agency  monitoring of bathing water quality  static pilot  live pilot  historic annual assessments  weekly assessments  operational system  additional data feeds  live update  integrated API  data explorer
  • 9. From pilot to practice  reduce modelling costs  patterns dive1  reuse  handling change and update  patterns  publication process  automation  conversion  publication  embed in the business process  use internally as well as externally  publish once, use many  data platform
  • 10. Reduce costs - modelling 1. Don’t do it  map source data into isomorphic RDF, synthesize URIs  loses some of the value proposition 2. Reuse existing ontologies intact or mix-and-match  best solution when available  W3C GLD work on vocabularies – people, organizations, datasets ... 3. Reusable vocabulary patterns  example:  Data cube plus reference URI sets  adaptable to broad range of data – environmental, statistical, financial ...
  • 11. Reusable patterns: Data cube  Much public sector data has regularities  set of measures  observations, forecasts, budgets, assessments, statistics ... >0.1 34 27 good excellent poor good 125
  • 12. Reusable patterns: Data cube  Much public sector data has regularities  sets of measures  observations, forecasts, budgets, assessments, estimates ...  organized along some dimensions  region, agency, time, category, cost centre ... objective code cost centre 12 15 25 measure: spend 8 9 11 120 130 180 time
  • 13. Reusable patterns: Data cube  Much public sector data has regularities  sets of measures  observations, forecasts, budgets, assessments, estimates ...  organized along some dimensions  region, agency, time, category, cost centre ...  interpreted according to attributes  units, multipliers, status objective code cost centre provisional $12k $15k $25k measure: spend $8k $9k $11k final $120k $130k $180k time
  • 15. Data cube pattern  Pattern, not a fixed ontology  customize by selecting measures, dimensions and attributes  originated in publishing of statistics  applied to environment measurements, weather forecasts, budgets and spend, quality assessments, regional demographics ...  Supports reuse  widely reusable URI sets – geography, time periods, agencies, units  organization-wide sets  modelling often only requires small increments on top of core pattern and reusable components  opens door for reusable visualization tools  standardization through W3C GLD
  • 16. Application to case study  Data Cubes for water quality measurement  in-season weekly assessments  end of season annual assessments  dimensions:  time intervals – UK reference time service  location - reference URI set for bathing waters and sample pts  cubes can reuse these dimensions  just need to define specific measures
  • 17. From pilot to practice  reduce modelling costs  patterns  reuse  handling change and update  patterns dive 2  publication process  automation  conversion  publication  embed in the business process  use internally as well as externally  publish once, use many  data platform
  • 18. Handling change  critical challenge  most initial pilots choose a snapshot dataset  and go stale, fast  understanding the nature of data updates and how to handle them is critical to successful scaling to business as usual  types of change  new data related to different time period  corrections to data  entities change  properties  identity
  • 19. Modelling change 1. Individual data items relate to new time period Pattern: n-ary relation  observation resource relates value to time period and other context  use Data Cube dimensions for this bwq:sampleYear bwq:bathingWater http://reference.data.gov.uk/id/year/2009 http://environment.data.gov. uk/id/bathing- bwq:classification Higher water/ukk1202-36000 bwq:sampleYear Clevedon Beach http://reference.data.gov.uk/id/year/2010 bwq:classification Minimum bwq:sampleYear http://reference.data.gov.uk/id/year/2011 bwq:classification Higher History or latest?  latest is non-monotonic but helpful for many practical uses  materialize (SPARQL Update), implement in query, implement in API  choice whether to keep history as well  water quality v. weather forecasts
  • 20. Modelling change 2. Corrections  patterns  silent change (!)  explicit replacement  API level hides replaced values but SPARQL query can retrieve & trace  explicit change event bwq:sampleYear http://environment.data.gov. bwq:bathingWater classification : Higher http://reference.data.gov.uk/id/year/2011 uk/id/bathing- water/ukk1202-36000 dct:isReplacedBy ev:after Clevedon Beach dct:replaces ev:occuredOn classification : Minimum status: replaced analysis event reason: reanalysis ev:before ev:agent
  • 21. Modelling change 3. Mutation  Infrequent change of properties, essential identity remains  e.g. renaming a school, adding another building  routine accesses see property value, not function of time  patterns  in place update  named graphs  current graph + graphs for each previous state + meta-graph  explicit versioning with open periods
  • 22. Modelling change 3. Mutation explicit versioning with open periods dct:hasVersion dct:hasVersion endurant “Clevedon Beach” “Clevedon Sands” time:intervalStarts time:intervalStarts dct:valid 2003 dct:valid 2011 2011 time:intervalFinishes  find right version by query on validity interval  simplify use through  non-monotonic “latest value” link  API to implement query filters automatically
  • 23. Application to case study  weekly and annual samples  use Data Cube pattern (n-ary relation)  withdrawn samples  replacement pattern (no explicit change event)  Data Cube slice for “latest valid assessment”  generated by a SPARQL Update query  API gives easy access to the latest valid values  linked data following or raw SPARQL query allows drilling into changes  changes to bathing water profile  versioning pattern  bathing water entity points to latest profile (SPARQL Update again)
  • 24. From pilot to practice  reduce modelling costs  patterns  reuse  handling change and update  patterns  publication process  automation  conversion dive 3  publication  embed in the business process  use internally as well as externally  publish once, use many  data platform
  • 25. Automation Transform and publish data feed increments  transformation engine service  reusable mappings, low cost to adapt to new feeds  linking to reference data  publication service that supports non-monotonic changes publication service data increments (csv) transform service replicated xform xform reconciliation xform spec. spec. publication spec. service servers Reference data
  • 26. Transformation service  declarative specification of transform  single service support range of transformations  easy to adapt transformation to new feeds and modelling changes  R2RML – RDB to RDF Mapping Language  specify mapping from database tables to RDF triples  W3C candidate recommendation  D2RML  R2RML extension to treat CSV feed as a database table
  • 27. Small D2RML example :dataSource a dr:CSVDataSource ; rdfs:label "dataSource" . :bathingWaterTermMap a dr:SubjectMap; dr:template "http://environment.data.gov.uk/id/bathing-water/{EUBWID2}" ; dr:class def-bw:BathingWater . :bathingWaterMap dr:logicalTable :dataSource ; dr:subjectMap :bathingWaterTermMap ; dr:predicateObjectMap [ dr:predicate rdfs:label ; dr:objectMap [dr:column "description_english" ; dr:language "en" ] ] dr:predicateObjectMap [ dr:predicate def-bw:eubwidNotation; dr:objectMap [ dr:column "EUBWID2"; dr:datatype def-bw:eubwid ] ] .
  • 28. Using patterns  problems with verbosity, increases reuse costs  extend to support modelling patterns  Data Cube  specify mapping to observation with measures and dimensions  engine generates Data Set and Data Structure Definition automatically
  • 29. D2RML cube map example :dataCubeMap a dr:DataCubeMap ; rr:logicalTable “dataSource”; dr:datasetIRI “http://example.org/datacube1”^^xsd:anyURI ; dr:dsdIRI “http://example.org/myDsd”^^xsd:anyURI ; Instances will dr:observationMap [ automatically link to rr:subjectMap [ base Data Set rr:termType rr:IRI ; rr:template “http://example.org/observation/{PLACE}/{DATE}” ] ; rr:componentMap [ Implies an entry in the Data dr:componentType qb:measure ; Structure Definition which is rr:predicate aq:concentration ; auto-generated rr:objectMap [ rr:column “NO2” ; rr:datatype xsd:decimal ; ] ] ; ... Define how measure value is to be represented
  • 30. But what about linking?  connect observations to reference data  a core value of linked data  R2RML has Term Maps to create values  constants and templates  extend to allow maps based on other data sources  Lookup map  lookup resource in a store, fetch predicate  Reconcile  specify lookup in a remote service  use Google Refine reconciliation API
  • 31. Automation Transform and publish data feed increments  transformation engine service   reusable mappings, low cost to adapt to new feeds   linking to reference data   publication service that supports non-monotonic changes publication service data increments (csv) transform service replicated xform xform reconciliation xform spec. spec. publication spec. service servers Reference data
  • 32. Publication service  goals  cope with non-monotonic effects of change representation  so replication is robust and cheap (=> make it idempotent)  solution  SPARQL Update  publish transformed increment as a simple DATA INSERT  then run SPARQL Update script for non-monotonic links  dct:replacedBy links  lastest value slices
  • 33. Sample update script DELETE { ?bw bwq:latestComplianceAssessment ?o . } WHERE { ?bw bwq:latestComplianceAssessment ?o . } INSERT { ?bw bwq:latestComplianceAssessment ?o . } WHERE { { ?slice a bwq:ComplianceByYearSlice; bwq:sampleYear [interval:ordinalYear ?year]. OPTIONAL { ?slice2 a bwq:ComplianceByYearSlice; bwq:sampleYear [interval:ordinalYear ?year2]. FILTER (?year2 > ?year) } FILTER ( !bound(?slice2) ) } ?slice qb:observation ?o . ?o bwq:bathingWater ?bw. }
  • 34. Automation Transform and publish data feed increments  transformation engine service   reusable mappings, low cost to adapt to new feeds   linking to reference data   publication service that supports non-monotonic changes  publication service data increments (csv) transform service replicated xform xform reconciliation xform spec. spec. publication spec. service servers Reference data
  • 35. Application to case study  Update server  transforms based on scripts (earlier scripting utility)  linking to reference data  distributed publication via SPARQL Update  extensible range of data sets  annual assessments  in-season assessments  bathing water profile  features (e.g. pollution sources)  reference data
  • 36. From pilot to practice  reduce modelling costs  patterns  reuse  handling change and update  patterns  publication process  automation  conversion  publication  embed in the business process dive 4  use internally as well as externally  publish once, use many  data platform
  • 37. Embed in business process  embedding is critical to ensure data kept up to date  in turn needs usage => lower barrier to use external use data not used rich, up to date invest data data goes hard to stale justify internal use
  • 38. Lowering barrier to use  simple REST APIs  use Linked Data API specification  rich query without learning SPARQL  easy consumption as JSON, XML  gets developers used to data and data model publication LD API service transform service
  • 39. Application to case study  embedded in process for weekly/daily updates  infrastructure to automate conversion and publishing  API plus extensive developer documentation  third party and in-house applications built over API  publish once, use many  information products as applications over a data platform, usable externally as well as internally
  • 40. The next stage  grow range of data publications and uses  range of reference data and sets brings new challenges  discover reference terms and models to reuse  discover datasets to use for application  discover models and links between sets  needs a coordination or registry service  story for another day ...
  • 41. Conclusions  illustrated how public sector users of linked are moving from static pilots to operational systems  keys are:  reduce modelling costs through patterns and reuse  design for continuous update  automation of publication using declarative mappings and SPARQL Update  lower barrier to use through API design and documentation  embed in organization’s process so the data is used and useful Acknowledgements Only possible thanks to many smart colleagues: Stuart Williams, Andy Seaborne, Ian Dickinson, Brian McBride, Chris Dollin plus Alex Coley and team from the Environment Agency