SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Building a semantic
  integration framework to
  support a federated query
  environment in 5 steps




Philip Ashworth UCB Celltech
Dean Allemang TopQuadrant




                               Nele, living with lupus
Data Integration… Why?



 Scope and knowledge of life sciences expands everyday

 Everyday we make new discoveries by experimenting (in the lab)

 Data generated in the lab in large quantities to complement the vast
 growth externally

 Too difficult and time consuming for the user to bring data together

 Therefore we don’t often make use of the data we already have to
 make new discoveries
Data Integration… Problems


                 Registration, Query

App DB
                     DI, Query




                                                       Applications
App DB’s

           DI                    Query


App DB’s         Project DB
           DI




App DB’s        Warehouse DB
                                       Project Marts
Data Integration… Problems

 Demand for DI increases everyday.

 Data doesn’t evolve into a larger more beneficial platform
 • Where is the long term benefit?
 • Driving ourselves around in circles


 Just creating more data silos
 • Limited scope for reuse


 Slow & difficult to modify / enhance

 High maintenance
 • Multiple systems create more and more overhead
Data Integration… Thoughts




 Data Integration is clearly evolving




                                        But it is not fulfilling the needs




  If we identify the need… can we see what we should be doing?
Data Integration… Needs


                   All Data for All Projects




                                               Accessible Data
True Integration




                                                      Align Concepts
 Data has
 Context



                      Variety of Sources
Data Integration… There is a way!

                                                           Open Linked Data Cloud
         Connected and linked data with context




       Created by a
        community


                      A Valuable resource that will only Grow!
                           Something we can learn from!


Significant scientific content
                                          Significant linking hubs appearing
Data Integration… Starting an Evolutionary Leap




 No one internally really knows about this




                                     Can’t just rip and replace old systems




  Have to do some ground work
Linked Data…The Quest


 Technology Projects
 • Emphasis on semantic web principles




 Scientific Projects
 • Data Integration
 • Data Visualisation (mash-ups)
Linked Data… The Quest




                             Highly
                         Promiscuous &
                           Repetitive

           Highly
        Repetitive &
        Promiscuous
Linked Data


 New Approach


 Develop a POC semantic data integration
 framework
 • Easy to configure
 • Support all projects
 • Builds an environment for the future.
The Idea



                               Applications

                    Business Process /
                    Workflow Automation
 PURL




                    Rest Services (Abstraction layer)
                                                                        Increasing Ease of Development
                                                                        Decreasing knowledge of Semantic
              Semantic Integration Framework                             technologies
              Knowledge Collation, Concept mapping, Distributed Query
                          Result inference, Aggregation



                                     RDF
        Sparql EndPoint            Native            Sparql EndPoint

           RDBMS                RDF Triple              MS Excel
        Oracle,Postgres           Store                  TXT
         SQL, mySql                                       Doc

                             Data Sources
RDF
Step 1. Data Sources

 Expose data as RDF through SPARQL Endpoints
 Internal Data sources
 • D2R SPARQL Endpoints on RDBMS databases
  • Each Modelled as local concepts that they represent
  • Don’t worry about the larger concept picture
 • Virtuoso RDF triple store (Open source) to host RDF data created from
   spreadsheets
 • TopBraid Ensemble & SPARQLMotion/SPIN scripts to convert static data to
   RDF

              SPARQL Endpoints



      D2R




   RDBMS                       Virtuoso
RDF
Step 1. Data Sources


External Data Sources
• SPARQL endpoints in LOD from Bio2RDF, LODD and others.
• Some stability, access, quality issues within these sources.
• Created Amazon Cloud server to host stable environments.
• Bio2RDF sources downloaded, stored and modified
• Virtuoso (open source) used as triple store




                                                               Linked Open Data Cloud
                      MOC
                                NBE
     NBE                                      LDAP
                                WH
     Mart                                            chebi
                                                                          Bio2RDF
                                                                            PDB
                      ITrack                                     UCB                     geneid
                                                                 PDB
          Premier
                               Abysis         PEP
                                                     Kegg
                                                                                  Kegg
                                                      dr
                                                                                   gl
                                                                  Kegg
   IDAC
                        WKW                                       cpd
             SEQ                        PMT                                 Dis
                                                                                           Uniprot
                                                       Sider                eas
                                                                                             ec
                                                                            om
                    UCB Data Cloud                                           e
Step 2: Integration Framework:

 Why?
 •   Linked Open Data: links within a source are manually created
 •   To Navigate the cloud you either
      • Learn the network
      • Discover the network as you go through (unguided)


 •   There is nothing that understands the total connectivity of concepts
     available to you.
      • Difficult to know where start
      • No idea if a start point will lead you to the information you are
        looking for or might be interested in.
      • Can’t query the cloud for specific Information


 The Integration Framework will resolve these issues
 •   It will model the models to understand the connectivity


 You shouldn’t have to know where to look for data
Step 2: Integration Framework



                                             Applications
                         Understand                     Understand
                        UCB concepts                     how UCB
                              Business        Process / Concepts fit
                              Workflow        Automationwith source
               PURL



                                                         concepts
                                                                                        Easy to
 Understand                       Rest Services (Abstraction layer)                     wire up
Links Across
  Sources
                            Semantic Integration Framework
                            Knowledge Collation, Concept mapping, Distributed Query
                                        Result inference, Aggregation


                                                                                         Automate
                                                  RDF                                     some
                                                                                           tasks
        Understand
       Data Sources                                                       Accessible
      (concepts, acce                                                    Via Services
         ss, props)

                                            Data Sources
Step 2: Integration Framework.                                Sem Int Framework


 Integration Framework
 • Data source, concept and property registry
 • An Ontology that Utilises
     • VoID (enhanced) to capture data source information (endpoints)
     • SKOS to link local ontologies with UCB concepts
       • UCB:Person -> db1:user, db2:employee, db3:actor




 Built using TopBraid Suite
 • Ontology development (TopBraid Composer)
 • SPARQLMotion scripts to provide some automation
   • Creation of ontologies from endpoints, D2R mappings
   • Configuration assistance
Step 2: Integration Framework.                               Sem Int Framework



    Integration Framework
                                   UCB Concept Ontology (SKOS)


                                   UCB:Person
                                                                 DB1:User
         Dataset Ontology (VoID)
                                   UCB:Antibody                  DB1:Antibod
                                                                      y
                                    UCB:Project
                                                                  DB1:Project




         DB1
Step 2: Integration Framework.                                  Sem Int Framework




    Dataset Ontology (VoID)   UCB Concept Ontology (SKOS)


                               UCB:Person                          DB1:User



                                                            DB2:Person

                                                          DB3:Employe
                                                               e
                                            DB3:Contact




                DB1           DB2                                        DB3
Step 2: Integration Framework.                                                     Sem Int Framework




      Dataset Ontology (VoID)                    UCB Concept Ontology (SKOS)


                                                  UCB:Person                          DB1:User

                    Linksets
                                Person_DB1_DB3                                 DB2:Person
   Person_DB1_DB2                                                            DB3:Employe
                                                                                  e
                                                               DB3:Contact




                    DB1                          DB2                                        DB3
Step 2: Integration Framework.                                         Sem Int Framework




       Dataset Ontology (VoID)               UCB Concept Ontology (SKOS)




   1                2            3       4              5                  6




           7               8         9        10                11             12
Step 3: Rest Services                                   Rest Services


 Rest Services
 • Interaction point for applications

 • Expose simple and generic access to the Integration framework

 • Removes complexity of framework and how to ask questions of it.
   • You don’t need to know how to make it work

 • You don’t need to know anything about the datasets and the concepts and
   properties held within.

 • Just ask simple questions in the UCB language
   • Tell me about UCB:Person “ashworth”

 • Built using SPARQLMotion/SPIN and exposed in TopBraid Live enterprise
   server.

 • Two simple yet very effective services created
Step 3: Rest Services                                                            Rest Services


         Find UCB:Person “phil”                      Here are the resources for “phil”
                                                     ldap:U0xx10x, itrack:101, moc:scordisp etc….

                      Keyword
                                                            Get Info
                       Search
                                            Tell me the sub-types of UCB:Person
 Can the linksets tell us any info?
       Dataset Ontology (VoID)                     UCB Concept Ontology (SKOS)

                                    Tell me the datasets for the sub-types




                                                  Search DB3:Employee

                                                  Search DB3:Contact
    Search DB1:User              Search DB2:Person




               DB1                        DB2                         DB3
Step 3: Rest Services                                                        Rest Services


 Here is everything I know about it.            Tell me about moc:scordisp


                     Keyword
                                                        Get Info
                      Search

  Tell me everything about this                UCB Concept Ontology (SKOS)
        Dataset Ontology (VoID)
  resource?
                                             Tell me the super-types of all resources




                                                       Retrieve DB3:philscordis
 Retrieve DB1:U0xx10x        Retrieve DB2:scordisp




               DB1                     DB2                        DB3
Step 4: Building an Application 1                        Applications




 Data Exploration environment
 • Search concepts
 • Display data
 • Allow link following.
 • Deals with any concept defined in UCB SKOS language
 • Uses two framework services mentioned previously.


 • Deployed in TopBraid Ensemble – Live
Step 4: Data Exploration                  Applications




                UCB
              Concepts




                           Search submitted
                             to “Keyword
                            Search” Service
Step 4: Data Exploration                 Applications




                    Results Displayed.




                    Index shows
                inference is already
                    taking place
Step 4: Data Exploration                Applications




                    Drag Instance to
                    basket, Initiates
                   “Get Info” Service
                          call
Step 4: Data Exploration                     Applications




          Select Instance
                            Data Displayed
                             per Source
Step 4: Data Exploration                         Applications




                           Links to other data
                                  items
Step 4: Data Exploration           Applications




          Displays Sparse data




                                   Submit Instance
                                 to“Get info” service
Step 4: Data Exploration            Applications




                           More Detailed
                            Information
Step 4: Data Exploration                    Applications




                           He has another
                             interaction.
                            Lets Explore.
Step 4: Data Exploration   Applications
Step 4: Data Exploration                          Applications




                            Data cached as we
                            navigated Concept
                           Explorer. Can now be
                               investigated.
Step 4: Data Exploration                                          Applications




                                                         Integrated Internal
                                                          and External data


                       Structure concept
                     Keyword Search pulls
                     data from internal and
                     external data sources




                           After detailed Information
                         retrieved a second Structure
                         has been identified without a
                                keyword search




             Add to basket
Step 4: Data Exploration   Applications
Step 4: Building an Application 2                      Applications




 Federated data gathering & marting
 • Data marting without the warehouse
 • New Mart Rest Service
   • SPARQLMotion/SPIN scripts
   • Dump_UCB:Antibody
 • Still uses framework to integrate data
   • On the fly data integration
   • Gather RDF from data sources
 • Dump into tables
 • Data consumed by traditional query tools
 • Not particularly designed for this aspect… (slow)
   • But works!
Step 4: Building an Application 3                           Applications




 Knowledge Base Creation
 • Gathering information can be a time consuming exercise
  • But is vital for projects to have
  • Different individuals have different ideas
    • Relevance, sources etc, presentation
 • Knowledge Base to provide consistency for
  • Data gathered
  • Data sources used
  • Data presentation
 • ROI
  • 150 fold Increase in efficiency
    • 6mins compared to > 16hrs (over several weeks)
  • Information available to all at central access point
Step 4: Knowledge Base                                       Applications




“Tell me about the
protein with Gene ID
X” and I want to know
about Literature
Refs, Sequences, Desc
riptions, Structure……
etc.



                               App Service


                  Keyword
                                                  Get Info
                   Search
                        Semantic Integration Framework




                                   Data Sources
Step 4: Knowledge Base   Applications
Step 4: Knowledge Base   Applications
Step 4: Knowledge Base   Applications
Step 4: Knowledge Base   Applications
Step 4: Knowledge Base   Applications
PURL
Step 5: Purl Server




 Removing URL dependencies
 D2R publishes resolvable URLs’ as specific to the server
 Removing URL specificity with PURL server
 Allows each layer of the architecture to be removed without all the others
 having to be reconfigured
 • Level of independence / indirection

 Only done on limited scale
Conclusions & Business value

 We have built an extensible data integration framework

 • Shown how data integration can be an incremental process
  • Started with three datasets, more than 20 a few months later
  • Compare warehouse took 18 months to add two new data sources
  • Adding a new source can take less than a day (whole process, inc
    endpoint creation)
  • Creates an enterprise-wide “data fabric” rather than just one more
    application

 • Connect datasets together like web pages fit together
  • Literally click from one dataset to the other
  • Dynamically mash-up data from multiple sources
  • Add new sources by describing the connections, not by building a new
    application
Conclusions & Business value

 We have built a framework that


 • Differs from data integration applications the way the Web
   differs from earlier network technologies (ftp, archie)
  • Infrastructure allows new entities (pages, databases) to be added
    dynamically
  • Adding connections is as easy as specifying them



 • Provides data for all projects
  • Three very different applications have been demonstrated
  • All are able to use the same framework
  • Reuse
Questions?




             49

Weitere ähnliche Inhalte

Andere mochten auch

Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8dallemang
 
Lotico oct 2010
Lotico oct 2010Lotico oct 2010
Lotico oct 2010dallemang
 
LA CIRUGÍA ESTÉTICA A FAVOR O EN CONTRA
LA CIRUGÍA ESTÉTICA A FAVOR O EN CONTRA LA CIRUGÍA ESTÉTICA A FAVOR O EN CONTRA
LA CIRUGÍA ESTÉTICA A FAVOR O EN CONTRA Vanessa Garcia Castillo
 
Teaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & TextspeakTeaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & TextspeakShelly Sanchez Terrell
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerLuminary Labs
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 

Andere mochten auch (6)

Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Lotico oct 2010
Lotico oct 2010Lotico oct 2010
Lotico oct 2010
 
LA CIRUGÍA ESTÉTICA A FAVOR O EN CONTRA
LA CIRUGÍA ESTÉTICA A FAVOR O EN CONTRA LA CIRUGÍA ESTÉTICA A FAVOR O EN CONTRA
LA CIRUGÍA ESTÉTICA A FAVOR O EN CONTRA
 
Teaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & TextspeakTeaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & Textspeak
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 

Ähnlich wie Sem tech 2011 v8

Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic WebNuxeo
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackJérôme Kehrli
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLTugdual Grall
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Fwdays
 
Research ON Big Data
Research ON Big DataResearch ON Big Data
Research ON Big Datamysqlops
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
The Synergy Between the Object Database, Graph Database, Cloud Computing and ...
The Synergy Between the Object Database, Graph Database, Cloud Computing and ...The Synergy Between the Object Database, Graph Database, Cloud Computing and ...
The Synergy Between the Object Database, Graph Database, Cloud Computing and ...InfiniteGraph
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendCaserta
 
Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataMaori Ito
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the DatabusAmy W. Tang
 
DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012Amazon Web Services
 
Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library InteroperabilityAlex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperabilityparker01
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Shirshanka Das
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 

Ähnlich wie Sem tech 2011 v8 (20)

Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software Stack
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"
 
Spark
SparkSpark
Spark
 
Research ON Big Data
Research ON Big DataResearch ON Big Data
Research ON Big Data
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
The Synergy Between the Object Database, Graph Database, Cloud Computing and ...
The Synergy Between the Object Database, Graph Database, Cloud Computing and ...The Synergy Between the Object Database, Graph Database, Cloud Computing and ...
The Synergy Between the Object Database, Graph Database, Cloud Computing and ...
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
 
Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and Metadata
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012
 
Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library InteroperabilityAlex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperability
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
A View on eScience
A View on eScienceA View on eScience
A View on eScience
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 

Sem tech 2011 v8

  • 1. Building a semantic integration framework to support a federated query environment in 5 steps Philip Ashworth UCB Celltech Dean Allemang TopQuadrant Nele, living with lupus
  • 2. Data Integration… Why? Scope and knowledge of life sciences expands everyday Everyday we make new discoveries by experimenting (in the lab) Data generated in the lab in large quantities to complement the vast growth externally Too difficult and time consuming for the user to bring data together Therefore we don’t often make use of the data we already have to make new discoveries
  • 3. Data Integration… Problems Registration, Query App DB DI, Query Applications App DB’s DI Query App DB’s Project DB DI App DB’s Warehouse DB Project Marts
  • 4. Data Integration… Problems Demand for DI increases everyday. Data doesn’t evolve into a larger more beneficial platform • Where is the long term benefit? • Driving ourselves around in circles Just creating more data silos • Limited scope for reuse Slow & difficult to modify / enhance High maintenance • Multiple systems create more and more overhead
  • 5. Data Integration… Thoughts Data Integration is clearly evolving But it is not fulfilling the needs If we identify the need… can we see what we should be doing?
  • 6. Data Integration… Needs All Data for All Projects Accessible Data True Integration Align Concepts Data has Context Variety of Sources
  • 7. Data Integration… There is a way! Open Linked Data Cloud Connected and linked data with context Created by a community A Valuable resource that will only Grow! Something we can learn from! Significant scientific content Significant linking hubs appearing
  • 8. Data Integration… Starting an Evolutionary Leap No one internally really knows about this Can’t just rip and replace old systems Have to do some ground work
  • 9. Linked Data…The Quest Technology Projects • Emphasis on semantic web principles Scientific Projects • Data Integration • Data Visualisation (mash-ups)
  • 10. Linked Data… The Quest Highly Promiscuous & Repetitive Highly Repetitive & Promiscuous
  • 11. Linked Data New Approach Develop a POC semantic data integration framework • Easy to configure • Support all projects • Builds an environment for the future.
  • 12. The Idea Applications Business Process / Workflow Automation PURL Rest Services (Abstraction layer) Increasing Ease of Development Decreasing knowledge of Semantic Semantic Integration Framework technologies Knowledge Collation, Concept mapping, Distributed Query Result inference, Aggregation RDF Sparql EndPoint Native Sparql EndPoint RDBMS RDF Triple MS Excel Oracle,Postgres Store TXT SQL, mySql Doc Data Sources
  • 13. RDF Step 1. Data Sources Expose data as RDF through SPARQL Endpoints Internal Data sources • D2R SPARQL Endpoints on RDBMS databases • Each Modelled as local concepts that they represent • Don’t worry about the larger concept picture • Virtuoso RDF triple store (Open source) to host RDF data created from spreadsheets • TopBraid Ensemble & SPARQLMotion/SPIN scripts to convert static data to RDF SPARQL Endpoints D2R RDBMS Virtuoso
  • 14. RDF Step 1. Data Sources External Data Sources • SPARQL endpoints in LOD from Bio2RDF, LODD and others. • Some stability, access, quality issues within these sources. • Created Amazon Cloud server to host stable environments. • Bio2RDF sources downloaded, stored and modified • Virtuoso (open source) used as triple store Linked Open Data Cloud MOC NBE NBE LDAP WH Mart chebi Bio2RDF PDB ITrack UCB geneid PDB Premier Abysis PEP Kegg Kegg dr gl Kegg IDAC WKW cpd SEQ PMT Dis Uniprot Sider eas ec om UCB Data Cloud e
  • 15. Step 2: Integration Framework: Why? • Linked Open Data: links within a source are manually created • To Navigate the cloud you either • Learn the network • Discover the network as you go through (unguided) • There is nothing that understands the total connectivity of concepts available to you. • Difficult to know where start • No idea if a start point will lead you to the information you are looking for or might be interested in. • Can’t query the cloud for specific Information The Integration Framework will resolve these issues • It will model the models to understand the connectivity You shouldn’t have to know where to look for data
  • 16. Step 2: Integration Framework Applications Understand Understand UCB concepts how UCB Business Process / Concepts fit Workflow Automationwith source PURL concepts Easy to Understand Rest Services (Abstraction layer) wire up Links Across Sources Semantic Integration Framework Knowledge Collation, Concept mapping, Distributed Query Result inference, Aggregation Automate RDF some tasks Understand Data Sources Accessible (concepts, acce Via Services ss, props) Data Sources
  • 17. Step 2: Integration Framework. Sem Int Framework Integration Framework • Data source, concept and property registry • An Ontology that Utilises • VoID (enhanced) to capture data source information (endpoints) • SKOS to link local ontologies with UCB concepts • UCB:Person -> db1:user, db2:employee, db3:actor Built using TopBraid Suite • Ontology development (TopBraid Composer) • SPARQLMotion scripts to provide some automation • Creation of ontologies from endpoints, D2R mappings • Configuration assistance
  • 18. Step 2: Integration Framework. Sem Int Framework Integration Framework UCB Concept Ontology (SKOS) UCB:Person DB1:User Dataset Ontology (VoID) UCB:Antibody DB1:Antibod y UCB:Project DB1:Project DB1
  • 19. Step 2: Integration Framework. Sem Int Framework Dataset Ontology (VoID) UCB Concept Ontology (SKOS) UCB:Person DB1:User DB2:Person DB3:Employe e DB3:Contact DB1 DB2 DB3
  • 20. Step 2: Integration Framework. Sem Int Framework Dataset Ontology (VoID) UCB Concept Ontology (SKOS) UCB:Person DB1:User Linksets Person_DB1_DB3 DB2:Person Person_DB1_DB2 DB3:Employe e DB3:Contact DB1 DB2 DB3
  • 21. Step 2: Integration Framework. Sem Int Framework Dataset Ontology (VoID) UCB Concept Ontology (SKOS) 1 2 3 4 5 6 7 8 9 10 11 12
  • 22. Step 3: Rest Services Rest Services Rest Services • Interaction point for applications • Expose simple and generic access to the Integration framework • Removes complexity of framework and how to ask questions of it. • You don’t need to know how to make it work • You don’t need to know anything about the datasets and the concepts and properties held within. • Just ask simple questions in the UCB language • Tell me about UCB:Person “ashworth” • Built using SPARQLMotion/SPIN and exposed in TopBraid Live enterprise server. • Two simple yet very effective services created
  • 23. Step 3: Rest Services Rest Services Find UCB:Person “phil” Here are the resources for “phil” ldap:U0xx10x, itrack:101, moc:scordisp etc…. Keyword Get Info Search Tell me the sub-types of UCB:Person Can the linksets tell us any info? Dataset Ontology (VoID) UCB Concept Ontology (SKOS) Tell me the datasets for the sub-types Search DB3:Employee Search DB3:Contact Search DB1:User Search DB2:Person DB1 DB2 DB3
  • 24. Step 3: Rest Services Rest Services Here is everything I know about it. Tell me about moc:scordisp Keyword Get Info Search Tell me everything about this UCB Concept Ontology (SKOS) Dataset Ontology (VoID) resource? Tell me the super-types of all resources Retrieve DB3:philscordis Retrieve DB1:U0xx10x Retrieve DB2:scordisp DB1 DB2 DB3
  • 25. Step 4: Building an Application 1 Applications Data Exploration environment • Search concepts • Display data • Allow link following. • Deals with any concept defined in UCB SKOS language • Uses two framework services mentioned previously. • Deployed in TopBraid Ensemble – Live
  • 26. Step 4: Data Exploration Applications UCB Concepts Search submitted to “Keyword Search” Service
  • 27. Step 4: Data Exploration Applications Results Displayed. Index shows inference is already taking place
  • 28. Step 4: Data Exploration Applications Drag Instance to basket, Initiates “Get Info” Service call
  • 29. Step 4: Data Exploration Applications Select Instance Data Displayed per Source
  • 30. Step 4: Data Exploration Applications Links to other data items
  • 31. Step 4: Data Exploration Applications Displays Sparse data Submit Instance to“Get info” service
  • 32. Step 4: Data Exploration Applications More Detailed Information
  • 33. Step 4: Data Exploration Applications He has another interaction. Lets Explore.
  • 34. Step 4: Data Exploration Applications
  • 35. Step 4: Data Exploration Applications Data cached as we navigated Concept Explorer. Can now be investigated.
  • 36. Step 4: Data Exploration Applications Integrated Internal and External data Structure concept Keyword Search pulls data from internal and external data sources After detailed Information retrieved a second Structure has been identified without a keyword search Add to basket
  • 37. Step 4: Data Exploration Applications
  • 38. Step 4: Building an Application 2 Applications Federated data gathering & marting • Data marting without the warehouse • New Mart Rest Service • SPARQLMotion/SPIN scripts • Dump_UCB:Antibody • Still uses framework to integrate data • On the fly data integration • Gather RDF from data sources • Dump into tables • Data consumed by traditional query tools • Not particularly designed for this aspect… (slow) • But works!
  • 39. Step 4: Building an Application 3 Applications Knowledge Base Creation • Gathering information can be a time consuming exercise • But is vital for projects to have • Different individuals have different ideas • Relevance, sources etc, presentation • Knowledge Base to provide consistency for • Data gathered • Data sources used • Data presentation • ROI • 150 fold Increase in efficiency • 6mins compared to > 16hrs (over several weeks) • Information available to all at central access point
  • 40. Step 4: Knowledge Base Applications “Tell me about the protein with Gene ID X” and I want to know about Literature Refs, Sequences, Desc riptions, Structure…… etc. App Service Keyword Get Info Search Semantic Integration Framework Data Sources
  • 41. Step 4: Knowledge Base Applications
  • 42. Step 4: Knowledge Base Applications
  • 43. Step 4: Knowledge Base Applications
  • 44. Step 4: Knowledge Base Applications
  • 45. Step 4: Knowledge Base Applications
  • 46. PURL Step 5: Purl Server Removing URL dependencies D2R publishes resolvable URLs’ as specific to the server Removing URL specificity with PURL server Allows each layer of the architecture to be removed without all the others having to be reconfigured • Level of independence / indirection Only done on limited scale
  • 47. Conclusions & Business value We have built an extensible data integration framework • Shown how data integration can be an incremental process • Started with three datasets, more than 20 a few months later • Compare warehouse took 18 months to add two new data sources • Adding a new source can take less than a day (whole process, inc endpoint creation) • Creates an enterprise-wide “data fabric” rather than just one more application • Connect datasets together like web pages fit together • Literally click from one dataset to the other • Dynamically mash-up data from multiple sources • Add new sources by describing the connections, not by building a new application
  • 48. Conclusions & Business value We have built a framework that • Differs from data integration applications the way the Web differs from earlier network technologies (ftp, archie) • Infrastructure allows new entities (pages, databases) to be added dynamically • Adding connections is as easy as specifying them • Provides data for all projects • Three very different applications have been demonstrated • All are able to use the same framework • Reuse