1. Building a semantic
integration framework to
support a federated query
environment in 5 steps
Philip Ashworth UCB Celltech
Dean Allemang TopQuadrant
Nele, living with lupus
2. Data Integration… Why?
Scope and knowledge of life sciences expands everyday
Everyday we make new discoveries by experimenting (in the lab)
Data generated in the lab in large quantities to complement the vast
growth externally
Too difficult and time consuming for the user to bring data together
Therefore we don’t often make use of the data we already have to
make new discoveries
3. Data Integration… Problems
Registration, Query
App DB
DI, Query
Applications
App DB’s
DI Query
App DB’s Project DB
DI
App DB’s Warehouse DB
Project Marts
4. Data Integration… Problems
Demand for DI increases everyday.
Data doesn’t evolve into a larger more beneficial platform
• Where is the long term benefit?
• Driving ourselves around in circles
Just creating more data silos
• Limited scope for reuse
Slow & difficult to modify / enhance
High maintenance
• Multiple systems create more and more overhead
5. Data Integration… Thoughts
Data Integration is clearly evolving
But it is not fulfilling the needs
If we identify the need… can we see what we should be doing?
6. Data Integration… Needs
All Data for All Projects
Accessible Data
True Integration
Align Concepts
Data has
Context
Variety of Sources
7. Data Integration… There is a way!
Open Linked Data Cloud
Connected and linked data with context
Created by a
community
A Valuable resource that will only Grow!
Something we can learn from!
Significant scientific content
Significant linking hubs appearing
8. Data Integration… Starting an Evolutionary Leap
No one internally really knows about this
Can’t just rip and replace old systems
Have to do some ground work
9. Linked Data…The Quest
Technology Projects
• Emphasis on semantic web principles
Scientific Projects
• Data Integration
• Data Visualisation (mash-ups)
11. Linked Data
New Approach
Develop a POC semantic data integration
framework
• Easy to configure
• Support all projects
• Builds an environment for the future.
12. The Idea
Applications
Business Process /
Workflow Automation
PURL
Rest Services (Abstraction layer)
Increasing Ease of Development
Decreasing knowledge of Semantic
Semantic Integration Framework technologies
Knowledge Collation, Concept mapping, Distributed Query
Result inference, Aggregation
RDF
Sparql EndPoint Native Sparql EndPoint
RDBMS RDF Triple MS Excel
Oracle,Postgres Store TXT
SQL, mySql Doc
Data Sources
13. RDF
Step 1. Data Sources
Expose data as RDF through SPARQL Endpoints
Internal Data sources
• D2R SPARQL Endpoints on RDBMS databases
• Each Modelled as local concepts that they represent
• Don’t worry about the larger concept picture
• Virtuoso RDF triple store (Open source) to host RDF data created from
spreadsheets
• TopBraid Ensemble & SPARQLMotion/SPIN scripts to convert static data to
RDF
SPARQL Endpoints
D2R
RDBMS Virtuoso
14. RDF
Step 1. Data Sources
External Data Sources
• SPARQL endpoints in LOD from Bio2RDF, LODD and others.
• Some stability, access, quality issues within these sources.
• Created Amazon Cloud server to host stable environments.
• Bio2RDF sources downloaded, stored and modified
• Virtuoso (open source) used as triple store
Linked Open Data Cloud
MOC
NBE
NBE LDAP
WH
Mart chebi
Bio2RDF
PDB
ITrack UCB geneid
PDB
Premier
Abysis PEP
Kegg
Kegg
dr
gl
Kegg
IDAC
WKW cpd
SEQ PMT Dis
Uniprot
Sider eas
ec
om
UCB Data Cloud e
15. Step 2: Integration Framework:
Why?
• Linked Open Data: links within a source are manually created
• To Navigate the cloud you either
• Learn the network
• Discover the network as you go through (unguided)
• There is nothing that understands the total connectivity of concepts
available to you.
• Difficult to know where start
• No idea if a start point will lead you to the information you are
looking for or might be interested in.
• Can’t query the cloud for specific Information
The Integration Framework will resolve these issues
• It will model the models to understand the connectivity
You shouldn’t have to know where to look for data
16. Step 2: Integration Framework
Applications
Understand Understand
UCB concepts how UCB
Business Process / Concepts fit
Workflow Automationwith source
PURL
concepts
Easy to
Understand Rest Services (Abstraction layer) wire up
Links Across
Sources
Semantic Integration Framework
Knowledge Collation, Concept mapping, Distributed Query
Result inference, Aggregation
Automate
RDF some
tasks
Understand
Data Sources Accessible
(concepts, acce Via Services
ss, props)
Data Sources
17. Step 2: Integration Framework. Sem Int Framework
Integration Framework
• Data source, concept and property registry
• An Ontology that Utilises
• VoID (enhanced) to capture data source information (endpoints)
• SKOS to link local ontologies with UCB concepts
• UCB:Person -> db1:user, db2:employee, db3:actor
Built using TopBraid Suite
• Ontology development (TopBraid Composer)
• SPARQLMotion scripts to provide some automation
• Creation of ontologies from endpoints, D2R mappings
• Configuration assistance
18. Step 2: Integration Framework. Sem Int Framework
Integration Framework
UCB Concept Ontology (SKOS)
UCB:Person
DB1:User
Dataset Ontology (VoID)
UCB:Antibody DB1:Antibod
y
UCB:Project
DB1:Project
DB1
19. Step 2: Integration Framework. Sem Int Framework
Dataset Ontology (VoID) UCB Concept Ontology (SKOS)
UCB:Person DB1:User
DB2:Person
DB3:Employe
e
DB3:Contact
DB1 DB2 DB3
20. Step 2: Integration Framework. Sem Int Framework
Dataset Ontology (VoID) UCB Concept Ontology (SKOS)
UCB:Person DB1:User
Linksets
Person_DB1_DB3 DB2:Person
Person_DB1_DB2 DB3:Employe
e
DB3:Contact
DB1 DB2 DB3
22. Step 3: Rest Services Rest Services
Rest Services
• Interaction point for applications
• Expose simple and generic access to the Integration framework
• Removes complexity of framework and how to ask questions of it.
• You don’t need to know how to make it work
• You don’t need to know anything about the datasets and the concepts and
properties held within.
• Just ask simple questions in the UCB language
• Tell me about UCB:Person “ashworth”
• Built using SPARQLMotion/SPIN and exposed in TopBraid Live enterprise
server.
• Two simple yet very effective services created
23. Step 3: Rest Services Rest Services
Find UCB:Person “phil” Here are the resources for “phil”
ldap:U0xx10x, itrack:101, moc:scordisp etc….
Keyword
Get Info
Search
Tell me the sub-types of UCB:Person
Can the linksets tell us any info?
Dataset Ontology (VoID) UCB Concept Ontology (SKOS)
Tell me the datasets for the sub-types
Search DB3:Employee
Search DB3:Contact
Search DB1:User Search DB2:Person
DB1 DB2 DB3
24. Step 3: Rest Services Rest Services
Here is everything I know about it. Tell me about moc:scordisp
Keyword
Get Info
Search
Tell me everything about this UCB Concept Ontology (SKOS)
Dataset Ontology (VoID)
resource?
Tell me the super-types of all resources
Retrieve DB3:philscordis
Retrieve DB1:U0xx10x Retrieve DB2:scordisp
DB1 DB2 DB3
25. Step 4: Building an Application 1 Applications
Data Exploration environment
• Search concepts
• Display data
• Allow link following.
• Deals with any concept defined in UCB SKOS language
• Uses two framework services mentioned previously.
• Deployed in TopBraid Ensemble – Live
26. Step 4: Data Exploration Applications
UCB
Concepts
Search submitted
to “Keyword
Search” Service
27. Step 4: Data Exploration Applications
Results Displayed.
Index shows
inference is already
taking place
28. Step 4: Data Exploration Applications
Drag Instance to
basket, Initiates
“Get Info” Service
call
29. Step 4: Data Exploration Applications
Select Instance
Data Displayed
per Source
30. Step 4: Data Exploration Applications
Links to other data
items
31. Step 4: Data Exploration Applications
Displays Sparse data
Submit Instance
to“Get info” service
32. Step 4: Data Exploration Applications
More Detailed
Information
33. Step 4: Data Exploration Applications
He has another
interaction.
Lets Explore.
35. Step 4: Data Exploration Applications
Data cached as we
navigated Concept
Explorer. Can now be
investigated.
36. Step 4: Data Exploration Applications
Integrated Internal
and External data
Structure concept
Keyword Search pulls
data from internal and
external data sources
After detailed Information
retrieved a second Structure
has been identified without a
keyword search
Add to basket
38. Step 4: Building an Application 2 Applications
Federated data gathering & marting
• Data marting without the warehouse
• New Mart Rest Service
• SPARQLMotion/SPIN scripts
• Dump_UCB:Antibody
• Still uses framework to integrate data
• On the fly data integration
• Gather RDF from data sources
• Dump into tables
• Data consumed by traditional query tools
• Not particularly designed for this aspect… (slow)
• But works!
39. Step 4: Building an Application 3 Applications
Knowledge Base Creation
• Gathering information can be a time consuming exercise
• But is vital for projects to have
• Different individuals have different ideas
• Relevance, sources etc, presentation
• Knowledge Base to provide consistency for
• Data gathered
• Data sources used
• Data presentation
• ROI
• 150 fold Increase in efficiency
• 6mins compared to > 16hrs (over several weeks)
• Information available to all at central access point
40. Step 4: Knowledge Base Applications
“Tell me about the
protein with Gene ID
X” and I want to know
about Literature
Refs, Sequences, Desc
riptions, Structure……
etc.
App Service
Keyword
Get Info
Search
Semantic Integration Framework
Data Sources
46. PURL
Step 5: Purl Server
Removing URL dependencies
D2R publishes resolvable URLs’ as specific to the server
Removing URL specificity with PURL server
Allows each layer of the architecture to be removed without all the others
having to be reconfigured
• Level of independence / indirection
Only done on limited scale
47. Conclusions & Business value
We have built an extensible data integration framework
• Shown how data integration can be an incremental process
• Started with three datasets, more than 20 a few months later
• Compare warehouse took 18 months to add two new data sources
• Adding a new source can take less than a day (whole process, inc
endpoint creation)
• Creates an enterprise-wide “data fabric” rather than just one more
application
• Connect datasets together like web pages fit together
• Literally click from one dataset to the other
• Dynamically mash-up data from multiple sources
• Add new sources by describing the connections, not by building a new
application
48. Conclusions & Business value
We have built a framework that
• Differs from data integration applications the way the Web
differs from earlier network technologies (ftp, archie)
• Infrastructure allows new entities (pages, databases) to be added
dynamically
• Adding connections is as easy as specifying them
• Provides data for all projects
• Three very different applications have been demonstrated
• All are able to use the same framework
• Reuse