Sem tech 2011 v8

Building a semantic
integration framework to
support a federated query
environment in 5 steps

Philip Ashworth UCB Celltech
Dean Allemang TopQuadrant

Nele, living with lupus

Data Integration… Why?

Scope and knowledge of life sciences expands everyday

Everyday we make new discoveries by experimenting (in the lab)

Data generated in the lab in large quantities to complement the vast
growth externally

Too difficult and time consuming for the user to bring data together

Therefore we don’t often make use of the data we already have to
make new discoveries

Data Integration… Problems

Registration, Query

App DB
DI, Query

Applications
App DB’s

DI Query

App DB’s Project DB
DI

App DB’s Warehouse DB
Project Marts

Data Integration… Problems

Demand for DI increases everyday.

Data doesn’t evolve into a larger more beneficial platform
• Where is the long term benefit?
• Driving ourselves around in circles

Just creating more data silos
• Limited scope for reuse

Slow & difficult to modify / enhance

High maintenance
• Multiple systems create more and more overhead

Data Integration… Thoughts

Data Integration is clearly evolving

But it is not fulfilling the needs

If we identify the need… can we see what we should be doing?

Data Integration… Needs

All Data for All Projects

Accessible Data
True Integration

Align Concepts
Data has
Context

Variety of Sources

Data Integration… There is a way!

Open Linked Data Cloud
Connected and linked data with context

Created by a
community

A Valuable resource that will only Grow!
Something we can learn from!

Significant scientific content
Significant linking hubs appearing

Data Integration… Starting an Evolutionary Leap

No one internally really knows about this

Can’t just rip and replace old systems

Have to do some ground work

Linked Data…The Quest

Technology Projects
• Emphasis on semantic web principles

Scientific Projects
• Data Integration
• Data Visualisation (mash-ups)

Linked Data… The Quest

Highly
Promiscuous &
Repetitive

Highly
Repetitive &
Promiscuous

Linked Data

New Approach

Develop a POC semantic data integration
framework
• Easy to configure
• Support all projects
• Builds an environment for the future.

The Idea

Applications

Business Process /
Workflow Automation
PURL

Rest Services (Abstraction layer)
Increasing Ease of Development
Decreasing knowledge of Semantic
Semantic Integration Framework technologies
Knowledge Collation, Concept mapping, Distributed Query
Result inference, Aggregation

RDF
Sparql EndPoint Native Sparql EndPoint

RDBMS RDF Triple MS Excel
Oracle,Postgres Store TXT
SQL, mySql Doc

Data Sources

RDF
Step 1. Data Sources

Expose data as RDF through SPARQL Endpoints
Internal Data sources
• D2R SPARQL Endpoints on RDBMS databases
• Each Modelled as local concepts that they represent
• Don’t worry about the larger concept picture
• Virtuoso RDF triple store (Open source) to host RDF data created from
spreadsheets
• TopBraid Ensemble & SPARQLMotion/SPIN scripts to convert static data to
RDF

SPARQL Endpoints

D2R

RDBMS Virtuoso

RDF
Step 1. Data Sources

External Data Sources
• SPARQL endpoints in LOD from Bio2RDF, LODD and others.
• Some stability, access, quality issues within these sources.
• Created Amazon Cloud server to host stable environments.
• Bio2RDF sources downloaded, stored and modified
• Virtuoso (open source) used as triple store

Linked Open Data Cloud
MOC
NBE
NBE LDAP
WH
Mart chebi
Bio2RDF
PDB
ITrack UCB geneid
PDB
Premier
Abysis PEP
Kegg
Kegg
dr
gl
Kegg
IDAC
WKW cpd
SEQ PMT Dis
Uniprot
Sider eas
ec
om
UCB Data Cloud e

Step 2: Integration Framework:

Why?
• Linked Open Data: links within a source are manually created
• To Navigate the cloud you either
• Learn the network
• Discover the network as you go through (unguided)

• There is nothing that understands the total connectivity of concepts
available to you.
• Difficult to know where start
• No idea if a start point will lead you to the information you are
looking for or might be interested in.
• Can’t query the cloud for specific Information

The Integration Framework will resolve these issues
• It will model the models to understand the connectivity

You shouldn’t have to know where to look for data

Step 2: Integration Framework

Applications
Understand Understand
UCB concepts how UCB
Business Process / Concepts fit
Workflow Automationwith source
PURL

concepts
Easy to
Understand Rest Services (Abstraction layer) wire up
Links Across
Sources
Semantic Integration Framework
Knowledge Collation, Concept mapping, Distributed Query
Result inference, Aggregation

Automate
RDF some
tasks
Understand
Data Sources Accessible
(concepts, acce Via Services
ss, props)

Data Sources

Step 2: Integration Framework. Sem Int Framework

Integration Framework
• Data source, concept and property registry
• An Ontology that Utilises
• VoID (enhanced) to capture data source information (endpoints)
• SKOS to link local ontologies with UCB concepts
• UCB:Person -> db1:user, db2:employee, db3:actor

Built using TopBraid Suite
• Ontology development (TopBraid Composer)
• SPARQLMotion scripts to provide some automation
• Creation of ontologies from endpoints, D2R mappings
• Configuration assistance


Integration Framework
UCB Concept Ontology (SKOS)

UCB:Person
DB1:User
Dataset Ontology (VoID)
UCB:Antibody DB1:Antibod
y
UCB:Project
DB1:Project

DB1


Dataset Ontology (VoID) UCB Concept Ontology (SKOS)

UCB:Person DB1:User

DB2:Person

DB3:Employe
e
DB3:Contact

DB1 DB2 DB3



UCB:Person DB1:User

Linksets
Person_DB1_DB3 DB2:Person
Person_DB1_DB2 DB3:Employe
e
DB3:Contact

DB1 DB2 DB3



1 2 3 4 5 6

7 8 9 10 11 12

Step 3: Rest Services Rest Services

Rest Services
• Interaction point for applications

• Expose simple and generic access to the Integration framework

• Removes complexity of framework and how to ask questions of it.
• You don’t need to know how to make it work

• You don’t need to know anything about the datasets and the concepts and
properties held within.

• Just ask simple questions in the UCB language
• Tell me about UCB:Person “ashworth”

• Built using SPARQLMotion/SPIN and exposed in TopBraid Live enterprise
server.

• Two simple yet very effective services created


Find UCB:Person “phil” Here are the resources for “phil”
ldap:U0xx10x, itrack:101, moc:scordisp etc….

Keyword
Get Info
Search
Tell me the sub-types of UCB:Person
Can the linksets tell us any info?

Tell me the datasets for the sub-types

Search DB3:Employee

Search DB3:Contact
Search DB1:User Search DB2:Person

DB1 DB2 DB3


Here is everything I know about it. Tell me about moc:scordisp

Keyword
Get Info
Search

Tell me everything about this UCB Concept Ontology (SKOS)
Dataset Ontology (VoID)
resource?
Tell me the super-types of all resources

Retrieve DB3:philscordis
Retrieve DB1:U0xx10x Retrieve DB2:scordisp

DB1 DB2 DB3

Step 4: Building an Application 1 Applications

Data Exploration environment
• Search concepts
• Display data
• Allow link following.
• Deals with any concept defined in UCB SKOS language
• Uses two framework services mentioned previously.

• Deployed in TopBraid Ensemble – Live

Step 4: Data Exploration Applications

UCB
Concepts

Search submitted
to “Keyword
Search” Service


Results Displayed.

Index shows
inference is already
taking place


Drag Instance to
basket, Initiates
“Get Info” Service
call


Select Instance
Data Displayed
per Source


Links to other data
items


Displays Sparse data

Submit Instance
to“Get info” service


More Detailed
Information


He has another
interaction.
Lets Explore.


Data cached as we
navigated Concept
Explorer. Can now be
investigated.


Integrated Internal
and External data

Structure concept
Keyword Search pulls
data from internal and
external data sources

After detailed Information
retrieved a second Structure
has been identified without a
keyword search

Add to basket


Federated data gathering & marting
• Data marting without the warehouse
• New Mart Rest Service
• SPARQLMotion/SPIN scripts
• Dump_UCB:Antibody
• Still uses framework to integrate data
• On the fly data integration
• Gather RDF from data sources
• Dump into tables
• Data consumed by traditional query tools
• Not particularly designed for this aspect… (slow)
• But works!


Knowledge Base Creation
• Gathering information can be a time consuming exercise
• But is vital for projects to have
• Different individuals have different ideas
• Relevance, sources etc, presentation
• Knowledge Base to provide consistency for
• Data gathered
• Data sources used
• Data presentation
• ROI
• 150 fold Increase in efficiency
• 6mins compared to > 16hrs (over several weeks)
• Information available to all at central access point

Step 4: Knowledge Base Applications

“Tell me about the
protein with Gene ID
X” and I want to know
about Literature
Refs, Sequences, Desc
riptions, Structure……
etc.

App Service

Keyword
Get Info
Search
Semantic Integration Framework

Data Sources

Step 4: Knowledge Base Applications

PURL
Step 5: Purl Server

Removing URL dependencies
D2R publishes resolvable URLs’ as specific to the server
Removing URL specificity with PURL server
Allows each layer of the architecture to be removed without all the others
having to be reconfigured
• Level of independence / indirection

Only done on limited scale

Conclusions & Business value

We have built an extensible data integration framework

• Shown how data integration can be an incremental process
• Started with three datasets, more than 20 a few months later
• Compare warehouse took 18 months to add two new data sources
• Adding a new source can take less than a day (whole process, inc
endpoint creation)
• Creates an enterprise-wide “data fabric” rather than just one more
application

• Connect datasets together like web pages fit together
• Literally click from one dataset to the other
• Dynamically mash-up data from multiple sources
• Add new sources by describing the connections, not by building a new
application

Conclusions & Business value

We have built a framework that

• Differs from data integration applications the way the Web
differs from earlier network technologies (ftp, archie)
• Infrastructure allows new entities (pages, databases) to be added
dynamically
• Adding connections is as easy as specifying them

• Provides data for all projects
• Three very different applications have been demonstrated
• All are able to use the same framework
• Reuse

Sem tech 2011 v8

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Sem tech 2011 v8

Ähnlich wie Sem tech 2011 v8 (20)

Sem tech 2011 v8