SlideShare a Scribd company logo
1 of 17
Dataset Descriptions in
    Open PHACTS

    Alasdair J G Gray
    University of Manchester
    W3C HCLS Call – 14 January 2013

    www.openphacts.org/specs/datadesc/
Authors:
Christian Y. A. Brenninkmeijer, Chris Evelo, Carole Goble,
Alasdair J. G. Gray, Andra Waagmeester and
Egon L. Willighagen
Public Domain Drug Discovery Data:
Pharma are accessing, processing, storing & re-processing




                                                                                Repeat @
          Literature Genbank
     Patents PubChem
                                    Databases
                                                                Downloads
                                                                            x     each
                                                                                company

                                                   Firewalled Databases
       Data Integration        Data Analysis




                                                                                  Why?
The Innovative Medicines
Initiative                          The Open PHACTS Project
• EC funded public-private          • Create a semantic integration hub (“Open
     partnership for                  Pharmacological Space”)…
     pharmaceutical research        • Delivering services to support on-going drug
• Focus on key problems               discovery programs in pharma and public domain
      – Efficacy, Safety, Educati   • Not just another project; Leading academics in
         on &                         semantics, pharmacology and informatics, driven
         Training, Knowledge          by solid industry business requirements
         Management                 • 13 academic partners, 9 pharmaceutical
                                      companies, 6 SMEs
                                    • Work split into clusters:
                                        • Technical Build (focus here)
                                        • Scientific Drive
                                        • Community & Sustainability



                                                           The Project
User Interfaces & Applications



                 Linked Data API


                           Identity          Identity
 Linked Data Cache         Mapping          Resolution
                           Service           Service
Domain
Specific     Data
Services                            Architecture
Datasets and Links
ChemSpider
 • ChemSpider aggregates data from
   over 400 sources
 • Central integration point for
   chemicals in OPS
 • OPS data covers
      – ChEBI
      – ChEMBL
      – DrugBank
14 January 2013   OPS Dataset Descriptions – A. J. G. Gray   5
What version of ChEMBL?
                                                    ~Jan 2012
 • ChemSpider: EBI SDF file
      – ChEMBL 13
 • Data Cache: Chem2Bio2RDF ChEMBL RDF
      – File downloaded May 2011
      – Chem2Bio2RDF metadata webpages:
        ChEMBL 8
      – File: ChEMBL 2
 • Mapping Server: Kasabi ChEMBL RDF file
      – ChEMBL 12

14 January 2013   OPS Dataset Descriptions – A. J. G. Gray      6
For the record
 • OPS currently uses ChEMBL 13
      – RDF generated from EBI database
        dump
      – Published at linkedchemistry.info
          • Credit: Egon Willighagen
 • Soon moving to ChEMBL 15
      – RDF published by EBI


14 January 2013    OPS Dataset Descriptions – A. J. G. Gray   7
Challenges
 • Datasets available
      – In many versions over time
      – In different formats
      – From many mirrors/registries
 • Files do not carry metadata
 • Registries
      – Can be out-of-date
      – Can contain conflicting information
14 January 2013   OPS Dataset Descriptions – A. J. G. Gray   8
VoID:            Vocabulary of Interlinked Datasets

 • Describes RDF datasets
      – W3C Note: http://www.w3.org/TR/void/
 • Metadata carried with data
      – Directly embedded or
        linked (void:inDataset)
 • Problems
      – Very generic
      – No checklist of requisite fields
14 January 2013       OPS Dataset Descriptions – A. J. G. Gray   9
Provenance Vocabularies
 • Dublin Core Terms
      – Widely used
      – Terms to generic to give proper credit
          • “Date: A point or period of time associated with
            an event in the lifecycle of the resource.”
 • PROV
      – New W3C standard: www.w3.org/2011/prov
      – Generic framework for exchanging data
      – Does not contain required predicates

14 January 2013     OPS Dataset Descriptions – A. J. G. Gray   10
PAV: Provenance, Authoring and
 Versioning Vocabulary
 http://code.google.com/p/pav-
 ontology/wiki/Homepage
 • Easy to understand predicates
      – http://purl.org/pav/
 • Right level of granularity
      – Distinguishes: author/creator/curator
      – Captures source of data:
          • import/derived/accessed
          • version/previousVersion
 • Being aligned with PROV-O
14 January 2013      OPS Dataset Descriptions – A. J. G. Gray   11
Dataset Descriptions in the
 Open Pharmacological Space




14 January 2013   OPS Dataset Descriptions – A. J. G. Gray   12
Related Work
 • Registries: DataHub, MIRIAM
      – Do not tie metadata with the data
      – No checklist of attributes
 • BioDBCore
      – Checklist
          • Similar information captured
          • Includes point of contact information
      – Not tied to the data

14 January 2013     OPS Dataset Descriptions – A. J. G. Gray   13
Realisation of Dataset
 Descriptions
 • Needs to be incorporated into data
   publishing pipeline
 • Hard for publishers to provide
   conformant descriptions
      – Datasets are complex
      – Evolve over time
      – Seen as yet another burden
 • Validation tool provided
      – http://openphacts.cs.man.ac.uk:9090/OPS-IMS/validate

14 January 2013      OPS Dataset Descriptions – A. J. G. Gray   14
Future Vision
 • Provide rich and accurate
   provenance trail of data
      – Alignment with BioDBCore
          • One standard to rule them all
      – Automatic pipeline from VoID file to
        registries
          • Write once, use many times



14 January 2013    OPS Dataset Descriptions – A. J. G. Gray   15
Thank you
 A.Gray@cs.man.ac.uk
 www.cs.man.ac.uk/~graya/
 www.openphacts.org




14 January 2013   OPS Dataset Descriptions – A. J. G. Gray   16

More Related Content

What's hot

Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Lucy McKenna
 
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...Tom Plasterer
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...OSTHUS
 
Linked Data for Biopharma
Linked Data for BiopharmaLinked Data for Biopharma
Linked Data for BiopharmaTom Plasterer
 
THOR Workshop - Persistent Identifier Linking
THOR Workshop - Persistent Identifier LinkingTHOR Workshop - Persistent Identifier Linking
THOR Workshop - Persistent Identifier LinkingMaaike Duine
 
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...Dr. Haxel Consult
 
Collaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareCollaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareAnita de Waard
 
Application of recently developed FAIR metrics to the ELIXIR Core Data Resources
Application of recently developed FAIR metrics to the ELIXIR Core Data ResourcesApplication of recently developed FAIR metrics to the ELIXIR Core Data Resources
Application of recently developed FAIR metrics to the ELIXIR Core Data ResourcesPistoia Alliance
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIPistoia Alliance
 
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...Anita de Waard
 
ICIC 2014 What Can We Learn from Our Past, that Equips Us for the Future?
ICIC 2014 What Can We Learn from Our Past, that Equips Us for the Future? ICIC 2014 What Can We Learn from Our Past, that Equips Us for the Future?
ICIC 2014 What Can We Learn from Our Past, that Equips Us for the Future? Dr. Haxel Consult
 
Linked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareLinked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareKerstin Forsberg
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Kerstin Forsberg
 
2013 DataCite Summer Meeting - Thomson Reuters Data citation index cooperatio...
2013 DataCite Summer Meeting - Thomson Reuters Data citation index cooperatio...2013 DataCite Summer Meeting - Thomson Reuters Data citation index cooperatio...
2013 DataCite Summer Meeting - Thomson Reuters Data citation index cooperatio...datacite
 
Linked Data: Opportunities for Entrepreneurs
Linked Data: Opportunities for EntrepreneursLinked Data: Opportunities for Entrepreneurs
Linked Data: Opportunities for Entrepreneurs3 Round Stones
 
FAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeFAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeTom Plasterer
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataEUCLID project
 

What's hot (20)

Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
 
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
 
Linked Data for Biopharma
Linked Data for BiopharmaLinked Data for Biopharma
Linked Data for Biopharma
 
THOR Workshop - Persistent Identifier Linking
THOR Workshop - Persistent Identifier LinkingTHOR Workshop - Persistent Identifier Linking
THOR Workshop - Persistent Identifier Linking
 
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...
 
Collaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareCollaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and software
 
Application of recently developed FAIR metrics to the ELIXIR Core Data Resources
Application of recently developed FAIR metrics to the ELIXIR Core Data ResourcesApplication of recently developed FAIR metrics to the ELIXIR Core Data Resources
Application of recently developed FAIR metrics to the ELIXIR Core Data Resources
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBI
 
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
 
Meadows apr28-1
Meadows apr28-1Meadows apr28-1
Meadows apr28-1
 
ICIC 2014 What Can We Learn from Our Past, that Equips Us for the Future?
ICIC 2014 What Can We Learn from Our Past, that Equips Us for the Future? ICIC 2014 What Can We Learn from Our Past, that Equips Us for the Future?
ICIC 2014 What Can We Learn from Our Past, that Equips Us for the Future?
 
Linked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareLinked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcare
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015
 
IC-SDV 2019: OntoChem
IC-SDV 2019: OntoChemIC-SDV 2019: OntoChem
IC-SDV 2019: OntoChem
 
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
 
2013 DataCite Summer Meeting - Thomson Reuters Data citation index cooperatio...
2013 DataCite Summer Meeting - Thomson Reuters Data citation index cooperatio...2013 DataCite Summer Meeting - Thomson Reuters Data citation index cooperatio...
2013 DataCite Summer Meeting - Thomson Reuters Data citation index cooperatio...
 
Linked Data: Opportunities for Entrepreneurs
Linked Data: Opportunities for EntrepreneursLinked Data: Opportunities for Entrepreneurs
Linked Data: Opportunities for Entrepreneurs
 
FAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeFAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to Practice
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 

Similar to 2013 01-14 ops-dataset_descriptions

Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformSanjay Padhi, Ph.D
 
Research Data Management at the University of Salford
Research Data Management at the University of SalfordResearch Data Management at the University of Salford
Research Data Management at the University of SalfordDavid Clay
 
Linked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the SoftwareLinked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the SoftwareIMC Technologies
 
FAIR BioData Management
FAIR BioData ManagementFAIR BioData Management
FAIR BioData ManagementUlrike Wittig
 
Public Identifiers in Scholarly Publishing
Public Identifiers in Scholarly PublishingPublic Identifiers in Scholarly Publishing
Public Identifiers in Scholarly PublishingAnita de Waard
 
BioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
BioIT 2017 - Ontoforce and Amgen Gene Knowledge DiscoveryBioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
BioIT 2017 - Ontoforce and Amgen Gene Knowledge DiscoveryWolfgang G. Hoeck
 
Delivering Faster Insights with a Logical Data Fabric
Delivering Faster Insights with a Logical Data FabricDelivering Faster Insights with a Logical Data Fabric
Delivering Faster Insights with a Logical Data FabricDenodo
 
Creating a sustainable business model for a digital repository: the Dryad exp...
Creating a sustainable business model for a digital repository: the Dryad exp...Creating a sustainable business model for a digital repository: the Dryad exp...
Creating a sustainable business model for a digital repository: the Dryad exp...ASIS&T
 
Toward F.A.I.R. Pharma. PhUSE Linked Data Initiatives Past and Present
Toward F.A.I.R. Pharma. PhUSE Linked Data Initiatives Past and PresentToward F.A.I.R. Pharma. PhUSE Linked Data Initiatives Past and Present
Toward F.A.I.R. Pharma. PhUSE Linked Data Initiatives Past and PresentTim Williams
 
Lynch & Dirks - Platforms for Open Research - Charleston Conference 2011
Lynch & Dirks  - Platforms for Open Research - Charleston Conference 2011Lynch & Dirks  - Platforms for Open Research - Charleston Conference 2011
Lynch & Dirks - Platforms for Open Research - Charleston Conference 2011Lee Dirks
 
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyOracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyInfiniteGraph
 
Metadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiencesMetadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiencesKerstin Forsberg
 
The open semantic enterprise enterprise data meets web data
The open semantic enterprise   enterprise data meets web dataThe open semantic enterprise   enterprise data meets web data
The open semantic enterprise enterprise data meets web dataGeorg Guentner
 
08 wp7 progresses&results-20130221
08 wp7 progresses&results-2013022108 wp7 progresses&results-20130221
08 wp7 progresses&results-20130221fruitbreedomics
 
The Rise of the Data Journal
The Rise of the Data JournalThe Rise of the Data Journal
The Rise of the Data JournalMarieke Guy
 

Similar to 2013 01-14 ops-dataset_descriptions (20)

Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
Research Data Management at the University of Salford
Research Data Management at the University of SalfordResearch Data Management at the University of Salford
Research Data Management at the University of Salford
 
Dive deep into your Data Pools
Dive deep into your Data PoolsDive deep into your Data Pools
Dive deep into your Data Pools
 
Linked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter HaaseLinked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter Haase
 
Linked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the SoftwareLinked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the Software
 
Scholze liber 2015-06-25_final
Scholze liber 2015-06-25_finalScholze liber 2015-06-25_final
Scholze liber 2015-06-25_final
 
FAIR BioData Management
FAIR BioData ManagementFAIR BioData Management
FAIR BioData Management
 
Public Identifiers in Scholarly Publishing
Public Identifiers in Scholarly PublishingPublic Identifiers in Scholarly Publishing
Public Identifiers in Scholarly Publishing
 
BioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
BioIT 2017 - Ontoforce and Amgen Gene Knowledge DiscoveryBioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
BioIT 2017 - Ontoforce and Amgen Gene Knowledge Discovery
 
Delivering Faster Insights with a Logical Data Fabric
Delivering Faster Insights with a Logical Data FabricDelivering Faster Insights with a Logical Data Fabric
Delivering Faster Insights with a Logical Data Fabric
 
ODSC and iRODS
ODSC and iRODSODSC and iRODS
ODSC and iRODS
 
Enabling simultaneous analysis of multiple cohort studies: A BRISSKit use case
Enabling simultaneous analysis of multiple cohort studies: A BRISSKit use caseEnabling simultaneous analysis of multiple cohort studies: A BRISSKit use case
Enabling simultaneous analysis of multiple cohort studies: A BRISSKit use case
 
Creating a sustainable business model for a digital repository: the Dryad exp...
Creating a sustainable business model for a digital repository: the Dryad exp...Creating a sustainable business model for a digital repository: the Dryad exp...
Creating a sustainable business model for a digital repository: the Dryad exp...
 
Toward F.A.I.R. Pharma. PhUSE Linked Data Initiatives Past and Present
Toward F.A.I.R. Pharma. PhUSE Linked Data Initiatives Past and PresentToward F.A.I.R. Pharma. PhUSE Linked Data Initiatives Past and Present
Toward F.A.I.R. Pharma. PhUSE Linked Data Initiatives Past and Present
 
Lynch & Dirks - Platforms for Open Research - Charleston Conference 2011
Lynch & Dirks  - Platforms for Open Research - Charleston Conference 2011Lynch & Dirks  - Platforms for Open Research - Charleston Conference 2011
Lynch & Dirks - Platforms for Open Research - Charleston Conference 2011
 
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyOracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
 
Metadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiencesMetadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiences
 
The open semantic enterprise enterprise data meets web data
The open semantic enterprise   enterprise data meets web dataThe open semantic enterprise   enterprise data meets web data
The open semantic enterprise enterprise data meets web data
 
08 wp7 progresses&results-20130221
08 wp7 progresses&results-2013022108 wp7 progresses&results-20130221
08 wp7 progresses&results-20130221
 
The Rise of the Data Journal
The Rise of the Data JournalThe Rise of the Data Journal
The Rise of the Data Journal
 

More from Alasdair Gray

Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...Alasdair Gray
 
Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...Alasdair Gray
 
An Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAn Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAlasdair Gray
 
Supporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life SciencesSupporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life SciencesAlasdair Gray
 
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Alasdair Gray
 
Validata: A tool for testing profile conformance
Validata: A tool for testing profile conformanceValidata: A tool for testing profile conformance
Validata: A tool for testing profile conformanceAlasdair Gray
 
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsThe HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsAlasdair Gray
 
Open PHACTS: The Data Today
Open PHACTS: The Data TodayOpen PHACTS: The Data Today
Open PHACTS: The Data TodayAlasdair Gray
 
Data Integration in a Big Data Context: An Open PHACTS Case Study
Data Integration in a Big Data Context: An Open PHACTS Case StudyData Integration in a Big Data Context: An Open PHACTS Case Study
Data Integration in a Big Data Context: An Open PHACTS Case StudyAlasdair Gray
 
Data Integration in a Big Data Context
Data Integration in a Big Data ContextData Integration in a Big Data Context
Data Integration in a Big Data ContextAlasdair Gray
 
Scientific lenses to support multiple views over linked chemistry data
Scientific lenses to support multiple views over linked chemistry dataScientific lenses to support multiple views over linked chemistry data
Scientific lenses to support multiple views over linked chemistry dataAlasdair Gray
 
Scientific Lenses over Linked Data An approach to support multiple integrate...
Scientific Lenses over Linked Data An approach to support multiple integrate...Scientific Lenses over Linked Data An approach to support multiple integrate...
Scientific Lenses over Linked Data An approach to support multiple integrate...Alasdair Gray
 
Describing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community ProfileDescribing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community ProfileAlasdair Gray
 
Data Science meets Linked Data
Data Science meets Linked DataData Science meets Linked Data
Data Science meets Linked DataAlasdair Gray
 
Sensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-beingSensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-beingAlasdair Gray
 
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...Alasdair Gray
 
Dataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSDataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSAlasdair Gray
 

More from Alasdair Gray (20)

Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
 
Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...
 
An Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAn Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland Project
 
Supporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life SciencesSupporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life Sciences
 
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
 
Validata: A tool for testing profile conformance
Validata: A tool for testing profile conformanceValidata: A tool for testing profile conformance
Validata: A tool for testing profile conformance
 
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsThe HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
 
Open PHACTS: The Data Today
Open PHACTS: The Data TodayOpen PHACTS: The Data Today
Open PHACTS: The Data Today
 
Project X
Project XProject X
Project X
 
Data Integration in a Big Data Context: An Open PHACTS Case Study
Data Integration in a Big Data Context: An Open PHACTS Case StudyData Integration in a Big Data Context: An Open PHACTS Case Study
Data Integration in a Big Data Context: An Open PHACTS Case Study
 
Data Integration in a Big Data Context
Data Integration in a Big Data ContextData Integration in a Big Data Context
Data Integration in a Big Data Context
 
Data Linkage
Data LinkageData Linkage
Data Linkage
 
Scientific lenses to support multiple views over linked chemistry data
Scientific lenses to support multiple views over linked chemistry dataScientific lenses to support multiple views over linked chemistry data
Scientific lenses to support multiple views over linked chemistry data
 
Scientific Lenses over Linked Data An approach to support multiple integrate...
Scientific Lenses over Linked Data An approach to support multiple integrate...Scientific Lenses over Linked Data An approach to support multiple integrate...
Scientific Lenses over Linked Data An approach to support multiple integrate...
 
Describing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community ProfileDescribing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community Profile
 
SensorBench
SensorBenchSensorBench
SensorBench
 
Data Science meets Linked Data
Data Science meets Linked DataData Science meets Linked Data
Data Science meets Linked Data
 
Sensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-beingSensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-being
 
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
 
Dataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSDataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLS
 

2013 01-14 ops-dataset_descriptions

  • 1. Dataset Descriptions in Open PHACTS Alasdair J G Gray University of Manchester W3C HCLS Call – 14 January 2013 www.openphacts.org/specs/datadesc/ Authors: Christian Y. A. Brenninkmeijer, Chris Evelo, Carole Goble, Alasdair J. G. Gray, Andra Waagmeester and Egon L. Willighagen
  • 2. Public Domain Drug Discovery Data: Pharma are accessing, processing, storing & re-processing Repeat @ Literature Genbank Patents PubChem Databases Downloads x each company Firewalled Databases Data Integration Data Analysis Why?
  • 3. The Innovative Medicines Initiative The Open PHACTS Project • EC funded public-private • Create a semantic integration hub (“Open partnership for Pharmacological Space”)… pharmaceutical research • Delivering services to support on-going drug • Focus on key problems discovery programs in pharma and public domain – Efficacy, Safety, Educati • Not just another project; Leading academics in on & semantics, pharmacology and informatics, driven Training, Knowledge by solid industry business requirements Management • 13 academic partners, 9 pharmaceutical companies, 6 SMEs • Work split into clusters: • Technical Build (focus here) • Scientific Drive • Community & Sustainability The Project
  • 4. User Interfaces & Applications Linked Data API Identity Identity Linked Data Cache Mapping Resolution Service Service Domain Specific Data Services Architecture
  • 6. ChemSpider • ChemSpider aggregates data from over 400 sources • Central integration point for chemicals in OPS • OPS data covers – ChEBI – ChEMBL – DrugBank 14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 5
  • 7. What version of ChEMBL? ~Jan 2012 • ChemSpider: EBI SDF file – ChEMBL 13 • Data Cache: Chem2Bio2RDF ChEMBL RDF – File downloaded May 2011 – Chem2Bio2RDF metadata webpages: ChEMBL 8 – File: ChEMBL 2 • Mapping Server: Kasabi ChEMBL RDF file – ChEMBL 12 14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 6
  • 8. For the record • OPS currently uses ChEMBL 13 – RDF generated from EBI database dump – Published at linkedchemistry.info • Credit: Egon Willighagen • Soon moving to ChEMBL 15 – RDF published by EBI 14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 7
  • 9. Challenges • Datasets available – In many versions over time – In different formats – From many mirrors/registries • Files do not carry metadata • Registries – Can be out-of-date – Can contain conflicting information 14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 8
  • 10. VoID: Vocabulary of Interlinked Datasets • Describes RDF datasets – W3C Note: http://www.w3.org/TR/void/ • Metadata carried with data – Directly embedded or linked (void:inDataset) • Problems – Very generic – No checklist of requisite fields 14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 9
  • 11. Provenance Vocabularies • Dublin Core Terms – Widely used – Terms to generic to give proper credit • “Date: A point or period of time associated with an event in the lifecycle of the resource.” • PROV – New W3C standard: www.w3.org/2011/prov – Generic framework for exchanging data – Does not contain required predicates 14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 10
  • 12. PAV: Provenance, Authoring and Versioning Vocabulary http://code.google.com/p/pav- ontology/wiki/Homepage • Easy to understand predicates – http://purl.org/pav/ • Right level of granularity – Distinguishes: author/creator/curator – Captures source of data: • import/derived/accessed • version/previousVersion • Being aligned with PROV-O 14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 11
  • 13. Dataset Descriptions in the Open Pharmacological Space 14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 12
  • 14. Related Work • Registries: DataHub, MIRIAM – Do not tie metadata with the data – No checklist of attributes • BioDBCore – Checklist • Similar information captured • Includes point of contact information – Not tied to the data 14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 13
  • 15. Realisation of Dataset Descriptions • Needs to be incorporated into data publishing pipeline • Hard for publishers to provide conformant descriptions – Datasets are complex – Evolve over time – Seen as yet another burden • Validation tool provided – http://openphacts.cs.man.ac.uk:9090/OPS-IMS/validate 14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 14
  • 16. Future Vision • Provide rich and accurate provenance trail of data – Alignment with BioDBCore • One standard to rule them all – Automatic pipeline from VoID file to registries • Write once, use many times 14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 15
  • 17. Thank you A.Gray@cs.man.ac.uk www.cs.man.ac.uk/~graya/ www.openphacts.org 14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 16

Editor's Notes

  1. This is what motivated us that we need metadata in the data files
  2. Specifies VoID and PAV predicatesMIM checklist
  3. Open PHACTS: 28 partner9 Pharmaceuticals3 Biotechs1 Triplestore firm15 academic