SlideShare ist ein Scribd-Unternehmen logo
1 von 54
The Big Data Challenges
Associated with Building a National
Data Repository for Chemistry

Antony Williams
ICIC Meeting, Vienna
October 14th 2013
So what is all this Big Data?
And the World of Chemistry?
And the World of Chemistry?

“The InChIKey indexing has therefore turned
Google into a de-facto open global chemical
information hub by merging links to most
significant sources, including over 50 million
PubChem and ChemSpider records.”
And the World of Chemistry?
RSC’s ChemSpider
>29 million chemicals from >500 sources
…and the world of Openness
Times have changed…
Open Access funder mandates…
Times have changed…
Growth, growth, growth…
Publishers are responding
The world of Open Data…
Open Data are everywhere
• Is Openness and Social Sharing changing
the world?
• The cultural experiments in Open Data and
exchange are almost daily
• Mobile platforms enhance participation
• And then what of Chemistry Data???
Publications-summary of work
• Scientific publications are a summary of work
•
•
•
•

Is all work reported?
How much science is lost to pruning?
What of value sits in notebooks and is lost?
Publications offering access to “real data”?

• How much data is lost?
•
•
•

How many compounds never reported?
How many syntheses fail or succeed?
How many characterization measurements?
About Me…as a Chemist
• I’ve performed a few dozen chemical
syntheses
• I’ve run thousands of analytical spectra
• I’ve generated thousands of NMR assignments
• I’ve probably published <5% of all work
• Most of it has been lost
• But things can be different today….
• But it still needs to be associated with me…
What of non-abstracted data?
• How much data generated in a lab, that COULD
go public, is lost forever?
What of non-abstracted data?
• How much data generated in a lab, that COULD
go public, is lost forever?
• Public Domain reference databases of value?
• Syntheses
• Properties
• Spectra and CIFs
• Images
• Raw data vs. representations of data
ChemSpider
• ChemSpider allowed the community to
participate in linking the internet of chemistry
& crowdsourcing of data
• Successful experiment in terms of building a
central hub for integrated web search
• More people are “users” than “contributors”
• Yet basic feedback and game-play helps
Crowdsourced “Annotations”
• Users can add
•
•
•
•
•
•
•
•

Descriptions, Syntheses and Commentaries
Links to PubMed articles
Links to articles via DOIs
Add spectral data
Add Crystallographic Information Files
Add photos
Add MP3 files
Add Videos
An EPSRC Call

“…the identification of the need for a UK
national service for the provision of a
searchable, electronic chemical database
for the UK academic research community.”
National Chemical Database Service
• Service for UK Academics
• “Prepaid access” integrating commercial
databases and services
• Access to curated data sets
• Provision of prediction algorithms
National Chemical Database Service
National Chemical Database Service
• Service for UK Academics
• “Prepaid access” integrating commercial
databases and services
• Access to curated data sets
• Provision of prediction algorithms
• Ultimate goal is to federate search
• Development of “data repository”
Development of Data Repository
•
•
•
•
•

Data repository should not just be a data
dump – should not be a “big disk”
Searchable, integrated, segregated
repository of data types
Data access including private, shared
embargoed and public
Delivery of derived models from data
Integrated to AltMetrics models
What can drive participation?
• What can drive scientists to participate and
contribute?
•
•
•
•
•
•

Ensuring provenance of their data for reuse
Mandates from funding agencies
Improved systems to ease contribution
Additional contributions to science
Improved publishing processes
Recognition for contributions
AltMetrics
AltMetrics
AltMetrics as Scientist Impact
AltMetrics
Plum Analytics
Plum Analytics
Rewards and Recognition
The First Step badge is
awarded when a user
submits (& has published)
their 1st CSSP article.

Congratulations! Your 1st CSSP
article has been published.
Philosopher Lao Tzu said “A
journey of a thousand miles begins
with a single step”. In the same
way we hope that this will be the
first of many submissions that you
make to CSSP.
AltMetrics Feeds
• For our data repository ensure contribution of
data will feed out to the AltMetrics platforms
• Every data point, every data download, use
and reuse will be associated with the scientist
• Data will be DOI’ed (presently under review)
• Services provided will allow for AltMetrics use
Domain Specific Challenges
• Creating a platform of value not just dumping
•

Searchability, segregation, tagging, use and
reuse, collaboration, low barrier to participation

• Quality of chemistry data at source
•
•
•
•

ensuring chemicals are correct
reactions map and balance as appropriate
file format handling for analytical data types –
binary file formats are proprietary
valid interpretation of data
Domain Specific Challenges
• Quality of data at source
• ensuring chemicals are correct - VALIDATION
• reactions map and balance as appropriate –
VALIDATION and STANDARDIZATION
• file format handling for analytical data types –
binary file formats are proprietary STANDARDIZATION
• valid interpretation of data – VALIDATION and
ANNOTATION
Validating Chemicals
• Community service
for validation and
standardization of
chemicals (CVSP)
• Open rules sets but
standard set based
on FDA substance
registry system
Validating chemicals
J. Brechner, IUPAC
Graphical
Representation of
stereochem.
configurations
Section: ST-1.1.10

DB08128

DB06287
Standardizing Chemicals
Validated Name-Structure
dictionaries for data checking
• Chemical name dictionaries used for:
• Text-mining (publications, patents)
• Linking to other databases – think Biology
• Drug names are incredibly valuable links

• Searching the web
• Names link to structures
Difficult to navigate…
IP?
IP?
What’s the
What’s the
structure?
structure?
Are they in
Are they in
our file?
our file?
What’s
What’s
similar?
similar?
Pharmacology
Pharmacology
data?
data?

What’s the
What’s the
target?
target?

Known
Known
Pathways?
Pathways?
Competitors?
Competitors?
Connections
Connections
to disease?
to disease?

Working On
Working On
Now?
Now?
Expressed in
Expressed in
right cell type?
right cell type?
Inside our Publication Archive
• How much data is in the archive, in the
publications and in the supplementary info?
• How many compounds for ChemSpider?
• How many syntheses for ChemSpider
reactions?
• How many characterization measurements?
• Property Data
• Spectral Data
• Graphs and charts to be used for modeling?
What if we could capture it all?
Digitally Enhancing the RSC Archive
Linking Names to Structures
Semantic Mark-up of Articles
Hosting Reactions
•
•
•

Seed set of over 1 million reactions from patents to
develop validation and standardization routines.
Reactions to be extracted from RSC journal articles,
ESI and reaction databases will be examined
Resulting validation algorithms used at deposition
The challenges of analytical data
• Integration of ChemSpider to analytical
instrumentation vendors already in place
• Agilent, Bruker, Thermo, Waters

• Vendors produce complex proprietary data
formats and standard formats are required
(JCAMP, NetCDF, AniML)
•
•
•
•

ChemSpider already hosts thousands of JCAMP spectra
Support of “assigned spectra” in place
Data validation approaches understood
There are a myriad of analytical data types…
Turning “Figures” Into Data
Community Data Repository
• Automated depositions of data – service-based
deposition, sweep and deposit
• Integrate to Electronic Lab Notebooks as feeds
• High value would be databases of reference
data, but validated by model validation and the
community
• National services feeding the repository –
crystallography, mass spectrometry
E-Lab Notebooks
• Integration between ELNs
and:
• ChemSpider
• ChemSpider Reactions
• Chemistry Data Repository
What do we have in place?
• We are testing a data repository on our assets –
ChemSpider and our archive of publications
• Working with many collaborators to define needs
• Deposition system for deposition of chemical
compounds – hosts >29 million chemicals
• Crowdsourcing curation & annotation platform
• Chemical validation & standardization platform
• Chemical reactions database with >1 million
reactions and presently developing RVSP
• Analytical data handling formats (JCAMP
preferred)
• And lots in development…
The Challenges Ahead
• Chemistry is NOT just nicely defined structures!
• Materials, minerals, attached to beads,
polymers, ambiguous materials

• Domain-specific measurements
• File format standards are limited in application

• Encouraging scientists to free up their data
• AltMetrics, open data mandates, systems

• The data explosion continues
• 4 years ahead to expand capability
The Future
Internet Data

Small organic molecules
Undefined materials
Organometallics
Nanomaterials
Polymers
Minerals
Particle bound
Links to Biologicals

Commercial Software
Pre-competitive Data
Open Science
Open Data
Publishers
Educators
Open Databases
Chemical Vendors
RSC Open Access Repository

•Imagine applying text-mining to all articles
•Extract all chemicals, syntheses, chemistry
data and link to OA articles
•Provide additional data handling tools
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

Weitere ähnliche Inhalte

Was ist angesagt?

ACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectStuart Chalk
 
Building a semantic chemistry platform with the royal society of chemistry
Building a semantic chemistry platform with the royal society of chemistryBuilding a semantic chemistry platform with the royal society of chemistry
Building a semantic chemistry platform with the royal society of chemistryValery Tkachenko
 
Making solubility models with reaxy
Making solubility models with reaxyMaking solubility models with reaxy
Making solubility models with reaxyAnn-Marie Roche
 
The application of cloud computing to royal society of chemistry data platforms
The application of cloud computing to royal society of chemistry data platformsThe application of cloud computing to royal society of chemistry data platforms
The application of cloud computing to royal society of chemistry data platformsValery Tkachenko
 
Acs collaborative computational technologies for biomedical research an enabl...
Acs collaborative computational technologies for biomedical research an enabl...Acs collaborative computational technologies for biomedical research an enabl...
Acs collaborative computational technologies for biomedical research an enabl...Sean Ekins
 
The royal society of chemistry and its adoption of semantic web technologies ...
The royal society of chemistry and its adoption of semantic web technologies ...The royal society of chemistry and its adoption of semantic web technologies ...
The royal society of chemistry and its adoption of semantic web technologies ...Valery Tkachenko
 

Was ist angesagt? (20)

Data integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientistData integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientist
 
Cheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural ProductsCheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural Products
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Value of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry communityValue of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry community
 
Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...
 
Current initiatives in developing research data repositories at the Royal Soc...
Current initiatives in developing research data repositories at the Royal Soc...Current initiatives in developing research data repositories at the Royal Soc...
Current initiatives in developing research data repositories at the Royal Soc...
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...
 
ACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP Project
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
 
Building a semantic chemistry platform with the royal society of chemistry
Building a semantic chemistry platform with the royal society of chemistryBuilding a semantic chemistry platform with the royal society of chemistry
Building a semantic chemistry platform with the royal society of chemistry
 
Hosting a compound centric community resource for chemistry data
Hosting a compound centric community resource for chemistry dataHosting a compound centric community resource for chemistry data
Hosting a compound centric community resource for chemistry data
 
Making solubility models with reaxy
Making solubility models with reaxyMaking solubility models with reaxy
Making solubility models with reaxy
 
The application of cloud computing to royal society of chemistry data platforms
The application of cloud computing to royal society of chemistry data platformsThe application of cloud computing to royal society of chemistry data platforms
The application of cloud computing to royal society of chemistry data platforms
 
Acs collaborative computational technologies for biomedical research an enabl...
Acs collaborative computational technologies for biomedical research an enabl...Acs collaborative computational technologies for biomedical research an enabl...
Acs collaborative computational technologies for biomedical research an enabl...
 
The royal society of chemistry and its adoption of semantic web technologies ...
The royal society of chemistry and its adoption of semantic web technologies ...The royal society of chemistry and its adoption of semantic web technologies ...
The royal society of chemistry and its adoption of semantic web technologies ...
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
 
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
 
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
 

Ähnlich wie Big data challenges associated with building a national data repository for chemistry

ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryDr. Haxel Consult
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsKen Karapetyan
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineKen Karapetyan
 
Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Valery Tkachenko
 

Ähnlich wie Big data challenges associated with building a national data repository for chemistry (20)

ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
 
eScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiativeseScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiatives
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Royal Society of Chemistry projects underpinning open innovation
Royal Society of Chemistry projects underpinning open innovationRoyal Society of Chemistry projects underpinning open innovation
Royal Society of Chemistry projects underpinning open innovation
 
Providing support for JC Bradleys vision of open science using RSC cheminform...
Providing support for JC Bradleys vision of open science using RSC cheminform...Providing support for JC Bradleys vision of open science using RSC cheminform...
Providing support for JC Bradleys vision of open science using RSC cheminform...
 
The expansive reach of ChemSpider as a resource for the chemistry community
The expansive reach of ChemSpider as a resource for the chemistry communityThe expansive reach of ChemSpider as a resource for the chemistry community
The expansive reach of ChemSpider as a resource for the chemistry community
 
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
 
ChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry dataChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry data
 
Delivering chemical-associated data via EPA web applications
Delivering chemical-associated data via EPA web applicationsDelivering chemical-associated data via EPA web applications
Delivering chemical-associated data via EPA web applications
 
Delivering on the promise of a chemistry data repository for the world
Delivering on the promise of a chemistry data repository for the worldDelivering on the promise of a chemistry data repository for the world
Delivering on the promise of a chemistry data repository for the world
 
Utilizing Online Databases for the Purpose of Structure Identification – Appr...
Utilizing Online Databases for the Purpose of Structure Identification – Appr...Utilizing Online Databases for the Purpose of Structure Identification – Appr...
Utilizing Online Databases for the Purpose of Structure Identification – Appr...
 
Contributions to the World of eScience from the Royal Society of Chemistry
Contributions to the World of eScience from the Royal Society of ChemistryContributions to the World of eScience from the Royal Society of Chemistry
Contributions to the World of eScience from the Royal Society of Chemistry
 
Challenging cajoling and rewarding the community for their contributions to o...
Challenging cajoling and rewarding the community for their contributions to o...Challenging cajoling and rewarding the community for their contributions to o...
Challenging cajoling and rewarding the community for their contributions to o...
 
Marrying ACDLabs technologies to eScience Projects at the Royal Society of C...
Marrying ACDLabs technologies to eScience Projects at the  Royal Society of C...Marrying ACDLabs technologies to eScience Projects at the  Royal Society of C...
Marrying ACDLabs technologies to eScience Projects at the Royal Society of C...
 
Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0
 
Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...
 
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
 
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
US-EPA Cheminformatics Support for Delivering Data Related to Chemicals of E...
US-EPA Cheminformatics Support for Delivering Data Related to Chemicals of E...US-EPA Cheminformatics Support for Delivering Data Related to Chemicals of E...
US-EPA Cheminformatics Support for Delivering Data Related to Chemicals of E...
 

Kürzlich hochgeladen

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Kürzlich hochgeladen (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Big data challenges associated with building a national data repository for chemistry

  • 1. The Big Data Challenges Associated with Building a National Data Repository for Chemistry Antony Williams ICIC Meeting, Vienna October 14th 2013
  • 2. So what is all this Big Data?
  • 3.
  • 4. And the World of Chemistry?
  • 5. And the World of Chemistry? “The InChIKey indexing has therefore turned Google into a de-facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records.”
  • 6. And the World of Chemistry?
  • 7. RSC’s ChemSpider >29 million chemicals from >500 sources
  • 8. …and the world of Openness
  • 9. Times have changed… Open Access funder mandates…
  • 10. Times have changed… Growth, growth, growth…
  • 12. The world of Open Data…
  • 13. Open Data are everywhere • Is Openness and Social Sharing changing the world? • The cultural experiments in Open Data and exchange are almost daily • Mobile platforms enhance participation • And then what of Chemistry Data???
  • 14. Publications-summary of work • Scientific publications are a summary of work • • • • Is all work reported? How much science is lost to pruning? What of value sits in notebooks and is lost? Publications offering access to “real data”? • How much data is lost? • • • How many compounds never reported? How many syntheses fail or succeed? How many characterization measurements?
  • 15. About Me…as a Chemist • I’ve performed a few dozen chemical syntheses • I’ve run thousands of analytical spectra • I’ve generated thousands of NMR assignments • I’ve probably published <5% of all work • Most of it has been lost • But things can be different today…. • But it still needs to be associated with me…
  • 16. What of non-abstracted data? • How much data generated in a lab, that COULD go public, is lost forever?
  • 17. What of non-abstracted data? • How much data generated in a lab, that COULD go public, is lost forever? • Public Domain reference databases of value? • Syntheses • Properties • Spectra and CIFs • Images • Raw data vs. representations of data
  • 18. ChemSpider • ChemSpider allowed the community to participate in linking the internet of chemistry & crowdsourcing of data • Successful experiment in terms of building a central hub for integrated web search • More people are “users” than “contributors” • Yet basic feedback and game-play helps
  • 19. Crowdsourced “Annotations” • Users can add • • • • • • • • Descriptions, Syntheses and Commentaries Links to PubMed articles Links to articles via DOIs Add spectral data Add Crystallographic Information Files Add photos Add MP3 files Add Videos
  • 20. An EPSRC Call “…the identification of the need for a UK national service for the provision of a searchable, electronic chemical database for the UK academic research community.”
  • 21. National Chemical Database Service • Service for UK Academics • “Prepaid access” integrating commercial databases and services • Access to curated data sets • Provision of prediction algorithms
  • 23. National Chemical Database Service • Service for UK Academics • “Prepaid access” integrating commercial databases and services • Access to curated data sets • Provision of prediction algorithms • Ultimate goal is to federate search • Development of “data repository”
  • 24. Development of Data Repository • • • • • Data repository should not just be a data dump – should not be a “big disk” Searchable, integrated, segregated repository of data types Data access including private, shared embargoed and public Delivery of derived models from data Integrated to AltMetrics models
  • 25. What can drive participation? • What can drive scientists to participate and contribute? • • • • • • Ensuring provenance of their data for reuse Mandates from funding agencies Improved systems to ease contribution Additional contributions to science Improved publishing processes Recognition for contributions
  • 32. Rewards and Recognition The First Step badge is awarded when a user submits (& has published) their 1st CSSP article. Congratulations! Your 1st CSSP article has been published. Philosopher Lao Tzu said “A journey of a thousand miles begins with a single step”. In the same way we hope that this will be the first of many submissions that you make to CSSP.
  • 33. AltMetrics Feeds • For our data repository ensure contribution of data will feed out to the AltMetrics platforms • Every data point, every data download, use and reuse will be associated with the scientist • Data will be DOI’ed (presently under review) • Services provided will allow for AltMetrics use
  • 34. Domain Specific Challenges • Creating a platform of value not just dumping • Searchability, segregation, tagging, use and reuse, collaboration, low barrier to participation • Quality of chemistry data at source • • • • ensuring chemicals are correct reactions map and balance as appropriate file format handling for analytical data types – binary file formats are proprietary valid interpretation of data
  • 35. Domain Specific Challenges • Quality of data at source • ensuring chemicals are correct - VALIDATION • reactions map and balance as appropriate – VALIDATION and STANDARDIZATION • file format handling for analytical data types – binary file formats are proprietary STANDARDIZATION • valid interpretation of data – VALIDATION and ANNOTATION
  • 36. Validating Chemicals • Community service for validation and standardization of chemicals (CVSP) • Open rules sets but standard set based on FDA substance registry system
  • 37. Validating chemicals J. Brechner, IUPAC Graphical Representation of stereochem. configurations Section: ST-1.1.10 DB08128 DB06287
  • 39. Validated Name-Structure dictionaries for data checking • Chemical name dictionaries used for: • Text-mining (publications, patents) • Linking to other databases – think Biology • Drug names are incredibly valuable links • Searching the web • Names link to structures
  • 40. Difficult to navigate… IP? IP? What’s the What’s the structure? structure? Are they in Are they in our file? our file? What’s What’s similar? similar? Pharmacology Pharmacology data? data? What’s the What’s the target? target? Known Known Pathways? Pathways? Competitors? Competitors? Connections Connections to disease? to disease? Working On Working On Now? Now? Expressed in Expressed in right cell type? right cell type?
  • 41. Inside our Publication Archive • How much data is in the archive, in the publications and in the supplementary info? • How many compounds for ChemSpider? • How many syntheses for ChemSpider reactions? • How many characterization measurements? • Property Data • Spectral Data • Graphs and charts to be used for modeling?
  • 42. What if we could capture it all? Digitally Enhancing the RSC Archive
  • 43. Linking Names to Structures
  • 45. Hosting Reactions • • • Seed set of over 1 million reactions from patents to develop validation and standardization routines. Reactions to be extracted from RSC journal articles, ESI and reaction databases will be examined Resulting validation algorithms used at deposition
  • 46. The challenges of analytical data • Integration of ChemSpider to analytical instrumentation vendors already in place • Agilent, Bruker, Thermo, Waters • Vendors produce complex proprietary data formats and standard formats are required (JCAMP, NetCDF, AniML) • • • • ChemSpider already hosts thousands of JCAMP spectra Support of “assigned spectra” in place Data validation approaches understood There are a myriad of analytical data types…
  • 48. Community Data Repository • Automated depositions of data – service-based deposition, sweep and deposit • Integrate to Electronic Lab Notebooks as feeds • High value would be databases of reference data, but validated by model validation and the community • National services feeding the repository – crystallography, mass spectrometry
  • 49. E-Lab Notebooks • Integration between ELNs and: • ChemSpider • ChemSpider Reactions • Chemistry Data Repository
  • 50. What do we have in place? • We are testing a data repository on our assets – ChemSpider and our archive of publications • Working with many collaborators to define needs • Deposition system for deposition of chemical compounds – hosts >29 million chemicals • Crowdsourcing curation & annotation platform • Chemical validation & standardization platform • Chemical reactions database with >1 million reactions and presently developing RVSP • Analytical data handling formats (JCAMP preferred) • And lots in development…
  • 51. The Challenges Ahead • Chemistry is NOT just nicely defined structures! • Materials, minerals, attached to beads, polymers, ambiguous materials • Domain-specific measurements • File format standards are limited in application • Encouraging scientists to free up their data • AltMetrics, open data mandates, systems • The data explosion continues • 4 years ahead to expand capability
  • 52. The Future Internet Data Small organic molecules Undefined materials Organometallics Nanomaterials Polymers Minerals Particle bound Links to Biologicals Commercial Software Pre-competitive Data Open Science Open Data Publishers Educators Open Databases Chemical Vendors
  • 53. RSC Open Access Repository •Imagine applying text-mining to all articles •Extract all chemicals, syntheses, chemistry data and link to OA articles •Provide additional data handling tools
  • 54. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams