SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
Semantic Web
Data Science
?? ?
Samuel Lampa | @smllmp | pharmb.io | Linked Data Sweden 2018 | Uppsala,April 9
Practical large scale
semantic data handling
with RDFIO and RDF-HDT
… or, in other words:
A deep historic divide ...
Semantic Web
Data Science
“web-focused”
“distributed”
“verbose”
“slow”
“large-scale”
“performance focussed”
“pragmatic”
“academic”
“automated”
A deep historic divide ...
Semantic Web
Data Science
“web-focused”
“distributed”
“verbose”
“slow”
“large-scale”
“performance focussed”
“pragmatic”
“academic”
“automated”
Any solution?Any solution?
Semantic Web vs. Data Science
Data Science = “Be able to experiment with data”
Not been easy in SemWeb, because of ...
(warning: strong opinions ahead):
● Distributedness of data locality (original vision)
● Massive technological “re-invention of the wheel”
So, what’s the problem?
● Data science requires:
● “Local” data (for large data)
● Powerful querying
● “Schema-less” is challenging without some starting
point, or some structure (such as re-usable queries)
● SPARQL helps only so much (no re-usable queries)
One solution:
SWI-Prolog – Re-usable rules:
Great support for semweb: www.swi-prolog.org/web
What we did (1/3):
Willighagen EL,Alvarsson J,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES.
Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6.
Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport
← SWI-Prolog for querying
… Integrated into Bioclipse
Pros / Cons:
+ Powerful querying
+ Easy to integrate into other software
=> Powerful interactive environment
+ Excellent performance
- No support for really large datasets
(exceednig RAM size)
What we did (2/3):
Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć
RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
Semantic MediaWiki as a collaborative and
interactive platform for playing around with
data, summarizing and visualizing using SMW’s
Ask query language →
Pros / Cons:
+ Collaboration supported
+ Versioned data storage
+ UI generation included in SMW
- Performance concerns
- Lack of expressiveness and power
in the SMW “Ask” query language
What we did (2/3):
Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć
RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
Ecosystem today:Totally different
blazegraph.com rdfhdt.org
BlazeGraph
blazegraph.com
Powers query.wikidata.org
+ Fast
+ Easy to use, web-based, interface
- Requires running process
- Needs importing
- Only SPARQL (No re-usable queries)
RDF-HDT
HDT: Header,Dictionary,Triples
+ Fast
+ Relatively few dependencies
+ Easy to integrate
+ SWI-Prolog support(!)
- Resource demanding conversion
- Still quite new and “bleeding-edge” rdfhdt.org
RDF serializations
Text (XML/Turtle/N3) (G)Zipped Text RDF-HDT
Inefficient
(compared to TSV)
Search requires
Brute-force scan
Search requires
decompression
AND(!)
brute-force scan
Search can
leverage indexes
to make it fast
Compact, binary
format
Compact
LOD Laundromat
● The “whole Linked Data cloud” !!!
● Cleaned up and integrated.
● Download in RDF-HDT format
● Or query via “Linked data fragments” or SPARQL
● Play around!
● lodlaundromat.org
● See also: youtu.be/sXJdSfjO1dU
What we did (3/3): urisolve
● Based on data in BlazeGraph or RDF-HDT
● Resolves RDF URIs
● Returns RDF with any triples connected to the URI in question
● Source code: github.com/pharmbio/urisolve
Lapins M,Arvidsson S, Lampa S, Berg A, Schaal W,Alvarsson J, Spjuth O.
A confidence predictor for logD using conformal regression and a support-vector machine. J Cheminform. 2018;10(1):17. doi: 10.1186/s13321-018-0271-1
The future: SWI-Prolog as central point?
SWISH: SWI-Prolog Notebook: swish.swi-prolog.org
… for powerful querying and reasoning, aka “hands-on data science”
“Linked Data is the Semantic Web done right”
– Tim Berners Lee
tomheath.com/blog/2009/03/linked-data-web-of-data-semantic-web-wtf/
Linked Data
Data Science

Weitere ähnliche Inhalte

Mehr von Samuel Lampa

Profiling go code a beginners tutorial
Profiling go code   a beginners tutorialProfiling go code   a beginners tutorial
Profiling go code a beginners tutorial
Samuel Lampa
 
Flow based programming an overview
Flow based programming   an overviewFlow based programming   an overview
Flow based programming an overview
Samuel Lampa
 
My lightning talk at Go Stockholm meetup Aug 6th 2013
My lightning talk at Go Stockholm meetup Aug 6th 2013My lightning talk at Go Stockholm meetup Aug 6th 2013
My lightning talk at Go Stockholm meetup Aug 6th 2013
Samuel Lampa
 
Hooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLHooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQL
Samuel Lampa
 

Mehr von Samuel Lampa (11)

AddisDev Meetup ii: Golang and Flow-based Programming
AddisDev Meetup ii: Golang and Flow-based ProgrammingAddisDev Meetup ii: Golang and Flow-based Programming
AddisDev Meetup ii: Golang and Flow-based Programming
 
First encounter with Elixir - Some random things
First encounter with Elixir - Some random thingsFirst encounter with Elixir - Some random things
First encounter with Elixir - Some random things
 
Profiling go code a beginners tutorial
Profiling go code   a beginners tutorialProfiling go code   a beginners tutorial
Profiling go code a beginners tutorial
 
Flow based programming an overview
Flow based programming   an overviewFlow based programming   an overview
Flow based programming an overview
 
Python Generators - Talk at PySthlm meetup #15
Python Generators - Talk at PySthlm meetup #15Python Generators - Talk at PySthlm meetup #15
Python Generators - Talk at PySthlm meetup #15
 
The RDFIO Extension - A Status update
The RDFIO Extension - A Status updateThe RDFIO Extension - A Status update
The RDFIO Extension - A Status update
 
My lightning talk at Go Stockholm meetup Aug 6th 2013
My lightning talk at Go Stockholm meetup Aug 6th 2013My lightning talk at Go Stockholm meetup Aug 6th 2013
My lightning talk at Go Stockholm meetup Aug 6th 2013
 
Hooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLHooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQL
 
Thesis presentation Samuel Lampa
Thesis presentation Samuel LampaThesis presentation Samuel Lampa
Thesis presentation Samuel Lampa
 
3rd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
3rd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse3rd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
3rd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
 
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Semantic Web ❤ Data Science? - Practical large scale semantic data handling with RDFIO and RDF-HDT

  • 1. Semantic Web Data Science ?? ? Samuel Lampa | @smllmp | pharmb.io | Linked Data Sweden 2018 | Uppsala,April 9
  • 2. Practical large scale semantic data handling with RDFIO and RDF-HDT … or, in other words:
  • 3. A deep historic divide ... Semantic Web Data Science “web-focused” “distributed” “verbose” “slow” “large-scale” “performance focussed” “pragmatic” “academic” “automated”
  • 4. A deep historic divide ... Semantic Web Data Science “web-focused” “distributed” “verbose” “slow” “large-scale” “performance focussed” “pragmatic” “academic” “automated” Any solution?Any solution?
  • 5. Semantic Web vs. Data Science Data Science = “Be able to experiment with data” Not been easy in SemWeb, because of ... (warning: strong opinions ahead): ● Distributedness of data locality (original vision) ● Massive technological “re-invention of the wheel”
  • 6. So, what’s the problem? ● Data science requires: ● “Local” data (for large data) ● Powerful querying ● “Schema-less” is challenging without some starting point, or some structure (such as re-usable queries) ● SPARQL helps only so much (no re-usable queries)
  • 7. One solution: SWI-Prolog – Re-usable rules: Great support for semweb: www.swi-prolog.org/web
  • 8. What we did (1/3): Willighagen EL,Alvarsson J,Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O,Wikberg JES. Linking the Resource Description Framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2(Suppl 1):S6. Doi:10.1186/2041-1480-2-S1-S6. Lampa S. SWI-Prolog as a Semantic Web Tool for semantic querying in Bioclipse: Integration and performance benchmarking. 2010. bit.ly/mscreport ← SWI-Prolog for querying … Integrated into Bioclipse Pros / Cons: + Powerful querying + Easy to integrate into other software => Powerful interactive environment + Excellent performance - No support for really large datasets (exceednig RAM size)
  • 9. What we did (2/3): Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y. Semantic MediaWiki as a collaborative and interactive platform for playing around with data, summarizing and visualizing using SMW’s Ask query language → Pros / Cons: + Collaboration supported + Versioned data storage + UI generation included in SMW - Performance concerns - Lack of expressiveness and power in the SMW “Ask” query language
  • 10. What we did (2/3): Lampa S,Willighagen E, Kohonen P,King A,Vrande i D,Grafström R, Spjuth O.č ć RDFIO: Extending Semantic MediaWiki for interoperable biomedical data management. 2017;8(35):1-13. doi: 10.1186/s13326-017-0136-y.
  • 12. BlazeGraph blazegraph.com Powers query.wikidata.org + Fast + Easy to use, web-based, interface - Requires running process - Needs importing - Only SPARQL (No re-usable queries)
  • 13. RDF-HDT HDT: Header,Dictionary,Triples + Fast + Relatively few dependencies + Easy to integrate + SWI-Prolog support(!) - Resource demanding conversion - Still quite new and “bleeding-edge” rdfhdt.org
  • 14. RDF serializations Text (XML/Turtle/N3) (G)Zipped Text RDF-HDT Inefficient (compared to TSV) Search requires Brute-force scan Search requires decompression AND(!) brute-force scan Search can leverage indexes to make it fast Compact, binary format Compact
  • 15. LOD Laundromat ● The “whole Linked Data cloud” !!! ● Cleaned up and integrated. ● Download in RDF-HDT format ● Or query via “Linked data fragments” or SPARQL ● Play around! ● lodlaundromat.org ● See also: youtu.be/sXJdSfjO1dU
  • 16. What we did (3/3): urisolve ● Based on data in BlazeGraph or RDF-HDT ● Resolves RDF URIs ● Returns RDF with any triples connected to the URI in question ● Source code: github.com/pharmbio/urisolve Lapins M,Arvidsson S, Lampa S, Berg A, Schaal W,Alvarsson J, Spjuth O. A confidence predictor for logD using conformal regression and a support-vector machine. J Cheminform. 2018;10(1):17. doi: 10.1186/s13321-018-0271-1
  • 17. The future: SWI-Prolog as central point? SWISH: SWI-Prolog Notebook: swish.swi-prolog.org … for powerful querying and reasoning, aka “hands-on data science”
  • 18. “Linked Data is the Semantic Web done right” – Tim Berners Lee tomheath.com/blog/2009/03/linked-data-web-of-data-semantic-web-wtf/