Joint Keynote at Int. Conference on Knowledge Engineering and Semantic Web and Prague Computer Science Seminar, Prague, September 22, 2016
The challenges of Big Data are frequently explained by dealing with Volume, Velocity, Variety and Veracity. The large variety of data in organizations results from accessing different information systems with heterogeneous schemata or ontologies. In this talk I will present the research efforts that target the management of such broad data.
They include: (i) an integrated development environment for programming with broad data, (ii) a query language that allows for typing of query results, (iii) a typed lambda-calculus based on description logics, and (iv) efficient access to data repositories via schema indices.
How to Choose the Right Laravel Development Partner in New York City_compress...
Programming with Semantic Broad Data
1. Steffen Staab Programming with Semantic Broad Data 1Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
Web and Internet Science Group · ECS · University of Southampton, UK &
Programming with
Semantic Broad Data
Steffen Staab
@ststaab
west.uni-koblenz.de
2. Steffen Staab Programming with Semantic Broad Data 2
The World of Big Data – Volume & Velocity
Genome data
• Up to 200 GB/person
Video data
• Upload 300 hrs/min
Sensor data
• 5000 sensors/jet
engine
• 1 Tera bit/s
360 TB/disc
https://flic.kr/p/8zuDTm
https://flic.kr/p/59jc2h
3. Steffen Staab Programming with Semantic Broad Data 3
The World of Big Data – Volume & Velocity
Genome data
• Up to 200 GB/person
Video data
• Upload 300 hrs/min
Sensor data
• 5000 sensors/jet
engine
• 1 Tera bit/s
https://flic.kr/p/8zuDTm
https://flic.kr/p/59jc2h
18 concepts
Noise
amplitudes
4. Steffen Staab Programming with Semantic Broad Data 4
The World of Big Data – Variety
Data models
• Graph data
• Relational
• XML
• RDF
• CSV
• JPEG
• MPEG-1, 2, 4
• Dicom
• PDF
• Excel
• ...
Conceptual models
aka ER schemata
aka Logical schemata
aka XML schemata
aka RDFS / OWL ontologies
Foaf, Dublin Core, Marc81,
Unifact,.....
Dozens - Hundreds "¥"
5. Steffen Staab Programming with Semantic Broad Data 5
The World of Big Data – Variety – 15 years ago
SAP
• In the order of 10,000
‘concepts’
• Days to find the right column
Medical information system
(Lars)
• Treating transplant patients
• Approx. 10,000 concepts
Only my
very limited
experiences
Big consulting
business
6. Steffen Staab Programming with Semantic Broad Data 6
The World of Big Data – Variety – Today!
Wikidata
• 1,148,230 concepts
• 2515 relations
UMLS
• 1 Mio concepts
Bioinformatics
• 1000s public databases
• 35 in Bio2rdf
(11 bio triples)
eGov datasets
• 200,000 by Fraunh. Fokus
• 20,000 by ODI
Knowledge Graphs
• Ask Google, Microsoft,
Samsung, HP, ...
Sensor types
• 330 broad types in Wikipedia
• Tens of thousands
How to write
valid, robust
programs?
How to find data?
7. Steffen Staab Programming with Semantic Broad Data 7
How to write a valid, robust program?
SELECT ?x
WHERE
{
?x a CONCEPT15
}
SELECT ?x
WHERE
{
?x a CONCEPT151735
}
https://flic.kr/p/8zuDTm
18 concepts
1,166,040 concepts
1,148,230 concepts
Sept, ´16
March, ´16
8. Steffen Staab Programming with Semantic Broad Data 8
How to approach big data
In fhe following I am guessing
what Axel Polleres might have told you
about Enterprise Linked Data
9. Steffen Staab Programming with Semantic Broad Data 9
Traditional Information Architecture
Business
Logics
Structured Data
Unstructured
Data
Presentation and
Interaction
Characteristics:
• Processes are
known
• Data structures
are known
• Meaning of data
primarily in
schema and code
10. Steffen Staab Programming with Semantic Broad Data 10
Big Data in Today‘s Information Architecture
Characteristics:
• Little structure
• Semi-structured
data
• Meaning of data of
primary
importance!
11. Steffen Staab Programming with Semantic Broad Data 11
Variety Issue 1: Data Models
Data Models:
• Relational
• Tree (XML,...)
• Document oriented
• Stream
• Array
• Graph-DB
RDF
Graph data model as
common denominator
12. Steffen Staab Programming with Semantic Broad Data 12
Dealing with Issue 1: RDF as Data Model
RDF
Graph data model as
common denominator
knows
Bowie
Saran-
don
8-1-1947
bornOn
13. Steffen Staab Programming with Semantic Broad Data 13
Variety Issue 2: Conceptual Models
Conceptual Models:
• ER
• UML
• ...
RDFS
Ontology as common
denominator
14. Steffen Staab Programming with Semantic Broad Data 14
Variety Issue 2: RDFS as common conceptual meta model
RDFS
for explicit conceptual
description
knows
Bowie
Saran-
don
8-1-1947
bornOn
MusicArtist Actor
typetype
15. Steffen Staab Programming with Semantic Broad Data 15
Variety Issue 3: System Boundaries
IRIs
for globally unique
referencing
f:knows
m:Bowie
d:Saran
-don
8-1-1947
m:bornOn
m:Music
Artist
d:Actor
rdf:typerdf:type
m = http://musicbrainz.org
d = http://dbpedia.org
f = http://xmlns.com/foaf/0.1/
rdf = https://www.w3.org/2001/sw/
16. Steffen Staab Programming with Semantic Broad Data 16
A Practical Perspective on
Broad Data with LITEQ
17. Steffen Staab Programming with Semantic Broad Data 17
Drosophila: Linked Open Data Cloud
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
Dozens of domains
Hundreds of data sources
Thousands of concepts
Millions of entities
Billions of triples
Semantic
Broad
Data
19. Steffen Staab Programming with Semantic Broad Data 19
c1
Programming with Linked Data
Tasks of the Programmer
1 Schema exploration
2 Programming
code types
3 Programming queries
4 Programming procedures
for
• creating,
• manipulating,
• persisting
objects
20. Steffen Staab Programming with Semantic Broad Data 20
Node Path Query Language Using Autocompletion
Exploration of classes
21. Steffen Staab Programming with Semantic Broad Data 21
Node Path Query Language Using Autocompletion
Exploration of classes
Exploration of relations
22. Steffen Staab Programming with Semantic Broad Data 22
Node Path Query Language: Query Formulation
Exploration of classes
Exploration of relations
Querying for instances
Type
set of mo:MusicArtist
No definition or
declaration needed
23. Steffen Staab Programming with Semantic Broad Data 23
Node Path Query Language for Code Development
Exploration of classes
Exploration of relations
Querying for instances
Developing code with queries
All translated into SPARQL queries at
• Development time
• Type inference at compile time
(but also as part of IDE)
• Querying again at run time
One language to bind them all
24. Steffen Staab Programming with Semantic Broad Data 24
Node Path Query Language for Code Development
Exploration of classes
Exploration of relations
Querying for instances
Developing code with queries
Developing code with new classes
All translated into SPARQL queries at
• Development time
• Run time update
• Persistence!
25. Steffen Staab Programming with Semantic Broad Data 25
Formal NPQL Syntax
Data browsing
Restricting Class Expressions
Evaluating Class Expressions
Navigating from Data to Classes
Navigating from Data to Property Types
URI set
Intensional
Queries
Extensional
Queries
Navigational
Queries
26. Steffen Staab Programming with Semantic Broad Data 27
NPQL Algebra (Example)
Reversibility
can be used to simplify path expressions.
27. Steffen Staab Programming with Semantic Broad Data 28
Summary on LITEQ
Language Integrated Types, Extensions, and Queries
NPQL (Node Path Query Language)
• Navigational Queries
• Intensional Queries
• Extensional Queries
• Compilation to SPARQL
LITEQ
• Implementation of NPQL as F# Type Provider in Visual Studio
• Autocompletion using NPQL queries
• Automatic typing
of extensional query results
by intensional queries
28. Steffen Staab Programming with Semantic Broad Data 29
„That seems to work very well in practice,
but how does it work in theory?“
17 let allArtists =
Store.NPQL().``mo:MusicArtist``.Extension
What is implied by such a line...
...for the programme?
...for the compiler?
seems to
29. Steffen Staab Programming with Semantic Broad Data 30
A Foundational Perspective on
Semantic Broad Data Using DL
30. Steffen Staab Programming with Semantic Broad Data 31
What we want to have: Static Type Checking
But:
• In LITEQ: Queries must receive types
• Number of types in our system very/infinitely large
• Existing type systems expect complete knowledge
Programming with Data from a Knowledge Base
Issue in our prototype
31. Steffen Staab Programming with Semantic Broad Data 32
Related Work
Generic Types
• Everything is a node
or an edge
• No type checking!
Only 2nd place in
Halo competition
Mapping approaches
• Hibernate
• LITEQ
• ActiveRDF
• Summer / Winter
• ...
Preferred in SemWeb now Been there, done that
32. Steffen Staab Programming with Semantic Broad Data 33
Example – and Issues with Mapping
Mapping DL types to PL types problematic because
1. Mix of nominal (MusicArtist) and structural typing (recorded.Song)
2. Schema-less information (influencedBy)
3. Inference (hendrix:MusicArtist)
4. Sheer size of terminology
How to type a
query?
33. Steffen Staab Programming with Semantic Broad Data 34
Example
Code
To be rejected
is not subtype of
How to type a
query?
34. Steffen Staab Programming with Semantic Broad Data 35
Example
Code
To be accepted
is a
How to type a
query?
35. Steffen Staab Programming with Semantic Broad Data 36
What we want to have: Static Type Checking
Challenge:
• A programming language that accepts
concept expressions as types and
can deal with inferences
Programming with Data from a Knowledge Base
DL
36. Steffen Staab Programming with Semantic Broad Data 37
Given
• Atomic Types: A={...Ai...}
• Plus Function types: T={...Ai..., ...TiTj...}
Add elements
• Concept expressions ( Intensional NPQL queries )
• Instances ( Extensional NPQL queries)
Add knowledge
• Typing and subtyping derived from knowledge base
Core Ideas of DL
37. Steffen Staab Programming with Semantic Broad Data 38
Concept Forming
Expressions
Syntax Semantics
Top T I
Bottom I
Concept Name A AI
Intersection A B AI BI
Negation A I AI
Existential Restriction R.C { a I | (a,b) RI and b CI}
Axioms Syntax Semantics
T-Box Subclass C D AI BI
A-Box Concept assertion a:C aI CI
A-Box Role assertion (a,b) : R (aI,bI) RI
Description Logics Fragment
38. Steffen Staab Programming with Semantic Broad Data 39
Universal model of computation
• Abstraction
• Application
Example:
• f.x.f (f x)
Evaluation rules
Calculus
40. Steffen Staab Programming with Semantic Broad Data 41
Core DL: Evaluation and Typing
Nominal DL-Type
41. Steffen Staab Programming with Semantic Broad Data 42
Subtyping
many types
Add KB knowledge
only when needed for
checking application,
not proactively
42. Steffen Staab Programming with Semantic Broad Data 43
• Queries return sets
• Concept set type needed
• Set operators needed
• Map, Fold, Element
• Queries may return infinite sets
• No theoretical problem,
but lack of well-defined stopping conditions in KBs
• Type dispatch based on inferencing
Further issues and opportunities in DL
44. Steffen Staab Programming with Semantic Broad Data 45
Theorem: A well-typed closed term does not get stuck
during evaluation (with common exceptions).
Result for DL
Typing is a safety net,
but does not solve the halting problem
(empty list)
46. Steffen Staab Programming with Semantic Broad Data 47
Broad data
• has grown from 104 to 106 concepts (plus data)
• continues to grow
– more integration of distributed databases
– more sensors of different types
– More crowdwork
• has not been recognized as a problem of its own, yet
• will lead to
– brittleness
– high maintenance efforts
– loss of opportunities
Present of Broad Data
47. Steffen Staab Programming with Semantic Broad Data 48
New Methods for Broad data
• Explore
– Understand
• Find
• Relate (see e.g. Linda‘s talk today)
• Program
• Maintain
Future of Broad Data
48. Steffen Staab Programming with Semantic Broad Data 49Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
Web and Internet Science Group · ECS · University of Southampton, UK &
Thank you for your attention!
Thanks to my collaborators for this work:
Stefan Schegelmann, Martin Leinberger, Matthias Thimm (WeST, Koblenz)
Evelyne Viegas (Microsoft Research, Redmond)
Ralf Lämmel (SOFTLANG, Koblenz)
Hinweis der Redaktion
Programming with Semantic Broad Data*
The challenges of Big Data are frequently explained by dealing with Volume, Velocity, Variety and Veracity. The large variety of data in organizations results from accessing different information systems with heterogeneous schemata or ontologies. In this talk I will present the research efforts that target the management of such broad data. They include: (i) an integrated development environment for programming with broad data, (ii) a query language that allows for typing of query results, (iii) a typed lambda-calculus based on description logics, and (iv) efficient access to data repositories via schema indices.
Programming with Semantic Broad Data
Steffen Staab
Abstract: Challenges of Big Data are frequently explained by the technical challenges arising from dealing with Volume, Velocity, Variety, and Veracity. The large variety of data in organisations results from having access to a broad set of different information systems with heterogeneous schemata or ontologies. In this talk I will present research efforts that target the management of such broad data. They include: (i) an integrated development environment for programming with broad data, (ii) a query language that allows for typing of query results, (iii) a typed lambda-calculus based on description logics, and (iv) efficient access to data repositories via schema indices.
CV: Steffen Staab is professor for Databases and Information Systems at Universität Koblenz-Landau and holds a chair in Computer and Web Science at University of Southampton. He is interested in managing text and data and specifically in methods that target the management of explicit data semantics as well as the discovery of implicit text and data semantics.
360 TB / disc
Scientists at the University of Southampton have made a major step forward in the development of digital data storage that is capable of surviving for billions of years.
Using nanostructured glass, scientists from the University’s Optoelectronics Research Centre (ORC) have developed the recording and retrieval processes of five dimensional (5D) digital data by femtosecond laser writing.
The storage allows unprecedented properties including 360 TB/disc data capacity, thermal stability up to 1,000°C and virtually unlimited lifetime at room temperature (13.8 billion years at 190°C ) opening a new era of eternal data archiving.
These five thousand sensors create an astounding amount of data, 10 GB/s per engine. That is 1.02 Tbps, or 2.04 Tbps for a typical twin engine such as Airbus 320NEO or Boeing 737MAX. For comparison, a Formula 1 car produces around 1.2 GB/s (12.28Gbps), and current batch of P&W plane engines collects data in low Megabits, not Terabits per second.
EMBL, Cambridge
Could produce trillions of triples for genome information – but having triples is not sooo valuable for this task
Rolls Royce, X-media project – not soo interesting data for the knowledge engineer
Verteilte Daten und Zuständigkeiten
Unternehmens-übergreifenden Datendienste
Ad hoc-Daten (z.B. neue Sensoren)
Semantic Web data is (i) provided by different people in an ad-hoc manner, (ii) distributed, (iii) semi-structured, (iv) (more or less) typed, (v) supposed to be used serendipitously.
Impedance mismatch
Verteilte Daten und Zuständigkeiten
Unternehmens-übergreifenden Datendienste
Ad hoc-Daten (z.B. neue Sensoren)
Semantic Web data is (i) provided by different people in an ad-hoc manner, (ii) distributed, (iii) semi-structured, (iv) (more or less) typed, (v) supposed to be used serendipitously.
Verteilte Daten und Zuständigkeiten
Unternehmens-übergreifenden Datendienste
Ad hoc-Daten (z.B. neue Sensoren)
Semantic Web data is (i) provided by different people in an ad-hoc manner, (ii) distributed, (iii) semi-structured, (iv) (more or less) typed, (v) supposed to be used serendipitously.
Verteilte Daten und Zuständigkeiten
Unternehmens-übergreifenden Datendienste
Ad hoc-Daten (z.B. neue Sensoren)
Semantic Web data is (i) provided by different people in an ad-hoc manner, (ii) distributed, (iii) semi-structured, (iv) (more or less) typed, (v) supposed to be used serendipitously.
Verteilte Daten und Zuständigkeiten
Unternehmens-übergreifenden Datendienste
Ad hoc-Daten (z.B. neue Sensoren)
Semantic Web data is (i) provided by different people in an ad-hoc manner, (ii) distributed, (iii) semi-structured, (iv) (more or less) typed, (v) supposed to be used serendipitously.
Pun with LINQ is intended
LINQ – Language integrated queries
LITEQ – Language integrated types, extensions, and queries
Diese folie mit vorne noch synchronisieren
„DATA SOURCE“ SIND ZWEI WORTE, LAYOUT ANPASSEN; SCHWARZ VOR DUNKELGRAU SCHLECHT LESBAR
ES FEHLT DER DEVELOPER IM BILD
Wenn wir in eine Datenquelle hineinzoomen, dann finden wir Triple:
+ zur Beschreibung von Klassen (z.B. creature)
+ zur Beschreibung von Schemainformationen über 2-stellige-Relationen (z.B. hasOwner)
+ zur Beschreibung der Daten selbst, z.B. Bob
+ und es gibt noch eindeutige Identifier in Form von URIs, die stellen wir uns heute einfach mal als Java-Package-Namen vor
Was muss der Programmierer dann tun?
Automatic typing is not possible for general queries
Static type checking: better informed interfaces, avoiding run-time errors
(1) Conceptualizations
rely on a mixture of nominal (MusicArtist) and
structural typing (9recorded.Song). (2) It is also not uncommon
to have a very general or no conceptualization at all, as exemplified
by the influencedBy role that expresses that hendrix has
been influenced by the beatles. (3) Additional, implicit statements
may be derived by logical reasoning, e.g., in our running
example hendrix:MusicArtist can be inferred.
Another challenge is not illustrated: (4) In real data sources, the
sheer size of potential types may become problem. It is practically
infeasible to explicitly convert all 1,148,230 different concepts of
Wikidata into types of a programming language.
(1) Conceptualizations
rely on a mixture of nominal (MusicArtist) and
structural typing (9recorded.Song). (2) It is also not uncommon
to have a very general or no conceptualization at all, as exemplified
by the influencedBy role that expresses that hendrix has
been influenced by the beatles. (3) Additional, implicit statements
may be derived by logical reasoning, e.g., in our running
example hendrix:MusicArtist can be inferred.
Another challenge is not illustrated: (4) In real data sources, the
sheer size of potential types may become problem. It is practically
infeasible to explicitly convert all 1,148,230 different concepts of
Wikidata into types of a programming language.
(1) Conceptualizations
rely on a mixture of nominal (MusicArtist) and
structural typing (9recorded.Song). (2) It is also not uncommon
to have a very general or no conceptualization at all, as exemplified
by the influencedBy role that expresses that hendrix has
been influenced by the beatles. (3) Additional, implicit statements
may be derived by logical reasoning, e.g., in our running
example hendrix:MusicArtist can be inferred.
Another challenge is not illustrated: (4) In real data sources, the
sheer size of potential types may become problem. It is practically
infeasible to explicitly convert all 1,148,230 different concepts of
Wikidata into types of a programming language.
Static type checking: better informed interfaces, avoiding run-time errors
ALCO, e.g. part of OWL2DL
A Nominal DL-Type
a may be instance of infinitely many types. This is a syntactic trick to really pick a most specific type
Open issues:
Anonymous entities
Metaprogramming: Queries returning concepts
Plans:
tree-shaped Conjunctive queries
Generics
Changes to the knowledge base