SlideShare ist ein Scribd-Unternehmen logo
1 von 62
Don’t ask “how”,
Ask “why”!
(with illustrations from the Web of Data)
Frank van Harmelen
Dept. of “Computer Science”
Creative Commons License:
allowed to share & remix,
but must attribute & non-commercial
The Web of Data:
do we actually understand
what we built?
(pssst: our theory has fallen way behind our technology,
we know a lot of “how”
but we don’t know much “why”)
Some expectation management
• Speculation
• Questions
• Hypotheses
If we knew what we
were talking about, it
wouldn’t be called
research
Health Warning:
pretentious
philosophical
introduction
coming up
Computer Science should be like a natural science:
studying objects in the information universe,
and the laws that govern them.
And yes, I believe that the information universe exists and can be studied
Fortunately, I’m in good company
"Computer science is no more about computers
than astronomy is about telescopes”
-- Edsger W. Dijkstra
"we have to think of computation as a principle
and computers (only) as the tool”
-- Peter Denning
"Professor Shih-Fu Chang will receive a doctorate
for his many groundbreaking contributions to our
understanding of the digital universe“
-- Arnold Smeulders
Methodological Manifesto
Computer Science often:
given desired properties
design an object which those properties
In this talk:
given a (very large & complex) object,
explain what are its observed properties?
Not: “solving a problem”
But: “answering a question”
“The computer is not our object of study,
It’s our observational instrument”
Our object
of study
&
What to
measure
Semantic Web in 4 principles
1. Give all things a name
2. Make a graph of relations between the things
at this point we have (only) a Giant Graph
3. Make sure all names are URIs
at this point we have (only) a Giant Global Graph
4. Add semantics (= predictable inference)
This gives us a Giant Global Knowledge Graph
http://www.youtube.com/watch?v=tBSdYi4EY3s
P3. Make sure all names are URIs
x T
[<x> IsOfType <T>]
different
owners & locations
< analgesic >
P4: Add semantics
Frank Lynda
married-to
• Frank is male
• married-to relates
males to females
• married-to relates
1 male to 1 female
• Lynda = Hazel
lowerbound upperbound
Hazel
Did we get anywhere?
• Google = meaningful search
• NXP = data integration
• BBC = content re-use
• BestBuy = SEO (RDF-a)
• data.gov = data-publishing
Oracle DB, IBM DB2
Reuters,
New York Times, Guardian
Sears, Kmart, OverStock,
Volkswagen, Renault
GoodRelations ontology,
schema.org
Yahoo, Bing
1 triple
How big is the Semantic Web?
107 TriplesSuez Canal
Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 17 http://www.aifb.uni-karlsruhe.de/WBS
subsecond querying
108 TriplesMoon
Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 18 http://www.aifb.uni-karlsruhe.de/WBS
~109 TriplesEarth
Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 19 http://www.aifb.uni-karlsruhe.de/WBS
Size of the current Semantic Web
~1010 TriplesJupiter
Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 20 http://www.aifb.uni-karlsruhe.de/WBS
≈ 1 triple per web-page
Observing at different scales
Observing at different scales
Distances weighted by
number of links
What is this picture telling us?
• single connected component
• Dense clusters with sparse interconnections
• connectivity depends on a few nodes
• the degree distribution
is highly skewed,
• its structure varies
between aggregation levels.
What is this picture telling us?
• Does the meaning of a node
depend on the cluster it appears in?
• Does path-length correlate with semantic distance?
• Are highly connected nodes more certain?
• Mutual influence of
low-level and high-level
structure?
Logic?
Measuring what?
• degree distribution, P(d(v)=n) or P(d(v)>n)
• degree centrality: relative size of neighbourhood,
intuitive notion of local connecivity
• betweenness centrality:
fraction of all shortest paths that pass through a node,
how essential is the node for global connectivity,
likelihood of being visited on a graphwalk
• closeness centrality
1/average distance to all other nodes
where to start for a graphwalk
• average shortest path length
helps to tune upperbound on graphwalks
• number of (strongly) connected components
measure of coherence
Measuring When?
20092014
Real phenomenon or
measurement artefact?
Some first
measurements
&
their difficulties
Christophe Gueret
(European Conference on Complex Systems 2011,
OK, let’s measure
• Billion Triple
Challenge 2009 • WoD 2009
• WoD 2010
• BTC aggregated
• SameAs
aggregated
Non trivial
decisions
OK, let’s measure
Degree distributionBTC BTC aggregated
This suggest power law distribution
at different scales
OK, let’s measure
• Comparing WoD 2009 & 2010:
increasing powerlaw behaviour.
• top 5 by degree centrality in sameAs-aggregated
Preferential attachment?
Dataset SameAs Degree centrality
Revyu.com 0.039
Semanticweb.org 0.037
Dbpedia.org 0.027
Data.semanticweb.org 0.019
www.deri.ie 0.017
This guy owns 4 out of these 5!
Interesting socio-technical questions
But what should we measure?
• Treat sameAs nodes as single node?
(semantically yes, pragmatically no?)
• Is (undirected) connectedness meaningfull,
instead of (directed) strongly connected?
(semantically no, pragmatically yes?)
???????
And what are “good” values?
• Degree distribution should be powerlaw?
(robust against random decay)
• Local clustering coefficient should be high?
(strongly connected “topics”)
• Betweenness impact of a sameAs-link
should be high?
(adds much extra information)
???????
And here’s another one:
usage of DBPedia types
(Gangemi et al, ISWC2011)
impact on
mapping?
impact on
reasoning?
impact on
storage?
So what?
These observations have impact on design!
LODLaundromat:
a new observatory
for the Web of Data
Wouter Beek Laurens Rietveld
(ISWC 2014)
LOD Laundromat:
clean your dirty triples
• crawl
– from registries (CKAN),
– by chasing URL's,
– user can submit URLs
– Users can submit files (DropBox plugin)
• read multiple formats
• clean syntax errors, remove duplicates
• compute meta-data information
• publish triples as JSON API & (meta-data) as SPARQL
• harvest 1B triples/day
LOD Laundromat:
• 600.000 RDF files
• 3,345,904,218 unique URLs
• 5,319,790,836 literals
(not counting 6,699,148,542 integers, dates, etc)
• 328Gb of zip’ed RDF
http://lodlaundromat.org
https://www.youtube.com/watch?v=nU2Yh8RXeow
LOTUS:
Text search on LODLaundromat
• Filip Llievski (ISWC 2016)
• Search 5 billion(!) text strings in
Linked Open Data (0.5Tb)
• From words to linked data
• Fuzzy matching (or precise, or substring, or …)
• http://lotus.lodlaundromat.org
Graph structure
as a proxy
for semantics
Laurens Rietveld
(ISWC 2014)
Hotspots in Knowledge Graps
• Observation:
realistic queries only hit a small part of the data (< 2%)
(DBPedia would need 500k queries to hit < 1%)
• Non-trival to obtain these numbers
(YASGUI dataset, SWJ2015)
Dataset Size #queries Coverage
DBPedia 3.9 459M 1640 0.003%
Linked Geo Data 289M 81 1.917%
MetaLex 204M 4933 0.016%
Open-BioMed 79M 931 3.100%
Bio2RDF/KEGG 50M 1297 2.013%
SW Dog Food 240K 193 39.438%
Experiment
• Can we predict the popular part of the graph
without knowing the queries?
• Use graph-measures as selection thresholds
– indegree (easy)
– outdegree (easy)
– pagerank (doable, iterative)
– betweenness centrality (hard)
Evaluate
Queries
Structural sampling: results
Why does this
work so
unreasonably
well?
Which
methods
work on
which types
of graphs?
Logic?
It’s not only about the
graph structure:
Exploiting
the choice of URLs
to deal with inconsistency
Zhisheng Huang
(ISWC 2008)
48
General Idea
s(T,,0)s(T,,1)s(T,,2)
=def
 is soft-implied by T if it is implied by a consistent subset of T
T
 
Which selection function s(T,,n)?
Google distance
where
f(x) is the number of Google hits for x
f(x,y) is the number of Google hits for
the tuple of search items x and y
M is the number of web pages indexed by Google
)}(log),(min{loglog
),(log)}(log),(max{log
),(
yfxfM
yxfyfxf
yxNGD



Compute Google distance between URI’s for numbers and colors
(note: we’re abusing URI’s as words!)
51
Evaluation:
ask queries over inconsisentent datasets
Conclusion:
“Graph-growing” using Google Distance
gives a high quality sound approximation
Ontology #queries Unexpected Intended
MadCow+ 2594 0 93%
Communication 6576 0 96%
Transportation 6258 0 99%
Why does this
work so
unreasonably
well?
Google distance
This isn’t
supposed to
work!
URIs are supposed to be meaningless..
Information content
of URI’s?
Steven de Rooij
(ISWC 2016)
Unexplained performance
prompts more experiments
ISWC 2016
Do URL’s encode meaning?
Fraction of datasets with redundancy for types/predicates
at significance level > 0.99
BTW, this is 600.000 datapoints (RDF docs)
Properties
Types
We need a
semantics
that accounts
for this!
Inference as a measure
for information content
Nobody can
predict these
numbers
Exploiting
the
graph structure
for inference
Kathrin Dentler
(SSWS2009)
59/18
Inference by walking the graph
• Swarm of micro-reasoners
• One rule per micro-reasoner
• Walk the graph, applying rules when possible
• Deduced facts disappear after some time
Every author of a
paper is a person
Every person is
also an agent
60/18
Some early results
• most of the
derivations are
produced
• Lost:
determinism,
completenes
• Gained:
anytime,
coherent,
prioritised
For which
graphs does
this work well
or not?
Closing:
A call to all
Semantic Web
researchers
A gazillion new open questions
don’t just try to build things,
also try to understand things
don’t just ask how,
also ask why

Weitere ähnliche Inhalte

Was ist angesagt?

Translating Ontologies in Real-World Settings
Translating Ontologies in Real-World SettingsTranslating Ontologies in Real-World Settings
Translating Ontologies in Real-World SettingsMauro Dragoni
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesLaura Po
 
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSemantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSaeedeh Shekarpour
 
Presentation1.pdf
Presentation1.pdfPresentation1.pdf
Presentation1.pdfZixunZhou
 
Cognitive Models in Recommender Systems
Cognitive Models in Recommender SystemsCognitive Models in Recommender Systems
Cognitive Models in Recommender SystemsChristoph Trattner
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedJoel Azzopardi
 
Keystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenanceKeystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenancePaolo Missier
 
In search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked DataIn search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked Datajonblower
 
Data Tactics Analytics Brown Bag (Aug 22, 2013)
Data Tactics Analytics Brown Bag (Aug 22, 2013)Data Tactics Analytics Brown Bag (Aug 22, 2013)
Data Tactics Analytics Brown Bag (Aug 22, 2013)Rich Heimann
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationChristoph Trattner
 
Recommending Items in Social Tagging Systems Using Tag and Time Information
Recommending Items in Social Tagging Systems Using Tag and Time InformationRecommending Items in Social Tagging Systems Using Tag and Time Information
Recommending Items in Social Tagging Systems Using Tag and Time InformationChristoph Trattner
 
Semantic Data Retrieval: Search, Ranking, and Summarization
Semantic Data Retrieval: Search, Ranking, and SummarizationSemantic Data Retrieval: Search, Ranking, and Summarization
Semantic Data Retrieval: Search, Ranking, and SummarizationGong Cheng
 
Data-mining the Semantic Web @TCD
Data-mining the Semantic Web @TCDData-mining the Semantic Web @TCD
Data-mining the Semantic Web @TCDFrank Lynam
 
Elephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research CenterElephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research CenterRobert H. McDonald
 

Was ist angesagt? (20)

Translating Ontologies in Real-World Settings
Translating Ontologies in Real-World SettingsTranslating Ontologies in Real-World Settings
Translating Ontologies in Real-World Settings
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sources
 
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSemantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked Data
 
Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
 
Presentation1.pdf
Presentation1.pdfPresentation1.pdf
Presentation1.pdf
 
Cognitive Models in Recommender Systems
Cognitive Models in Recommender SystemsCognitive Models in Recommender Systems
Cognitive Models in Recommender Systems
 
BDACA1516s2 - Lecture3
BDACA1516s2 - Lecture3BDACA1516s2 - Lecture3
BDACA1516s2 - Lecture3
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
 
Keystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenanceKeystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenance
 
In search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked DataIn search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked Data
 
BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
 
Data Tactics Analytics Brown Bag (Aug 22, 2013)
Data Tactics Analytics Brown Bag (Aug 22, 2013)Data Tactics Analytics Brown Bag (Aug 22, 2013)
Data Tactics Analytics Brown Bag (Aug 22, 2013)
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
 
Recommending Items in Social Tagging Systems Using Tag and Time Information
Recommending Items in Social Tagging Systems Using Tag and Time InformationRecommending Items in Social Tagging Systems Using Tag and Time Information
Recommending Items in Social Tagging Systems Using Tag and Time Information
 
BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5
 
Semantic Data Retrieval: Search, Ranking, and Summarization
Semantic Data Retrieval: Search, Ranking, and SummarizationSemantic Data Retrieval: Search, Ranking, and Summarization
Semantic Data Retrieval: Search, Ranking, and Summarization
 
Data-mining the Semantic Web @TCD
Data-mining the Semantic Web @TCDData-mining the Semantic Web @TCD
Data-mining the Semantic Web @TCD
 
BDACA1516s2 - Lecture2
BDACA1516s2 - Lecture2BDACA1516s2 - Lecture2
BDACA1516s2 - Lecture2
 
Elephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research CenterElephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research Center
 

Ähnlich wie The Web of Data: do we actually understand what we built?

Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesBertram Ludäscher
 
Deep Content Learning in Traffic Prediction and Text Classification
Deep Content Learning in Traffic Prediction and Text ClassificationDeep Content Learning in Traffic Prediction and Text Classification
Deep Content Learning in Traffic Prediction and Text ClassificationHPCC Systems
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging EnvironmentsPaul Groth
 
Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingVrije Universiteit Amsterdam
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semanticsplan4all
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneIan Foster
 
Short URLs, Big Fun
Short URLs, Big FunShort URLs, Big Fun
Short URLs, Big FunHilary Mason
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesReal-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesEugene Dvorkin
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
 
The Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge RepresentationThe Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge RepresentationFrank van Harmelen
 
Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big DataPierre De Wilde
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Bertram Ludäscher
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Jonathan Stray
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 

Ähnlich wie The Web of Data: do we actually understand what we built? (20)

Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
Deep Content Learning in Traffic Prediction and Text Classification
Deep Content Learning in Traffic Prediction and Text ClassificationDeep Content Learning in Traffic Prediction and Text Classification
Deep Content Learning in Traffic Prediction and Text Classification
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
Our World is Socio-technical
Our World is Socio-technicalOur World is Socio-technical
Our World is Socio-technical
 
Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic Programming
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
Short URLs, Big Fun
Short URLs, Big FunShort URLs, Big Fun
Short URLs, Big Fun
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesReal-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
The Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge RepresentationThe Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge Representation
 
Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big Data
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 

Mehr von Frank van Harmelen

The K in "neuro-symbolic" stands for "knowledge"
The K in "neuro-symbolic" stands for "knowledge"The K in "neuro-symbolic" stands for "knowledge"
The K in "neuro-symbolic" stands for "knowledge"Frank van Harmelen
 
Adoption of Knowledge Graphs, mid 2022 (incomplete)
Adoption of Knowledge Graphs, mid 2022 (incomplete)Adoption of Knowledge Graphs, mid 2022 (incomplete)
Adoption of Knowledge Graphs, mid 2022 (incomplete)Frank van Harmelen
 
Adoption of Knowledge Graphs, late 2019
Adoption of Knowledge Graphs, late 2019Adoption of Knowledge Graphs, late 2019
Adoption of Knowledge Graphs, late 2019Frank van Harmelen
 
Adoption of Knowledge Graphs, mid 2019
Adoption of Knowledge Graphs, mid 2019Adoption of Knowledge Graphs, mid 2019
Adoption of Knowledge Graphs, mid 2019Frank van Harmelen
 
On the nature of AI, and the relation between symbolic and statistical approa...
On the nature of AI, and the relation between symbolic and statistical approa...On the nature of AI, and the relation between symbolic and statistical approa...
On the nature of AI, and the relation between symbolic and statistical approa...Frank van Harmelen
 
Linked Open Data for Medical Guidelines Interactions
Linked Open Data for Medical  Guidelines InteractionsLinked Open Data for Medical  Guidelines Interactions
Linked Open Data for Medical Guidelines InteractionsFrank van Harmelen
 
Semantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoSemantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoFrank van Harmelen
 
Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant...
Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant...Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant...
Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant...Frank van Harmelen
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)Frank van Harmelen
 
4 Popular Fallacies about the Semantic Web
4 Popular Fallacies about the Semantic Web4 Popular Fallacies about the Semantic Web
4 Popular Fallacies about the Semantic WebFrank van Harmelen
 
Semantic Web research anno 2006:main streams, popular falacies, current statu...
Semantic Web research anno 2006:main streams, popular falacies, current statu...Semantic Web research anno 2006:main streams, popular falacies, current statu...
Semantic Web research anno 2006:main streams, popular falacies, current statu...Frank van Harmelen
 
Ontology mapping needs context & approximation
Ontology mapping needs context & approximationOntology mapping needs context & approximation
Ontology mapping needs context & approximationFrank van Harmelen
 
Ontology Mapping - Out Of The Babel Tower
Ontology Mapping - Out Of The Babel TowerOntology Mapping - Out Of The Babel Tower
Ontology Mapping - Out Of The Babel TowerFrank van Harmelen
 
LarKC: the large knowledge collider
LarKC: the large knowledge colliderLarKC: the large knowledge collider
LarKC: the large knowledge colliderFrank van Harmelen
 

Mehr von Frank van Harmelen (20)

The K in "neuro-symbolic" stands for "knowledge"
The K in "neuro-symbolic" stands for "knowledge"The K in "neuro-symbolic" stands for "knowledge"
The K in "neuro-symbolic" stands for "knowledge"
 
Adoption of Knowledge Graphs, mid 2022 (incomplete)
Adoption of Knowledge Graphs, mid 2022 (incomplete)Adoption of Knowledge Graphs, mid 2022 (incomplete)
Adoption of Knowledge Graphs, mid 2022 (incomplete)
 
Adoption of Knowledge Graphs, late 2019
Adoption of Knowledge Graphs, late 2019Adoption of Knowledge Graphs, late 2019
Adoption of Knowledge Graphs, late 2019
 
Adoption of Knowledge Graphs, mid 2019
Adoption of Knowledge Graphs, mid 2019Adoption of Knowledge Graphs, mid 2019
Adoption of Knowledge Graphs, mid 2019
 
Empirical Semantics
Empirical SemanticsEmpirical Semantics
Empirical Semantics
 
On the nature of AI, and the relation between symbolic and statistical approa...
On the nature of AI, and the relation between symbolic and statistical approa...On the nature of AI, and the relation between symbolic and statistical approa...
On the nature of AI, and the relation between symbolic and statistical approa...
 
Linked Open Data for Medical Guidelines Interactions
Linked Open Data for Medical  Guidelines InteractionsLinked Open Data for Medical  Guidelines Interactions
Linked Open Data for Medical Guidelines Interactions
 
Semantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoSemantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years ago
 
Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant...
Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant...Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant...
Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant...
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)
 
4 Popular Fallacies about the Semantic Web
4 Popular Fallacies about the Semantic Web4 Popular Fallacies about the Semantic Web
4 Popular Fallacies about the Semantic Web
 
WCIT2010
WCIT2010WCIT2010
WCIT2010
 
Het slimme Web 3.0
Het slimme Web 3.0Het slimme Web 3.0
Het slimme Web 3.0
 
OWL briefing
OWL briefingOWL briefing
OWL briefing
 
RDF briefing
RDF briefingRDF briefing
RDF briefing
 
Semantic Web research anno 2006:main streams, popular falacies, current statu...
Semantic Web research anno 2006:main streams, popular falacies, current statu...Semantic Web research anno 2006:main streams, popular falacies, current statu...
Semantic Web research anno 2006:main streams, popular falacies, current statu...
 
Ontology mapping needs context & approximation
Ontology mapping needs context & approximationOntology mapping needs context & approximation
Ontology mapping needs context & approximation
 
Ontology Mapping - Out Of The Babel Tower
Ontology Mapping - Out Of The Babel TowerOntology Mapping - Out Of The Babel Tower
Ontology Mapping - Out Of The Babel Tower
 
Where Does It Break?
Where Does It Break?Where Does It Break?
Where Does It Break?
 
LarKC: the large knowledge collider
LarKC: the large knowledge colliderLarKC: the large knowledge collider
LarKC: the large knowledge collider
 

Kürzlich hochgeladen

Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003
 

Kürzlich hochgeladen (20)

Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptx
 

The Web of Data: do we actually understand what we built?

  • 1. Don’t ask “how”, Ask “why”! (with illustrations from the Web of Data) Frank van Harmelen Dept. of “Computer Science” Creative Commons License: allowed to share & remix, but must attribute & non-commercial The Web of Data: do we actually understand what we built? (pssst: our theory has fallen way behind our technology, we know a lot of “how” but we don’t know much “why”)
  • 2. Some expectation management • Speculation • Questions • Hypotheses If we knew what we were talking about, it wouldn’t be called research
  • 4. Computer Science should be like a natural science: studying objects in the information universe, and the laws that govern them. And yes, I believe that the information universe exists and can be studied
  • 5. Fortunately, I’m in good company "Computer science is no more about computers than astronomy is about telescopes” -- Edsger W. Dijkstra "we have to think of computation as a principle and computers (only) as the tool” -- Peter Denning "Professor Shih-Fu Chang will receive a doctorate for his many groundbreaking contributions to our understanding of the digital universe“ -- Arnold Smeulders
  • 6. Methodological Manifesto Computer Science often: given desired properties design an object which those properties In this talk: given a (very large & complex) object, explain what are its observed properties? Not: “solving a problem” But: “answering a question”
  • 7. “The computer is not our object of study, It’s our observational instrument”
  • 9. Semantic Web in 4 principles 1. Give all things a name 2. Make a graph of relations between the things at this point we have (only) a Giant Graph 3. Make sure all names are URIs at this point we have (only) a Giant Global Graph 4. Add semantics (= predictable inference) This gives us a Giant Global Knowledge Graph http://www.youtube.com/watch?v=tBSdYi4EY3s
  • 10. P3. Make sure all names are URIs x T [<x> IsOfType <T>] different owners & locations < analgesic >
  • 11. P4: Add semantics Frank Lynda married-to • Frank is male • married-to relates males to females • married-to relates 1 male to 1 female • Lynda = Hazel lowerbound upperbound Hazel
  • 12. Did we get anywhere? • Google = meaningful search • NXP = data integration • BBC = content re-use • BestBuy = SEO (RDF-a) • data.gov = data-publishing Oracle DB, IBM DB2 Reuters, New York Times, Guardian Sears, Kmart, OverStock, Volkswagen, Renault GoodRelations ontology, schema.org Yahoo, Bing
  • 13. 1 triple How big is the Semantic Web?
  • 14.
  • 15.
  • 16.
  • 17. 107 TriplesSuez Canal Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 17 http://www.aifb.uni-karlsruhe.de/WBS
  • 18. subsecond querying 108 TriplesMoon Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 18 http://www.aifb.uni-karlsruhe.de/WBS
  • 19. ~109 TriplesEarth Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 19 http://www.aifb.uni-karlsruhe.de/WBS
  • 20. Size of the current Semantic Web ~1010 TriplesJupiter Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 20 http://www.aifb.uni-karlsruhe.de/WBS ≈ 1 triple per web-page
  • 23.
  • 24.
  • 26. What is this picture telling us? • single connected component • Dense clusters with sparse interconnections • connectivity depends on a few nodes • the degree distribution is highly skewed, • its structure varies between aggregation levels.
  • 27. What is this picture telling us? • Does the meaning of a node depend on the cluster it appears in? • Does path-length correlate with semantic distance? • Are highly connected nodes more certain? • Mutual influence of low-level and high-level structure? Logic?
  • 28. Measuring what? • degree distribution, P(d(v)=n) or P(d(v)>n) • degree centrality: relative size of neighbourhood, intuitive notion of local connecivity • betweenness centrality: fraction of all shortest paths that pass through a node, how essential is the node for global connectivity, likelihood of being visited on a graphwalk • closeness centrality 1/average distance to all other nodes where to start for a graphwalk • average shortest path length helps to tune upperbound on graphwalks • number of (strongly) connected components measure of coherence
  • 29. Measuring When? 20092014 Real phenomenon or measurement artefact?
  • 30. Some first measurements & their difficulties Christophe Gueret (European Conference on Complex Systems 2011,
  • 31. OK, let’s measure • Billion Triple Challenge 2009 • WoD 2009 • WoD 2010 • BTC aggregated • SameAs aggregated Non trivial decisions
  • 32. OK, let’s measure Degree distributionBTC BTC aggregated This suggest power law distribution at different scales
  • 33. OK, let’s measure • Comparing WoD 2009 & 2010: increasing powerlaw behaviour. • top 5 by degree centrality in sameAs-aggregated Preferential attachment? Dataset SameAs Degree centrality Revyu.com 0.039 Semanticweb.org 0.037 Dbpedia.org 0.027 Data.semanticweb.org 0.019 www.deri.ie 0.017 This guy owns 4 out of these 5! Interesting socio-technical questions
  • 34. But what should we measure? • Treat sameAs nodes as single node? (semantically yes, pragmatically no?) • Is (undirected) connectedness meaningfull, instead of (directed) strongly connected? (semantically no, pragmatically yes?) ???????
  • 35. And what are “good” values? • Degree distribution should be powerlaw? (robust against random decay) • Local clustering coefficient should be high? (strongly connected “topics”) • Betweenness impact of a sameAs-link should be high? (adds much extra information) ???????
  • 36. And here’s another one: usage of DBPedia types (Gangemi et al, ISWC2011)
  • 37. impact on mapping? impact on reasoning? impact on storage? So what? These observations have impact on design!
  • 38. LODLaundromat: a new observatory for the Web of Data Wouter Beek Laurens Rietveld (ISWC 2014)
  • 39. LOD Laundromat: clean your dirty triples • crawl – from registries (CKAN), – by chasing URL's, – user can submit URLs – Users can submit files (DropBox plugin) • read multiple formats • clean syntax errors, remove duplicates • compute meta-data information • publish triples as JSON API & (meta-data) as SPARQL • harvest 1B triples/day
  • 40. LOD Laundromat: • 600.000 RDF files • 3,345,904,218 unique URLs • 5,319,790,836 literals (not counting 6,699,148,542 integers, dates, etc) • 328Gb of zip’ed RDF http://lodlaundromat.org
  • 42. LOTUS: Text search on LODLaundromat • Filip Llievski (ISWC 2016) • Search 5 billion(!) text strings in Linked Open Data (0.5Tb) • From words to linked data • Fuzzy matching (or precise, or substring, or …) • http://lotus.lodlaundromat.org
  • 43. Graph structure as a proxy for semantics Laurens Rietveld (ISWC 2014)
  • 44. Hotspots in Knowledge Graps • Observation: realistic queries only hit a small part of the data (< 2%) (DBPedia would need 500k queries to hit < 1%) • Non-trival to obtain these numbers (YASGUI dataset, SWJ2015) Dataset Size #queries Coverage DBPedia 3.9 459M 1640 0.003% Linked Geo Data 289M 81 1.917% MetaLex 204M 4933 0.016% Open-BioMed 79M 931 3.100% Bio2RDF/KEGG 50M 1297 2.013% SW Dog Food 240K 193 39.438%
  • 45. Experiment • Can we predict the popular part of the graph without knowing the queries? • Use graph-measures as selection thresholds – indegree (easy) – outdegree (easy) – pagerank (doable, iterative) – betweenness centrality (hard) Evaluate Queries
  • 46. Structural sampling: results Why does this work so unreasonably well? Which methods work on which types of graphs? Logic?
  • 47. It’s not only about the graph structure: Exploiting the choice of URLs to deal with inconsistency Zhisheng Huang (ISWC 2008)
  • 48. 48 General Idea s(T,,0)s(T,,1)s(T,,2) =def  is soft-implied by T if it is implied by a consistent subset of T T  
  • 49. Which selection function s(T,,n)? Google distance where f(x) is the number of Google hits for x f(x,y) is the number of Google hits for the tuple of search items x and y M is the number of web pages indexed by Google )}(log),(min{loglog ),(log)}(log),(max{log ),( yfxfM yxfyfxf yxNGD   
  • 50. Compute Google distance between URI’s for numbers and colors (note: we’re abusing URI’s as words!)
  • 51. 51 Evaluation: ask queries over inconsisentent datasets Conclusion: “Graph-growing” using Google Distance gives a high quality sound approximation Ontology #queries Unexpected Intended MadCow+ 2594 0 93% Communication 6576 0 96% Transportation 6258 0 99% Why does this work so unreasonably well?
  • 53. URIs are supposed to be meaningless..
  • 54. Information content of URI’s? Steven de Rooij (ISWC 2016) Unexplained performance prompts more experiments ISWC 2016
  • 55. Do URL’s encode meaning? Fraction of datasets with redundancy for types/predicates at significance level > 0.99 BTW, this is 600.000 datapoints (RDF docs) Properties Types We need a semantics that accounts for this!
  • 56. Inference as a measure for information content Nobody can predict these numbers
  • 58. 59/18 Inference by walking the graph • Swarm of micro-reasoners • One rule per micro-reasoner • Walk the graph, applying rules when possible • Deduced facts disappear after some time Every author of a paper is a person Every person is also an agent
  • 59. 60/18 Some early results • most of the derivations are produced • Lost: determinism, completenes • Gained: anytime, coherent, prioritised For which graphs does this work well or not?
  • 60. Closing: A call to all Semantic Web researchers
  • 61.
  • 62. A gazillion new open questions don’t just try to build things, also try to understand things don’t just ask how, also ask why