Despite its obvious success (largest knowledge base ever built, used in practice by companies and governments alike), we actually understand very little of the structure of the Web of Data. Its formal meaning is specified in logic, but with its scale, context dependency and dynamics, the Web of Data has outgrown its traditional model-theoretic semantics.
Is the meaning of a logical statement (an edge in the graph) dependent on the cluster ("context") in which it appears? Does a more densely connected concept (node) contain more information? Is the path length between two nodes related to their semantic distance?
Properties such as clustering, connectivity and path length are not described, much less explained by model-theoretic semantics. Do such properties contribute to the meaning of a knowledge graph?
To properly understand the structure and meaning of knowledge graphs, we should no longer treat knowledge graphs as (only) a set of logical statements, but treat them properly as a graph. But how to do this is far from clear.
In this talk, I report on some of our early results on some of these questions, but I ask many more questions for which we don't have answers yet.
preservation, maintanence and improvement of industrial organism.pptx
The Web of Data: do we actually understand what we built?
1. Don’t ask “how”,
Ask “why”!
(with illustrations from the Web of Data)
Frank van Harmelen
Dept. of “Computer Science”
Creative Commons License:
allowed to share & remix,
but must attribute & non-commercial
The Web of Data:
do we actually understand
what we built?
(pssst: our theory has fallen way behind our technology,
we know a lot of “how”
but we don’t know much “why”)
2. Some expectation management
• Speculation
• Questions
• Hypotheses
If we knew what we
were talking about, it
wouldn’t be called
research
4. Computer Science should be like a natural science:
studying objects in the information universe,
and the laws that govern them.
And yes, I believe that the information universe exists and can be studied
5. Fortunately, I’m in good company
"Computer science is no more about computers
than astronomy is about telescopes”
-- Edsger W. Dijkstra
"we have to think of computation as a principle
and computers (only) as the tool”
-- Peter Denning
"Professor Shih-Fu Chang will receive a doctorate
for his many groundbreaking contributions to our
understanding of the digital universe“
-- Arnold Smeulders
6. Methodological Manifesto
Computer Science often:
given desired properties
design an object which those properties
In this talk:
given a (very large & complex) object,
explain what are its observed properties?
Not: “solving a problem”
But: “answering a question”
7. “The computer is not our object of study,
It’s our observational instrument”
9. Semantic Web in 4 principles
1. Give all things a name
2. Make a graph of relations between the things
at this point we have (only) a Giant Graph
3. Make sure all names are URIs
at this point we have (only) a Giant Global Graph
4. Add semantics (= predictable inference)
This gives us a Giant Global Knowledge Graph
http://www.youtube.com/watch?v=tBSdYi4EY3s
10. P3. Make sure all names are URIs
x T
[<x> IsOfType <T>]
different
owners & locations
< analgesic >
11. P4: Add semantics
Frank Lynda
married-to
• Frank is male
• married-to relates
males to females
• married-to relates
1 male to 1 female
• Lynda = Hazel
lowerbound upperbound
Hazel
12. Did we get anywhere?
• Google = meaningful search
• NXP = data integration
• BBC = content re-use
• BestBuy = SEO (RDF-a)
• data.gov = data-publishing
Oracle DB, IBM DB2
Reuters,
New York Times, Guardian
Sears, Kmart, OverStock,
Volkswagen, Renault
GoodRelations ontology,
schema.org
Yahoo, Bing
20. Size of the current Semantic Web
~1010 TriplesJupiter
Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 20 http://www.aifb.uni-karlsruhe.de/WBS
≈ 1 triple per web-page
26. What is this picture telling us?
• single connected component
• Dense clusters with sparse interconnections
• connectivity depends on a few nodes
• the degree distribution
is highly skewed,
• its structure varies
between aggregation levels.
27. What is this picture telling us?
• Does the meaning of a node
depend on the cluster it appears in?
• Does path-length correlate with semantic distance?
• Are highly connected nodes more certain?
• Mutual influence of
low-level and high-level
structure?
Logic?
28. Measuring what?
• degree distribution, P(d(v)=n) or P(d(v)>n)
• degree centrality: relative size of neighbourhood,
intuitive notion of local connecivity
• betweenness centrality:
fraction of all shortest paths that pass through a node,
how essential is the node for global connectivity,
likelihood of being visited on a graphwalk
• closeness centrality
1/average distance to all other nodes
where to start for a graphwalk
• average shortest path length
helps to tune upperbound on graphwalks
• number of (strongly) connected components
measure of coherence
32. OK, let’s measure
Degree distributionBTC BTC aggregated
This suggest power law distribution
at different scales
33. OK, let’s measure
• Comparing WoD 2009 & 2010:
increasing powerlaw behaviour.
• top 5 by degree centrality in sameAs-aggregated
Preferential attachment?
Dataset SameAs Degree centrality
Revyu.com 0.039
Semanticweb.org 0.037
Dbpedia.org 0.027
Data.semanticweb.org 0.019
www.deri.ie 0.017
This guy owns 4 out of these 5!
Interesting socio-technical questions
34. But what should we measure?
• Treat sameAs nodes as single node?
(semantically yes, pragmatically no?)
• Is (undirected) connectedness meaningfull,
instead of (directed) strongly connected?
(semantically no, pragmatically yes?)
???????
35. And what are “good” values?
• Degree distribution should be powerlaw?
(robust against random decay)
• Local clustering coefficient should be high?
(strongly connected “topics”)
• Betweenness impact of a sameAs-link
should be high?
(adds much extra information)
???????
36. And here’s another one:
usage of DBPedia types
(Gangemi et al, ISWC2011)
42. LOTUS:
Text search on LODLaundromat
• Filip Llievski (ISWC 2016)
• Search 5 billion(!) text strings in
Linked Open Data (0.5Tb)
• From words to linked data
• Fuzzy matching (or precise, or substring, or …)
• http://lotus.lodlaundromat.org
44. Hotspots in Knowledge Graps
• Observation:
realistic queries only hit a small part of the data (< 2%)
(DBPedia would need 500k queries to hit < 1%)
• Non-trival to obtain these numbers
(YASGUI dataset, SWJ2015)
Dataset Size #queries Coverage
DBPedia 3.9 459M 1640 0.003%
Linked Geo Data 289M 81 1.917%
MetaLex 204M 4933 0.016%
Open-BioMed 79M 931 3.100%
Bio2RDF/KEGG 50M 1297 2.013%
SW Dog Food 240K 193 39.438%
45. Experiment
• Can we predict the popular part of the graph
without knowing the queries?
• Use graph-measures as selection thresholds
– indegree (easy)
– outdegree (easy)
– pagerank (doable, iterative)
– betweenness centrality (hard)
Evaluate
Queries
49. Which selection function s(T,,n)?
Google distance
where
f(x) is the number of Google hits for x
f(x,y) is the number of Google hits for
the tuple of search items x and y
M is the number of web pages indexed by Google
)}(log),(min{loglog
),(log)}(log),(max{log
),(
yfxfM
yxfyfxf
yxNGD
50. Compute Google distance between URI’s for numbers and colors
(note: we’re abusing URI’s as words!)
51. 51
Evaluation:
ask queries over inconsisentent datasets
Conclusion:
“Graph-growing” using Google Distance
gives a high quality sound approximation
Ontology #queries Unexpected Intended
MadCow+ 2594 0 93%
Communication 6576 0 96%
Transportation 6258 0 99%
Why does this
work so
unreasonably
well?
55. Do URL’s encode meaning?
Fraction of datasets with redundancy for types/predicates
at significance level > 0.99
BTW, this is 600.000 datapoints (RDF docs)
Properties
Types
We need a
semantics
that accounts
for this!
56. Inference as a measure
for information content
Nobody can
predict these
numbers
58. 59/18
Inference by walking the graph
• Swarm of micro-reasoners
• One rule per micro-reasoner
• Walk the graph, applying rules when possible
• Deduced facts disappear after some time
Every author of a
paper is a person
Every person is
also an agent
59. 60/18
Some early results
• most of the
derivations are
produced
• Lost:
determinism,
completenes
• Gained:
anytime,
coherent,
prioritised
For which
graphs does
this work well
or not?