Decarbonising Buildings: Making a net-zero built environment a reality
Exploring Linked Data content through network analysis
1. Exploring Linked Data content
through network analysis
Christophe Guéret (@cgueret)
Free University Amsterdam
Co-explorers: Stefan Schlobach, Shenghui Wang,
Paul Groth, Frank van Harmelen
http://latc-project.eu http://www.vu.nl
2. Outline of the talk
What is Linked Data?
What is there is to be analysed?
Do we miss something?
New research directions and first results
November 23, 2011 Analysis of Linked Data 2/35
3. Linked Data (aka Semantic Web)
Linked Data
November 23, 2011 Analysis of Linked Data 3/35
http://www.flickr.com/photos/erikcharlton/3337465138
4. What is the problem?
Frank and Christophe publish some open data
Roi wants to combine and enrich it
Kennissen Staad
Christophe Amsterdam
Peter Barcelona WWW
Frank David Parijs
Ville Pays Roi
Barcelone Espagne
Paris France WWW
Christophe Amsterdam Pays-Bas
Marvel icons: mermer, DeviantArt
November 23, 2011 Analysis of Linked Data 4/35
5. What is the problem?
Kennissen Staad Ville Pays
Christophe
Peter
David
Amsterdam
Barcelona
Parijs
+
Barcelone
Paris
Amsterdam
Espagne
France
Pays-Bas
= ?
Data integration issue
“Kennissen”, “Staad”, “Ville”, “Pays” ?
“Paris” = “Parijs” ?
“Amsterdam” = “Amsterdam” ?
Lot of work, must be done again on updates
November 23, 2011 Analysis of Linked Data 5/35
6. A solution
Do data integration at the data level
Use, and re-use, unambiguous identifiers
Use meta-level descriptions of the identifiers
Proposal: use the Web as a platform
Identifiers = URIs
Descriptions = de-referenced documents
November 23, 2011 Analysis of Linked Data 6/35
7. Frank publishes his data Kennissen Staad
Christophe Amsterdam
Peter Barcelona
This is a “triple”
David Parijs
ex:Acquaintance
rdf:type rdf:type rdf:type
ex:Christophe ex:Peter ex:David
ex:worksIn ex:worksIn ex:worksIn
dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris
Use of compact URIs
dbpedia = http://dbpedia.org/resource/
ex = http://example.org/
rdf = http://www.w3.org/1999/02/22-rdf-syntax-ns#
November 23, 2011 Analysis of Linked Data 7/35
8. Christophe re-use part of Frank's data Ville Pays
to publish his data Barcelone Espagne
Paris France
Amsterdam Pays-Bas
ex:Acquaintance
rdf:type rdf:type rdf:type
ex:Christophe ex:Peter ex:David
ex:worksIn ex:worksIn ex:worksIn
dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris
ex:isIn ex:isIn ex:isIn
dbpedia:Netherlands dbpedia:Spain dbpedia:France
November 23, 2011 Analysis of Linked Data 8/35
9. Roi add some “Conocido”@es
more information
rdf:label
ex:Acquaintance
rdf:type rdf:type rdf:type
ex:Christophe ex:Peter ex:David
ex:worksIn ex:worksIn ex:worksIn
dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris
ex:isIn ex:isIn ex:isIn
dbpedia:Netherlands dbpedia:Spain dbpedia:France
ex:isIn ex:isIn ex:isIn
dbpedia:Europe
November 23, 2011 Analysis of Linked Data 9/35
11. Reasoning with Semantics Bonus!
dbpedia:Amsterdam ex:isIn dbpedia:Amsterdam
ex:isIn rdf:type
dbpedia:Netherlands + owl:TransitiveProperty = ex:isIn
ex:isIn
dbpedia:Europe dbpedia:Europe
Example usage
Materialize implicit information
Check for consistency
November 23, 2011 Analysis of Linked Data 11/35
12. Rough estimate of size
295 data sets, 31B facts in LOD Cloud
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
November 23, 2011 Analysis of Linked Data 12/35
13. Lots of Data to analyze! :-)
November 23, 2011 Analysis of Linked Data 13/35
http://www.flickr.com/photos/argonne/3323018571
14. But analyzing what exactly?
Table of facts published at different locations
A distributed Knowledge Base
Subject Predicate Object
ex:Christophe rdf:type ex:Acquaintance
ex:Christophe ex:worksIn dbpedia:Amsterdam
ex:Peter rdf:type ex:Acquaintance
... ... ...
Subject Predicate Object
dbpedia:Amsterdam ex:isIn dbpedia:Netherlands
dbpedia:Netherlands ex:isIn dbpedia:Europe
... ... ...
Subject Predicate Object
ex:Acquaintance rdf:label “Conocido”@es
... ... ...
November 23, 2011 Analysis of Linked Data 14/35
15. Analysis workflow
1.Gather a snapshot of triples
2.Compute descriptive statistics
Top resources (subject, predicate, object)
Frequency cross-links types (SP,SO,PO,...)
Connected components
Paths frequency
…
=> Tricky enough, the data is really big!
=> We should be able to get more out of the data
November 23, 2011 Analysis of Linked Data 15/35
16. Can we explain that?
Suggestions
Started the graph
General knowledge
Very well known
November 23, 2011 Analysis of Linked Data 16/35
17. or that?
Suggestions
All published by Bio2RDF
Well aware of each other
Overlapping domain
November 23, 2011 Analysis of Linked Data 17/35
18. Could we predict the impact of ...
Dbpedia being down for a while ?
SIOC renaming “User” into “UserAccount” ?
creating a dataset that turns out to be popular ?
Analysing a set of triples is not enough
November 23, 2011 Analysis of Linked Data 18/35
19. Are we overlooking something?
November 23, 2011 Analysis of Linked Data 19/35
20. It's not only about the resources
Several entities related to the data
ex:something WWW
Data publishers/consumers Resources Web servers
Interactions between all of them
WWW
November 23, 2011 Analysis of Linked Data 20/35
21. There are different scales
Triples level versus Resource groups level
Different data complexity at each scale
“Conocido”@es
rdf:label
ex:Acquaintance
rdf:type rdf:type rdf:type
ex:Christophe ex:Peter ex:David
ex:worksIn ex:worksIn ex:worksIn
dbpedia:Amsterdam dbpedia:Barcelona dbpedia:Paris
ex:isIn ex:isIn ex:isIn
dbpedia:Netherlands dbpedia:Spain dbpedia:France
ex:isIn ex:isIn ex:isIn
dbpedia:Europe
November 23, 2011 Analysis of Linked Data 21/35
22. It is not a static network
Size and topology evolve over time
2007 2008 2010
November 23, 2011 Analysis of Linked Data 22/35
23. Linked Data is a Complex System
Multiple scale of observation
Emergence of properties
The whole is more than the sum of the parts
=> Interactions/relations are important to
understand the system behavior
=> We can benefit from a large body of
research results in Complex Systems study
November 23, 2011 Analysis of Linked Data 23/35
24. Initial findings and future work
November 23, 2011 Analysis of Linked Data 24/35
Ya3hs3/2531493704 on Flickr
25. New analysis workflow
1.Gather a snapshot of triples
2.Gather information about other type of interactions
3.Create specific networks related to the research
questions at hand
4.Run metrics, interpret results
November 23, 2011 Analysis of Linked Data 25/35
26. The LOD is not what we think it is
LOD Cloud 2009/2010 vs BTC 2009 crawl
Crawled sample differs from the community
based view
LOD Cloud has lumpy structure
Evolution of LOD Cloud
centrality changes
Increased density and connectivity
Christophe Guéret, Shenghui Wang, Paul Groth et al. (2011)
Multi-scale Analysis of the Web Of Data: A Challenge to the Complex System's Community
Advances in Complex Systems 14 (04)
November 23, 2011 Analysis of Linked Data 26/35
28. The tools we need don't exist
We need to flatten the networks to study them
Some specific aspects of the system
Existence of implicit links
Multi-relational and dynamic
Distributed
Hypergraph of relations
Christophe Guéret, Shenghui Wang, Paul Groth et al. (2011)
Multi-scale Analysis of the Web Of Data: A Challenge to the Complex System's Community
Advances in Complex Systems 14 (04)
November 23, 2011 Analysis of Linked Data 28/35
29. Influence content<->social networks
Generate and bind two networks
ex:a
ex:b
ex:c
Measure evolution of degree, betweenness,
clustering over time
Predict evolution Shenghui Wang, Paul Groth (2010)
Measuring the dynamic bi-directional influence between content and social networks
Proceedings of the 9th International Semantic Web Conference (ISWC2010)
November 23, 2011 Analysis of Linked Data 29/35
30. Result for conferences
Shenghui Wang, Paul Groth (2010)
Measuring the dynamic bi-directional influence between content and social networks
Proceedings of the 9th International Semantic Web Conference (ISWC2010)
November 23, 2011 Analysis of Linked Data 30/35
31. Centrality to measure robustness
Map the BTC2010 to two networks
Semantic network based on namespaces
Host networks based on hostnames
Measure robustness as the variance in betweenness
centrality
Find weak spots
Optimize networks to increase robustness
Christophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010)
Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation
Proceedings of the 9th International Semantic Web Conference (ISWC2010)
November 23, 2011 Analysis of Linked Data 31/35
32. Results on hostnames
Christophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010)
Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation
Proceedings of the 9th International Semantic Web Conference (ISWC2010)
November 23, 2011 Analysis of Linked Data 32/35
33. Results on namespaces
Christophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010)
Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation
Proceedings of the 9th International Semantic Web Conference (ISWC2010)
November 23, 2011 Analysis of Linked Data 33/35
34. Improving the network
Christophe Guéret, Paul Groth, Frank Van Harmelen et al. (2010)
Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation
Proceedings of the 9th International Semantic Web Conference (ISWC2010)
November 23, 2011 Analysis of Linked Data 34/35
35. Conclusion
Take home message
Linked Data is not a simple knowledge base
Network analysis tools give new insights on the data
Results can be used to improve the network
Future work
Make resource-centric analysis rather than graph-
centric analysis (big bottleneck now)
Tackle the time aspect of the data
Find more analysis to perform and what they tell us
November 23, 2011 Analysis of Linked Data 35/35