STI Summit 2011 - Global data integration and global data mining
1. STI Summit
July 6th, 2011 Riga Latvia
2011, Riga,
Global Data Integration
and Global Data Mining
Prof. Dr. Christian Bizer
Freie U i
F i Universität Berlin
ität B li
Germany
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
2. Outline
1. Topology of the Web of Data
What data is out there?
2. Global Data Integration
How to split the integration effort
3. Global Data Mining
The logical next step
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
3. Linked Data Deployment on the Web
Year Datasets Triples Growth
2007 12 500.000.000
500 000 000
2008 45 2.000.000.000 300%
2009 95 6.726.000.000 236%
2010 203 26.930.509.703 300%
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
4. Uptake in the Government Domain
The EU is starting to publish Linked Data (LOD2, LATC)
Various other national efforts
W3C eGovernment Interest Group
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
5. Uptake in the Libraries Community
Institutions publishing Linked Data
Library of Congress (subject headings)
German National Library (PND dataset and subject headings)
S edish National Librar (Libris - catalog)
Swedish Library
Hungarian National Library (OPAC and Digital Library)
E
Europeana project j t released d t about 4 million artifacts
j t just l d data b t illi tif t
Growth of Library Linked Data (2009-2010): 1000%
W3C Library Linked Data Incubator Group
Goals:
1. Integrate Library Catalogs on global scale.
2. Interconnect resources between repositories
(by topic, by location, by historical period, by ...).
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
6. LOD data set statistics as of November 2010
Domain Data Sets Triples Percent RDF Links Percent
Cross‐domain 20 1,999,085,950 7.42 29,105,638 7.36
Geographic 16 5,904,980,833 21.93 16,589,086 4.19
Government 25 11,613,525,437 43.12 17,658,869 4.46
Media 26 2,453,898,811 9.11 50,374,304 12.74
Libraries
Lib i 67 2,237,435,732
2 237 435 732 8.31
8 31 77,951,898
77 951 898 19.71
19 71
Life sciences 42 2,664,119,184 9.89 200,417,873 50.67
User Content
User Content 7 57,463,756
57 463 756 0.21
0 21 3,402,228
3 402 228 0.86
0 86
203 26,930,509,703 395,499,896
LOD Cloud Data Catalog on CKAN
http://www.ckan.net/group/lodcloud
http://www ckan net/group/lodcloud
More statistics
http://www4.wiwiss.fu-berlin.de/lodcloud/state/
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
7. What are the big players doing?
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
8. Structured Data becomes a SEO Topic
Data Snippets
pp
Query Answer
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
9. Result: Further growth …
usage of RDFa has increased 510%
g
between March, 2009 and October, 2010
430 million webpages contain RDFa
Source: Yahoo
http://tripletalk.wordpress.com/2011/01/25/rdfa-deployment-across-the-web/
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
10. The Structural Continuum
The Web of Data is interwoven with the classic Web.
Unstructured text: HTML
Structured data:
RDFa embed into HTML (Open Graph)
Microdata embed into HTML (Schema.org)
Microformats embed into HTML
Linked data: RDF/XML
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
11. Topology of the Web of Data
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
12. How to get the data?
Download the Billion Triples Challenge Dataset
2 billion triples (20GB gzipped)
crawled from the public Web of Linked Data in May/June 2011
http://challenge.semanticweb.org/
Download the Sindice Dump
12 billion triples (164GB gzipped, ~1 16TB uncompressed)
gzipped 1,16TB
crawled from the public Web of Linked Data and
includes RDFa Microformat and wrapped API data
RDFa, Microformat,
http://data.sindice.com/trec2011/download.html
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
13. 2. Global Data Integration
Applications hate heterogeneity!
pp g y
The wild wild west My little world
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
14. The Dataspace Vision
Alternative to classic data integration systems in
order to cope with growing number of data sources.
P
Properties of dataspaces
ti fd t
no upfront investment into a global schema
rely on pay-as-you-go d t integration
l data i t ti
give best effort answers to queries
Franklin, M., Halevy, A., and Maier, D.: From Databases to Dataspaces
A new Abstraction for Information Management SIGMOD Rec. 2005
Management, Rec 2005.
Madhavan, J., et al.: Web-scale Data Integration: You Can Only Afford
to Pay As You Go, CIDR 2007
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
15. Linked Data relies on Pay-as-You-Go Idea
for Identity Management
for Schema/Vocabulary Management
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
16. Publish Identity Links on the Web
Identity Link
<http://www4.wiwiss.fu-berlin.de/is-group/resource/persons/Person4>
owl:sameAs
<http://dblp.l3s.de/d2r/resource/authors/Christian_Bizer> .
You publish links pointing at other data sources.
S
Somebody else publishes li k pointing at your
b d l bli h links i ti t
data source.
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
17. Effort Distribution between Publisher and Consumer
Consumer data mines
identity
identit links
Effort
Distribution
Publishers or third
parties provides
identity links
y
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
18. Vocabularies on the Web of Data
Everyone can use whatever vocabularies she likes
to publish Data on the Web.
Web
Or invest effort and reuse Common Vocabularies
Friend-of-a-Friend for describing people and their social network
SIOC for describing forums and blogs
SKOS for representing topic taxonomies
Organization Ontology for describing the structure of organizations
GoodRelations provides terms for describing products and business entities
Music Ontology for describing artists, albums, and performances
Review Vocabulary provides terms for representing reviews
Many Linked Data Source use mixture of common and
proprietary vocabulary terms.
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
19. Publish Vocabulary Links on the Web
Vocabulary Link
<http://xmlns.com/foaf/0.1/Person>
owl:equivalentClass
<http://dbpedia.org/ontology/Person> .
Simple Mappings: RDFS, OWL
rdfs:subClassOf, rdfs:subPropertyOf
owl:equivalentClass, owl:equivalentProperty
Complex Mappings: R2R
p pp g
provides value transformation functions
structural transformations
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
20. Deployment of Vocabulary Links
Source: Li k d O
S Linked Open V
Vocabularies,
b l i
http://labs.mondeca.com/dataset/lov
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
21. Effort Distribution between Publisher and Consumer
Consumer defines or
data mines mappings
Effort
Distribution
Publisher reuses
vocabularies
Publisher or third party
publishes mappings
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
22. Somebody-Pays-As-You-Go
The overall data integration effort is
split between the data publisher, the
publisher
data consumer and third parties. Fix
Overall Data
Integration
Data Publisher Effort
publishes data as RDF
sets identity links
reuses terms or publishes mappings
Third Parties
set identity links pointing at y
y p g your data Publisher‘s
Third
Party
Effort
publish mappings to the Web Effort
Data Consumer
Consumer‘s
has to do the rest Effort
using record linkage and schema matching
techniques
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
23. Research Directions
1. More research on pay-as-you-go data integration is needed.
2. More research on data mining mappings and
identity resolution heuristics is needed.
Identity links make it easier to mine vocabulary links.
Vocabulary links make it easier to mine identity links.
3.
3 More research on SPAM detection and data quality
assessment is needed.
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
24. LDIF – Linked Data Integration Framework
Combines vocabulary normalization and identity resolution
C
Currently only i
tl l in-memory i l
implementation
t ti
Next release: Hadoop-based implementation
htt //
http://www4.wiwiss.fu-berlin.de/bizer/ldif/
4 i i f b li d /bi /ldif/ Normalize Identity
vocabularies Resolution
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
25. What can we do afterwards …
… build better entity search engines
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
26. 3. Global Data Mining
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
28. Think about interesting questions …
… that you can answer based on the Web of Data
… that require
aggregation
summarization
classification
association rule mining
… combined with
text mining
sediment analysis
y
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
29. Everybody has the tools to find the answers
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
30. Research Directions
1. More research on data space profiling is needed.
2. More research on global data mining i needed.
2 M h l b ld t i i is d d
Google, Yahoo, Microsoft, Facebook will get there soon.
g , , , g
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
31. Semantic Web Challenge
Submission Statistics
Year Open Track Billion Triple Track
2008 13 9
2009 16 3
2010 14 4
Do something interesting with the Billion Triple Data
and submit your results to the challenge until October 1st
present your results at the 10th International Semantic Web Conference
(ISWC2011), October 2011, Koblenz, Germany
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
32. Conclusions
The Web of Data is there
Linked Data, Microdata, RDFa, Microformats
Upcoming research topics
pay-as-you-go data integration
mapping discovery, schema clustering
identity resolution heuristics discovery
probabilistic data integration
data quality assessment
data space profiling
global data mining
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
33. Thanks!
References
Textbook: Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global
Heath
Data Space. http://linkeddatabook.com/
Christian Bizer, Tom Heath, Tim Berners-Lee: Linked Data – The Story So Far
http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)