SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
STI Summit
       July 6th, 2011 Riga Latvia
                 2011, Riga,




Global Data Integration
and Global Data Mining

       Prof. Dr. Christian Bizer
        Freie U i
        F i Universität Berlin
                   ität B li
               Germany



                         Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Outline



          1. Topology of the Web of Data
              What data is out there?


          2. Global Data Integration
              How to split the integration effort


          3. Global Data Mining
               The logical next step




                                        Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Linked Data Deployment on the Web

  Year   Datasets     Triples        Growth
  2007     12         500.000.000
                      500 000 000
  2008     45        2.000.000.000   300%
  2009     95        6.726.000.000   236%
  2010     203      26.930.509.703   300%




                                        Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Uptake in the Government Domain




  The EU is starting to publish Linked Data (LOD2, LATC)
  Various other national efforts
  W3C eGovernment Interest Group

                                    Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Uptake in the Libraries Community

  Institutions publishing Linked Data
     Library of Congress (subject headings)
     German National Library (PND dataset and subject headings)
     S edish National Librar (Libris - catalog)
      Swedish          Library
     Hungarian National Library (OPAC and Digital Library)
     E
      Europeana project j t released d t about 4 million artifacts
                   j t just l      d data b t     illi     tif t


  Growth of Library Linked Data (2009-2010): 1000%
  W3C Library Linked Data Incubator Group
  Goals:
    1. Integrate Library Catalogs on global scale.
    2. Interconnect resources between repositories
       (by topic, by location, by historical period, by ...).


                                                  Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
LOD data set statistics as of November 2010


 Domain          Data Sets      Triples       Percent              RDF Links                 Percent
 Cross‐domain       20        1,999,085,950   7.42                  29,105,638                  7.36
 Geographic         16        5,904,980,833   21.93                 16,589,086                  4.19
 Government         25       11,613,525,437   43.12                 17,658,869                  4.46
 Media              26        2,453,898,811    9.11                 50,374,304                 12.74
 Libraries
 Lib i              67        2,237,435,732
                              2 237 435 732   8.31
                                              8 31                  77,951,898
                                                                    77 951 898                 19.71
                                                                                               19 71
 Life sciences      42        2,664,119,184    9.89                200,417,873                 50.67
 User Content
 User Content       7            57,463,756
                                 57 463 756   0.21
                                              0 21                   3,402,228
                                                                     3 402 228                  0.86
                                                                                                0 86
                   203       26,930,509,703                        395,499,896


 LOD Cloud Data Catalog on CKAN
 http://www.ckan.net/group/lodcloud
 http://www ckan net/group/lodcloud

 More statistics
 http://www4.wiwiss.fu-berlin.de/lodcloud/state/
                                              Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
What are the big players doing?




                          Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Structured Data becomes a SEO Topic


                                                      Data Snippets
                                                              pp




                                                    Query Answer




                         Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Result: Further growth …


 usage of RDFa has increased 510%
     g
  between March, 2009 and October, 2010
 430 million webpages contain RDFa




 Source: Yahoo
 http://tripletalk.wordpress.com/2011/01/25/rdfa-deployment-across-the-web/
                                            Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
The Structural Continuum


    The Web of Data is interwoven with the classic Web.


         Unstructured text: HTML
         Structured data:
            RDFa embed into HTML (Open Graph)
            Microdata embed into HTML (Schema.org)
            Microformats embed into HTML

         Linked data: RDF/XML




                                     Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Topology of the Web of Data




                          Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
How to get the data?


   Download the Billion Triples Challenge Dataset
      2 billion triples (20GB gzipped)
      crawled from the public Web of Linked Data in May/June 2011
      http://challenge.semanticweb.org/


   Download the Sindice Dump
      12 billion triples (164GB gzipped, ~1 16TB uncompressed)
                                 gzipped 1,16TB
      crawled from the public Web of Linked Data and
      includes RDFa Microformat and wrapped API data
                RDFa, Microformat,
      http://data.sindice.com/trec2011/download.html




                                            Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
2. Global Data Integration


           Applications hate heterogeneity!
            pp                     g     y




     The wild wild west                      My little world
                             Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
The Dataspace Vision

 Alternative to classic data integration systems in
 order to cope with growing number of data sources.

 P
  Properties of dataspaces
        ti    fd t
     no upfront investment into a global schema
     rely on pay-as-you-go d t integration
        l                   data i t   ti
     give best effort answers to queries


   Franklin, M., Halevy, A., and Maier, D.: From Databases to Dataspaces
   A new Abstraction for Information Management SIGMOD Rec. 2005
                                       Management,           Rec 2005.

   Madhavan, J., et al.: Web-scale Data Integration: You Can Only Afford
   to Pay As You Go, CIDR 2007




                                               Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Linked Data relies on Pay-as-You-Go Idea

  for Identity Management
  for Schema/Vocabulary Management




                                Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Publish Identity Links on the Web


                                                                      Identity Link
    <http://www4.wiwiss.fu-berlin.de/is-group/resource/persons/Person4>
    owl:sameAs
    <http://dblp.l3s.de/d2r/resource/authors/Christian_Bizer> .




     You publish links pointing at other data sources.
    S
     Somebody else publishes li k pointing at your
           b d l     bli h links i ti       t
     data source.




                                           Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Effort Distribution between Publisher and Consumer




Consumer data mines
    identity
    identit links




              Effort
           Distribution




 Publishers or third
  parties provides
   identity links
          y


                            Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Vocabularies on the Web of Data

  Everyone can use whatever vocabularies she likes
   to publish Data on the Web.
                          Web
  Or invest effort and reuse Common Vocabularies
     Friend-of-a-Friend for describing people and their social network
     SIOC for describing forums and blogs
     SKOS for representing topic taxonomies
     Organization Ontology for describing the structure of organizations
     GoodRelations provides terms for describing products and business entities
     Music Ontology for describing artists, albums, and performances
     Review Vocabulary provides terms for representing reviews

  Many Linked Data Source use mixture of common and
   proprietary vocabulary terms.


                                              Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Publish Vocabulary Links on the Web


                                                                 Vocabulary Link
    <http://xmlns.com/foaf/0.1/Person>
    owl:equivalentClass
    <http://dbpedia.org/ontology/Person> .



     Simple Mappings: RDFS, OWL
         rdfs:subClassOf, rdfs:subPropertyOf
         owl:equivalentClass, owl:equivalentProperty

     Complex Mappings: R2R
         p      pp g
         provides value transformation functions
         structural transformations




                                             Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Deployment of Vocabulary Links




Source: Li k d O
S         Linked Open V
                      Vocabularies,
                          b l i
http://labs.mondeca.com/dataset/lov
                                      Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Effort Distribution between Publisher and Consumer




Consumer defines or
data mines mappings




               Effort
            Distribution



  Publisher reuses
   vocabularies

Publisher or third party
 publishes mappings


                            Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Somebody-Pays-As-You-Go

  The overall data integration effort is
  split between the data publisher, the
                         publisher
  data consumer and third parties.                                              Fix 
                                                                            Overall Data  
                                                                            Integration
   Data Publisher                                                             Effort
      publishes data as RDF
      sets identity links
      reuses terms or publishes mappings

   Third Parties
      set identity links pointing at y
                  y       p      g your data                           Publisher‘s
                                                                                             Third 
                                                                                             Party 
                                                                         Effort
      publish mappings to the Web                                                           Effort


   Data Consumer
                                                                                Consumer‘s
      has to do the rest                                                         Effort
      using record linkage and schema matching
       techniques
                                               Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Research Directions


 1. More research on pay-as-you-go data integration is needed.


 2. More research on data mining mappings and
    identity resolution heuristics is needed.
     Identity links make it easier to mine vocabulary links.
     Vocabulary links make it easier to mine identity links.



 3.
 3 More research on SPAM detection and data quality
    assessment is needed.




                                                Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
LDIF – Linked Data Integration Framework

  Combines vocabulary normalization and identity resolution
     C
      Currently only i
            tl    l in-memory i l
                              implementation
                                      t ti
     Next release: Hadoop-based implementation

  htt //
   http://www4.wiwiss.fu-berlin.de/bizer/ldif/
             4 i i f b li d /bi /ldif/                               Normalize                 Identity
                                                                    vocabularies              Resolution




                                               Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
What can we do afterwards …

   … build better entity search engines




                                    Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
3. Global Data Mining




                        Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Think about interesting questions …

 … that you can answer based on the Web of Data
 … that require
     aggregation
     summarization
     classification
     association rule mining

 … combined with
     text mining
     sediment analysis
                   y




                                   Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Everybody has the tools to find the answers




                           Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Research Directions


 1. More research on data space profiling is needed.


 2. More research on global data mining i needed.
 2 M            h     l b ld t    i i is     d d



  Google, Yahoo, Microsoft, Facebook will get there soon.
      g ,       ,          ,               g




                                    Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Semantic Web Challenge

  Submission Statistics

    Year       Open Track          Billion Triple Track
    2008            13                      9
    2009            16                      3
    2010            14                      4



  Do something interesting with the Billion Triple Data
     and submit your results to the challenge until October 1st
     present your results at the 10th International Semantic Web Conference
      (ISWC2011), October 2011, Koblenz, Germany




                                                Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Conclusions

  The Web of Data is there
     Linked Data, Microdata, RDFa, Microformats


  Upcoming research topics
     pay-as-you-go data integration
     mapping discovery, schema clustering
     identity resolution heuristics discovery
     probabilistic data integration
     data quality assessment
     data space profiling
     global data mining




                                                 Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
Thanks!




References
   Textbook: Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global
                   Heath
    Data Space. http://linkeddatabook.com/
   Christian Bizer, Tom Heath, Tim Berners-Lee: Linked Data – The Story So Far
    http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf

                                                 Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)

Weitere ähnliche Inhalte

Ähnlich wie STI Summit 2011 - Global data integration and global data mining

Charleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldCharleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldProQuest
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...OpenAIRE
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8Scott Edmunds
 
Web Data Management in the RDF Age
Web Data Management in the RDF AgeWeb Data Management in the RDF Age
Web Data Management in the RDF AgeM. Tamer Özsu
 
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONSDATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONSijdms
 
TDWG_2010_Chavan_data_citation
TDWG_2010_Chavan_data_citationTDWG_2010_Chavan_data_citation
TDWG_2010_Chavan_data_citationVishwas Chavan
 
RDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupRDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupAnita de Waard
 
The CSO Open Data Experience
The CSO Open Data ExperienceThe CSO Open Data Experience
The CSO Open Data ExperienceDublinked .
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Anja Jentzsch
 
Linked dataresearch
Linked dataresearchLinked dataresearch
Linked dataresearchTope Omitola
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Stefan Dietze
 
SIOC: Semantic Web for Social Media Sites
SIOC: Semantic Web for Social Media SitesSIOC: Semantic Web for Social Media Sites
SIOC: Semantic Web for Social Media SitesUldis Bojars
 
Linked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGChris Ewing
 
From Open Linked Data towards an Ecosystem of Interlinked Knowledge
From Open Linked Data towards an Ecosystem of Interlinked KnowledgeFrom Open Linked Data towards an Ecosystem of Interlinked Knowledge
From Open Linked Data towards an Ecosystem of Interlinked KnowledgeSören Auer
 

Ähnlich wie STI Summit 2011 - Global data integration and global data mining (20)

Charleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldCharleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data World
 
Going for GOLD - Adventures in Open Linked Metadata
Going for GOLD - Adventures in Open Linked MetadataGoing for GOLD - Adventures in Open Linked Metadata
Going for GOLD - Adventures in Open Linked Metadata
 
The Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of LeipzigThe Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of Leipzig
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"
 
Web Data Management in the RDF Age
Web Data Management in the RDF AgeWeb Data Management in the RDF Age
Web Data Management in the RDF Age
 
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONSDATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
 
TDWG_2010_Chavan_data_citation
TDWG_2010_Chavan_data_citationTDWG_2010_Chavan_data_citation
TDWG_2010_Chavan_data_citation
 
Jung 2010
Jung 2010Jung 2010
Jung 2010
 
Ciard Initiative and a Global Infrastructure for Linked Open Data
Ciard Initiative and a Global Infrastructure for Linked Open Data Ciard Initiative and a Global Infrastructure for Linked Open Data
Ciard Initiative and a Global Infrastructure for Linked Open Data
 
RDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupRDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest Group
 
The CSO Open Data Experience
The CSO Open Data ExperienceThe CSO Open Data Experience
The CSO Open Data Experience
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Linked dataresearch
Linked dataresearchLinked dataresearch
Linked dataresearch
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...
 
SIOC: Semantic Web for Social Media Sites
SIOC: Semantic Web for Social Media SitesSIOC: Semantic Web for Social Media Sites
SIOC: Semantic Web for Social Media Sites
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
Linked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIG
 
From Open Linked Data towards an Ecosystem of Interlinked Knowledge
From Open Linked Data towards an Ecosystem of Interlinked KnowledgeFrom Open Linked Data towards an Ecosystem of Interlinked Knowledge
From Open Linked Data towards an Ecosystem of Interlinked Knowledge
 

Mehr von Semantic Technology Institute International

Mehr von Semantic Technology Institute International (20)

Summit2013 sw in russian universities
Summit2013   sw in russian universitiesSummit2013   sw in russian universities
Summit2013 sw in russian universities
 
Summit2013 semantic web in russia
Summit2013   semantic web in russiaSummit2013   semantic web in russia
Summit2013 semantic web in russia
 
Summit2013 john domingue - introduction
Summit2013   john domingue - introductionSummit2013   john domingue - introduction
Summit2013 john domingue - introduction
 
Summit2013 john domingue - horizon2020
Summit2013   john domingue - horizon2020Summit2013   john domingue - horizon2020
Summit2013 john domingue - horizon2020
 
Summit2013 ho-jin choi - summit2013
Summit2013   ho-jin choi - summit2013Summit2013   ho-jin choi - summit2013
Summit2013 ho-jin choi - summit2013
 
Summit2013 georg gottlob and tim furche - diadem
Summit2013   georg gottlob and tim furche - diademSummit2013   georg gottlob and tim furche - diadem
Summit2013 georg gottlob and tim furche - diadem
 
Summit2013 eventos onto quad
Summit2013   eventos onto quadSummit2013   eventos onto quad
Summit2013 eventos onto quad
 
Summit2013 choi - wise kb-introd
Summit2013   choi - wise kb-introdSummit2013   choi - wise kb-introd
Summit2013 choi - wise kb-introd
 
Summit2013 choi - kaist-cs-intro
Summit2013   choi - kaist-cs-introSummit2013   choi - kaist-cs-intro
Summit2013 choi - kaist-cs-intro
 
STI Summit 2011 - Conclusion
STI Summit 2011 - ConclusionSTI Summit 2011 - Conclusion
STI Summit 2011 - Conclusion
 
STI Summit 2011 - Dynamic web
STI Summit 2011 - Dynamic webSTI Summit 2011 - Dynamic web
STI Summit 2011 - Dynamic web
 
STI Summit 2011 - Mlr-sm
STI Summit 2011 - Mlr-smSTI Summit 2011 - Mlr-sm
STI Summit 2011 - Mlr-sm
 
STI Summit 2011 - Linked data-services-streams
STI Summit 2011 - Linked data-services-streamsSTI Summit 2011 - Linked data-services-streams
STI Summit 2011 - Linked data-services-streams
 
STI Summit 2011 - Linked services
STI Summit 2011 - Linked servicesSTI Summit 2011 - Linked services
STI Summit 2011 - Linked services
 
STI Summit 2011 - di@scale
STI Summit 2011 - di@scaleSTI Summit 2011 - di@scale
STI Summit 2011 - di@scale
 
STI Summit 2011 - A personal look at the future of Semantic Technologies
STI Summit 2011 - A personal look at the future of Semantic TechnologiesSTI Summit 2011 - A personal look at the future of Semantic Technologies
STI Summit 2011 - A personal look at the future of Semantic Technologies
 
STI Summit 2011 - Visual analytics and linked data
STI Summit 2011 - Visual analytics and linked dataSTI Summit 2011 - Visual analytics and linked data
STI Summit 2011 - Visual analytics and linked data
 
STI Summit 2011 - LS4 LS Khaos
STI Summit 2011 - LS4 LS KhaosSTI Summit 2011 - LS4 LS Khaos
STI Summit 2011 - LS4 LS Khaos
 
STI Summit 2011 - Making linked data work
STI Summit 2011 - Making linked data workSTI Summit 2011 - Making linked data work
STI Summit 2011 - Making linked data work
 
STI Summit 2011 - Shortipedia
STI Summit 2011 - ShortipediaSTI Summit 2011 - Shortipedia
STI Summit 2011 - Shortipedia
 

Kürzlich hochgeladen

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Kürzlich hochgeladen (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

STI Summit 2011 - Global data integration and global data mining

  • 1. STI Summit July 6th, 2011 Riga Latvia 2011, Riga, Global Data Integration and Global Data Mining Prof. Dr. Christian Bizer Freie U i F i Universität Berlin ität B li Germany Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 2. Outline 1. Topology of the Web of Data  What data is out there? 2. Global Data Integration  How to split the integration effort 3. Global Data Mining  The logical next step Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 3. Linked Data Deployment on the Web Year Datasets Triples Growth 2007 12 500.000.000 500 000 000 2008 45 2.000.000.000 300% 2009 95 6.726.000.000 236% 2010 203 26.930.509.703 300% Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 4. Uptake in the Government Domain  The EU is starting to publish Linked Data (LOD2, LATC)  Various other national efforts  W3C eGovernment Interest Group Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 5. Uptake in the Libraries Community  Institutions publishing Linked Data  Library of Congress (subject headings)  German National Library (PND dataset and subject headings)  S edish National Librar (Libris - catalog) Swedish Library  Hungarian National Library (OPAC and Digital Library)  E Europeana project j t released d t about 4 million artifacts j t just l d data b t illi tif t  Growth of Library Linked Data (2009-2010): 1000%  W3C Library Linked Data Incubator Group  Goals: 1. Integrate Library Catalogs on global scale. 2. Interconnect resources between repositories (by topic, by location, by historical period, by ...). Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 6. LOD data set statistics as of November 2010 Domain Data Sets Triples Percent RDF Links Percent Cross‐domain 20 1,999,085,950 7.42 29,105,638 7.36 Geographic 16 5,904,980,833 21.93 16,589,086 4.19 Government 25 11,613,525,437 43.12 17,658,869 4.46 Media 26 2,453,898,811 9.11 50,374,304 12.74 Libraries Lib i 67 2,237,435,732 2 237 435 732 8.31 8 31 77,951,898 77 951 898 19.71 19 71 Life sciences 42 2,664,119,184 9.89 200,417,873 50.67 User Content User Content 7 57,463,756 57 463 756 0.21 0 21 3,402,228 3 402 228 0.86 0 86 203 26,930,509,703 395,499,896 LOD Cloud Data Catalog on CKAN http://www.ckan.net/group/lodcloud http://www ckan net/group/lodcloud More statistics http://www4.wiwiss.fu-berlin.de/lodcloud/state/ Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 7. What are the big players doing? Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 8. Structured Data becomes a SEO Topic Data Snippets pp Query Answer Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 9. Result: Further growth … usage of RDFa has increased 510% g between March, 2009 and October, 2010 430 million webpages contain RDFa Source: Yahoo http://tripletalk.wordpress.com/2011/01/25/rdfa-deployment-across-the-web/ Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 10. The Structural Continuum The Web of Data is interwoven with the classic Web.  Unstructured text: HTML  Structured data:  RDFa embed into HTML (Open Graph)  Microdata embed into HTML (Schema.org)  Microformats embed into HTML  Linked data: RDF/XML Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 11. Topology of the Web of Data Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 12. How to get the data?  Download the Billion Triples Challenge Dataset  2 billion triples (20GB gzipped)  crawled from the public Web of Linked Data in May/June 2011  http://challenge.semanticweb.org/  Download the Sindice Dump  12 billion triples (164GB gzipped, ~1 16TB uncompressed) gzipped 1,16TB  crawled from the public Web of Linked Data and  includes RDFa Microformat and wrapped API data RDFa, Microformat,  http://data.sindice.com/trec2011/download.html Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 13. 2. Global Data Integration Applications hate heterogeneity! pp g y The wild wild west My little world Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 14. The Dataspace Vision Alternative to classic data integration systems in order to cope with growing number of data sources. P Properties of dataspaces ti fd t  no upfront investment into a global schema  rely on pay-as-you-go d t integration l data i t ti  give best effort answers to queries Franklin, M., Halevy, A., and Maier, D.: From Databases to Dataspaces A new Abstraction for Information Management SIGMOD Rec. 2005 Management, Rec 2005. Madhavan, J., et al.: Web-scale Data Integration: You Can Only Afford to Pay As You Go, CIDR 2007 Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 15. Linked Data relies on Pay-as-You-Go Idea  for Identity Management  for Schema/Vocabulary Management Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 16. Publish Identity Links on the Web Identity Link <http://www4.wiwiss.fu-berlin.de/is-group/resource/persons/Person4> owl:sameAs <http://dblp.l3s.de/d2r/resource/authors/Christian_Bizer> .  You publish links pointing at other data sources. S Somebody else publishes li k pointing at your b d l bli h links i ti t data source. Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 17. Effort Distribution between Publisher and Consumer Consumer data mines identity identit links Effort Distribution Publishers or third parties provides identity links y Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 18. Vocabularies on the Web of Data  Everyone can use whatever vocabularies she likes to publish Data on the Web. Web  Or invest effort and reuse Common Vocabularies  Friend-of-a-Friend for describing people and their social network  SIOC for describing forums and blogs  SKOS for representing topic taxonomies  Organization Ontology for describing the structure of organizations  GoodRelations provides terms for describing products and business entities  Music Ontology for describing artists, albums, and performances  Review Vocabulary provides terms for representing reviews  Many Linked Data Source use mixture of common and proprietary vocabulary terms. Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 19. Publish Vocabulary Links on the Web Vocabulary Link <http://xmlns.com/foaf/0.1/Person> owl:equivalentClass <http://dbpedia.org/ontology/Person> .  Simple Mappings: RDFS, OWL  rdfs:subClassOf, rdfs:subPropertyOf  owl:equivalentClass, owl:equivalentProperty  Complex Mappings: R2R p pp g  provides value transformation functions  structural transformations Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 20. Deployment of Vocabulary Links Source: Li k d O S Linked Open V Vocabularies, b l i http://labs.mondeca.com/dataset/lov Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 21. Effort Distribution between Publisher and Consumer Consumer defines or data mines mappings Effort Distribution Publisher reuses vocabularies Publisher or third party publishes mappings Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 22. Somebody-Pays-As-You-Go The overall data integration effort is split between the data publisher, the publisher data consumer and third parties. Fix  Overall Data   Integration  Data Publisher Effort  publishes data as RDF  sets identity links  reuses terms or publishes mappings  Third Parties  set identity links pointing at y y p g your data Publisher‘s Third  Party  Effort  publish mappings to the Web Effort  Data Consumer Consumer‘s  has to do the rest Effort  using record linkage and schema matching techniques Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 23. Research Directions 1. More research on pay-as-you-go data integration is needed. 2. More research on data mining mappings and identity resolution heuristics is needed.  Identity links make it easier to mine vocabulary links.  Vocabulary links make it easier to mine identity links. 3. 3 More research on SPAM detection and data quality assessment is needed. Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 24. LDIF – Linked Data Integration Framework  Combines vocabulary normalization and identity resolution  C Currently only i tl l in-memory i l implementation t ti  Next release: Hadoop-based implementation  htt // http://www4.wiwiss.fu-berlin.de/bizer/ldif/ 4 i i f b li d /bi /ldif/ Normalize Identity vocabularies Resolution Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 25. What can we do afterwards … … build better entity search engines Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 26. 3. Global Data Mining Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 27. Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 28. Think about interesting questions … … that you can answer based on the Web of Data … that require  aggregation  summarization  classification  association rule mining … combined with  text mining  sediment analysis y Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 29. Everybody has the tools to find the answers Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 30. Research Directions 1. More research on data space profiling is needed. 2. More research on global data mining i needed. 2 M h l b ld t i i is d d  Google, Yahoo, Microsoft, Facebook will get there soon. g , , , g Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 31. Semantic Web Challenge  Submission Statistics Year Open Track Billion Triple Track 2008 13 9 2009 16 3 2010 14 4  Do something interesting with the Billion Triple Data  and submit your results to the challenge until October 1st  present your results at the 10th International Semantic Web Conference (ISWC2011), October 2011, Koblenz, Germany Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 32. Conclusions  The Web of Data is there  Linked Data, Microdata, RDFa, Microformats  Upcoming research topics  pay-as-you-go data integration  mapping discovery, schema clustering  identity resolution heuristics discovery  probabilistic data integration  data quality assessment  data space profiling  global data mining Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)
  • 33. Thanks! References  Textbook: Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Heath Data Space. http://linkeddatabook.com/  Christian Bizer, Tom Heath, Tim Berners-Lee: Linked Data – The Story So Far http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf Christian Bizer: Global Data Integration – STI Summit, Riga (6/7/2011)