SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
On Statistical Characteristics
of Real-life Knowledge Graphs
Weining Qian
Institute for Data Science and Engineering
East China Normal University
wnqian@sei.ecnu.edu.cn
NSFC key project on Big Data Benchmarks
• Theory and Methods of Benchmarking Big Data
Management Systems
– 2015.1 – 2019.12
2
East China
Normal University
Renmin University
of China
Institute of Computing Technology
Chinese Academy of Sciences
6/22/2016 LDBC TUC 2016
Research focuses
• BigDataBench Suite (@ict.ac)
– http://prof.ict.ac.cn
• Benchmarking transaction processing in
NewSQL systems (@ecnu)
– SecKillBench
• Benchmarking Big Data systems (@rmu)
• Benchmarks for graph data (@ecnu)
– Social media, knowledge graphs, ...
36/22/2016 LDBC TUC 2016
Knowledge graphs
4
knowledge
basesemantic
web
knowledge
graph
linked data
6/22/2016 LDBC TUC 2016
Some KG‘s
5
• YAGO
– 10M entities in 350K classes
– 120M facts for 100 relations
– 100 languages
– 95% accuracy
• DBPedia
– 4M entities in 250 classes
– 500M facts for 6000 properties
– live updates
• Kosmix
– 6.5M concepts, 6.7M concept
instances,
– 165M relationship instances
• Freebase
– 40M entities in 15000 topics
– 1B facts for 4000 properties
– core of Google Knowledge
Graph
• Google Knowledge Graph
– 600M entities in 15000 topics
– 20B facts
• Probase
– 2.7 million+ concepts
• And many domain/application-
specific knowledge graphs
6/22/2016 LDBC TUC 2016
A natural question
• Knowledge graph can serve as the backbone
of many Web-scale applications, such as
search engine, question answering, text
understanding etc.
• How to effectively and efficiently manage a
large-scale knowledge graph?
– MySQL, Oracle, Neo4j, TITAN, Trinity, or other
triple stores???
66/22/2016 LDBC TUC 2016
Social networks vs. Knowledge graphs
• Though there are some benchmarks for social
networks exist
– Facebook LinkBench, LDBC SNB, BSMA, ...
• Knowledge graph is different with social network
– More semantic labels in both entities and relations
– Topic or domain sensitive
– Contains various kinds of knowledge
– Hard to define a unified schema
76/22/2016 LDBC TUC 2016
Why study their statistical characteristics?
• To better understand knowledge graphs
• To help the selection of seeding data sets in
benchmarks
• To help the development of data generators
86/22/2016 LDBC TUC 2016
Characteristics of large-scale graphs
• Previous research works on analyzing structural
properties of large scale graphs, e.g.
– [Broder et al. Comput. Netw. 2000] studied the web
structure as a graph via a series of metrics, e.g degree,
diameter, component.
– [Kumar et al. KDD, 2006] studied the dynamic social
network’s structure properties, e.g. degree, hop etc.
– [Boccaletti et al. Phys. Rep. 2006] surveyed the studies of
the structure and dynamics of complex network.
96/22/2016 LDBC TUC 2016
Real-life knowledge graphs
• YAGO2
– A huge semantic knowledge graph based on WordNet,
Wikipedia and GeoNames
– 10+ million entities, 120+ million facts
• Separate the YAGO2 into three sub-graphs
– YagoTax: Taxonomy tree of YAGO2
– YagoFact: Facts in YAGO2
– YagoWiki: Hyperlink relations in YAGO2 based on
Wikipedia
106/22/2016 LDBC TUC 2016
Real-life knowledge graphs
• WordNet
– A lexical network for the English language.
– Synonym set as node and semantic relation as edge.
– 98,000 entities, 154,000 relationships
• DBpedia
– A multi-language knowledge base extracted from
Wikipedia info-boxes
– English version of DBPedia
– 4.58 million things and 2,795 different kinds of properties
116/22/2016 LDBC TUC 2016
Real-life knowledge graphs
• Enterprise Knowledge Graph (EKG)
– Describes an enterprise relationships in Chinese
– Extracted from reports from enterprises in Shanghai Stock
Market
– Used for credit and risk analysis in financial companies
– A domain specific knowledge graph
– Seven kinds of relationships between two entities
• Assignment, hold, subcompany, changename, manager,
cooperate, merge
– 51,853 entities and 430,973 relationships.
126/22/2016 LDBC TUC 2016
Plus two social networks
• SNRand
– 0.2 million randomly selected users
– 5 million fellowship relations between users
• SNRank
– 0.2 million most active users.
– 36+ million fellowship relations between users
• The raw data is collected from a famous social
media platform named Sina Weibo in China
136/22/2016 LDBC TUC 2016
Statistical characteristics
146/22/2016 LDBC TUC 2016
Four kinds of distributions
• Distribution of degrees
– In-degree and out-degree
– Power-law distribution
• Distribution of hops
– Reflects the connectivity cost inside a graph
• Distribution of connected components
– Strongly and weakly connected components
– Reflects the connectivity of a graph
• Distribution of clustering coefficients
– Measures the nodes’ tendency to cluster together
156/22/2016 LDBC TUC 2016
In-degrees and out-degrees
16
All the in/out-degree distributions exhibit the power-law (or piece-wise power-
law), except for some initial segments that deviate the power-law.
6/22/2016 LDBC TUC 2016
Size of connected components
17
Both the strongly and weakly connected component distributions of knowledge
graphs exhibit the power-law distribution in general. While the social networks
are nearly in a whole strongly connected component.
6/22/2016 LDBC TUC 2016
Distance between two vertices
• Social networks are
“small worlds”
• Taxonomy’s
diameter is large
(tree-alike)
• Basically, the larger
the network is, the
smaller the diameter
is
186/22/2016 LDBC TUC 2016
196/22/2016 LDBC TUC 2016
Cluster coefficient
• Taxonomy and social
networks are
different
• All other KG’s are of
power-law
distributions
206/22/2016 LDBC TUC 2016
Cluster coefficient
• Taxonomy and social
networks are
different
• All other KG’s are of
power-law
distributions
216/22/2016 LDBC TUC 2016
Node degrees of different parts in EKG
22
Different relationships show different out-degree distributions
6/22/2016 LDBC TUC 2016
Size of connected components of
different parts in EKG
23
Few connected components are much larger than others
6/22/2016 LDBC TUC 2016
Different parts in EKG
24
Highly clustered Small-world network
6/22/2016 LDBC TUC 2016
Conclusions and discussions
25
• KG’s are different to SN’s
– Taxonomy/ontology + fact
• Following different statistical distributions
– KG’s are labeled
• Different subgraphs are of different sizes and characteristics
• Both triple stores and relational databases have reasons to be
used
– The key is to avoid joins over power-law distributed data
Wenliang Cheng, Chengyu Wang, Bing Xiao, Weining Qian, Aoying Zhou:
On
Statistical Characteristics of Real-Life Knowledge Graphs. BPOE 2015: 37-49
6/22/2016 LDBC TUC 2016
谢谢!
Thanks!
http://dase.ecnu.edu.cn
wnqian@sei.ecnu.edu.cn
Statistical characteristics
6/22/2016 LDBC TUC 2016 27

Weitere ähnliche Inhalte

Andere mochten auch

Plc scada by vishal kumar from niec delhi
Plc scada by vishal kumar from niec delhiPlc scada by vishal kumar from niec delhi
Plc scada by vishal kumar from niec delhi
7532993375
 

Andere mochten auch (7)

Close to Home Cruises
Close to Home Cruises Close to Home Cruises
Close to Home Cruises
 
Informe de composición de drogas ilíicitas en Euskadi 2015
Informe de composición de drogas ilíicitas en Euskadi 2015Informe de composición de drogas ilíicitas en Euskadi 2015
Informe de composición de drogas ilíicitas en Euskadi 2015
 
Plc scada by vishal kumar from niec delhi
Plc scada by vishal kumar from niec delhiPlc scada by vishal kumar from niec delhi
Plc scada by vishal kumar from niec delhi
 
Stili comunicativi
Stili comunicativiStili comunicativi
Stili comunicativi
 
Challenges faced by women entrepreneurs in the present technological era
Challenges faced by women entrepreneurs in the present technological eraChallenges faced by women entrepreneurs in the present technological era
Challenges faced by women entrepreneurs in the present technological era
 
Termòmetre Lingüístic: Un eina de recerca per la pràctica docent
Termòmetre Lingüístic: Un eina de recerca  per la pràctica docent Termòmetre Lingüístic: Un eina de recerca  per la pràctica docent
Termòmetre Lingüístic: Un eina de recerca per la pràctica docent
 
Sidetur
SideturSidetur
Sidetur
 

Ähnlich wie Weining Qian (ECNU). On Statistical Characteristics of Real-Life Knowledge Graphs.

An Open Spatial Systems Framework for Place-Based Decision-Making
An Open Spatial Systems Framework for Place-Based Decision-MakingAn Open Spatial Systems Framework for Place-Based Decision-Making
An Open Spatial Systems Framework for Place-Based Decision-Making
Raed Mansour
 
Service Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data MiningService Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data Mining
IIRindia
 
Provenance and social network analysis for recommender systems: a literature...
Provenance and social network analysis for recommender  systems: a literature...Provenance and social network analysis for recommender  systems: a literature...
Provenance and social network analysis for recommender systems: a literature...
IJECEIAES
 
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Citadelh2020
 
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Gayane Sedrakyan
 

Ähnlich wie Weining Qian (ECNU). On Statistical Characteristics of Real-Life Knowledge Graphs. (20)

Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
IRJET- Predicting Social Network Communities Structure Changes and Detection ...
IRJET- Predicting Social Network Communities Structure Changes and Detection ...IRJET- Predicting Social Network Communities Structure Changes and Detection ...
IRJET- Predicting Social Network Communities Structure Changes and Detection ...
 
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open Data
 
Rubrics for DMPs
Rubrics for DMPsRubrics for DMPs
Rubrics for DMPs
 
Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)
 
ONS local presents clustering
ONS local presents clusteringONS local presents clustering
ONS local presents clustering
 
An Open Spatial Systems Framework for Place-Based Decision-Making
An Open Spatial Systems Framework for Place-Based Decision-MakingAn Open Spatial Systems Framework for Place-Based Decision-Making
An Open Spatial Systems Framework for Place-Based Decision-Making
 
Service Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data MiningService Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data Mining
 
Provenance and social network analysis for recommender systems: a literature...
Provenance and social network analysis for recommender  systems: a literature...Provenance and social network analysis for recommender  systems: a literature...
Provenance and social network analysis for recommender systems: a literature...
 
NBK update briefing October 2017
NBK update briefing October 2017NBK update briefing October 2017
NBK update briefing October 2017
 
Towards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphTowards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data Graph
 
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
 
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
 
UNIT_1-BD.pptx
UNIT_1-BD.pptxUNIT_1-BD.pptx
UNIT_1-BD.pptx
 
Survey on Location Based Recommendation System Using POI
Survey on Location Based Recommendation System Using POISurvey on Location Based Recommendation System Using POI
Survey on Location Based Recommendation System Using POI
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs
 
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
 
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
 
Overview of XSEDE Systems Engineering
Overview of XSEDE Systems EngineeringOverview of XSEDE Systems Engineering
Overview of XSEDE Systems Engineering
 

Mehr von LDBC council

Mehr von LDBC council (20)

8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
 
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
8th TUC Meeting -  Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...8th TUC Meeting -  Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
 
8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine
8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine
8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine
 
8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...
8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...
8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...
 
8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark
8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark
8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark
 
8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...
8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...
8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...
 
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
 
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
 
8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...
8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...
8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...
 
8th TUC Meeting -
8th TUC Meeting - 8th TUC Meeting -
8th TUC Meeting -
 
8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...
8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...
8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...
 
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status update
 
LDBC 6th TUC Meeting conclusions
LDBC 6th TUC Meeting conclusionsLDBC 6th TUC Meeting conclusions
LDBC 6th TUC Meeting conclusions
 
Parallel and incremental materialisation of RDF/DATALOG in RDFOX
Parallel and incremental materialisation of RDF/DATALOG in RDFOXParallel and incremental materialisation of RDF/DATALOG in RDFOX
Parallel and incremental materialisation of RDF/DATALOG in RDFOX
 
MODAClouds Decision Support System for Cloud Service Selection
MODAClouds Decision Support System for Cloud Service SelectionMODAClouds Decision Support System for Cloud Service Selection
MODAClouds Decision Support System for Cloud Service Selection
 
E-Commerce and Graph-driven Applications: Experiences and Optimizations while...
E-Commerce and Graph-driven Applications: Experiences and Optimizations while...E-Commerce and Graph-driven Applications: Experiences and Optimizations while...
E-Commerce and Graph-driven Applications: Experiences and Optimizations while...
 
LDBC SNB Benchmark Auditing
LDBC SNB Benchmark AuditingLDBC SNB Benchmark Auditing
LDBC SNB Benchmark Auditing
 
Social Network Benchmark Interactive Workload
Social Network Benchmark Interactive WorkloadSocial Network Benchmark Interactive Workload
Social Network Benchmark Interactive Workload
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Weining Qian (ECNU). On Statistical Characteristics of Real-Life Knowledge Graphs.

  • 1. On Statistical Characteristics of Real-life Knowledge Graphs Weining Qian Institute for Data Science and Engineering East China Normal University wnqian@sei.ecnu.edu.cn
  • 2. NSFC key project on Big Data Benchmarks • Theory and Methods of Benchmarking Big Data Management Systems – 2015.1 – 2019.12 2 East China Normal University Renmin University of China Institute of Computing Technology Chinese Academy of Sciences 6/22/2016 LDBC TUC 2016
  • 3. Research focuses • BigDataBench Suite (@ict.ac) – http://prof.ict.ac.cn • Benchmarking transaction processing in NewSQL systems (@ecnu) – SecKillBench • Benchmarking Big Data systems (@rmu) • Benchmarks for graph data (@ecnu) – Social media, knowledge graphs, ... 36/22/2016 LDBC TUC 2016
  • 5. Some KG‘s 5 • YAGO – 10M entities in 350K classes – 120M facts for 100 relations – 100 languages – 95% accuracy • DBPedia – 4M entities in 250 classes – 500M facts for 6000 properties – live updates • Kosmix – 6.5M concepts, 6.7M concept instances, – 165M relationship instances • Freebase – 40M entities in 15000 topics – 1B facts for 4000 properties – core of Google Knowledge Graph • Google Knowledge Graph – 600M entities in 15000 topics – 20B facts • Probase – 2.7 million+ concepts • And many domain/application- specific knowledge graphs 6/22/2016 LDBC TUC 2016
  • 6. A natural question • Knowledge graph can serve as the backbone of many Web-scale applications, such as search engine, question answering, text understanding etc. • How to effectively and efficiently manage a large-scale knowledge graph? – MySQL, Oracle, Neo4j, TITAN, Trinity, or other triple stores??? 66/22/2016 LDBC TUC 2016
  • 7. Social networks vs. Knowledge graphs • Though there are some benchmarks for social networks exist – Facebook LinkBench, LDBC SNB, BSMA, ... • Knowledge graph is different with social network – More semantic labels in both entities and relations – Topic or domain sensitive – Contains various kinds of knowledge – Hard to define a unified schema 76/22/2016 LDBC TUC 2016
  • 8. Why study their statistical characteristics? • To better understand knowledge graphs • To help the selection of seeding data sets in benchmarks • To help the development of data generators 86/22/2016 LDBC TUC 2016
  • 9. Characteristics of large-scale graphs • Previous research works on analyzing structural properties of large scale graphs, e.g. – [Broder et al. Comput. Netw. 2000] studied the web structure as a graph via a series of metrics, e.g degree, diameter, component. – [Kumar et al. KDD, 2006] studied the dynamic social network’s structure properties, e.g. degree, hop etc. – [Boccaletti et al. Phys. Rep. 2006] surveyed the studies of the structure and dynamics of complex network. 96/22/2016 LDBC TUC 2016
  • 10. Real-life knowledge graphs • YAGO2 – A huge semantic knowledge graph based on WordNet, Wikipedia and GeoNames – 10+ million entities, 120+ million facts • Separate the YAGO2 into three sub-graphs – YagoTax: Taxonomy tree of YAGO2 – YagoFact: Facts in YAGO2 – YagoWiki: Hyperlink relations in YAGO2 based on Wikipedia 106/22/2016 LDBC TUC 2016
  • 11. Real-life knowledge graphs • WordNet – A lexical network for the English language. – Synonym set as node and semantic relation as edge. – 98,000 entities, 154,000 relationships • DBpedia – A multi-language knowledge base extracted from Wikipedia info-boxes – English version of DBPedia – 4.58 million things and 2,795 different kinds of properties 116/22/2016 LDBC TUC 2016
  • 12. Real-life knowledge graphs • Enterprise Knowledge Graph (EKG) – Describes an enterprise relationships in Chinese – Extracted from reports from enterprises in Shanghai Stock Market – Used for credit and risk analysis in financial companies – A domain specific knowledge graph – Seven kinds of relationships between two entities • Assignment, hold, subcompany, changename, manager, cooperate, merge – 51,853 entities and 430,973 relationships. 126/22/2016 LDBC TUC 2016
  • 13. Plus two social networks • SNRand – 0.2 million randomly selected users – 5 million fellowship relations between users • SNRank – 0.2 million most active users. – 36+ million fellowship relations between users • The raw data is collected from a famous social media platform named Sina Weibo in China 136/22/2016 LDBC TUC 2016
  • 15. Four kinds of distributions • Distribution of degrees – In-degree and out-degree – Power-law distribution • Distribution of hops – Reflects the connectivity cost inside a graph • Distribution of connected components – Strongly and weakly connected components – Reflects the connectivity of a graph • Distribution of clustering coefficients – Measures the nodes’ tendency to cluster together 156/22/2016 LDBC TUC 2016
  • 16. In-degrees and out-degrees 16 All the in/out-degree distributions exhibit the power-law (or piece-wise power- law), except for some initial segments that deviate the power-law. 6/22/2016 LDBC TUC 2016
  • 17. Size of connected components 17 Both the strongly and weakly connected component distributions of knowledge graphs exhibit the power-law distribution in general. While the social networks are nearly in a whole strongly connected component. 6/22/2016 LDBC TUC 2016
  • 18. Distance between two vertices • Social networks are “small worlds” • Taxonomy’s diameter is large (tree-alike) • Basically, the larger the network is, the smaller the diameter is 186/22/2016 LDBC TUC 2016
  • 20. Cluster coefficient • Taxonomy and social networks are different • All other KG’s are of power-law distributions 206/22/2016 LDBC TUC 2016
  • 21. Cluster coefficient • Taxonomy and social networks are different • All other KG’s are of power-law distributions 216/22/2016 LDBC TUC 2016
  • 22. Node degrees of different parts in EKG 22 Different relationships show different out-degree distributions 6/22/2016 LDBC TUC 2016
  • 23. Size of connected components of different parts in EKG 23 Few connected components are much larger than others 6/22/2016 LDBC TUC 2016
  • 24. Different parts in EKG 24 Highly clustered Small-world network 6/22/2016 LDBC TUC 2016
  • 25. Conclusions and discussions 25 • KG’s are different to SN’s – Taxonomy/ontology + fact • Following different statistical distributions – KG’s are labeled • Different subgraphs are of different sizes and characteristics • Both triple stores and relational databases have reasons to be used – The key is to avoid joins over power-law distributed data Wenliang Cheng, Chengyu Wang, Bing Xiao, Weining Qian, Aoying Zhou:
On Statistical Characteristics of Real-Life Knowledge Graphs. BPOE 2015: 37-49 6/22/2016 LDBC TUC 2016