SlideShare ist ein Scribd-Unternehmen logo
1 von 50
LDBC
Industry-strength benchmarks
for
Graph and RDF
Data Management
Peter Boncz
 make
competing
products
comparable
 accelerate
progress,
make
technology
viable
Why Benchmarking?
© Jim Gray, 2005
What is the LDBC?
Linked Data Benchmark Council = LDBC
 Industry entity similar to TPC (www.tpc.org)
 Focusing on graph and RDF store benchmarking
Kick-started by an EU project
 Runs from September 2012 – March 2015
 9 project partners:
 Will continue independently after the EU project
LDBC Benchmark Design
Developed by so-called “task forces”
 Requirements analysis and use case selection.
◦ Technical User Community (TUC)
 Benchmark specification.
◦ data generator
◦ query workload
◦ metrics
◦ reporting format
 Benchmark implementation.
◦ tools (query drivers, data generation, validation)
◦ test evaluations
 Auditing
◦ auditing guide
◦ auditor training
LDBC: what systems?
Benchmarks for:
 RDF stores (SPARQL speaking)
◦ Virtuoso, OWLIM, BigData, Allegrograph,…
 Graph Database systems
◦ Neo4j, DEX, InfiniteGraph, …
 Graph Programming Frameworks
◦ Giraph, Green Marl, Grappa, GraphLab,…
 Relational Database systems
LDBC: functionality
Benchmarks for:
 Transactional updates in (RDF) graphs
 Business Intelligence queries over
graphs
 Graph Analytics (e.g. graph clustering)
 Complex RDF workload, e.g. including
reasoning, or for data integration
Anything relevant for RDF and graph
data management systems
Roadmap for the Keynote
Choke-point based benchmark design
 What are Choke-points?
◦ examples from good-old TPC-H
◦  relational database benchmarking
 A Graph benchmark Choke-Point, in-
depth:
◦ Structural Correlation in Graphs
◦ and what we do about it in LDBC
Database Benchmark Design
Desirable properties:
 Relevant.
 Representative.
 Understandable.
 Economical.
 Accepted.
 Scalable.
 Portable.
 Fair.
 Evolvable.
 Public.
Jim Gray (1991) The Benchmark Handbook for Database
and Transaction Processing Systems
Dina Bitton, David J. DeWitt, Carolyn Turbyfill (1993)
Benchmarking Database Systems: A Systematic Approa
Multiple TPCTC papers, e.g.:
Karl Huppler (2009) The Art of Building a Good Benchma
Stimulating Technical
Progress
 An aspect of ‘Relevant’
 The benchmark metric
◦ depends on,
◦ or, rewards:
solving certain
technical challenges
“Choke Point”
(not commonly solved by technology at
benchmark design time)
Benchmark Design with Choke Points
Choke-Point = well-chosen difficulty in the
workload
 “difficulties in the workloads”
◦ arise from Data (distribs)+Query+Workload
◦ there may be different technical solutions to
address the choke point
 or, there may not yet exist optimizations (but
should not be NP hard to do so)
 the impact of the choke point may differ among
systems
Benchmark Design with Choke
Points
Choke-Point = well-chosen difficulty in the
workload
 “difficulties in the workloads”
 “well-chosen”
◦ the majority of actual systems do not
handle the choke point very well
◦ the choke point occurs or is likely to occur
in actual or near-future workloads
Example: TPC-H choke points
 Even though it was designed without
specific choke point analysis
 TPC-H contained a lot of interesting
challenges
◦ many more than Star Schema Benchmark
◦ considerably more than Xmark (XML DB
benchmark)
◦ not sure about TPC-DS (yet)
TPCTC 2013:
www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf
TPC-H choke point areas
(1/3)
TPCTC 2013:
www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf
TPC-H choke point areas
(2/3)
TPCTC 2013:
www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf
TPC-H choke point areas
(3/3)
TPCTC 2013:
www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf
CP1.4 Dependent GroupBy
KeysSELECT c_custkey, c_name, c_acctbal,
sum(l_extendedprice * (1 - l_discount)) as revenue,
n_name, c_address, c_phone, c_comment
FROM customer, orders, lineitem, nation
WHERE c_custkey = o_custkey and l_orderkey = o_orderkey
and o_orderdate >= date '[DATE]'
and o_orderdate < date '[DATE]' + interval '3' month
and l_returnflag = 'R‘ and c_nationkey = n_nationkey
GROUP BY
c_custkey, c_name, c_acctbal, c_phone, n_name,
c_address, c_comment
ORDER BY revenue DESC
Q10
TPCTC 2013:
www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf
CP1.4 Dependent GroupBy
KeysSELECT c_custkey, c_name, c_acctbal,
sum(l_extendedprice * (1 - l_discount)) as revenue,
n_name, c_address, c_phone, c_comment
FROM customer, orders, lineitem, nation
WHERE c_custkey = o_custkey and l_orderkey = o_orderkey
and o_orderdate >= date '[DATE]'
and o_orderdate < date '[DATE]' + interval '3' month
and l_returnflag = 'R‘ and c_nationkey = n_nationkey
GROUP BY
c_custkey, c_name, c_acctbal, c_phone,
c_address, c_comment, n_name
ORDER BY revenue DESC
Q10
TPCTC 2013:
www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf
TPCTC 2013:
www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf
CP1.4 Dependent GroupBy
Keys
 Functional dependencies:
c_custkey  c_name, c_acctbal, c_phone,
c_address, c_comment, c_nationkey  n_name
 Group-by hash table should exclude
the colored attrs  less CPU+ mem
footprint
 in TPC-H, one can choose to declare
primary and foreign keys (all or nothing)
◦ this optimization requires declared keys
◦ Key checking slows down RF
(insert/delete)
Exasol:
“foreign key check” phase after
load
CP2.2 Sparse Joins
 Foreign key (N:1) joins towards a
relation with a selection condition
◦ Most tuples will *not* find a match
◦ Probing (index, hash) is the most
expensive activity in TPC-H
 Can we do better?
◦ Bloom filters!
TPCTC 2013:
www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf
CP2.2 Sparse Joins
 Foreign key (N:1) joins towards a
relation with a selection condition
2G cycles 29M probes  cost would have been 14G cycles ~= 7 sec
1.5G cycles 200M probes  85% eliminated
probed: 200M tuples
result: 8M tuples
 1:25 join hit ratio
Q21
Vectorwise:
TPC-H joins typically accelerate
4x
Queries accelerate 2x
CP5.2 Subquery Rewrite
SELECT sum(l_extendedprice) / 7.0 as avg_yearly
FROM lineitem, part
WHERE p_partkey = l_partkey
and p_brand = '[BRAND]'
and p_container = '[CONTAINER]'
and l_quantity <( SELECT 0.2 * avg(l_quantity)
FROM lineitem
WHERE l_partkey = p_partkey)
This subquery can be extended with restrictions
from the outer query.
SELECT 0.2 * avg(l_quantity)
FROM lineitem
WHERE l_partkey = p_partkey
and p_brand = '[BRAND]'
and p_container = '[CONTAINER]'
+ CP5.3 Overlap between Outer- and Subquery.
Q17
Hyper:
CP5.1+CP5.2+CP5.3
results in 500x faster
Q17
Choke Points
 Hidden challenges in a benchmark
influence database system design, e.g. TPC-H
 Functional Dependency Analysis in aggregation
 Bloom Filters for sparse joins
 Subquery predicate propagation
 LDBC explicitly designs benchmarks
looking at choke-point “coverage”
◦ requires access to database kernel
architects
Roadmap for the Keynote
Choke-point based benchmark design
 What are Choke-points?
◦ examples from good-old TPC-H
 Graph benchmark Choke-Point, in-
depth:
◦ Structural Correlation in Graphs
◦ and what we do about it in LDBC
 Wrap up
Data correlations between attributes
SELECT personID from person
WHERE firstName = AND addressCountry = ‘Germany’‘Joachim’
SELECT personID from person
WHERE firstName = AND addressCountry = ‘Italy’‘Cesare’
 Query optimizers may underestimate or overestimate the result size
of conjunctive predicates
Anti-Correlation
Loew PrandelliJoachim CesareCesare Joachim
SELECT COUNT(*)
FROM paper pa1 JOIN conferences cn1 ON pa1.journal = jn1.ID
paper pa2 JOIN conferences cn2 ON pa2.journal = jn2.ID
WHERE pa1.author = pa2.author AND
cn1.name = ‘VLDB’ AND cn2.name =
Data correlations between attributes
‘SIGMOD’
SELECT COUNT(*)
FROM paper pa1 JOIN conferences cn1 ON pa1.journal = cn1.ID
paper pa2 JOIN conferences cn2 ON pa2.journal = cn2.ID
WHERE pa1.author = pa2.author AND
cn1.name = ‘VLDB’ AND cn2.name =
Data correlations over joins
‘Nature’‘SIGMOD’
 A challenge to the optimizers to adjust estimated join hit ratio
pa1.author = pa2.author
depending on other predicates
Correlated predicates are still a frontier area in database
research
LDBC Social Network Benchmark (SNB)
User
User
User
User
Photo
InRelationShip
User
“Yamaku
”
“EPFL”
“Switzerland”
like
 What makes graphs interesting are the connectivity patterns
• who is connected to who?
 structure typically depends on the (values) attributes of nodes
 Structural Correlation ( choke point)
• amount of common friends
• shortest path between two persons
search complexity in a social network varies wildly between
• two random persons
• e.g. colleagues at the same company
 No existing graph benchmark specifically tests for the effects of
correlations
 Synthetic graphs used for benchmarking do not have structural
correlations
Handling Correlation: a choke point for Graph D
Need a data generator generating synthetic
graph with data/structure correlations
TPCTC 2012:
www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
 How do data generators generate values? E.g. FirstName
Generating Correlated Property Values
TPCTC 2012:
www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
 How do data generators generate values? E.g. FirstName
 Value Dictionary D()
• a fixed set of values, e.g.,
{“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”, .. }
 Probability density function F()
• steers how the generator chooses values
 cumulative distribution over dictionary entries determines which value to pick
• could be anything: uniform, binomial, geometric, etc…
 geometric (discrete exponential) seems to explain many natural phenomena
Generating Property Values
TPCTC 2012:
www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
 How do data generators generate values? E.g. FirstName
 Value Dictionary D()
 Probability density function F()
 Ranking Function R()
• Gives each value a unique rank between one and |D|
determines which value gets which probability
• Depends on some parameters (parameterized function)
 value frequency distribution becomes correlated by the parameters or R()
Generating Correlated Property Values
TPCTC 2012:
www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
TPCTC 2012:
www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
 How do data generators generate values? E.g. FirstName
 Value Dictionary D()
{“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”,“Leon”,“Orr
 Probability density function F()
geometric distribution
 Ranking Function R(gender,country,birthyear)
• gender, country, birthyear  correlation parameters
Generating Correlated Property Values
How to implement R()?
We need a table storing
|Gender| X |Country| X |BirthYear| X |D|
Solution:
- Just store the rank of the top-N values, not all|D|
- Assign the rank of the other dictionary values
randomly
limited #combinations
Potentially
Many! 
Compact Correlated Property Value Generation
Using geometric distribution for function
F()
 Main source of dictionary values from DBpedia (http://dbpedia.org)
 Various realistic property value correlations ()
e.g.,
(person.location,person.gender,person.birthDay)  person.firstName
person.location  person.lastName
person.location  person.university
person.createdDate  person.photoAlbum.createdDate
….
Correlated Value Property in LDBC SNB
TPCTC 2012:
www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
Correlated Edge Generation
P4
P5
Student
“Anna”
“University of
Leipzig”
“Germany”
“1990”
P1
“University
of Leipzig”
“Laura”
“1990”
<Britney
Spears>
<Britney
Spears>
P3
“University
of Leipzig”
“1990”
P2
“University of
Amsterdam”
“Netherlands”
Correlated Edge Generation
P4
P5
Student
“Anna”
“University of
Leipzig”
“Germany”
“1990”
P1
“University
of Leipzig”
“Laura”
“1990”
<Britney
Spears>
<Britney
Spears>
P3
“University
of Leipzig”
“1990”
P2
“University of
Amsterdam”
“Netherlands”
Correlated Edge Generation
P4
P5
Student
“Anna”
“University of
Leipzig”
“Germany”
“1990”
P1
“University
of Leipzig”
“Laura”
“1990”
<Britney
Spears>
<Britney
Spears>
P3
“University
of Leipzig”
“1990”
P2
“University of
Amsterdam”
“Netherlands”
Simple approach
P
4
P
5
Student
“Anna”
“University of
Leipzig”
“Germany
”
“1990”
P
1
“University
of Leipzig”
“Laura
”
“1990
”
<Britney
Spears>
<Britney
Spears>
P
3
“Universit
y of
Leipzig” “1990
”
P
2
“University of
Amsterdam”
“Netherland
s”
Danger: this is very expensive to compute on a large graph!
(quadratic, random access)
• Compute similarity of two nodes
based on their (correlated)
properties.
• Use a probability density function
wrt to this similarity for connecting
nodes
connection
probability
highly similar  less similar
Our observation
P
4
P
5
Student
“Anna”
“University of
Leipzig”
“Germany
”
“1990”
P
1
“University
of Leipzig”
“Laura
”
“1990
”
<Britney
Spears>
<Britney
Spears>
P
3
“Universit
y of
Leipzig” “1990
”
P
2
“University of
Amsterdam”
“Netherland
s”
Probability that two nodes are connected is skewed w.r.t the
similarity between the nodes (due to probability distr.)
connection
probability
highly similar  less similar
Window
Trick: disregard nodes with too large similarity distanc
(only connect nodes in a similarity window)
Correlation Dimensions
 Similar metric
Sort nodes on similarity (similar nodes are brought near each other)
 Probability function
Pick edge between two nodes based on their ranked distance
(e.g. geometric distribution, again)
Similarity metric +
Probability function
P1
London
P5
London
P3
Eton
P2
Eton
P4
Cambridge
<Ranking along the “Having study together” dimension>
we use space filling curves (e.g. Z-order) to get a linear dimension
 Sort nodes using MapReduce on similarity metric
 Reduce function keeps a window of nodes to generate edges
• Keep low memory usage (sliding window approach)
 Slide the window for multiple passes, each pass corresponds to one cor
dimension (multiple MapReduce jobs)
• for each node we choose degree per pass (also using a prob. function
steers how many edges are picked in the window for that node
Generate edges along correlation dimensions
W
TPCTC 2012:
www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
 Having studied together
 Having common interests (hobbies)
 Random dimension
• motivation: not all friendships are explainable (…)
(of course, these two correlation dimensions are still a gross simplification o
but this provides some interesting material for benchmark queries)
Correlation Dimensions in LDBC SNB
TPCTC 2012:
www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
 Social graph characteristics
• Output graph has similar characteristics as observed in real social
network (i.e., “small-world network” characteristics)
- Power-law social degree distribution
- Low average path-length
- High clustering coefficient
 Scalability
• Generates up to 1.2 TB of data (1.2 million users) in half an hour
- Runs on a cluster of 16 nodes
(part of the SciLens cluster, www.scilens.org)
• Scales out linearly
Evaluation (… see the TPCTC 2012 paper)
TPCTC 2012:
www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
 correlation between values (“properties”) and connection pattern in graphs
affects many real-world data management tasks
use as a choke point in the Social Network Benchmark
 generating huge correlated graphs is hard!
MapReduce algorithm that approximates correlation probabilities with
windowed-approach
See: for more info
• https://github.com/ldbc
• SNB task-force wiki http://www.ldbc.eu:8090/display/TUC
Summary
Roadmap for the Keynote
Choke-point based benchmark design
 What are Choke-points?
◦ examples from good-old TPC-H
 Graph Choke-Point In depth
◦ Structural Correlation in Graphs
◦ And what we do about it in LDBC
 Wrap up
LDBC Benchmark Status
 Social Network Benchmark
◦ Interactive Workload
 Lookup queries + updates
 Navigation between friends and posts
Graph DB, RDF DB, Relational DB
◦ Business Intelligence Workload
 Heavy Joins, Group-By + navigation!
 Graph DB, RDF DB, Relational DB
◦ Graph Analytics
 Graph Diameter, Graph Clustering, etc.
 Graph Programming Frageworks, Graph DB (RDF
DB?, Relational DB?)
LDBC Benchmark Status
 Social Network Benchmark
 Semantic Publishing Benchmark
◦ BBC use case (BBC data + queries)
 Continuous updates
 Aggregation queries
 Light-weight RDF reasoning
LDBC Next Steps
 Benchmark Interim Reports
◦ November 2013
◦ SNB and Semantic Publishing
 Meet LDBC @ GraphConnect
◦ 3rd Techical User Community (TUC)
meeting
◦ London, November 19, 2013
Conclusion
 LDBC: a new graph/RDF
benchmarking initiative
◦ EU initatiated, Industry supported
◦ benchmarks under development (SNB,
SPB)
 more to follow
 Choke-point based benchmark
development
◦ Graph Correlation
LDBC
thank you very much.
Questions?

Weitere ähnliche Inhalte

Ähnlich wie Keynote IDEAS 2013 - Peter Boncz

CHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docx
CHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docxCHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docx
CHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docx
cravennichole326
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
Ioan Toma
 
Automation of MultiDimensional DB Design (poster)
Automation of MultiDimensional DB Design (poster)Automation of MultiDimensional DB Design (poster)
Automation of MultiDimensional DB Design (poster)
Rim Moussa
 

Ähnlich wie Keynote IDEAS 2013 - Peter Boncz (20)

CHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docx
CHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docxCHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docx
CHAPTER 10 SystemArchitectureChapter 10 is the final chapter.docx
 
Evolutionary db development
Evolutionary db development Evolutionary db development
Evolutionary db development
 
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingAgile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
 
Project Controls Expo - 31st Oct 2012 - Accurate Management Reports on 1me, e...
Project Controls Expo - 31st Oct 2012 - Accurate Management Reports on 1me, e...Project Controls Expo - 31st Oct 2012 - Accurate Management Reports on 1me, e...
Project Controls Expo - 31st Oct 2012 - Accurate Management Reports on 1me, e...
 
Capella Days 2021 | An example of model-centric engineering environment with ...
Capella Days 2021 | An example of model-centric engineering environment with ...Capella Days 2021 | An example of model-centric engineering environment with ...
Capella Days 2021 | An example of model-centric engineering environment with ...
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph Strategy
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael Moore
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
System X - About Benchmarks
System X - About BenchmarksSystem X - About Benchmarks
System X - About Benchmarks
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
 
Se notes
Se notesSe notes
Se notes
 
Automation of MultiDimensional DB Design (poster)
Automation of MultiDimensional DB Design (poster)Automation of MultiDimensional DB Design (poster)
Automation of MultiDimensional DB Design (poster)
 
COCOMO MODEL
COCOMO MODELCOCOMO MODEL
COCOMO MODEL
 

Mehr von LDBC council

Mehr von LDBC council (20)

8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
 
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
8th TUC Meeting -  Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...8th TUC Meeting -  Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
 
8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine
8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine
8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine
 
8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...
8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...
8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...
 
8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark
8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark
8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark
 
8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...
8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...
8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...
 
Weining Qian (ECNU). On Statistical Characteristics of Real-Life Knowledge Gr...
Weining Qian (ECNU). On Statistical Characteristics of Real-Life Knowledge Gr...Weining Qian (ECNU). On Statistical Characteristics of Real-Life Knowledge Gr...
Weining Qian (ECNU). On Statistical Characteristics of Real-Life Knowledge Gr...
 
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
 
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
 
8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...
8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...
8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...
 
8th TUC Meeting -
8th TUC Meeting - 8th TUC Meeting -
8th TUC Meeting -
 
8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...
8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...
8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...
 
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status update
 
LDBC 6th TUC Meeting conclusions
LDBC 6th TUC Meeting conclusionsLDBC 6th TUC Meeting conclusions
LDBC 6th TUC Meeting conclusions
 
Parallel and incremental materialisation of RDF/DATALOG in RDFOX
Parallel and incremental materialisation of RDF/DATALOG in RDFOXParallel and incremental materialisation of RDF/DATALOG in RDFOX
Parallel and incremental materialisation of RDF/DATALOG in RDFOX
 
MODAClouds Decision Support System for Cloud Service Selection
MODAClouds Decision Support System for Cloud Service SelectionMODAClouds Decision Support System for Cloud Service Selection
MODAClouds Decision Support System for Cloud Service Selection
 
E-Commerce and Graph-driven Applications: Experiences and Optimizations while...
E-Commerce and Graph-driven Applications: Experiences and Optimizations while...E-Commerce and Graph-driven Applications: Experiences and Optimizations while...
E-Commerce and Graph-driven Applications: Experiences and Optimizations while...
 
LDBC SNB Benchmark Auditing
LDBC SNB Benchmark AuditingLDBC SNB Benchmark Auditing
LDBC SNB Benchmark Auditing
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Keynote IDEAS 2013 - Peter Boncz

  • 1. LDBC Industry-strength benchmarks for Graph and RDF Data Management Peter Boncz
  • 3. What is the LDBC? Linked Data Benchmark Council = LDBC  Industry entity similar to TPC (www.tpc.org)  Focusing on graph and RDF store benchmarking Kick-started by an EU project  Runs from September 2012 – March 2015  9 project partners:  Will continue independently after the EU project
  • 4. LDBC Benchmark Design Developed by so-called “task forces”  Requirements analysis and use case selection. ◦ Technical User Community (TUC)  Benchmark specification. ◦ data generator ◦ query workload ◦ metrics ◦ reporting format  Benchmark implementation. ◦ tools (query drivers, data generation, validation) ◦ test evaluations  Auditing ◦ auditing guide ◦ auditor training
  • 5. LDBC: what systems? Benchmarks for:  RDF stores (SPARQL speaking) ◦ Virtuoso, OWLIM, BigData, Allegrograph,…  Graph Database systems ◦ Neo4j, DEX, InfiniteGraph, …  Graph Programming Frameworks ◦ Giraph, Green Marl, Grappa, GraphLab,…  Relational Database systems
  • 6. LDBC: functionality Benchmarks for:  Transactional updates in (RDF) graphs  Business Intelligence queries over graphs  Graph Analytics (e.g. graph clustering)  Complex RDF workload, e.g. including reasoning, or for data integration Anything relevant for RDF and graph data management systems
  • 7. Roadmap for the Keynote Choke-point based benchmark design  What are Choke-points? ◦ examples from good-old TPC-H ◦  relational database benchmarking  A Graph benchmark Choke-Point, in- depth: ◦ Structural Correlation in Graphs ◦ and what we do about it in LDBC
  • 8. Database Benchmark Design Desirable properties:  Relevant.  Representative.  Understandable.  Economical.  Accepted.  Scalable.  Portable.  Fair.  Evolvable.  Public. Jim Gray (1991) The Benchmark Handbook for Database and Transaction Processing Systems Dina Bitton, David J. DeWitt, Carolyn Turbyfill (1993) Benchmarking Database Systems: A Systematic Approa Multiple TPCTC papers, e.g.: Karl Huppler (2009) The Art of Building a Good Benchma
  • 9. Stimulating Technical Progress  An aspect of ‘Relevant’  The benchmark metric ◦ depends on, ◦ or, rewards: solving certain technical challenges “Choke Point” (not commonly solved by technology at benchmark design time)
  • 10. Benchmark Design with Choke Points Choke-Point = well-chosen difficulty in the workload  “difficulties in the workloads” ◦ arise from Data (distribs)+Query+Workload ◦ there may be different technical solutions to address the choke point  or, there may not yet exist optimizations (but should not be NP hard to do so)  the impact of the choke point may differ among systems
  • 11. Benchmark Design with Choke Points Choke-Point = well-chosen difficulty in the workload  “difficulties in the workloads”  “well-chosen” ◦ the majority of actual systems do not handle the choke point very well ◦ the choke point occurs or is likely to occur in actual or near-future workloads
  • 12. Example: TPC-H choke points  Even though it was designed without specific choke point analysis  TPC-H contained a lot of interesting challenges ◦ many more than Star Schema Benchmark ◦ considerably more than Xmark (XML DB benchmark) ◦ not sure about TPC-DS (yet) TPCTC 2013: www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf
  • 13. TPC-H choke point areas (1/3) TPCTC 2013: www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf
  • 14. TPC-H choke point areas (2/3) TPCTC 2013: www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf
  • 15. TPC-H choke point areas (3/3) TPCTC 2013: www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf
  • 16. CP1.4 Dependent GroupBy KeysSELECT c_custkey, c_name, c_acctbal, sum(l_extendedprice * (1 - l_discount)) as revenue, n_name, c_address, c_phone, c_comment FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= date '[DATE]' and o_orderdate < date '[DATE]' + interval '3' month and l_returnflag = 'R‘ and c_nationkey = n_nationkey GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment ORDER BY revenue DESC Q10 TPCTC 2013: www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf
  • 17. CP1.4 Dependent GroupBy KeysSELECT c_custkey, c_name, c_acctbal, sum(l_extendedprice * (1 - l_discount)) as revenue, n_name, c_address, c_phone, c_comment FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= date '[DATE]' and o_orderdate < date '[DATE]' + interval '3' month and l_returnflag = 'R‘ and c_nationkey = n_nationkey GROUP BY c_custkey, c_name, c_acctbal, c_phone, c_address, c_comment, n_name ORDER BY revenue DESC Q10 TPCTC 2013: www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf
  • 18. TPCTC 2013: www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf CP1.4 Dependent GroupBy Keys  Functional dependencies: c_custkey  c_name, c_acctbal, c_phone, c_address, c_comment, c_nationkey  n_name  Group-by hash table should exclude the colored attrs  less CPU+ mem footprint  in TPC-H, one can choose to declare primary and foreign keys (all or nothing) ◦ this optimization requires declared keys ◦ Key checking slows down RF (insert/delete) Exasol: “foreign key check” phase after load
  • 19. CP2.2 Sparse Joins  Foreign key (N:1) joins towards a relation with a selection condition ◦ Most tuples will *not* find a match ◦ Probing (index, hash) is the most expensive activity in TPC-H  Can we do better? ◦ Bloom filters! TPCTC 2013: www.cwi.nl/~boncz/tpctc2013_boncz_neumann_erling.pdf
  • 20. CP2.2 Sparse Joins  Foreign key (N:1) joins towards a relation with a selection condition 2G cycles 29M probes  cost would have been 14G cycles ~= 7 sec 1.5G cycles 200M probes  85% eliminated probed: 200M tuples result: 8M tuples  1:25 join hit ratio Q21 Vectorwise: TPC-H joins typically accelerate 4x Queries accelerate 2x
  • 21. CP5.2 Subquery Rewrite SELECT sum(l_extendedprice) / 7.0 as avg_yearly FROM lineitem, part WHERE p_partkey = l_partkey and p_brand = '[BRAND]' and p_container = '[CONTAINER]' and l_quantity <( SELECT 0.2 * avg(l_quantity) FROM lineitem WHERE l_partkey = p_partkey) This subquery can be extended with restrictions from the outer query. SELECT 0.2 * avg(l_quantity) FROM lineitem WHERE l_partkey = p_partkey and p_brand = '[BRAND]' and p_container = '[CONTAINER]' + CP5.3 Overlap between Outer- and Subquery. Q17 Hyper: CP5.1+CP5.2+CP5.3 results in 500x faster Q17
  • 22. Choke Points  Hidden challenges in a benchmark influence database system design, e.g. TPC-H  Functional Dependency Analysis in aggregation  Bloom Filters for sparse joins  Subquery predicate propagation  LDBC explicitly designs benchmarks looking at choke-point “coverage” ◦ requires access to database kernel architects
  • 23. Roadmap for the Keynote Choke-point based benchmark design  What are Choke-points? ◦ examples from good-old TPC-H  Graph benchmark Choke-Point, in- depth: ◦ Structural Correlation in Graphs ◦ and what we do about it in LDBC  Wrap up
  • 24. Data correlations between attributes SELECT personID from person WHERE firstName = AND addressCountry = ‘Germany’‘Joachim’ SELECT personID from person WHERE firstName = AND addressCountry = ‘Italy’‘Cesare’  Query optimizers may underestimate or overestimate the result size of conjunctive predicates Anti-Correlation Loew PrandelliJoachim CesareCesare Joachim
  • 25. SELECT COUNT(*) FROM paper pa1 JOIN conferences cn1 ON pa1.journal = jn1.ID paper pa2 JOIN conferences cn2 ON pa2.journal = jn2.ID WHERE pa1.author = pa2.author AND cn1.name = ‘VLDB’ AND cn2.name = Data correlations between attributes ‘SIGMOD’
  • 26. SELECT COUNT(*) FROM paper pa1 JOIN conferences cn1 ON pa1.journal = cn1.ID paper pa2 JOIN conferences cn2 ON pa2.journal = cn2.ID WHERE pa1.author = pa2.author AND cn1.name = ‘VLDB’ AND cn2.name = Data correlations over joins ‘Nature’‘SIGMOD’  A challenge to the optimizers to adjust estimated join hit ratio pa1.author = pa2.author depending on other predicates Correlated predicates are still a frontier area in database research
  • 27. LDBC Social Network Benchmark (SNB) User User User User Photo InRelationShip User “Yamaku ” “EPFL” “Switzerland” like
  • 28.  What makes graphs interesting are the connectivity patterns • who is connected to who?  structure typically depends on the (values) attributes of nodes  Structural Correlation ( choke point) • amount of common friends • shortest path between two persons search complexity in a social network varies wildly between • two random persons • e.g. colleagues at the same company  No existing graph benchmark specifically tests for the effects of correlations  Synthetic graphs used for benchmarking do not have structural correlations Handling Correlation: a choke point for Graph D Need a data generator generating synthetic graph with data/structure correlations TPCTC 2012: www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
  • 29.  How do data generators generate values? E.g. FirstName Generating Correlated Property Values TPCTC 2012: www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
  • 30.  How do data generators generate values? E.g. FirstName  Value Dictionary D() • a fixed set of values, e.g., {“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”, .. }  Probability density function F() • steers how the generator chooses values  cumulative distribution over dictionary entries determines which value to pick • could be anything: uniform, binomial, geometric, etc…  geometric (discrete exponential) seems to explain many natural phenomena Generating Property Values TPCTC 2012: www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
  • 31.  How do data generators generate values? E.g. FirstName  Value Dictionary D()  Probability density function F()  Ranking Function R() • Gives each value a unique rank between one and |D| determines which value gets which probability • Depends on some parameters (parameterized function)  value frequency distribution becomes correlated by the parameters or R() Generating Correlated Property Values TPCTC 2012: www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
  • 32. TPCTC 2012: www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf  How do data generators generate values? E.g. FirstName  Value Dictionary D() {“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”,“Leon”,“Orr  Probability density function F() geometric distribution  Ranking Function R(gender,country,birthyear) • gender, country, birthyear  correlation parameters Generating Correlated Property Values How to implement R()? We need a table storing |Gender| X |Country| X |BirthYear| X |D| Solution: - Just store the rank of the top-N values, not all|D| - Assign the rank of the other dictionary values randomly limited #combinations Potentially Many! 
  • 33. Compact Correlated Property Value Generation Using geometric distribution for function F()
  • 34.  Main source of dictionary values from DBpedia (http://dbpedia.org)  Various realistic property value correlations () e.g., (person.location,person.gender,person.birthDay)  person.firstName person.location  person.lastName person.location  person.university person.createdDate  person.photoAlbum.createdDate …. Correlated Value Property in LDBC SNB TPCTC 2012: www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
  • 35. Correlated Edge Generation P4 P5 Student “Anna” “University of Leipzig” “Germany” “1990” P1 “University of Leipzig” “Laura” “1990” <Britney Spears> <Britney Spears> P3 “University of Leipzig” “1990” P2 “University of Amsterdam” “Netherlands”
  • 36. Correlated Edge Generation P4 P5 Student “Anna” “University of Leipzig” “Germany” “1990” P1 “University of Leipzig” “Laura” “1990” <Britney Spears> <Britney Spears> P3 “University of Leipzig” “1990” P2 “University of Amsterdam” “Netherlands”
  • 37. Correlated Edge Generation P4 P5 Student “Anna” “University of Leipzig” “Germany” “1990” P1 “University of Leipzig” “Laura” “1990” <Britney Spears> <Britney Spears> P3 “University of Leipzig” “1990” P2 “University of Amsterdam” “Netherlands”
  • 38. Simple approach P 4 P 5 Student “Anna” “University of Leipzig” “Germany ” “1990” P 1 “University of Leipzig” “Laura ” “1990 ” <Britney Spears> <Britney Spears> P 3 “Universit y of Leipzig” “1990 ” P 2 “University of Amsterdam” “Netherland s” Danger: this is very expensive to compute on a large graph! (quadratic, random access) • Compute similarity of two nodes based on their (correlated) properties. • Use a probability density function wrt to this similarity for connecting nodes connection probability highly similar  less similar
  • 39. Our observation P 4 P 5 Student “Anna” “University of Leipzig” “Germany ” “1990” P 1 “University of Leipzig” “Laura ” “1990 ” <Britney Spears> <Britney Spears> P 3 “Universit y of Leipzig” “1990 ” P 2 “University of Amsterdam” “Netherland s” Probability that two nodes are connected is skewed w.r.t the similarity between the nodes (due to probability distr.) connection probability highly similar  less similar Window Trick: disregard nodes with too large similarity distanc (only connect nodes in a similarity window)
  • 40. Correlation Dimensions  Similar metric Sort nodes on similarity (similar nodes are brought near each other)  Probability function Pick edge between two nodes based on their ranked distance (e.g. geometric distribution, again) Similarity metric + Probability function P1 London P5 London P3 Eton P2 Eton P4 Cambridge <Ranking along the “Having study together” dimension> we use space filling curves (e.g. Z-order) to get a linear dimension
  • 41.  Sort nodes using MapReduce on similarity metric  Reduce function keeps a window of nodes to generate edges • Keep low memory usage (sliding window approach)  Slide the window for multiple passes, each pass corresponds to one cor dimension (multiple MapReduce jobs) • for each node we choose degree per pass (also using a prob. function steers how many edges are picked in the window for that node Generate edges along correlation dimensions W TPCTC 2012: www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
  • 42.  Having studied together  Having common interests (hobbies)  Random dimension • motivation: not all friendships are explainable (…) (of course, these two correlation dimensions are still a gross simplification o but this provides some interesting material for benchmark queries) Correlation Dimensions in LDBC SNB TPCTC 2012: www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
  • 43.  Social graph characteristics • Output graph has similar characteristics as observed in real social network (i.e., “small-world network” characteristics) - Power-law social degree distribution - Low average path-length - High clustering coefficient  Scalability • Generates up to 1.2 TB of data (1.2 million users) in half an hour - Runs on a cluster of 16 nodes (part of the SciLens cluster, www.scilens.org) • Scales out linearly Evaluation (… see the TPCTC 2012 paper) TPCTC 2012: www.cwi.nl/~boncz/tpctc2012_pham_boncz_erling.pdf
  • 44.  correlation between values (“properties”) and connection pattern in graphs affects many real-world data management tasks use as a choke point in the Social Network Benchmark  generating huge correlated graphs is hard! MapReduce algorithm that approximates correlation probabilities with windowed-approach See: for more info • https://github.com/ldbc • SNB task-force wiki http://www.ldbc.eu:8090/display/TUC Summary
  • 45. Roadmap for the Keynote Choke-point based benchmark design  What are Choke-points? ◦ examples from good-old TPC-H  Graph Choke-Point In depth ◦ Structural Correlation in Graphs ◦ And what we do about it in LDBC  Wrap up
  • 46. LDBC Benchmark Status  Social Network Benchmark ◦ Interactive Workload  Lookup queries + updates  Navigation between friends and posts Graph DB, RDF DB, Relational DB ◦ Business Intelligence Workload  Heavy Joins, Group-By + navigation!  Graph DB, RDF DB, Relational DB ◦ Graph Analytics  Graph Diameter, Graph Clustering, etc.  Graph Programming Frageworks, Graph DB (RDF DB?, Relational DB?)
  • 47. LDBC Benchmark Status  Social Network Benchmark  Semantic Publishing Benchmark ◦ BBC use case (BBC data + queries)  Continuous updates  Aggregation queries  Light-weight RDF reasoning
  • 48. LDBC Next Steps  Benchmark Interim Reports ◦ November 2013 ◦ SNB and Semantic Publishing  Meet LDBC @ GraphConnect ◦ 3rd Techical User Community (TUC) meeting ◦ London, November 19, 2013
  • 49. Conclusion  LDBC: a new graph/RDF benchmarking initiative ◦ EU initatiated, Industry supported ◦ benchmarks under development (SNB, SPB)  more to follow  Choke-point based benchmark development ◦ Graph Correlation
  • 50. LDBC thank you very much. Questions?

Hinweis der Redaktion

  1. Sources for generating edges correlated with property values
  2. Sources for generating edges correlated with property values
  3. Sources for generating edges correlated with property values
  4. Sources for generating edges correlated with property values
  5. Sources for generating edges correlated with property values
  6. Note: The similarity metric must be mapped to a linear metric