Challenges in the Design of a Graph Database Benchmark

© Prof. Dr.-Ing. Wolfgang Lehner |
Challenges in the Design of a Graph Database
Benchmark
FOSDEM‘12 – Graph Processing DevRoom
Marcus Paradies

Marcus Paradies | | 1
> Outline
 Motivation
 Challenges
 Thoughts on Graph Data Generation
 Thoughts on Query Workload
 Summary and Outlook
 Discussion
FOSDEM 2012

> Motivation
FOSDEM 2012
 Graph databases are gaining momentum
 Enterprise corporations are getting interested
 How to compare the available graph database vendors?
 Main issue: Results from benchmarks are not comparable
 Lack of standardization in the data model and query language
 What are “typical“ graph operations?

>
Challenges
FOSDEM 2012

> Challenge #1: Application Domain
 Graph data is not homogenous
 Graph data from different domains follows different patterns
 Examples:
 Social Network Analysis (SNA)
 Protein Interaction Analysis
 Recommendation Systems
 Supply Chain Management (Vehicle Routing, CRM)
 Fraud Detection in Financial Systems
 …
Challenge: Find an application domain which represents a graph data pattern
common in many different scenarios.
FOSDEM 2012

> Challenge #2: Graph Data Model
FOSDEM 2012
What flavours of graph data models
are commonly used?

FOSDEM 2012
Directed Graph

FOSDEM 2012
Directed Graph
Undirected Graph

FOSDEM 2012
Directed Graph
Undirected Graph
Mixed Graph

FOSDEM 2012
Directed Graph
Undirected Graph
Mixed Graph Multi Graph

FOSDEM 2012
Directed Graph
Undirected Graph
(Plain) Property
Graph

FOSDEM 2012
Directed Graph
Undirected Graph
(Plain) Property
Graph
(Structured
Property Graph)

FOSDEM 2012
Directed Graph
Undirected Graph
(Plain) Property
Graph
(Structured
Property Graph)
Hyper Graph

FOSDEM 2012
Directed Graph
Undirected Graph
(Plain) Property
Graph
(Structured
Property Graph)
Hyper Graph
Challenge: Find a graph data model suited for the majority of use cases
from various domains.

> Challenge #3: Querying Graph Data
FOSDEM 2012
 Large variety in graph processing and manipulation languages
 Each graph database vendor implements own query languages/APIs
 Reason: No standardized graph query language available

> Challenge #3: Querying Graph Data
FOSDEM 2012
 Large variety in graph processing and manipulation languages
 Each graph database vendor implements own query languages/APIs
 Reason: No standardized graph query language available
Challenge: Find a way to abstract from the zoo of available query languages.

> Challenge #4: Defining the Workload
FOSDEM 2012
 The workload to be defined is dependent from the underlying
query/manipulation language
 Should complex (algorithmic) operations be part of a database benchmark?
 Which algorithms to pick?
 Social Network Analysis → Find communities
 Supply Chain Management → Find maximal flow
 Web of Data → Find pattern matches
 How are concurrent users represented?
 What about transactionality?

>
Thoughts on Graph Data Generation
FOSDEM 2012

> Graph Data Generation - Patterns
FOSDEM 2012
 Understanding graph patterns (characteristics) is crucical for a good graph
data generator
 What are distinguishing characteristics of graphs?
 How can we identify graph patterns on large graphs?
 Three main patterns [1]:
 Power law distributed
 Small diameters
 Community Effects
?
=
?
=

> Pattern 1 – Power law distributed
FOSDEM 2012
 Most real-world graph data sets follow a power law distribution
 Examples:
 Internet router graph
 Subsets of the WWW
 Citation Graphs
source: [2] source: [2]

> Pattern 2 – Small Diameters
FOSDEM 2012
 Effective Diameter (eccentricity): Minimum number of hops, in which a
fraction (e.g. 90%) of all connected pairs of nodes can reach each other
 Other measures exist as well, but are not applicable to disconnected graphs
 In most use cases, diameter is much smaller than the size of the graph
 Examples:
 97% eccentricity of around 16 for path lengths in the WWW
 Average path length around 6 for Epinions social network
source: [1]

> Pattern 3 – Community Effects
FOSDEM 2012
 Community: A set of nodes, where each node in the set is closer to all other
nodes in the community than to nodes outside the community.
 Communities can be found in many real-world graphs, especially social
networks and collaboration networks
 Clustering Coefficient C: A measure, which qualifies the „clumpiness“ of a
graph

>
Thoughts on Query Workload
FOSDEM 2012

> Query Workload - Operations
FOSDEM 2012
 Graph Manipulation Operations
 Add/Update/Remove Nodes from the Graph
 Add/Update/Remove Edges from the Graph
 Add/Update/Remove Edge attributes
 Add/Update/Remove Node attributes
 Graph Query Operations
 Retrieve selection of nodes from given filter expression
 Getting the neighbors of a set of nodes (possibly with edge filter constraints)
 Graph Traversals
 Based on basic query operations
 Exploration of neighborhood from a given set of start nodes
 Terminated by the number of steps and/or edge/node filter constraints
 Graph Analytical Operations
 Aggregation operations such as sum, avg, min, max
 Aggregations on node-level and on edge-level

> Query Workload - Measures
FOSDEM 2012
 Closely related to benchmark capabilities
 Measures from relational benchmarks apply such as
 Average query response time
 Transactions per second (throughput)
 Additional measures for graph traversals
 Traversals per second
 What about distributed scenarios?
 What about concurrent users?

> Summary and Outlook
 Graph data distribution highly important for graph database benchmark
 Application domains do have very specific graph characteristics
 A graph database benchmark has to provide abstract and high-level graph
operation descriptions
 Feel free to contact me if you want to contribute:
marcus.paradies@gmail.com
FOSDEM 2012

>
Discussion
FOSDEM 2012

> Theses
 A benchmark based on social network data is nice, but might be not be that
representative for large enterprise applications
 Algorithms should NOT be part of a graph database benchmark
 Only support basic operations such as simple lookups and path traversals
 The underlying graph data model should be a simple property graph
 A graph database has to scale in terms of data size as well as number of
concurrent users
 ....
FOSDEM 2012

> References
[1] Graph Mining: Laws, Generators, and Algorithms (2006)
[2] http://konect.uni-koblenz.de/
[3] A Discussion on the Design of Graph Database Benchmarks (2010)
FOSDEM 2012

Challenges in the Design of a Graph Database Benchmark

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Challenges in the Design of a Graph Database Benchmark

Ähnlich wie Challenges in the Design of a Graph Database Benchmark (20)

Mehr von graphdevroom

Mehr von graphdevroom (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Challenges in the Design of a Graph Database Benchmark