Titan is an open source distributed graph database build on top of Cassandra that can power real-time applications with thousands of concurrent users over graphs with billions of edges. Graphs are a versatile data model for capturing and analyzing rich relational structures. Graphs are an increasingly popular way to represent data in a wide range of domains such as social networking, recommendation engines, advertisement optimization, knowledge representation, health care, education, and security.
This presentation discusses Titan's data model, query language, and novel techniques in edge compression, data layout, and vertex-centric indices which facilitate the representation and processing of Big Graph Data across a Cassandra cluster. We demonstrate Titan's performance on a large scale benchmark evaluation using Twitter data.
Presented at the Cassandra 2012 Summit.
1. TITAN
BIG GRAPH DATA WITH CASSANDRA
#TITANDB #GRAPHDB #CASSANDRA12
Matthias Broecheler, CTO AURELIUS
August VIII, MMXII THINKAURELIUS.COM
2. Abstract
Titan is an open source distributed graph database build on top of
Cassandra that can power real-time applications with thousands of
concurrent users over graphs with billions of edges. Graphs are a versatile
data model for capturing and analyzing rich relational structures. Graphs
are an increasingly popular way to represent data in a wide range of
domains such as social networking, recommendation engines,
advertisement optimization, knowledge representation, health care,
education, and security.
This presentation discusses Titan's data model, query language, and novel
techniques in edge compression, data layout, and vertex-centric indices
which facilitate the representation and processing of Big Graph Data
across a Cassandra cluster. We demonstrate Titan's performance on a
large scale benchmark evaluation using Twitter data.
3. Titan Graph Database
ï§âŻ supports real time local traversals (OLTP)
ï§âŻ is highly scalable
ï§âŻ in the number of concurrent users
ï§âŻ in the size of the graph
ï§âŻ is open source under the Apache2 license
ï§âŻ builds on top of Apache Cassandra for
distribution and replication
17. name: âMuscle building for beginnersâ
name: Hercules
type: book
bought
bought
name: âHow to deal with Father issuesâ
name: Newton
type: book
bought
18. name: âMuscle building for beginnersâ
name: Hercules
type: book
bought
recommend
bought
name: âHow to deal with Father issuesâ
name: Newton
type: book
bought
Traversal
19. name: âDancing with the Starsâ
type: DVD
in-Cart
name: âMuscle building for beginnersâ
name: Hercules
type: book
bought
bought
name: âHow to deal with Father issuesâ
name: Newton
type: book
bought
viewed
name: âFriends forever braceletâ
type: Accessory
20. name: âDancing with the Starsâ
type: DVD
in-Cart
name: âMuscle building for beginnersâ
name: Hercules
type: book
bought
friends
bought
name: âHow to deal with Father issuesâ
name: Newton
type: book
bought
viewed
name: âFriends forever braceletâ
type: Accessory
21. name: âDancing with the Starsâ
type: DVD
in-Cart
name: âMuscle building for beginnersâ
name: Hercules
type: book
bought
time:24
friends
bought
time:22
name: âHow to deal with Father issuesâ
name: Newton
type: book
bought
time:20
viewed
name: âFriends forever braceletâ
type: Accessory
23. name: Neptune
name: Alcmene
type: god
type: god
X
brother
mother
name: Saturn
name: Jupiter
name: Hercules
type: titan
type: god
type: demigod
X
father
father
battled
brother
time:12
name: Pluto
name: Cerberus
type: god
type: monster
pet
Path
Finding
24. name: Neptune
name: Alcmene
type: god
type: god
X
brother
mother
name: Saturn
name: Jupiter
name: Hercules
type: titan
type: god
type: demigod
X
father
father
battled
brother
time:12
name: Pluto
name: Cerberus
type: god
type: monster
pet
Path
Finding
31. Titan Features
ï§âŻ numerous concurrent users
ï§âŻ real-time traversals (OLTP)
ï§âŻ high availability
ï§âŻ dynamic scalability
ï§âŻ built on Apache Cassandra
32. Titan Ecosystem
ï§âŻ Native Blueprints Graph
Server
Implementation
Graph
Algorithms
ï§âŻ Gremlin Query Object-Graph
Mapper
Language
Traversal
Language
ï§âŻ Rexster Server
DataïŹow
Processing
ï§âŻ any Titan graph can be exposed Generic
Graph API
as a REST endpoint
37. Titan Storage Model
ï§âŻ Adjacency list in one 5
column family
ï§âŻ Row key = vertex id
ï§âŻ Each property and edge
5
in one column
ï§âŻ Denormalized, i.e. stored twice
ï§âŻ Direction and label/key as column preïŹx
ï§âŻ Use slice predicate for quick retrieval
38. Connecting Titan
titan$ bin/gremlin.sh!
,,,/!
(o o)!
-----oOOo-(_)-oOOo-----!
gremlin> conf = new BaseConfiguration();!
==>org.apache.commons.configuration.BaseConfiguration@763861e6!
gremlin> conf.setProperty("storage.backend","cassandra");!
gremlin> conf.setProperty("storage.hostname","77.77.77.77");!
gremlin> g = TitanFactory.open(conf);
==>titangraph[cassandra:77.77.77.77]!
gremlin>!
40. Defining Property Keys
Each type has a unique name
gremlin> g.makeType().name(âtimeâ).! The allowed data type
! ! dataType(Long.class).!
! ! functional().! If a key is functional, each vertex can
! ! makePropertyKey();! have at most one property for this key
gremlin> g.makeType().name(âtextâ).dataType(String.class).!
! ! functional().makePropertyKey();!
gremlin> g.makeType().name(ânameâ).dataType(String.class).!
! ! indexed().!
! ! unique().!
! ! functional().makePropertyKey();!
41. Defining Property Keys
gremlin> g.makeType().name(âtimeâ).!
! ! dataType(Long.class).!
! ! functional().!
! ! makePropertyKey();!
gremlin> g.makeType().name(âtextâ).dataType(String.class).!
! ! functional().makePropertyKey();!
gremlin> g.makeType().name(ânameâ).dataType(String.class).!
! ! indexed().! Creates and maintains an index over property values
! ! unique().!
Ensures that each property value is uniquely
! ! functional().makePropertyKey();! associated with only one vertex by acquiring a lock.
42. Titan Indexing
ï§âŻ Vertices can be retrieved by
property key + value
name : Hercules
5
ï§âŻ Titan maintains index in a
name : Jupiter
9
separate column family as
graph is updated
ï§âŻ Only need to deïŹne a
property key as .index()
43. Titan Locking
ï§âŻ Locking ensures consistency
when it is needed
name : Hercules
5
ï§âŻ Titan uses time stamped
quorum reads and writes on 9
separate CFs for locking
ï§âŻ Uses
name :
name :
Jupiter
Hercules
ï§âŻ Property uniqueness: .unique()
father
ï§âŻ Functional edges: .functional()
ï§âŻ Global ID management
x
name :
father
Pluto
46. Defining Edge Labels
gremlin> g.makeType().name(âfollowsâ).!
! ! primaryKey(time).!
! ! makeEdgeLabel();!
gremlin> g.makeType().name(âtweetsâ).!
! ! primaryKey(time).makeEdgeLabel();!
gremlin> g.makeType().name(âstream).!
! ! primaryKey(time).!
! ! unidirected().!
Store edges of this label only in outgoing direction
! ! makeEdgeLabel();!
47. Vertex-Centric Indices
ï§âŻ Sort and index edges per
vertex by primary key
ï§âŻ Primary key can be composite
ï§âŻ Enables efïŹcient focused
traversals
ï§âŻ Only retrieve edges that matter
ï§âŻ Uses slice predicate for quick,
index-driven retrieval
59. Twitter Benchmark
ï§âŻ 1.47 billion followship edges
and 41.7 million users
ï§âŻ Loaded into Titan using BatchGraph
ï§âŻ Twitter in 2009, crawled by Kwak et. al
ï§âŻ 4 Transaction Types
ï§âŻ Create Account (1%)
ï§âŻ Publish tweet (15%)
ï§âŻ Read stream (76%)
ï§âŻ Recommendation (8%)
Kwak, H., Lee, C., Park, H., Moon, S., âWhat is
ï§âŻ Follow recommended user (30%)
Twitter, a Social Network or a News Media?,â
World Wide Web Conference, 2010.
60. Benchmark Setup
ï§âŻ 6 cc1.4xl Cassandra nodes
ï§âŻ in one placement group
ï§âŻ Cassandra 1.10
ï§âŻ 40 m1.small worker machines
ï§âŻ repeatedly running transactions
ï§âŻ simulating servers handling user
requests
ï§âŻ EC2 cost: $11/hour
61. Benchmark Results
Transaction Type
Number of tx
Mean tx time
Std of tx time
Create account
379,019
115.15 ms
5.88 ms
Publish tweet
7,580,995
18.45 ms
6.34 ms
Read stream
37,936,184
6.29 ms
1.62 ms
Recommendation
3,793,863
67.65 ms
13.89 ms
Total
49,690,061
Runtime
2.3 hours
5,900 tx/sec
62. Peak Load Results
Transaction Type
Number of tx
Mean tx time
Std of tx time
Create account
374,860
172.74 ms
10.52 ms
Publish tweet
7,517,667
70.07 ms
19.43 ms
Read stream
37,618,648
24.40 ms
3.18 ms
Recommendation
3,758,266
229.83 ms
29.08 ms
Total
49,269,441
Runtime
1.3 hours
10,200 tx/sec
63. Benchmark Conclusion
Titan  can  handle  10s  of  thousands  of  concurrent  users Â
with  short  response  5mes  even  for  complex  traversals Â
on  a  simulated  social  networking  applica5on  based  on Â
real-Ââworld  network  data  with  billions  of  edges  and Â
millions  of  users  in  a  standard  EC2  deployment. Â
For  more  informa5on  on  the  benchmark: Â
hDp://thinkaurelius.com/2012/08/06/5tan-Ââprovides-Ââreal-Ââ5me-Ââbig-Ââgraph-Ââdata/ Â
64. Future Titan
ï§âŻ Titan+Cassandra embedding
ï§âŻ sending Gremlin queries into
the cluster
ï§âŻ Graph partitioning together
with ByteOrderedPartitioner
ï§âŻ data locality = better performance
ï§âŻ Let us know what you need!
65. Titan goes OLAP
Map/Reduce
Load & Compress
Analysis results
back into Titan
Stores a massive-scale Batch processing of large Runs global graph algorithms
property graph allowing real- graphs with Hadoop
on large, compressed,
time traversals and updates
in-memory graphs