This session explores building graph databases on AWS, examining common use cases, design patterns, and best practices. We then discuss the main options for running graph databases on AWS and go deeper into the Amazon DynamoDB storage backend plugin for Titan launched earlier this year. The Amazon Fulfillment team will share their story of running the Titan graph database on DynamoDB to track inventory going in and out of the company's fulfillment network. They provide best practices on running an efficient graph database at massive scale.
2. What to Expect from the Session
• Who are we?
• General overview of graph database technology
• AWS architecture examples
• Amazon Fulfillment technology’s “Inventory Notification
Graph”
• Amazon DynamoDB Storage Backend for Titan
4. What is a graph? What is a graph database?
• A graph is a data structure consisting of vertexes
(nodes), directed edges (relationships), and properties.
Subset of tree data structure.
• A graph database uses a property graph as the data
model and includes a query language.
• Other possible data models are hyper-graphs, triple-
stores, RDF.
5. Graph data modeling
• NoSQL data models – Document, Key-Value, Columnar,
Graph, Mixed
• CAP and ACID
• Start with the use case, then develop the data model:
• As a Student, I want to know other Students in my Class who
know about a Subject
• Student KNOWS Subject, Student BELONGS_TO Class
StudentSubject Class
KNOWS BELONGS_TO
6. Graph vs. relational database
Graph
• Need to traverse a graph
without JOINs
• Queries have a starting
location MATCH ON x
• Normalized attribute to
enable filtering
• Dynamic schema
Relational
• Columnar analytics
• Tables denormalized for
performance
• Cluster and fault
management
• Recursive query support in
the query optimizer
7. Titan: distributed graph database
• Distributed graph
• Storage layer has plug-in architecture
• Native TinkerPop implementation
• Full text search with Lucene, SOLR, Elasticsearch
• HA using multi-master replication (Cassandra cluster)
• Scalability using DynamoDB
8. • Shared-nothing architecture, single master (writes),
multiple replicas (reads), embeddable using JVM
• HA when distributed, uses Paxos for master election
• Attempts to load DB into RAM, larger is better. Efficient
spilling to disk.
• Primary query language is Cypher, supports Gremlin
9. AWS deployment for Neo4j
Availability Zone #1
Write ELB
Availability Zone #1
Read ELB
ELB health checks
HTTP GET
/db/manage/server/ha/master
/db/manage/server/ha/slave
/db/manage/server/ha/active
10. Analytics on graphs
• OLAP not OLTP
• Leverages the Hadoop / MapReduce framework
• GraphX is analytics on Spark in-memory; functional-like,
“declarative” programming model
• Giraph is graph using MapReduce / HDFS; procedural,
vertex-centric programming model
• Aggregation type queries over the entire graph
11. TinkerPop
• Apache Incubator graph framework supporting both
OLAP and OLTP.
• Gremlin, a query language for graph traversals.
Supports analysis, modification, and queries.
• Gremlin Structured API, a generic connector framework
or API. Interface to a backend graph engine.
12. Graph DB use cases
• Social
• Recommendation
• Classic network problems
• Deep hierarchies
• Sensor analysis with geo-spatial constraints
• Fraud detection
• Identity and Access Management
13. Recommendation engine example
neo4j cluster
EMR
Writes Reads
Buy like
item
“People who bought
this item also bought”
Custom
Email
“Something you
recently looked at has
changed”
16. Manual Research
• All tools emit events
• Humans trace the events
• Difficult to follow as search
space increases
• Developed queries, but took
too long to run
Approaches
Unique Identifiers
• Every item gets a unique
identifier
• Easy to get all related events
• Expensive
• Impractical for some items
18. Why not use a relational or NoSQL database?
• Relational Database
• Knew data volume would be huge and keep growing
• Did not want to vertically scale
• JOINs on table will be expensive
• Use case required high availability
• NoSQL Store
• Would be the same solution without all the functionality built
into the TinkerPop Graph Framework
19. Why a graph?
• No way to index just the events we need
• Need to perform search from receive to stow and vice
versa; i.e., requires many hops to find the data
• Need to process messages out of order
• Graphs provide a simple mental model
22. Cassandra: Titan lessons learned
• No one on our team had experience managing or
configuring a Cassandra cluster
• Needed to manage a cluster
• Team manually replaces hosts as EC2 swaps them out
• Does not handle time series data well
• We ran two producers against two keyspaces so we
could efficiently drop old data
23. DynamoDB: Titan
• Massively scalable
• No more tuning and host management
• Team was already familiar with DynamoDB
• Risky because there was no existing Titan
implementation
25. DynamoDB: single-item data model
Hash Key (hk) Attribute Attribute Attribute Attribute Attribute
Vertex id 1 Property –
Name Justin
Edge (out) –
Friend: Anna
Edge (out) –
Friend: Kris
Edge (out) –
Likes: Movies
Hidden
Property -
Exists
Vertex id 2 Property –
Name Anna
Edge (out) –
Friend: Justin
Edge (out) –
Likes: Books
Hidden
Property -
Exists
Vertex id 3 Property –
Name Kris
Edge (out) –
Friend: Justin
Edge (out) –
Likes: Movies
Hidden
Property -
Exists
Vertex id 4 Property –
Name Movies
Edge (out) –
Friend: Justin
Edge (out) –
Likes: Kris
Hidden
Property -
Exists
Vertex id 5 Property –
Name Books
Edge (out) –
Friend: Anna
Hidden
Property -
Exists
26. DynamoDB: multiple-item data model
Hash Key (hk) Range Key (rk) Value (v)
Vertex id 1 Range key
Vertex id 1 Property id Property – Name Justin
Vertex id 1 Edge id Edge (out) – Friend Anna
Vertex id 1 Edge id Edge (out) – Friend Kris
Vertex id 2 Range key
Vertex id 2 Property id Property – Name Anna
Vertex id 2 Edge id Edge (out) – Friend Justin
Vertex id 2 Edge id Edge (out) – Friend
Brooks
27. DynamoDB: how does it scale?
• Close to 100 billion vertices
• Terabytes of data
• Without corresponding increase in latency
28. DynamoDB: Titan lessons learned
• Use Titan explicit partitioning on large graph
• Partition across multiple graphs for time series data
• Able to achieve stable performance at scale
29. How to get started
• GitHub Repository
• DynamoDB Local
• CloudFormation Template
30. Resources
• Graph Databases by Ian Robinson, Jim Webber, and Emil Eifrem
• Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL
Movement by Eric Redmond and Jim R. Wilson
• NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence by
Pramod J. Sadalage and Martin Fowler
• Titan Graph Database Integration with DynamoDB: World-class Performance,
Availability, and Scale for New Workloads by Werner Vogels
• Store and Process Graph Data using the DynamoDB Storage Backend for Titan by
Jeff Barr
• Amazon DynamoDB Storage Backend for Titan: Distributed Graph Database by
Matthew Sowders and Alexander Patrikalakis
• Amazon DynamoDB Storage Backend for Titan FAQ
• Amazon DynamoDB Storage Backend for Titan Documentation