Data Day Texas presentation on our decision to switch to a graph database at WellAware. It gives an overview of the major factors that went into the decision to switch, challenges we’ve faced, and the lessons learned along the way to assist anyone looking to make the plunge into the world of graph databases.
10. A Toy Example
http://coachesbythenumbers.com/sportsource-college-football-data-packages/
2005 College Football Data
● Team names & conferences
● Game record with dates and scores
● Interesting questions:
○ Records for all teams in conference X
○ Top 25 ranking using record + strength of opponents
○ Three team loop (A beat B beat C beat A)
● Source code: https://github.com/njeirath/titan-perf-tester
11. Toy Models
Label: team
name: Purdue
conf: Big 10
Label: team
name: IU
conf: Big 10
label: beat
date: 11/19/05
score: 41-14
Teams
team_id
conference
name
Beat
winner
loser
win_score
lose_score
SQL
Graph
13. Example: Get Big 10 Records
SQL
SELECT win_record.NAME,
win_record.wins,
Count(l)
FROM (SELECT teams.team_id,
teams.NAME AS NAME,
Count(w) AS wins
FROM teams
JOIN beat AS w
ON teams.team_id = w.winner
WHERE conference = 'Big Ten Conference'
GROUP BY teams.NAME,
teams.team_id) AS win_record
JOIN beat AS l
ON team_id = l.loser
GROUP BY win_record.NAME,
win_record.wins
ORDER BY win_record.wins DESC;
Gremlin
g.V().order().by(__.outE().count(), decr)
.has('conference', 'Big Ten Conference')
.as('team', 'wins', 'losses')
.select('team', 'wins', 'losses')
.by('name')
.by(__.outE().count())
.by(__.inE().count())
14. Example: Top 25 Ranking
SQL
SELECT teams.name,
ranks.rank
FROM (SELECT beat.winner,
Sum(rec.wins) AS rank
FROM (SELECT teams.team_id,
Count(w) AS wins
FROM teams
JOIN beat AS w
ON w.winner = teams.
team_id
GROUP BY teams.team_id) AS rec
JOIN beat
ON beat.loser = rec.team_id
GROUP BY beat.winner
ORDER BY rank DESC
LIMIT 25) AS ranks
JOIN teams
ON teams.team_id = ranks.winner
ORDER BY ranks.rank DESC;
Gremlin
g.V().order().by(__.out().out().count(), decr)
.as('team', 'score', 'wins', 'losses')
.select('team', 'score', 'wins', 'losses')
.by('name')
.by(__.out().out().count())
.by(__.outE().count())
.by(__.inE().count())
.limit(25)
15. /r/mildlyinteresting/
1. Texas
2. USC
3. Penn State
4. Ohio State
5. Virginia Tech
6. TCU
7. West Virginia
8. Lousianna State
9. Alabama
10. Oregon
11. Louisville
12. Georgia
13. UCLA
14. Miami (FL)
1. Texas
2. USC
3. Penn State
4. Virginia Tech
5. LSU
6. Ohio State
7. Georgia
8. TCU
9. West Virginia
10. Alabama
11. Boston College
12. Oklahoma
13. Florida
14. UCLA
http://www.collegefootballpoll.com/2005_archive_computer_rankings.html
2005 End of
Season
Computer
Rankings
Our Query
Results
16. Developer Opinion
● ORMs
○ Move to graph, lost Django ORM
○ ORM/OGM option at the time was Totorom
● Query Language
○ Gremlin seems more intuitive
17. Episode II: Migration
Essentially an ETL operation:
1. Export tables (table name --> vertex label, columns --> vertex properties)
2. Export FK/Join tables (FK/Join table name --> edge label)
team_id conference name
559 Big 10 Purdue
306 Big 10 Indiana
...
winner loser win_score lose_score
559 306 41 14
...
Challenges:
● Dealing with indices
● Migrating a production DB
18. Challenges with Index
Relational DB indices are local per table, graph IDs are global
ID Name Teacher
1 Kyle 1
2 Stan 1
3 Kenny 1
...
ID Teacher
1 Garrison
...
student
pg_id: 1
teacher
pg_id: 1
Unique key is
Vertex label + pg_id
19. Migrating a Production DB
Potentially large amounts of data - batch loading optimizations
Static
Time series
Step 1: Move static
Step 2: Reroute requests and data
Step 3: Move old TS
20. Episode III: Operating Graph
Usual benefits of NoSQL
● Designed for scalability - built in sharding, redundancy, etc.
○ Ex: Titan pluggable with Cassandra/HBase
● Usually allows on the fly schema changes
○ Flexible migrations avoid DB downtime
Underlying DB technology requires expertise, tuning, monitoring, etc
21. Performance
If not considered early, OLTP performance can potentially be an issue
Consider Titan architecture:
Server
Titan JVM
Storage Backend
Gremlin evaluated
here
g.V().has('name', 'Purdue')
.out('beat')
.values('name')
Index retrieval
Edge traversal
Vertex property retrieval
22. Dealing with Performance
● Understand storage structures
● Understand Cassandra characteristics
○ Ex: Generally deletes are bad
● Talks on Titan+Cassandra tuning:
○ Ted Wilmes - Cassandra Summit 2015:
■ Slides: http://www.slideshare.net/twilmes/modeling-the-iot-with-titandb-and-cassandra
■ Video: https://vimeopro.com/user35188327/cassandra-summit-2015/video/143695770
○ Nakul Jeirath - Graph Day TX:
http://s3.thinkaurelius.com/docs/titan/1.0.0/data-model.html
23. Our Approach
Lots of real-time data, tiny bit of relatively static data
Some optimization, mostly caching of static data
Heavily optimized real-time
Static
Time series
Code Optimization + caching
Model changes + code optimization
24. Maturity of Graph
● Query languages
○ SQL allows relatively ease of switching relational DB vendors
○ Tinkerpop for graph but not universally supported today
● Version upgrades
○ Currently on Titan 0.4.4
○ 0.4.4 --> 0.5.*: not storage compatible (require ETL to upgrade)
○ 0.4.4 --> 1.*: not storage compatible, query code rewrite
25. Summary
● Development
○ Gremlin easier to work with than SQL (opinion)
○ Tools for SQL more mature and varied but graph is catching up
● Migration
○ Relational --> Graph generally requires ETL
● Operation
○ NoSQL benefits of distributed, scalable, schemaless DBs
○ Performance can be an issue if not considered early
○ Graph vendor/version coupling but will improve with maturity