The six degrees problem is a classic party game and typical use-case for the power and efficiency of graph databases. But even with a powerful graph database, a complex ecosystem of data like IMDB (Internet Movie Database) can return a dizzying amount of data within six degrees of separation from the source. With this amount of data, how do you draw business value from large sets of highly connected data? In this session, we will discuss some powerful strategies for using a distributed graph database, to perform analysis to derive business value from highly connected, complex data sets using navigational queries and visualization.
2. What are we talking about today?
Not that Bacon This Bacon!
⢠Intro to the Six Degrees Problem
⢠What is a Graph Database?
⢠Why Bacon in Graph Database?
⢠How we solved the problem
Images Courtesy of IMDB (www.imdb.com)
3. Six Degrees of Bacon
ââŚany individual involved in the Hollywood, California film industry
can be linked through his or her film roles to actor Kevin Bacon
within six stepsâ
[http://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon]
Gina Menza
Images Courtesy of IMDB (www.imdb.com)
A Tale of Two Kevins
4. Why Six Degrees of Bacon?
Actor Age # of Projects
Kevin Bacon 54 76
Harrison Ford 70 70
Tom Cruise 50 40
Julia Roberts 45 50
Tom Hanks 56 73
Denzel Washington 58 53
Michael Caine 80 157
Kiefer Sutherland 46 82
Kevin Bacon
Images Courtesy of IMDB (www.imdb.com)
5. Bacon Numbers in Google
In the summer of 2012, Google started to allow users to find the
bacon number of any actor simply by following his or her name
with âbacon numberâ.
Morgan
Freeman
The Dark
Night
Rises
appeared in
Gary
Oldman
appeared in
Kevin
Bacon
Criminal
Law
appeared in
appeared in
www.google.com Graphical Representation
7. The Physical Data Model
⢠Difference between relational & graph databases
Meetings
P1 Place TimeP2
Alice Denver 5-27-10Bob
Calls
From Time DurationTo
Bob 13:20 25Carlos
Bob 17:10 15Charlie
Payments
From Date AmountTo
Carlos 5-12-10 100000Charlie
Met
5-27-10
Alice
Called
13:20
Bob
Paid
100000
Carlos
Charlie
Called
17:10
Rows/Columns/Tables Relationship/Graph Optimized
9. Who is Gina Menza?
⢠How do we get meaning from highly
connected data?
Gina Menza
Jury Forewoman Miss Jeffries
Images Courtesy of IMDB (www.imdb.com)
10. Strength of Connections Matter!
⢠Why 6 degrees of separation and not 3.74?
⢠We need analysis tools in order to
â identify and filter out âunimportantâ data and
â infer what needs to be filtered as we investigate it.
âWhen considering another
person in the world, a friend
of your friend knows a
friend of their friendâ
- facebook
12. Graph Analysis
⢠Why use Graph Databases for graph analysis?
â Dynamic on Live Data
â Feedback/Inference
â Optimized for concurrent user access
â Handles big data problems
â Native Graph Traversal API
â Manage memory efficiently
13. Paths to Bacon
Bacon Number
(Degree of Separation / 2)
# of People
1 2823
2 323677
3 1088560
4 272905
5 22533
6 2300
Using the IMDB (www.imdb.com) data set, we can study how many paths
can be found by degrees of separation from Kevin Bacon. Out of 5,067,124
nodes and 11,505,797 edges, we get the following:
0
200000
400000
600000
800000
1000000
1200000
1 2 3 4 5 6
# of
People
14. Big Data + Graph = Big Graph Data
4 Degrees of Kevin Bacon
(Breadth First up to 20K connections)
Images generated using the IG Visualizer
15. Analyzing Bacon
⢠To be able to perform meaningful analysis,
these are things that you will need:
â Ingest IMDB Dataset â About 50 Formatted
compressed files (Largest > 200 MB)
â Custom algorithm support to perform meaningful
analysis
⢠Optimize queries to get results back in reasonable time
â Visualization tool to test and view the results of
the navigation (optional)
16. How IG Sizzles Your Bacon
Ingest
Update
Navigate
Massive graph data require efficient and intelligent tools
to analyze and understand it.
17. Super Simple Java API
Actor bacon = new Actor(âKevin Baconâ);
imdbGraphDB.addVertex( bacon );
Movie apollo= new Movie(âApollo 13â, 1995);
imdbGraphDB.addVertex( apollo );
ActedIn bacon2apollo = new ActedIn(âJack Swigertâ);
imdbGraphDB.addEdge(bacon2apollo, bacon, apollo,
EdgeKind.BIDIRECTIONAL, 1 /**weight**/);
Ingest
18. Scaling Writes
⢠Big/Fast data demands write performance
⢠Most NoSQL solutions allow you to scale
writes byâŚ
â Partitioning the data
â Understanding your consistency requirements
â Allowing you to defer conflicts
Ingest
21. Trade offs
⢠Excellent for efficient use of page cache
⢠Able to maintain full database consistency
⢠Achieves highest ingest rate in distributed
environments
⢠Almost always has highest âperceivedâ rate
⢠Trading Off :
⢠Eventual consistency in graph (connections)
⢠Updates are still atomic, isolated and durable but phased
⢠External agent performs graph building
Ingest
23. Scaling Reads and Query
Distributed API
Application(s)
Partition 1 Partition 3Partition 2 Partition ...n
Processor Processor Processor Processor
Partitioning and Read Replicas⌠easy right !
24. Why are Graphs Different ?
Distributed API
Application(s)
Partition 1 Partition 3Partition 2 Partition ...n
Processor Processor Processor Processor
Navigate
25. Distributed Navigation
⢠Detect local hops and perform in memory
traversal
⢠Send the partial path to the distributed
processing to continue the navigation.
⢠Intelligently cache remote data when accessed
frequently
⢠Route tasks to other hosts when it is optimal
Navigate
27. GraphViews
Leveraging Schema in the Graph
Patient Prescription
Drug
Ingredient
Outcome
Complaint
Visit
Allergy
Physician
Navigate
28. Schema Enables Views
⢠GraphViews are extremely powerful
⢠Allow Big Data to appear small !
⢠Connection inference can lead to exponential
gains in query performance
⢠Views are reusable between queries
⢠Views can be persisted
⢠Built into the native kernel
Navigate
29. Problem of Supernodes
In Graph Theory, a âsupernodeâ is a vertex with a
disproportionally high number of connected edges.
Supernodes make it difficult to do a navigational query in
real-time due to the amount of effort it may be to pursue
paths through it that may be unfruitful.
Navigate
Images generated using the IG Visualizer
30. Supernodes in Bacon
Navigate
In the IMDB data set, some examples of supernodes may be talk
shows, awards shows, compilations or variety shows.
Images generated using the IG Visualizer
31. How to avoid supernodes
1. Setting policies on the navigation like the
NoRevisitPolicy , MaximumResultCountPolicy and
MaximumPathDepthPolicy can be used to customize the
overall behavior of the navigation.
PolicyChain policies = new PolicyChain();
// Only traverse the same vertex once
policies.addPolicy(new NoRevisitPolicy());
// limits the number of paths that will be returned to 10K
policies.addPolicy(new MaximumResultCountPolicy(10000));
// limits the path depth to 6
policies.addPolicy(new MaximumPathDepthPolicy(6));
Navigate
32. How to avoid supernodes
2. Graph View to exclude or limit types
GraphView view = new GraphView();
//Excludes all instances of TvShow from navigation
view.excludeClass(myDb.getTypeId(TvShow.class.getName()));
//Excludes all movies made for TV/Video
view.excludeClass(myDb.getTypeId(Movie.class.getName()),
âdetails.madeForTv || details.madeForVideoâ);
//Include ActedIn w/ characterName not containing âHimselfâ
view.excludeClass(myDb.getTypeId(WorkedOn.class.getName()));
view.includeClass(myDb.getTypeId(ActedIn.class.getName()),
â!CONTAINS(characterName, âHimselfâ)â);
Navigate
Kevin Bacon
Actor
The
Following
TV Show
Behind the
Scenes
Movie
Apollo 13
Movie
HimselfRyan Hardy
Jack Swigert
33. How to avoid supernodes
3. Using these policies and graph view, we can
filter the size of the result set in our navigation:
Navigator navigator = bacon.navigate(view,
Guide.SIMPLE_BREADTH_FIRST, Qualifier.ANY,
new VertexPredicate(Person.class, ""),
policies, myResultHandler);
navigator.start();
Navigate
34. Filtered Views in Bacon
The results of this navigation would look something like thisâŚ
Navigate
Images generated using the IG Visualizer
35. Why InfiniteGraph�
⢠Objectivity/DB is a proven foundation
â Building distributed databases since 1993
â A complete database management system
⢠Concurrency, transactions, cache, schema, query, indexing
⢠Itâs a Graph Specialist !
â Simple but powerful API tailored for data navigation.
â Easy to configure distribution model
36. Advanced Configured Placement
⢠Physically co-locate âclosely relatedâ data
⢠Driven through a declarative placement model
⢠Dramatically speeds âlocalâ reads
Facility Data Page(s)Patient Data Page(s)
Mr
Citizen
Visit Visit
Dr
Jones
San
Jose
Facility
Dr
Smith
Primary
Physician
HasHas With
At
Located Located
Facility Data Page(s)
Dr
Blake
Sunny-
vale
Dr
Quinn
Located Located
With
At
37. Fully Distributed Data Model
Zone 2Zone 1
HostA
IG Core/API
Distributed Object and Relationship Persistence Layer
Customizable Placement
HostB HostC HostX
AddVertex()
38. Polyglot NoSQL Architectures
Distributed Data
Processing
Platform Document
Graph
Database
RDBMS
Partitioned Distributed DB (often Document / KV)
Users
Applications
External/LegacyData
TransformationMDM
Business