This document discusses using graph analytics and Gremlin on family trees to gain insights. It provides examples of using centrality analysis like degree, eigenvector, and PageRank centrality to determine the most important individuals. Path analysis like simplePath and cyclicPath are demonstrated to find lineage lengths. Pattern detection with Gremlin match queries helps identify patterns like individuals who married their first cousins. Finally, it shows combining techniques to analyze which women who married cousins had the most family connections.
2. Hello!
I am David Bechberger
Sr. Architect for Data and Analytics at Gene by
Gene, a bioinformatics company specializing
in genetic genealogy.
You can find me at:
@bechbd
www.linkedin.com/in/davebechberger
4. What this talk isn’t
◎A through review of graph analytic
techniques
◎A review of all graph analytic frameworks
◎A deep dive into any of the techniques we
discuss
5. What this talk is
◎Where to start with Graph Analytics
◎OLTP and OLAP in Gremlin
◎Practical Examples using …..
7. Or more specifically this
name
owns individual
family
tree
member_of
is_known_as
is_spouse
is_first_cousin
8. Example - Find the names of all family members in a tree
T1
F1
I1
Bob
F2
I2
I3
I4
Steve
Joan
Rick
owns
member_of:
Husband
member_of:
Sonis_known _as
is_known _as
is_known _as
is_known _as
member_of:
Husband
member_of:
Wife
member_of:
Wife
9. Gremlin Example - Finding the names of all family members
for tree owner
g.V().has(‘tree’, ‘unique_id, ‘T1')
.out(‘owns’)
.sideEffect(
out('is_known_as').properties('full_name')
.store('name')
)
.out('member_of').in('member_of')
.sideEffect(
out('is_known_as').properties('full_name')
.store('name')
)
.cap('name')
10. ◎Tinkerpop supports both
◎Gremlin can be used to
query in either
◎But their are differences….
Apache Tinkerpop Gremlin OLTP and OLAP
11. OLTP
◎ Depth First
◎ Lazy Evaluation - Low
memory usage
◎ Real-time (ms/sub-
sec)
Gremlin OLTP versus OLAP
OLAP
◎ Breadth First
◎ Eager evaluation -
High memory usage
◎ Long Running
(min/hour)
12. OLTP
◎ Cannot run certain
queries or steps (e.g.
pageRank, bulk
loading)
◎ Limited time a query
◎ Local operations
Limitations
OLAP
◎ Some steps are
prohibitive like path(),
simplePath(), etc.
◎ Barrier Steps (count(),
min(), max(), etc.)
◎ Global Operations
13. What insights are we going to gain
◎Who in this tree is the most important?
◎Who in this tree is 6 degrees from Kevin
Bacon?
◎Who in this tree married their first cousin?
16. Example - Who is the member of the most families?
g.V().hasLabel('individual')
.project('person', 'degree')
.by('full_name')
.by(bothE('member_of').count())
.order().by(select('degree'), decr).limit(5)
18. Example - Who is in the most important individual?
g.V().hasLabel('individual')
.repeat(
groupCount('m').by('full_name')
.out('member_of').in('member_of')
.timeLimit(100)
).times(5).cap('m')
.order(local).by(values, decr)
.limit(local, 5).next()
20. Example - Whose lineage exerts the most influence over this
family tree?
g.V().withComputer().hasLabel('individual')
.pageRank()
.by(bothE('member_of')).by('rank')
.order().by('rank', decr)
.valueMap('full_name', 'rank').limit(5)
21. Answer
Degree EigenVector PageRank
Name Value
Henry VIII 7
Charlemagne 6
Jan 5
Ferdinand VII 5
Philip II 5
Name Value
Mary 149950
Margret 124221
Henry VIII 107539
Son 90715
Daughter 86961
Name Value
Joan of the
Tower 0.784
Edward III 0.774
Elenor 0.774
John of
Eltham 0.719
Frederick
William III 0.681
30. Example - What long is the lineage between Queen Victoria
and Henry VIII?
SimplePath
g.V('@I1@').repeat(timeLimit(60000)
.out('member_of').in('member_of')
.simplePath()).until(hasId('@I828@'))
.path().limit(1).count(local)
CyclicPath
g.V('@I1@').repeat(timeLimit(60000)
.out('member_of').in('member_of')
.cyclicPath()).until(hasId('@I828@'))
.path().limit(1).count(local)
34. Pattern Detection in Gremlin
◎Gremlin has the ability to be imperative
○ g.V().in().out()......
◎Or Declarative
○ g.V().match(
__.as(‘a’).....as(‘b’), //predicate 1
__.as(‘b’).....as(‘c’), //predicate 2
__.as(‘c’).where(‘c’, eq(‘b’)).as(‘c’)
).select(‘b’, ‘c’)
35. Example - Who is married to their first cousin?
g.V().match(
__.as('e').has('individual','sex','M').as('husband'),
__.as('husband').in('is_spouse').as('wifes'),
__.as('husband').both('is_first_cousin').as('cousin'),
__.as('cousin').where('cousin',eq('wifes')).as('wife')
).select('husband',’wife')
.by('full_name').fold().unfold()
36. Answer
Husband Wife
1 Albert Augustus Charles Victoria /Hanover/
2 Leopold_I Margaret Teresa
3 Alexander_I the_Fierce Sybil
4 Philip_IV Mariana of_Austria
39. Example - Which women who married their first cousin had
the greatest number of families?
g.V().match(
__.as('e').has('individual','sex','M').as('husband'),
__.as('husband').in('is_spouse').as('wifes'),
__.as('husband').both('is_first_cousin').as('cousin'),
__.as('cousin').where('cousin',eq('wifes')).as('wife')
).select('wife')
.project('person','degree')
.by('full_name')
.by(bothE('member_of').count())
.order().by(select('degree'), decr).limit(5)
Background in nearly 20 years Full Stack development in.NET, C, Java/Scala, and pretty much everything else
Switched to working almost exclusively on big data problems several years ago
Spent the last few years leveraging graph databases to build out high performance data platforms
If you have questions on using .NET and graph databases feel free to come talk to me.
Current role is Sr Architect for data and analytics building out our next generation data and analytics platform
As I like to think of this talk as “Things I wish I knew 18 months ago about graphs”
Well known model
Going to use a European Royal Family Tree
Based on GEDCOM - 1995 Standard by the LDS church
Basically its a linked data structure where all records are atomic units (individual/family/name/note) that contain pointers to each other
Start at a tree
Move to the root owner and to their name
Traverse out to families
Then from families to other individuals and their name
Here is what an example query on our model looks like….
As you can see the basis of this model as it was brought over from GEDCOM can make the queries be more verbose that one would normally strive to in order to retrieve what should be a relatively simple set of data
OLTP -Depth first - serial stream processing to provide depth first traversals into the data.
Can be thought of as a stream processor where graph traversers arrive from the left -> an instruction is processed on those traversers -> mutated traversers are sent out the right
OLAP - Unlike OLTP queries OLAP queries are breadth first queries meaning that they run in a logically parallel and use message passing to communicate between the messages.
OLTP - Has its limitations , most notably certain complex operations (such as running pageRank, bulking loading, and global operations) which are not allowed or appropriate for a transactional workload
OLAP - This scatter/gather methodology allows for working on massive scales of data but also prevents some steps (such as path(), simplePath()) from being executed and others such as order() from being meaningful. It also has the disadvantage that some steps within a gremlin query can require all of the data to be in the same location to process. Steps such as count(), min(), max(), group(), etc. are known as barrier steps and requires that all the data return to a single location to be processed before being sent out to workers.
OLTP - Use when your query is going to touch only a portion of the data or a subgraph e.g. Give me the average age of people in my family?
OLAP - Use when your query is going to touch all/a significant amount of the data in the graph e.g. Give me the average age of everyone in my family tree?
Centrality Analysis is about determining what is the most important in your graph. This sort of analysis is quite common when performing social network analysis, looking for key infrastructure points and examining biological networks.
Unfortunately defining what it means to be important is really dependant on the circumstances. One other important thing to remember is that these sort of algorithms measure the importance of a vertex in a graph which may or may not be correlated to the influence. For finding the most influential nodes in a graph there are other node influence metrics you would want to investigate.
Degree Centrality - a measure of the number of edges associated with a vertex
Degree Centrality looks at the number of connections a vertex has and uses that to determine the relative importance. This can be further refined using only inward outward edges. In degree centrality the larger the number the more influential the vertex
Eigenvector Centrality - a measure of the vertex on the graph by using the relative importance of the adjacent vertices to influence the importance of a vertex. I.e. If a node has many edges but is connected to few influential vertices it will be ranked lower than a vertex with fewer edges but the adjacent vertices are more important
PageRank - Made famous by Sergey Brin and Larry Page at Google for ranking web links. It works similar to the Eigenvector Centrality but adds a scaling factor to the results. This algorithm is well documented but far from something you would want to create yourself. Luckily Gremlin has a prebuilt step to help us with this.
The interesting part about this answer is not necessarily the answer itself but the fact that each method produced distinctly different answers
Example why you need to understand your question to choose the correct method
Why do these examples matter?
Paths are the walk through the graph defined by a traversal
Path object contains
All Labels “as(xxxx)”
All Vertices
All Edges
All sideeffects/datastructures
Path traversals tend to be on the slow side and they are computationally expensive as the entire path is stored for each traverser. This can expand exponentially as the size of the
Simple path queries are pretty much what it sounds like.
Shortest path between two vertices in a graph.
Minimize the amount of computation that is required
simplePath filters out paths that contain repeats in them.
Simple path queries are often useful if you want to find the shortest connection between two things such as in a transportation network, between patterns or subgraphs or in social network analysis
cyclic paths are paths that repeat back on themselves.
Using something like a cyclic path can be a first step in trying to detect communities or clusters within your graph
How about we find the quickest lineage between Queen Victoria and Henry VIII instead?
There are a few key things to note here:
Where the “until” sits matters when do a repeat. If it is before the “repeat” it is a while/do, if it is after it is a do/while loop
Adding a timelimit to you traversal can help prevent a never ending query
If you go in and examine the path objects returned by these two queries you will notice that the difference between the simple and cyclic paths is that the cyclic path circles through her husband Albert to continue on to Henry VIII
Finding how you are related to others in your family tree is a rather straightforward matter of counting the ups and down in generations found by the simplest path.
Finding clusters of people in your tree can be used to help identify areas in your tree where familial marriages were common
Gremlin has the ability to work as both an imperative language as well as a declarative language
In the imperative model you usually write queries as we described earlier. You start with some stream over vertices -> you then move left to right taking in data -> processing that data -> the emitting the processed data
On the other hand the declarative model works using a different approach. In the declarative model the user defines a base set of nodes -> then describes a one or more patterns that the data needs to match. Once submitted to the gremlin engine the engine determines the optimal query to run to find that pattern within the graph
One of the neat features of gremlin is that you are able to intermix the two types of syntax within the same traversal.
Personally I find writing the declarative syntax powerful but I struggle everyt ime I work with it.
1.So what we are doing here is first define a predicate containing all the males
2. Next we are defining a predicate containing all of those mens wives
3. We then are defining a predicate containing all of those mens cousins
4. Finally we are matching everyone who is both a cousin and a wife
Yes while I understand this is an interesting traversal from an inquisitive perspective it is also relevant from a genetic aspect as well.
Endogamous populations, ones that marry within specific groups, have greater genetic chance of inheriting familial genetic defects than people who marry within the larger population. While cousin marriages were common across many parts of the European Royal family tree, one famous example was Queen Victoria. She married her first cousin Albert. Due to this close relationship between parents several of Queen Victoria's children ended up with hemophilia, which is a genetic defect of on the X chromosome inherited from parents.
Why do these examples matter?
Well in our business our customers are very interested in expanding their family trees. If we are able to use pattern matching algorithms to suggest potential matches in other people’s family trees then we are able to quickly and effectively provide them with the ability to expand their family trees.
When it comes to it this is a bit of a strange query.
Intermixing these sorts of graph analytical tools to gain more valuable insight into your data
This query is also an example of how you can levereage both declarative and imperative syntaxes in the same query.