2. KMW Technology Overview
Boston based software consulting and
professional services organization.
Founded in 2010.
Seven consultants with deep industry
experience.
Boutique firm specializing in Search
and Big Data technologies.
Custom Connectors, Pipelines,
Search, Analytics, and UI
development.
3. Search, Join, vs Graph
Which query should I use?
Search is for flat data, no relationships
◦ Data often de-normalized, updates require large
amounts of re-indexing potentially.
Join is for one level of relationships
◦ Data is normalized, but for more than 2 tables
involved, join queries must be nested.
Graph is for arbitrary depth/levels of
relationships.
◦ Data can be completely normalized, arbitrary
numbers of tables can be joined together.
A one level hop on a graph is roughly
equivalent to a join query.
4. What is a Graph?
A generic representation of all data
models.
“One data model to rule them all”!
G = <V,E> ?!?!
Vertices/Nodes
◦ Can have properties as key value pairs.
Edges
◦ Can have properties as key value pairs
5. Graph Traversal
There are many graph traversal /
exploration algorithms. DFS, BFS, A*,
Alpha–beta, etc…
Solr graph query implements “BFS”
Breadth-first search, each hop expands
the “Frontier” of the graph. It explores
all current edges in a single step, also
known as a “hop”
6. Key Features and Design Goals
“Graph is a Filter on top of your data”
-someone
Designed for large scale and large number of
edges and very deep traversals.
Limited memory usage for traversal
Cycle detection for “free”
Highly cacheable
Support multiValued fields for nodes and/or
edges
Support filters during the traversal
Follow Every Edge! No edge left behind!
Works with Facets & Facet Queries!
7. A Word about Memory Usage
One bit set to rule them all!
BitSet provides cycle detection implicitly.
(Have I been here before?)
BitSet is equal to the size of the index.
100 Million doc index only uses about 12
MB per query! (Same size as 1 filter
cache entry!)
Additional bitsets may be used during
query execution depending on query
params. (leaf nodes and root nodes
bitsets)
8. Graph Query Parser Syntax
Parameter Default Description
from field containing the node id
to Field contaning the edge id(s)
maxDepth -1
The number of hops to traverse from the root of the graph. -1 means
traverse until all edges and documents have been collected. maxDepth=1
is similar behavior to a JOIN.
traversalFilter null arbitrary query string to apply at each hop of the traversal
returnRoot true
true|false – indication of if the documents matching the root query should
be returned.
leafNodesOnly false
true|false – indication to return only documents in the result set that do not
have a value in the “to” field.
useAutn True Performance trade off based on use case. Mileage may vary.
Uses Solr’s query parser plugin and “local params” syntax
{!graph param=”value” … }
9. Princeton Wordnet
Princeton Wordnet has an ontology for many of the
words in the English language. These
relationships contain hierarchies of words that
represent a more general and a more specific class
of relatonships.
https://wordnet.princeton.edu/
Words have a “sense”, or meaning.
Hypernym is a more specific related word.
Hyponem is a more general related word.
◦ Jaguar is a type of Cat
◦ Large Cat is a type of Animal
Intersections of this hierachy can answer
questions: “Is a jaguar an animal?”
10. Wordnet Hypernym Traversal
Start traversing from the word sense “jaguar” up the hypernym graph 9 levels.
+{!graph from="synset_id" to="hypernym_id" maxDepth=9}sense_lemma:jaguar
11. Wordnet Graph Intersections
Is a jaguar an animal? Query for an
intersection between the two graphs.
If a graph intersection exists, the answer is yes!
12. OpenCV, Video Recognition
Imagine indexing each frame of video
from security cameras. Pass each
frame of video through OpenCV for
object recognition & face recognition.
Each frame has a frame number of it’s
frame and the previous frame.
Search for object/face “A” detected,
followed by object/face “B” detected,
across all of your video streams.
13. Users , Items and Actions
Model your browsing/purchase history as
◦ Users (have an ID)
◦ Items (have an ID, metadata, category, etc)
◦ Actions (link between user and Items, such
as rating, purchase, like/dislike)
User -> Action -> Item -> Action -> User …
Use Graph + maxDepth to get from a user to
an item. maxDepth = 2… gets from a user to
an Item. maxDepth = 4 .. Gets from one user
to a new set of users, and on and on.
14. Actions occur over time
These events can’t easily be
aggregated or flattened onto a record.
Model this as a “person” record, with a
set of “action” records.
Each action record has the id of the
“previous” action.
Search for an action, graph traverse
based on person id to another action,
then finally to the person record.
15. Find similar users
Graph traversal from a user (or set of
users) through their actions to items
they like, to find similar users, and out
to items they like.
Now, exclude the original starting set
“returnRoot=false”
16. Graph Query For Security
Graph queries are elegant and simple
to use for traversing security
hierarchies such as LDAP and AD
Custom security models that are
hierarchical or folder based in nature.
21. Security Query
Single security query term to traverse the entire graph
{!graph from=“node_id” to=“edge_ids” returnOnlyLeaf=“true”}id:user_1
The query is applied as a FilterQuery to the query request,
normal query is user for filtering against documents
22. FoaF
Friend of a Friend of a Friend of a Friend…
2 ways to model in the index.
Multi-valued “friendid” field that points to other
person records.
◦ More efficient and faster search.
◦ Filter traversal based on metadata on the person
record.
Single value field and on a document that
represents the link/edge between two person
records.
◦ More flexible slower search.
◦ Can filter edges with metadata about the edge
record..
23. Graph Analytics via Faceting
What do my friend’s friends like that live in
Boston?
Identify a graph/ dataset with a graph query
to identify the people records.
Use facets to generate analytics on the result
set based on the values in the person record
“like” field.
Use drill down to understand characteristics
of different demographics/cohorts.
Get counts at various levels using maxDepth
graph queries as facet queries.
24. What next?
Edge weights & Relevancy
◦ Based on tf/idf or bm25?
◦ Based on numerical field values (min/max/sum/avg
weight application)?
Min distance computation
Better support for D3.js and other Visualization
tools
Driving directions?
Distributed Traversal via Kafka frontier query
broker
SparkRDD Support? GraphX?
minDepth parameter? Only return records that
are at least N hops away?