Key Value and Column Stores are not the only two data models Scylla is capable of. In this presentation learn the What, Why and How of building and deploying a graph data system in the cloud, backed by the power of Scylla.
Powering a Graph Data System with Scylla + JanusGraph
1. Powering a Graph Data
System with Scylla +
JanusGraph
Ryan Stauffer, Founder & CEO
2. Presenter
Ryan Stauffer, Founder & CEO
Ryan founded Enharmonic to change the way we interact with
data. He has experience building modern data solutions for fast-
moving companies, both as a consultant and as the leader of
Data Strategy and Analytics at Private Equity-backed Driven
Brands. He received his MBA from Washington University in St.
Louis, and has additional experience in Investment Banking and
as a U.S. Army Infantry Officer. In his free time, he makes music
and tries to set PRs running up Potrero Hill.
5. Graph Data System
We can break down the concept of a “Graph Data System” into 2 pieces:
■ Graph - we’re modelling our data as a property graph
● Vertices model logical entities (Customer, Product, Order)
● Edges model logical relationships between entities (PURCHASED, IN_ORDER)
● Properties model attributes of entities/relationships (name, purchaseDate)
■ Data System - we use several components in a single system to store
and retrieve our data
8. 3 Core Benefits
■ Flexibility
■ Schema support
■ OLTP & OLAP support (Distinct from Scylla Workload Prioritization)
9. Flexibility
The “killer feature” of a graph data model is flexibility
■ Changing database schemas to support new business logic and data
sources is tough!
■ The nature of a graph’s data model makes it easier to evolve the data
model over time
■ Iterate on our model to match our understanding as we learn,
without having to start from scratch
■ In practice
● Incorporate fresh data sources without breaking existing workloads
● Write query results directly to the graph as new vertices & edges
● Share production-quality data between teams
10. Schema Support
By supporting a defined schema, our data system can enforce business
logic, and minimize duplicative application code
■ Flexible schema support out-of-the-box
■ We can pre-define the properties and datatypes that are possible for
a given vertex or edge, without requiring that each vertex/edge
contain every property
■ We can pre-define which edge types are allowed to connect a pair of
vertices, without requiring every pair of vertices to have this edge
■ Simplifies testing on new use cases
■ Separates data integrity maintenance from business logic
11. OLTP + OLAP
■ Transactional (graph-local) workloads
● Begin with a small number of vertices (found with the help of an index)
● Traverse across a reasonably small number of edges and vertices
● Goal is to minimize latency
● With Scylla, we can achieve scalable, single-digit millisecond response
■ Analytical (graph-global) workloads
● Travel to all (or a substantial portion) of the vertices and edges
● Includes many classic graph algorithms
● Goal is to maximize throughput (might leverage Spark)
■ The same traversal language (Gremlin) can be used to write both
types of workloads
■ At the graph level -> distinct from Scylla workload prioritization
14. Kubernetes
■ Open-source system for managing containerized applications
■ Groups application containers into logical units
■ Builds abstractions on top of the basic resources
● Compute
● Memory
● Disk
● Network
15. Deployment Overview
Stateful SetDeployment Storage Class
Headless
Service
Load
Balancer
Client
■ The “stateful” components of our system are Scylla & Elasticsearch
■ JanusGraph is deployed as a stateless server that stores and
retrieves data to and from the stateful systems
16. Scylla
■ Use your existing deployment == Zero lift!
■ New keyspace for JanusGraph data
18. Elasticsearch - Manifest Summary
Storage Class kind: StatefulSet
metadata: ...
spec:
serviceName: es
replicas: 3
selector: { matchLabels: { app: es }}
template:
metadata: { labels: { app: es }}
spec:
containers:
- name: elasticsearch
image: .../elasticsearch-oss:6.6.0
env:
- name: discovery.zen.ping.unicast.hosts
value: "es-0.es.default.svc.cluster.local,..."
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
volumeClaimTemplates:
- metadata: { name: data }
spec:
accessModes: [ ReadWriteOnce ]
storageClassName: elasticsearch-ssd
kind: Service
metadata:
name: es
labels: { app: es }
spec:
clusterIP: None
ports:
- port: 9200
- port: 9300
selector:
app: es
Headless Service
Stateful Set
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: elasticsearch-ssd
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
19. Elasticsearch - Deploy
$ kubectl apply -f elasticsearch.yaml
storageclass.storage.k8s.io/elasticsearch-ssd created
service/es created
statefulset.apps/elasticsearch created
$ kubectl get all -l app=elasticsearch
NAME READY AGE
statefulset.apps/elasticsearch 3/3 2m10s
NAME READY STATUS RESTARTS AGE
pod/elasticsearch-0 1/1 Running 0 2m9s
pod/elasticsearch-1 1/1 Running 0 87s
pod/elasticsearch-2 1/1 Running 0 44s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/es ClusterIP None <none> 9200/TCP,9300/TCP 2m9s
21. JanusGraph Image
$ git clone https://github.com/JanusGraph/janusgraph-docker.git
$ cd janusgraph-docker
$ sudo ./build-images.sh 0.4
# Push the image to your private project repository
$ docker tag janusgraph/janusgraph:0.4.0 gcr.io/$PROJECT/janusgraph:0.4.0
$ gcloud auth configure-docker
$ docker push gcr.io/$PROJECT/janusgraph:0.4.0
■ There are already official JanusGraph images on Docker Hub
■ You can also build your own using the JanusGraph project build
scripts and push it to a private image repository (ex: GCP)
$ docker pull janusgraph/janusgraph:0.4.0
23. JanusGraph Console - Manifest Summary
■ Run JanusGraph in a Pod, and connect to it directly
● Graph is only accessible through this console connection, but actions are persisted
in Scylla and Elasticsearch
kind: Pod
spec:
containers:
- name: janusgraph
image: .../janusgraph:0.4.0
env:
- name: JANUS_PROPS_TEMPLATE
value: cql-es
- name: janusgraph.storage.hostname
value: 10.138.0.3
- name: janusgraph.storage.cql.keyspace
value: graphdev
- name: janusgraph.index.search.hostname
value: "es-0.es.default.svc.cluster.local,..."
26. JanusGraph Server - Manifest Summary
■ Deploy JanusGraph as a standalone server
Service
kind: Deployment
labels:
app: janusgraph
spec:
replicas: 1
template:
spec:
containers:
- name: janusgraph
image: .../janusgraph:0.4.0
env:
- name: JANUS_PROPS_TEMPLATE
value: cql-es
- name: janusgraph.storage.hostname
value: 10.138.0.3
- name: janusgraph.storage.cql.keyspace
value: graphdev
- name: janusgraph.index.search.hostname
value: "es-0.es.default.svc.cluster.local,..."
Deployment
kind: Service
metadata:
name: janusgraph-service-lb
spec:
type: LoadBalancer
selector:
app: janusgraph
ports:
- name: gremlin-server-websocket
protocol: TCP
port: 8182
targetPort: 8182
● Uses TinkerPop Gremlin Server
● Graph will be accessible to a wide range of client languages (Python, Java, JS, etc.)
27. JanusGraph Server - Deploy
$ kubectl apply -f janusgraph.yaml
service/janusgraph-service-lb created
deployment.apps/janusgraph-server created
$ kubectl get all -l app=janusgraph
NAME READY STATUS RESTARTS AGE
pod/janusgraph-server-5d77dd9ddf-nc87p 1/1 Running 0 1m2s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/janusgraph-service-lb LoadBalancer 10.0.12.109 35.121.171.101 8182/TCP 1m3s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/janusgraph-server 1/1 1 1 1m3s
NAME DESIRED CURRENT READY AGE
replicaset.apps/janusgraph-server-5d77dd9ddf 1 1 1 1m2s
28. A Better Way - Helm Charts
■ Nobody has time to manage all of these individual manifest files!
■ Use Helm (https://helm.sh) - the “package manager” for k8s
■ Makes it easy to define, deploy & upgrade Kubernetes applications
■ You can find our opinionated take on deploying JanusGraph with
Helm at https://github.com/EnharmonicAI/janusgraph-helm
31. Thank you Stay in touch
Any questions?
Ryan Stauffer
ryan@enharmonic.ai
@RyantheStauffer
Hinweis der Redaktion
Let's give another round of applause to Brian. Everything he said applies here – now we'll just dig into the technical pieces a bit more.
I'm Ryan Stauffer, I'm the founder and CEO of a Bay Area startup called Enharmonic. I first got excited about graph databases several years back when I was leading data analytics and strategy for a large automotive aftermarket company. We were trying to build a unified model of data for the automotive aftermarket that combined data from across our different verticals. Using the source data in its existing form – hundreds of tables, and hundreds of millions of rows & columns - was leading us down a really bad path. It became clear that insights would be much easier if we used a graph data model, where we can explicitly model our data as real-world business concepts. Ever since then, I’ve viewed graph data systems as a core part of the solution for how to ask and answer better questions about our businesses.
For a litle backdrop about what we'll be talking about – what do we do at Enharmonic? Well, we're working to solve the problem of how companies interact with their data. We provide a clean, visual interface that let's business decision makers directly access their data with free-text search and point-click-and-drag actions. Data is modeled and retrieved as logical business concepts like Customers, Products, and Orders. Our system recommends analyses that make sense based on the data, and then goes ahead and executes those with just a few clicks. To make this possible, we use lots of automation on the backend – and sitting behind everything, we use a graph data system.
Brian discussed graphs in the last session, so I'm not going to rehash everything, but I do want to do a brief level-set. So what do I mean when I say "Graph Data System"?
We can break that into 2 parts: "Graph" & "Data System"
By "graph" we mean that we're modelling our data as a property graph, using Vertices, Edges & Properties.
Vertices model entities like Customers or Products
Edges model relationships between entities, like how one Customer KNOWS another Customer, or a Customer HAS PURCHASED a Product.
Properties model attributesof entites and relationships, like the name and age of a Customer.
By "Data System" we mean that several distinct components combine to form a single, logical system.
There are several options for graph databases out there on the market, but when we need a combination of scalability, flexibility, and performance, we can look to a system built of JanusGraph, Scylla, and Elasticsearch.
This is a single logical data system is structured into 3 parts:
- In the center we have JanusGraph, a Java application that clients communicate with directly.
- It serves as the abstraction layer that let's us interact with our data as a graph.
- JG will write to and read from Scylla, where our data is ultimately persisted.
- We can optionally add Elasticsearch to help us with advanced indexing and text search capabilities
So that sounds interestnig, but why do we want to do this at all?
I think there are 3 core benefits of this graph data system.
- Flexibility
- Schema support
- Support for both transactional & analytical workloads
The killer feature of using a graph is its flexibility
- Business logic changes, application requirements change, and it can often be a real problem trying to support that with traditional databases
- Using a graph means our data model isn't set in stone.
- We can iterate and evolve the data model by adding additional vertices and edges to meet our new needs, without throwing out everything that already works.
- We can also write analytics results directly back to the graph, explicitly connecting to our primary data.
- This simplifies the ways that teams can collaborate and share insights, while allowing for powerful data provenance capabilities.
Schema support is a real "nice-to-have" when it comes to separating business logic from lower-level database integrity issues.
JanusGraph, unlike some other graph databases out there, supports defining a schema for data, but doesn't require that we do this.
Basically, we can apply useful constraints to what is allowed and disallowed on our graph.
For example, we can ensure that name and age properties are only allowed to be written to a Customer vertex, but we don't required that every Customer vertex have all of these properties (minimizes the need for pointless null field values!)
We can also specify that a Product and Customer vertex are allowed to be related with a HAS_PURCHASED edge, bu we don't required that each Product vertex must have that edge.
This sort of clear schema flexibility is difficult to replicate outside of a graph environment.
Separates data integrity mantenance from our business logic – letting our DB take over DB tasks, without offloading them onto the application layer.
- Finally, with this graph data system, we can execute both transactional and analytical workloads with the same data systtem and same query language – Gremlin.
- We access data by “traversing” our graph, travelling from vertex to vertex by means of connecting edges.
- We can think of a transactional workload to be one where we travel to a small number of vertices and edge, and where our goal is to minimize latency.
- An analytical workload, on the other hand, is one where we travel to all, or a substantial portion, of our vertices and edges. Our goal here is to maximize throughput.
- Backed by the high-IO performance of Scylla, we can achieve scalable, single-digit millisecond response for transactional workloads. We can also leverage Spark to handle large scale analytical workloads
It's easy to talk about all of this in theory, but how do we go about actually deploying it
1st of all, WHERE are we going to deploy this?
In a production environment, it makes sense to deploy Scylla on either VMs or bare metal.
For JanusGraph & ES, there are many advantages to deploying on Kubernetes
Q – Quick show of hands, who is using Kubernetes today?
Q – Who has tried deploying Scylla on top of Kubernetes?
(Yannis Zarkadas gave a great talk earlier today on using the Scylla Operator to manage Scylla on K8s – if you missed it I highly recommend checking out the talk online.)
Kubernetes is an open source system for managing containerized applications
Allows you to group and manage application containers as logical units
Fundamentally, its about building and interacting with abstractions on top of basic resources
(Compute, memory, disk, network)
Not going to touch every last detail of the k8s manifests, but I want really dive into the low-level fundamental of the k8s resources you'll be using.
Now even when setting up our pieces on k8s seems pedantic, remember that this greatly simplifies the process of installing and managing a complex application. As many of you probably know, it's significantly easier to do it this way versus installing and upgrading each app and their dependencies manually at the VM level.
Walkthrough the details of deploying the whole system.
Big picture, we have 2 types of components – stateful and stateless
Stateful components are Scylla and Elasticsearch, where we'll actually persists our data. Everything else is stateless and ephemeral. Our actual JanusGraph app pods for instance are stateless, and if one dies, we simply spin up a new one in its place.
The what does this looks like?
A client (maybe an app, maybe our little Scylla monster up here) and she'll issue queries to JanusGraph. - Those queries hit a load balancer and are passed to 1 or more pods managed as part of a JanusGraph deployment.
JanusGraph app is what presents the "graph" view of data, and it does it by intermediating between the client and stateful apps.
Most data is put in Scylla, over here on the left.
For more advanced indexing, we use Elasticsearch, which we deploy as a Stateful Set and Headless Service.
Diving into more detail, we start with Scylla.
We can actually use your existing Scylla cluster, meaning there's 0 lift!
The one thing we'll do is create a new keyspace to hold graph data.
To give us more advanced indexing capabilities, we'll deploy Elasticsearch as well.
We deploy it on Kubernetes in 3 parts.
- Headless Service
- Stateful Set
- Storage Class
ES is stateful, so needs to persist data, which we'll accomplish this by means of a stateful set.
Now, a stateful set is just used to manage 1 or more replica pods, which are the nodes in our ES cluster. But it does this in a unique way. It assigns numbers to each pod and the disks that are mounted to it. This way, we consistently mount the same disk to the same pod #.
This gives us a reliably stateful system, where even if individual pods fail, they're safely recreated automatically by Kubernetes.
We define a storage class – what type of disks do we want to mount to our Elasticsearch nodes? In this case, we'll choose SSDs.
We'll define a headless service. We set clusterIP to None, specify our standard ES ports, and provide a selctor to target our stateful set pods.
The last step is to define our stateful set. This references the Storage Class and Headless Service we just defined, so I color-coded the important bits.
For storage, shown in blue, our goal is to define a disk from our elasticsearch-ssd storage class for each ES node, and mount it to that node. To do this, we'll define a Volume Clam Template, and define a volume mount that mounts the disk at our ES data path.
For networking, shown in red, we specify the Headless Service name. We'll also define 1 environment variable, that allows for ES node discovery.
Q – I THINK THERE'S A TYPO HERE ON THE SELECTOR FOR THE HEADLESS SERVICE.
Assuming we put all of this into a single manifest file, we can deploy Elasticsearch to our Kubernetes cluster with a single "apply" command
After a little bit of initialization, we can see the Ready status of our stateful set, the 3 pods it controls, and the services that routes network traffic to these pods.
Now, for the last and most important piece of the puzzle – JanusGraph.
We'll deploy this on Kubernetes as well.
There are already official JanusGraph images available on Docker Hub, and for these examples we'll be using version 0.4.0
You could also build your own using the JanusGraph project build scripts, and push that image to a private image repository (for example, Google Cloud Platform)
Now how do we use JanusGraph?
Let's start with a minimal example. Not for production use - but illustrates how this all works.
We'll deploy a single pod to get console access to our system.
We'll run JanusGraph in a single pod, and connect to it directly.
That means that the graph is only accessible through the console connection, but all of our actions are still persisted in Scylla and Elasticsearch.
Now, the standard JanusGraph docker image includes some great templateing and presets, which allow us to configure out connection to our storage and indexing backends with just a few environment variables.
We're using Scylla * Elasticsearch, so we set cql-es as our JanusGraph properties template.
We set the hostname as 1 or more of the Scylla cluster hostnames
We set the keyspace as a new, clean Scylla keyspace where we'll store all of our graph data.
Finally, supply the K8s cluster hostnames for our Elasticsearch nodes.
With that manifest file, we can create a pod, then connect to it with an interactive terminal.
This will bring up a Gremlin Console.
The JG Docker image will prepopuate a standard janusgraph.properties file that will reflect the env var configuration we just setup.
We use a factory to create a graph instance, and then we can do whatever we'd like to!
For example, we can start by defining a schema for a Product vertex with name and productId properties.
If we want to actually move to a real environment, we need to support multiple users and applications, probably written in different languages.
To handle this we deploy JanusGraph server.
On Kubernetes, we'll do this as a Deployment, which manages 1 or more stateless replica pods.
We put a load balancer in front of it, exposed on an external or internal IP depending on the use case.
When we deploy JanusGraph as a standalone server, we're actually using the Apache TinkerPop Gremlin Server underneath the hood, which will accept Gremlin language queries issued from applications written in multiple languages (Python, Java, JS, etc.)
The Service is pretty simple just a LoadBalancer that will route network requests to our pods. We're using port 8182 because that's the standard gremlin websocket port.
We manage those pods as a single deployment. We specify the number of replicas, the image, and setup the environment variables just like we did before.
We apply our manifest, and check that everything is running. The key parts are the Load Balancer and Deployment.
Once our LB has its IP assigned, we're able to connect to our JG pods with a client application. Now we can issue queries, store data – do whatever we want!
Now, some of that description of K8s manifest got pretty pedantic. There's got to be a better way, right?
There is – Helm Charts!
Q – With a show of hands, who uses Helm Charts?
Awesome.
We can think of Helm as a package manager for k8s. It lets us template out and group related manifest files into logical packages called Charts.
This makes it easy to define, deploy and upgrade Kubernetes applications with single commands.
We just released our own opinionate take on how to deploy JanusGraph as a Helm Chart on Github. If you like saving time and energy, please check it out and use it
Kubernetes gives us tremendous power, and makes it easy to deploy JanusGraph on top of Scylla.
With our deployment up and running, we have a flexible, scalable graph data system that we can use as the bedrock for an exciting new generation of applications.
Thank you for your time.
If you'd like to stay in touch, you can follow me on Twitter or connect with me on LinkedIn. You can also contact me directly via email.
I think we have a few more minutes, so what questions do you have?