The document discusses how graph databases and GPU acceleration can help analyze large datasets with billions of edges and nodes. It provides examples of precision medicine and cyber defense applications that involve very large graph problems. The document also summarizes Blazegraph's GPU acceleration technology, which can provide 200-300x faster performance for graph analytics compared to Apache Spark. Blazegraph uses its DASL language to simplify programming graph algorithms for multi-GPU execution.
2. http://blazegraph.com/
Exploding Data Volumes Requires New Approach
for Relationship Insight
SYSTAPâ˘, LLC. Š 2006-2015 All Rights Reserved
Graph databases are designed to analyze diverse entities and relationships.
Todayâs datasets have billions of edges and nodes.
A (very small) Knowledge Graph
2
3. http://blazegraph.com/
Powers Business
with Billion Edge
Datasets
ââŻInformation
Management
ââŻRetrieval
ââŻIndustrial
ââŻIoT
ââŻCyber
ââŻDefense
ââŻIntelligence
ââŻFinancial
Services
ââŻFraud
Detection
ââŻLife Sciences
ââŻPrecision
Medicine
SYSTAPâ˘, LLC. Š 2006-2016 All Rights Reserved 3
Blazegraph Vision: The Scalable Solution for Graphs
500+ weekly downloads
Thousands of active deployments
Blazegraph
ââŻEnterprise
High Availability
ââŻEnterprise
GPU Acceleration
ââŻEmbedded and
Single Server
Deployments
4. http://blazegraph.com/
Precision Medicine and Hospitals
â˘âŻ 5,686 Hospitals in the United States
â˘âŻ Top 10% have 100B+ Edge Graph Problems
â˘âŻ 256 K-80 GPUs for a 100B Edge Graph Integration Application
â˘âŻ Syapse Precision Medicine Video
4
3/10/15
5-10 B Edge Graph Problems
5. http://blazegraph.com/
Uncovering inďŹuence links in molecular knowledge networks to streamline personalized medicine | Shin, Dmitriy et al.Journal of Biomedical
Informatics , Volume 52 , 394 - 405"
Finding the Next Cure for Cancer is a
Billion+ Edge Graph Challenge
9. http://blazegraph.com/
Blazegraphâ˘: Embedded and Single Server
â˘âŻ High performance, Scalable
â⯠50B edges/node
â⯠RDF/SPARQL level query language
â⯠Efficient Graph Traversal
â⯠High 9s solution
â˘âŻ Property graphs
â⯠Blueprints, gremlin, rextser
â˘âŻ REST API (NSS)
â˘âŻ Extension points
â⯠Stored queries for custom application logic on
the server.
â⯠Custom services & indices
â⯠Custom functions
â⯠Vertex-centric programs
Embedded Server
Standalone Server
9
JVM
Journal
WAR
Journal
10. http://blazegraph.com/
Blazegraphâ˘: High Availability
â˘âŻ Shared nothing architecture
â⯠Same data on each node
â⯠Coordinate only at commit
â⯠Transparent load balancing
â˘âŻ Scaling
â⯠50 billion triples or quads
â⯠Query throughput scales linearly
â˘âŻ Self healing
â⯠Automatic failover
â⯠Automatic resync after disconnect
â⯠Online single node disaster recovery
â˘âŻ Online Backup
â⯠Online snapshots (full backups)
â⯠HA Logs (incremental backups)
â˘âŻ Point in time recovery (offline)
10
HAService
Quorum
k=3
size=3
follower
leader
HAService
HAService
11. http://blazegraph.com/
The Billion-Edge Graph Challenge:
Scaling Up Requires the Right Paradigm and Hardware
SYSTAP, LLC DBA Blazegraph. Š 2006-2016 All Rights Reserved 11
https://datatake.files.wordpress.com/2015/09/latency.png
TypeofCacheorMemory
Access Latency Per Clock Cycle
4.3
3
4
11
11
11
14
18
38
167
0 50 100 150
L1 Cache sequential access
L1 Cache in Page Random access
L1 Cache in Full Random access
L2 Cache sequential access
L2 Cache in Page Random access
L2 Cache in Full Random access
L3 Cache sequential access
L3 Cache in Page Random access
L3 Cache in Full Random access
Main memory
CPU Cache Access Latencies in Clock Cycles
Graph Cache Thrash
The CPU just waits for graph
data from main memory...
12. http://blazegraph.com/
Blazegraph Multi-GPU: Extreme Scale
Traverse 1B Edges/Sec (GTEP) 40x more affordably!
SYSTAP, LLC DBA Blazegraph. Š 2006-2016 All Rights Reserved
Cray XMT-2
$~180K per GTEP
COST
COST
Large Hadoop Cluster
$~18M per GTEP 1 GTEP = 1 Billion Traversed Edges Per Second
Blazegraph with
GPU Clusters
$16K per GTEP (K40)
$4K per GTEP (Pascal)
13. http://blazegraph.com/
With familiar graph APIs; 200-300x acceleration with almost no code changes.
Enabling GPU Acceleration without Code Changes
SYSTAP, LLC DBA Blazegraph. Š 2006-2016 All Rights Reserved 13
Blazegraph GPU
Acceleration
NVIDIA â¨
Tesla GPU
Graph DB
Blazegraph Plug-in for GPU Acceleration Solves âGraph Cache Thrashâ
Funded by and
co-developed with:
14. http://blazegraph.com/
172,632 results
187ms on Blazegraph GPU
172,632 results
53,960ms on Blazegraph CPU
LUBM Query #9 U1000 (167,697,263 Edges w/Inference)
Just Add GPUs âTesting Shows 200-300x Speed-up
SYSTAP, LLC DBA Blazegraph. Š 2006-2016 All Rights Reserved 14
Graph
Database
Graph
Database
290X
15. http://blazegraph.com/
1403
1843
700
0
200
400
600
800
1000
1200
1400
1600
1800
2000
700M Edges Single Node
Xeon 2650 (128G) vs 2 K80
(48G)
1.98 Edges 16 EC2
r3.xlarge (488G) vs 16 K40s
(192G)
1.98 Edges 16 EC2
r3.4xlarge (1952G) vs 16
K40s (192G)
1.98 Edges Spark CPU
Baseline
GPUs 700x-1800x Faster for Graphs
Compared to Apache Spark on 700M and 1.9B Edge Graph
SYSTAPâ˘, LLC. Š 2006-2015 All Rights Reserved
Speed-up(Higherisfaster)
1
15
Speed-up over Baseline Spark CPU Configuration
16. http://blazegraph.com/
Did you hear the one about the Analytics Developer who is
also a Parallel Programming expert?
SYSTAPâ˘, LLC
Š 2006-2015 All Rights Reserved
16
Need: Simpler, smarter algorithms that run eďŹciently on GPUs without requiring knowledge of the parallel
programming or device-speciďŹc opTmizaTon.
Algorithm Developer Parallel Programmer
17. http://blazegraph.com/
Blazegraph DASL for GPU Acceleration
SYSTAPâ˘, LLC. Š 2006-2015 All Rights Reserved 17
Blazegraph DASL â A Domain specific language for graphs with Accelerated Scala using Linear algebra
âŚyou can see why we call it âDASLâ
Automation Enabling Analytics Coders to Optimize for GPU
Delivering ease of use of Spark and Scala for graph algorithms and predictive
analytics⌠your analytic code can work on GPUs and Blazegraph.
DASL Executor Multi-GPU Extension
DASL Translator
DASL Graph Algorithms
âŚwith the speed of CUDAâŚ
âŚGraph and Machine
Learning AlgorithmsâŚ
18. http://blazegraph.com/
DASL your analytics
â˘âŻ DASL is functional, domain-specific language (DSL) for graph and machine
learning algorithms
â˘âŻ Provides of linear algebra primitives supporting the DSL designed for efficient,
multi-GPU execution
â˘âŻ Reduce the barrier of graphs on GPUS, enabling simpler and smarter
algorithms
â˘âŻ Co-opt existing data ecosystems, such as Apache Spark and Hadoop, for ease
of adoption and mission impact
SYSTAPâ˘, LLC
Š 2006-2015 All Rights Reserved
18
3/10/1
5
Graph and Machine Learning
Algorithms
1000X
DASL Executor Multi-GPU Extension
DASL Translator
DASL Graph Algorithms
22. http://blazegraph.com/
Case Study - NetFlow Data Analysis
â˘âŻ What is NetFlow data?
â˘âŻ NetFlow is a network protocol developed by Cisco for
collecting IP traffic information and monitoring network
traffic. http://www.solarwinds.com/what-is-netflow.aspx
â˘âŻ Sample NetFlow record
{"start_time":"2007-08-01 14:31:02.946â,âduration":0.000,
âsrc_addr":"122.166.71.110", âdst_addr":"9.155.118.136",
"src_port":10822,"dst_port":13567,"protocol":"UDP
â,"tcp_flags":"......","input_packets":1,"input_bytes":46}
â˘âŻ How is NetFlow collected?
â˘âŻ Generated by routers and forwarded to collection point
â˘âŻ Processed out of pcap records using tools such as nfcapd/
nfdump
22
23. http://blazegraph.com/
Graph Algorithms Applicable to NetFlow
â˘âŻ Path algorithms (BFS, SSSP, APSP,
CC) and their extensions (closeness,
betweenness, etc)
â˘âŻ Ranking algorithms (cardinality,
PageRank, BadRank, SecureRank)
â˘âŻ Clustering algorithms (canopy, k-
means, Jaccard, community detection,
etc)
â˘âŻ Many others!
23
24. http://blazegraph.com/
Challenges of Network Flow Analysis
24
â˘âŻ The volume of NetFlow packets could not previously be efficiently analyzed or
visualized to identify threats in real time
â⯠One hour of collection on a modest network can easily generate more than 100M NetFlow
records
â⯠Visualizing this data to identify points of interest is nearly impossible due to the âhairball
effectâ, even after time windowing the data
â⯠These issues largely led to attempts at batch processing the data on large clusters of
machines using Hadoop and/or Spark
â⯠Pushing the problem into the batch space puts organizations in the unenviable position of
hoping to identify why and by whom their network was attacked last month rather than being
able to identify an attack as it is occurring.
28. http://blazegraph.com/
Other new capabilities for high-speed, large-scale
graph processing
New technologies to deliver to deliver an HPC capability for
processing very large scale graphs at high speed.
â˘âŻ Blazegraph DASL provides a high-level way to author graph
analytics and integrate with Apache Spark and Hadoop Data
ecosystems.
â˘âŻ Power8 / OpenPOWER bring high speed interconnects for
moving data quickly (3X faster)
â˘âŻ New NVIDIA P100 GPUs provide stacked memory, 720 GB/s
of memory bandwidth, and unified memory addressing.
SYSTAPâ˘, LLC. Š 2006-2015 All Rights Reserved 28
30. http://blazegraph.com/
Users
Original
data
sources
FrontendBackend
Knowledge Graph Creation
Museum visitor
Museums and other sources
â˘âŻ Data crawling
â˘âŻ Data transformation (CIDOC-CRM)
â˘âŻ Data Interlinking
â˘âŻ Data enrichment / Information
extraction
Cards
Social networks
DBpedia BriTsh Museum Data User Data
Mobile App
â˘âŻ HTML5 Templates + CSS for mobile
devices
â˘âŻ Google Glass App
â˘âŻ QR Code recognition
â˘âŻ Pattern / image recognition
â˘âŻ Context reasoning
Knowledge Graph Exploration
â˘âŻ Templates for visualization (CIDOC-CRM
and external data)
â˘âŻ Timeline, Maps
â˘âŻ PivotViewer for visual exploration
â˘âŻ Semantic Search
â˘âŻ Graph Analytics (GAS)
Website visitorData Expert
Research Space / British Museum Project Architecture
and Use Cases
31. http://blazegraph.com/
System will either have
their own terminology to
qualify place
Or
More commonly it will be
implicit and not explicit.
Relationships both
support and bypass this
heterogeneity at the same
time
33. http://blazegraph.com/
FROM means
â˘âŻ Thing has been
located at a place
or
â˘âŻ Thing was created at a
place
or
â˘âŻ Thing was created by
a person from a place
or
â˘âŻ Thing was part of an
event that happened at
a place
or
â˘âŻ Thing was acquired at
a place
Research Space / British Museum Video
35. http://blazegraph.com/
This subset of FROM is just ârefers toâ India â These
objects may not be in India, or from India, but make
some reference to India. India is a subject.