Graph Data: a New Data Management Frontier -- Huawei’s view and Call for Collaboration by Demai Ni:
Huawei provides Enterprise Databases, and are actively exploring the latest technology to provide end-to-end Data Management Solution on Cloud. We are looking at to bridge classic RDMS to Graph Database on a distributed platform.
3. 3
An Industry-Leading Integrated Big Data Analysis Platform
Finance cloud
(Real-time credit
and risk control)
Safe City cloud
(Relationship analysis, track
analysis, and set collision)
Terminal cloud
(precision marketing, customer
portrait, and group categorization)
On-premise/on-cloud
data warehouse and Data
Mart
Integrated data analysis platform
Unstructured data
GIS JSON XML
Semi-structured data
TXT CSV Log
Structured data
Numeral Date/time Character
Characteristic
data
Relationship
data
Track data Behavior data
Time series
data
Deploy
Monitor
O&M
Cluster
Commis-
sion
Track
Unified
cloud
mgmt
Unified
metadata
Data
governance
and
integration
Data
integration
Data
quality
OLAP analysis
engine
Graph
database
engine
Relationship
analysis
engine
GIS
engine
IOT
engine
Real-time
analysis
engine
Unified data analysis interface
SQL+API
Unified data storage (DFV)
Interface
layer
Engine
layer
Storage
layer
Data
layer
Demai.Ni@huawei.com
4. 4
Objective : PB-level Enterprise Data Warehouse Solution
MPPDB
• Full SQL support, application
transparence
• Open platform with Top Performance
• PB level data management with
Scalability
ELK
• Full SQL support(99 TPC-DS queries)
• High capability
• Transactional IUD on HDFS
• Compatible with all Hadoop
Platforms
FusionInsight MPPDB: A Massively Parallel Processing database by Huawei, with the
capability of PB enterprise data warehouse for Big Data solution
FusionInsight ELK: MPPDB on Hadoop, which provides unified SQL solution
Unified entrance
LFS
Stream
process
Machine
learning
Data mining
unified source management
MPPDB
Enterprise
Data Warehouse
FusionInsight big data platform
CarbonData HDFS
ELK
Interactive analysis
No-SQL connection (SQL-like/API) Standard SQL Standard SQL
Demai.Ni@huawei.com
5. 5
FusionInsight MPPDB: PB-Level High-performance
Cloud and on Premise
Linux 64-bit, universal x86 architecture
(SUSE Linux and Red Hat, or Cloud OS/Storage)
Hardware+
OS
SCTP large-scale cluster communication network
...
Data Node
MPP cluster
Interface
layer
Standard ANSI SQL, JDBC, and ODBC interfaces
Telecommunications
Centralized
operation
analysis
Application
layer
xDR query EDW
Finance Government & public
security
Integrated data
warehouse
Public security
information search
Key features:
• Comprehensive SQL capabilities
and smooth application migration:
TPC-H/TPC-DS allows you to directly execute
SQL statements without modification, supporting
transactions and stored procedures.
• Best-performing open platform in
the industry: Based on x86 servers and an
open Linux platform, Huawei MPPDB supports
column storage, vectorization, all-parallel
execution, and self-learning optimizer, achieving
high performance in interactive SQL queries,
responding to TB-level data correlation analysis
requests within seconds.
• Auto-scaling supporting PB-level
data processing: Based on the MPP
architecture and unique SCTP-based large-scale
cluster communication technology, Huawei
MPPDB provides a solution supporting 256
physical nodes and 10000+ cores, allowing auto
scaling from TB to PB.
Data migration
SQL
development
Cluster
management
Comprehensive
tool chain
FusionInsight MPPDB
...
...
DN DN DN
DN DN DN
DN DN DN
DN DN DN
DN DN DN
DN DN DN
Core
CNCN
Core Core CoreCore Core CoreCore Core
Demai.Ni@huawei.com
7. 7
Graph Database vs. Relational Database
Relational Database doing a wonderful job managing data except for RELATIONSHIPS
Graph Database
Hop(Walk)
O(1)
Flexible
iteration
Air-routes Problem:
find routes from Routes: San Francisco (SFO) to
Shenzhen(SZX) with two stops
select
a1.code,r1.dest,r2.dest
,r3.dest from airports
a1
join routes r1 on
a1.code=r1.src
join routes r2 on
r1.dest=r2.src
join routes r3 on
r2.dest=r3.src
where a1.code= 'SFO'
and r3.dest= 'SZX';
g.V().has('code', 'SFO')
.out()
.out()
.out()
.has('code','SZX')
.path().by('code')
Demai.Ni@huawei.com
JOIN
Relational Database
O(M X N)
Fixed
number of
Operations
Example courtesy of Kelvin Lawrence (https://github.com/krlawrence/graph)
8. 8
Key Limitations with existing RDBMS for Graph
Too many expensive Joins amongst tables
Too many Self-Reference instead of Walking the graph
Unstructured and semi-structured Data for RDBMS’ 2-dimension
Flexible and often changed Schema
SQL Structured Query Language Not a native expression of Graph Relation
Demai.Ni@huawei.com
Graph Database vs. Relational Database
RDMS Graph Database
Relation Amongst tables (or self-reference) Vertex vs. Vertex
Operator Expensive Join Native Edge/Link Path
Results Emphasize accurate and exact results
More common with Estimate Results
for Performance
Model Entity-Relation model/relational algebra Nodes, Relations, Properties and Label
9. 9
Why Graph is important to Huawei?
Network Provider Mobile Device & IoT Cloud Computing
Network traffic
Cyber Attack
Social Relations
Ads/Recommendation
Spatial-temporal Data
Data Center Mgmt
Fault Detection
Logistics Analysis
Demai.Ni@huawei.com
13. 13
Top Challenges for a super large Graph?
Distribution key?
Edge-cut, Vertex-cut, or random-cut works, just not work well for data
locality and data rebalance
Massively Parallelism?
Graph walk is a pipeline and iterative operator
Incremental data/mining?
Insert Update Delete (n) Vertex/Edge may need re-computation of the
graph pattern, with (n2) or (n3) complexity for Incremental Query
Answering
Where the data comes from?
Often flat file or Relational Database, and ETL is toooooo slooooow!
Demai.Ni@huawei.com
15. 15
Why Janusgraph?
“JanusGraph is a highly scalable graph database optimized for storing and querying large graphs with
billions of vertices and edges distributed across a multi-machine cluster, … a transactional database
that can support thousands of concurrent users, complex traversals, and analytic graph queries.” –
JanusGraph README@github
Key features/Issues to be considered!
Bulkload into Backend Store
Why: Flat files or RDMS are common data source, and current performance is way slow(at GB/Hour level)
and not user friendly
Framework for various Partition methods and dynamic balance
Why: current edge-cut, vertex-cut or random partition works, can we do better? And able to balance the
data
Support Visibility labels (issue 493 for HBase)
Why: Security and Access Control
Janusgraph + HBase + Solr(or ES, Lucence) tutorial with real use cases
Why: resolve the real world problem one at a time
Incremental Load/Update
Why: Periodically(Daily, hourly?) data refresh with performance
Demai.Ni@huawei.com
16. 16
Thanks and Call for Collaboration!
Janusgraph and open source community
Collaboration with Academia, industry, Start-up and You!
Join us!
Demai.Ni@huawei.com