At Gartner Data & Analytics Summit 2017 Barry Zane, Vice President of Engineering, and Ben Szekely, Vice President of Solutions, discussed how Cambridge Semantics' Anzo Smart Data Lake® empowers business users with on-demand analytics on rich data through the use of graph database technology. These are the slides from their presentation.
10. Large Scale Graph Analytics
Graph is a simple, clean model for standard analytic queries and allows you to do more.
But, using Graph has had terrible performance for standard analytics queries against large-scale data.
If you can’t do the standard “data warehouse” queries at scale, you won’t get to the algorithms that only Graph can perform!
Build a Graph engine designed for large-scale analytics.
Leverage parallel computing - lots of hardware. Scale to hundreds of severs.
Extend the SPARQL language to backfill functionality present in SQL.
Deploy thru a user interface that automatically writes the SPARQL, and visualizes the results.
PROBLEM
SOLUTION
11. Analytic Landscape
ROLAP - Relational online analytics
•Broad adoption, 45 years of technology evolution
•Based on declarative SQL for business analysts
•Formal ANSI/ISO standard since 1986
GOLAP - Graph based online analytics
•Narrow adoption, accelerating over past 15 years
•Based on declarative SPARQL for business analysts
•Formal W3C standard since 2008
Hadoop (Spark) - Offline batch analytics
•Growing adoption since created in 2005 (2012)
•All queries programmed in Java/Scala/Python…
•Apache and community standards
•Limited only by programmer’s talents and available APIs
12. GOLAP is Real Relational Data Warehouse, Really
Relational Databases are predefined “rectangular” tables and rows with columns.
–Very natural for subjects (aka rows) with a number of known attributes common to all/most
of the subjects.
–Allows columns to be links (aka keys) to other table’s subjects.
Challenged by:
–Sparsity
–One-to-many needs a separate “join table”
–You need to understand the data in advance
Graphs are real relational, really. Just a little different than the points
above!
13. RDF/SPARQL… like RDB/SQL, but...
Standard SQL aggregates, joins, etc, but simple and powerful relationship capabilities.
“How is Joe related to Mary”
–In SQL Relational
•Are they spouses?
•Are they siblings?
•Are they friends?
•Do they have the same hobby?
•… enumerate the choices, EXPLODES with degrees of separation
–In SPARQL Graph
•How is Joe related to Mary?
•… you can directly specify degrees of separation
Pretty exciting, essentially all the power of SQL, but you can do more, with more diverse data, where the data
tells you about itself, rather than you knowing in advance.
14. The Smart Data Lake is the “database”
• Data cached in HDFS, AWS/GCP buckets
• Multiple Graph Query Engine instances, usually on subsets
• Ephemeral in-memory operation
• Short term instances - load, query, toss