SlideShare a Scribd company logo
1 of 52
1© Cloudera, Inc. All rights reserved.
A brave new world in mutable big data:
Relational storage
Todd Lipcon
Software Engineer at Cloudera
Apache Kudu founder and PMC chair
2© Cloudera, Inc. All rights reserved.
Introduction
3© Cloudera, Inc. All rights reserved.
About me
• Engineer at Cloudera since 2009
• Hadoop core (HDFS, MR1)
• HBase stability and performance
• Started Kudu project in 2012 (bias alert!)
• My 9th Strata NYC!
Feel free to tweet questions @tlipcon or find me on the Kudu Slack
4© Cloudera, Inc. All rights reserved.
A brief history of databases
Incomplete,distilled, and semi-accurate
5© Cloudera, Inc. All rights reserved.
1960s, 1970s
6© Cloudera, Inc. All rights reserved.
“A database system where an application developer directly uses an
application programming interface to search indexes in order to locate
records in data files.” - Wikipedia “ISAM”
• Files contain records (originally fixed-length, later variable-length)
• Files stored on disks and applications directly access them (file-system locking)
• Later added networked access (client-server model), hierarchical records
• Still a simple API:
• Seek by key, Read, Write, Insert, Delete
1960s, 1970s: ISAM / VSAM
7© Cloudera, Inc. All rights reserved.
Probably the only slide at Strata with COBOL on it
Source: http://www.mainframes360.com/2010/03/ksds-files-random-processing.html
8© Cloudera, Inc. All rights reserved.
Failings of ISAM/VSAM
A Relational Model of Data for Large Shared Data Banks (Codd, 1972)
• Applications and physical data layout are too tightly coupled
• e.g a database of parts might be originally ordered by part number and
later changed
• inventory app inadvertently depends on order (unexpected breaks)
• Hard to make general-purpose programs that run against ISAM/VSAM
datasets
• Proposed a new model: relational databases
• All entities modeled by peer tables with relationships between them
• Programs use declarative access (DB decides physical operations necessary)
9© Cloudera, Inc. All rights reserved.
Origins of SQL (1974)
• Originally SEQUEL (Structured English QUEry Language)
• Renamed to SQL due to trademark issues
• Designed to be easy to write, read, and maintain
• “is intended for users who are more comfortable with an English-keyword
format than with the terse mathematical notation of SQUARE.”
• Solves the coupling issue:
• Application: specify what should be returned
• Database: figure out how to return it
10© Cloudera, Inc. All rights reserved.
Explosion of SQL Popularity
• IBM, Oracle, Microsoft, Informix, and others joined the party
• ANSI standard in 1986
• Ecosystem growth:
• Business Intelligence tools
• Object-Relational Mappers
• Extract-Transform-Load tools (ETL)
• Open source SQL databases
• mSQL, MySQL, PostgreSQL, etc
• LAMP stack
11© Cloudera, Inc. All rights reserved.
ORCL
12© Cloudera, Inc. All rights reserved.
13© Cloudera, Inc. All rights reserved.
All good things must come to an end?
14© Cloudera, Inc. All rights reserved.
15© Cloudera, Inc. All rights reserved.
The beginnings of the NoSQL “movement”
16© Cloudera, Inc. All rights reserved.
17© Cloudera, Inc. All rights reserved.
First ever NoSQL meetup (2009)
18© Cloudera, Inc. All rights reserved.
Jay Kreps (Confluent)
Me!
19© Cloudera, Inc. All rights reserved.
I wasn’t keen on the
NoSQL buzzword!
20© Cloudera, Inc. All rights reserved.
NoSQL search interest over time
What happened in
Jan 2012???
21© Cloudera, Inc. All rights reserved.
NoSQL complaints
• Tool compatibility? BI? ETL? ORMs?
• Consistency
• denormalization is tough
• hard to program against weak semantics
• Access path sensitivity
• Have to tightly couple applications with
physical data model
• No ad-hoc access
• Complex application code to perform
simple aggregations
Some of these
critiques sound
awfully familiar...
1970s Database People
22© Cloudera, Inc. All rights reserved.
credit @jrecursive (2010)
f***
23© Cloudera, Inc. All rights reserved.
Not-Only SQL
People wanted their SQL back, and NoSQL
developers gave it!
• Cassandra - CQL (late 2011)
• HBase - Phoenix (Jan 2013)
• HDFS - Hive (2009), Impala (2012), Drill (2012),
Spark SQL (2014), Presto (2013)
24© Cloudera, Inc. All rights reserved.
Meanwhile in RDBMS land
Original complaints still relevant?
Most OLTP apps fit in 1TB
of RAM and flash!
Shared-nothing OLAP available
and works well now
Maybe NoSQL and SQL have
converged?
25© Cloudera, Inc. All rights reserved.
“It is perhaps fair to say that from the perspective of many
engineers working on the Google infrastructure, the SQL vs.
NoSQL dichotomy may no longer be relevant.”
Source: “Spanner: Becoming a SQL System”
26© Cloudera, Inc. All rights reserved.
Part 2:
Evaluating a Not-Only SQL Database
27© Cloudera, Inc. All rights reserved.
What kind of application?
• OLTP? OLAP? HTAP (Hybrid Transactional/Analytic Processing)
• Next-gen data apps are all hybrid (streaming ingest, constant analytics)
• “Combining OLTP, OLAP, and full-text search capabilities in a single system
remains at the top of customer priorities.” - Spanner: Becoming a SQL System
28© Cloudera, Inc. All rights reserved.
HTAP Application Architecture
• Realtime ingest (high performance writes)
• Throughput and latency both important
• Concurrent SQL reads
• BI apps demand interactive performance
• Often a time-series component
• IoT, transaction data, click logs, etc.
• High Availability/Geo-redundancy
Browser tracing Web logs
Kafka
Kudu
Impala
JDBC access
Marketing Dept.
Developers
Web-app
29© Cloudera, Inc. All rights reserved.
Evaluating an HTAP Data Store
• SQL support
• Semantics (eventual vs strict consistency, transactional support, features)
• Performance (ingest with concurrent analytics)
• Availability (multi-datacenter)
• Deployment Model
• Cost
30© Cloudera, Inc. All rights reserved.
Original usecase Deployment Semantics
HBase Web indexing Anywhere single-row ACID
Cassandra OLTP (web serving) Anywhere eventual
Cloud Spanner OLTP SaaS-only (GCE) full ACID
HDFS OLAP Physical HW bulk access only
Kudu HTAP Anywhere single-row ACID
Narrowing the options
Similar storage
implementations (SSTable,
Log-Structured-Merge)
Let’s compare with
Spanner since it’s shiny,
new, and similar to Kudu!
Only store originally
designed for HTAP
31© Cloudera, Inc. All rights reserved.
Not-Only-SQL in Depth:
Comparing Cloud Spanner and Kudu+Impala
32© Cloudera, Inc. All rights reserved.
Apache Kudu: Scalable and fast tabular storage
Scalable
• Tested up to 275 nodes (~3PB cluster)
• Designed to scale to 1000s of nodes and tens of PBs
Fast
• Millions of read/write operations per second across cluster
• Multiple GB/second read throughput per node
Tabular
• Represents data in structured tables like a relational database
•Strict schema, finite column count, no BLOBs
• Individual record-level access to 100+ billion row tables
33© Cloudera, Inc. All rights reserved.
Cloud Spanner at a glance
34© Cloudera, Inc. All rights reserved.
Kudu vs Spanner: Consistency and Availability
Kudu Spanner Winner?
Concurrency
control
MVCC (with
HybridTime)
MVCC (with
TrueTime)
Spanner (but
needs atomic
clock hardware!)
Read-only
(analytic)
queries
Consistent
Snapshot
Isolation
Consistent
Snapshot
Isolation
Tie
Transactions Single-row ACID Multi-row ACID
(small sets of
rows only)
Spanner
Availability/
Replication
Replicated log
(Raft, 3 replicas)
Replicated log
(Multi-Paxos, 3
replicas)
Tie
35© Cloudera, Inc. All rights reserved.
Kudu vs Spanner: Data Access
Kudu Spanner Winner?
Programmatic
APIs
Java, C++,
Python
C#, Go, Java,
Node, PHP,
Python, Ruby
Spanner
Secondary
Indexes
no supported Spanner
SQL via Impala or
Spark (SQL 2003
w/ Analytic
extensions)
Built-in (simple
ANSI99 queries
only, no write
support)
Kudu
Ecosystem
Integrations
Spark, Impala,
Flume, Apex,
StreamSets, et al.
?? (very limited) Kudu
36© Cloudera, Inc. All rights reserved.
Kudu Spanner Winner?
Partitioning Hash or range,
explicit
Range only
(automatic)
<it depends>
Load balancing manual automatic Spanner
Deployment
Environment
on-prem or cloud SaaS only (lock-
in)
Kudu
Ops model operate yourself SaaS (no ops) Spanner
Licensing Apache License closed source Kudu
Kudu vs Spanner: operational factors
37© Cloudera, Inc. All rights reserved.
Checkpoint so far
• Systems are really pretty similar
• No accident - Kudu’s replication, partitioning, and data model inherit a lot
from Spanner
• Current feature gaps
• Spanner ahead on transactional feature set (OLTP focus)
• Kudu ahead on analytic feature set (OLAP focus)
What about underlying storage and performance?
38© Cloudera, Inc. All rights reserved.
Spanner Storage - SSTable / Log-Structured Merge
• SSTable (sorted-string table)
• same storage format as BigTable (inherited code)
• row-oriented design
• Each row <cola, colb, colc, ...> stored on disk in that format
• Optimal for OLTP (read 1 row = 1 disk seek)
• Inefficient for OLAP (high CPU on scans)
• not schema-aware
• little opportunity for type-specific compression techniques, etc.
“SSTables have proven to be remarkably robust even when used for schematized
data consisting largely of small values, often traversed by column. But they are
ultimately a poor fit and leave a lot of performance on the table.”
39© Cloudera, Inc. All rights reserved.
base columnar data
Kudu Storage - Columnar + Deltas
• Stores most of its data in an internal columnar format
• Each column stored, encoded, and compressed separately, in small chunks
• Similar to Parquet, with enhancements:
• Indexes allow fast seeking by key or by position (for low-latency read)
• Delta Stores allow tracking of updated and deleted rows
c1 c2 c3 c4 +
deltas (recently changed rows)
d1 d2
c1 c2 c3 c4
1 hi 0.1 N
3 bye 0.2 N
2 cat 0.1 N
1 dog 0.5 Y
read-time
40© Cloudera, Inc. All rights reserved.
So how much does it really
matter?
Analytics benchmarks
41© Cloudera, Inc. All rights reserved.
Benchmark setup
Cloud Spanner
5 “nodes” (unknown specs)
us-central1 region (multi-zone)
Price:
$0.90/node/hr * 5 nodes
= $3240/month
Kudu on GCE
5 n1-standard-16 (16vCPU, 60G RAM)
us-central1 region (multi-zone)
500G Persistent SSD disk each
Price:
$0.54/node/hr * 5 +
500GB * $0.17/GB/mo * 5
= $2366.80/month
*drops to $1009 if preemptible is used!
* factoring in sustained-use discount
30%
Lower!
42© Cloudera, Inc. All rights reserved.
Test 1: TPCH Data Loading
• Used a separate node to load the TPC-H “LINEITEM” table
• 600M rows, 75GB in CSV format
• Multi-threaded Java program* to load, followed best practices
*Loader available at https://github.com/toddlipcon/spanner-kudu-comparison
43© Cloudera, Inc. All rights reserved.
Test 2: TPCH Queries
• SELECT COUNT(*)
• TPCH Q1, Q6: simple GROUP BY/SUM/COUNT which scan the whole table
44© Cloudera, Inc. All rights reserved.
Test 3: YCSB Loading
• Standard YCSB benchmark
• Configured as recommended in the
cloudspanner/README file
• Experienced many errors, timeouts,
and multi-minute stalls loading
spanner
• eventually succeeded on third try
• so take these results with a grain
of salt!
45© Cloudera, Inc. All rights reserved.
YCSB Throughput (Load and random-read)
46© Cloudera, Inc. All rights reserved.
YCSB Latencies (for read workload)
47© Cloudera, Inc. All rights reserved.
YCSB Workload A (50/50 read/write mix)
Kudu is not optimized for high update-rate scenarios. See KUDU-749
48© Cloudera, Inc. All rights reserved.
Benchmark summary
• Kudu ingests data at least 4x faster
• Stability issues with Cloud Spanner ingestion (cause unknown)
• Kudu performs simple analytic queries 10-100x faster
• Spanner wins on high-percentile tail latencies
• Kudu performance degrades significantly in 50/50 R/W mix workload
• Reminders:
• Kudu cluster has 30% lower cost, and can be run on any provider!
• Kudu doesn’t have the same rich OLTP feature set as Spanner (indexes,
multi-row transactions, etc)
49© Cloudera, Inc. All rights reserved.
Conclusions
50© Cloudera, Inc. All rights reserved.
Conclusions
• NoSQL and SQL are converging again
• We now get “best of both worlds” from both communities!
• Many different excellent choices are now available for building hybrid
transactional/analytic applications
• Understand the trade-offs before settling on an architecture
• Seemingly small details can make orders-of-magnitude difference
• Consider non-functional differences as well (licensing, deployment, lock-in,
etc)
51© Cloudera, Inc. All rights reserved.
Acknowledgements
• Spanner team for publishing papers, especially SIGMOD 2017 (“Spanner:
Becoming a SQL System”)
• Cloud Spanner team and developer advocates (Deepti Srivastava, Robert
Kubis)
• Siamak Tazari (YCSB binding for Cloud Spanner)
• Cloudera (paying my GCE bill)
52© Cloudera, Inc. All rights reserved.
kudu.apache.org
@tlipcon | @ApacheKudu

More Related Content

What's hot

February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...Yahoo Developer Network
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Dataconomy Media
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data PlatformRakuten Group, Inc.
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Datamichaelguia
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architectureMartinStrycek
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 

What's hot (20)

February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architecture
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 

Similar to A brave new world in mutable big data: Comparing Apache Kudu and Google Cloud Spanner

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Timothy Spann
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Cloudera, Inc.
 
HBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupHBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupCloudera, Inc.
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DBHeriyadi Janwar
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyHakka Labs
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community
 

Similar to A brave new world in mutable big data: Comparing Apache Kudu and Google Cloud Spanner (20)

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
HBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupHBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User Group
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

A brave new world in mutable big data: Comparing Apache Kudu and Google Cloud Spanner

  • 1. 1© Cloudera, Inc. All rights reserved. A brave new world in mutable big data: Relational storage Todd Lipcon Software Engineer at Cloudera Apache Kudu founder and PMC chair
  • 2. 2© Cloudera, Inc. All rights reserved. Introduction
  • 3. 3© Cloudera, Inc. All rights reserved. About me • Engineer at Cloudera since 2009 • Hadoop core (HDFS, MR1) • HBase stability and performance • Started Kudu project in 2012 (bias alert!) • My 9th Strata NYC! Feel free to tweet questions @tlipcon or find me on the Kudu Slack
  • 4. 4© Cloudera, Inc. All rights reserved. A brief history of databases Incomplete,distilled, and semi-accurate
  • 5. 5© Cloudera, Inc. All rights reserved. 1960s, 1970s
  • 6. 6© Cloudera, Inc. All rights reserved. “A database system where an application developer directly uses an application programming interface to search indexes in order to locate records in data files.” - Wikipedia “ISAM” • Files contain records (originally fixed-length, later variable-length) • Files stored on disks and applications directly access them (file-system locking) • Later added networked access (client-server model), hierarchical records • Still a simple API: • Seek by key, Read, Write, Insert, Delete 1960s, 1970s: ISAM / VSAM
  • 7. 7© Cloudera, Inc. All rights reserved. Probably the only slide at Strata with COBOL on it Source: http://www.mainframes360.com/2010/03/ksds-files-random-processing.html
  • 8. 8© Cloudera, Inc. All rights reserved. Failings of ISAM/VSAM A Relational Model of Data for Large Shared Data Banks (Codd, 1972) • Applications and physical data layout are too tightly coupled • e.g a database of parts might be originally ordered by part number and later changed • inventory app inadvertently depends on order (unexpected breaks) • Hard to make general-purpose programs that run against ISAM/VSAM datasets • Proposed a new model: relational databases • All entities modeled by peer tables with relationships between them • Programs use declarative access (DB decides physical operations necessary)
  • 9. 9© Cloudera, Inc. All rights reserved. Origins of SQL (1974) • Originally SEQUEL (Structured English QUEry Language) • Renamed to SQL due to trademark issues • Designed to be easy to write, read, and maintain • “is intended for users who are more comfortable with an English-keyword format than with the terse mathematical notation of SQUARE.” • Solves the coupling issue: • Application: specify what should be returned • Database: figure out how to return it
  • 10. 10© Cloudera, Inc. All rights reserved. Explosion of SQL Popularity • IBM, Oracle, Microsoft, Informix, and others joined the party • ANSI standard in 1986 • Ecosystem growth: • Business Intelligence tools • Object-Relational Mappers • Extract-Transform-Load tools (ETL) • Open source SQL databases • mSQL, MySQL, PostgreSQL, etc • LAMP stack
  • 11. 11© Cloudera, Inc. All rights reserved. ORCL
  • 12. 12© Cloudera, Inc. All rights reserved.
  • 13. 13© Cloudera, Inc. All rights reserved. All good things must come to an end?
  • 14. 14© Cloudera, Inc. All rights reserved.
  • 15. 15© Cloudera, Inc. All rights reserved. The beginnings of the NoSQL “movement”
  • 16. 16© Cloudera, Inc. All rights reserved.
  • 17. 17© Cloudera, Inc. All rights reserved. First ever NoSQL meetup (2009)
  • 18. 18© Cloudera, Inc. All rights reserved. Jay Kreps (Confluent) Me!
  • 19. 19© Cloudera, Inc. All rights reserved. I wasn’t keen on the NoSQL buzzword!
  • 20. 20© Cloudera, Inc. All rights reserved. NoSQL search interest over time What happened in Jan 2012???
  • 21. 21© Cloudera, Inc. All rights reserved. NoSQL complaints • Tool compatibility? BI? ETL? ORMs? • Consistency • denormalization is tough • hard to program against weak semantics • Access path sensitivity • Have to tightly couple applications with physical data model • No ad-hoc access • Complex application code to perform simple aggregations Some of these critiques sound awfully familiar... 1970s Database People
  • 22. 22© Cloudera, Inc. All rights reserved. credit @jrecursive (2010) f***
  • 23. 23© Cloudera, Inc. All rights reserved. Not-Only SQL People wanted their SQL back, and NoSQL developers gave it! • Cassandra - CQL (late 2011) • HBase - Phoenix (Jan 2013) • HDFS - Hive (2009), Impala (2012), Drill (2012), Spark SQL (2014), Presto (2013)
  • 24. 24© Cloudera, Inc. All rights reserved. Meanwhile in RDBMS land Original complaints still relevant? Most OLTP apps fit in 1TB of RAM and flash! Shared-nothing OLAP available and works well now Maybe NoSQL and SQL have converged?
  • 25. 25© Cloudera, Inc. All rights reserved. “It is perhaps fair to say that from the perspective of many engineers working on the Google infrastructure, the SQL vs. NoSQL dichotomy may no longer be relevant.” Source: “Spanner: Becoming a SQL System”
  • 26. 26© Cloudera, Inc. All rights reserved. Part 2: Evaluating a Not-Only SQL Database
  • 27. 27© Cloudera, Inc. All rights reserved. What kind of application? • OLTP? OLAP? HTAP (Hybrid Transactional/Analytic Processing) • Next-gen data apps are all hybrid (streaming ingest, constant analytics) • “Combining OLTP, OLAP, and full-text search capabilities in a single system remains at the top of customer priorities.” - Spanner: Becoming a SQL System
  • 28. 28© Cloudera, Inc. All rights reserved. HTAP Application Architecture • Realtime ingest (high performance writes) • Throughput and latency both important • Concurrent SQL reads • BI apps demand interactive performance • Often a time-series component • IoT, transaction data, click logs, etc. • High Availability/Geo-redundancy Browser tracing Web logs Kafka Kudu Impala JDBC access Marketing Dept. Developers Web-app
  • 29. 29© Cloudera, Inc. All rights reserved. Evaluating an HTAP Data Store • SQL support • Semantics (eventual vs strict consistency, transactional support, features) • Performance (ingest with concurrent analytics) • Availability (multi-datacenter) • Deployment Model • Cost
  • 30. 30© Cloudera, Inc. All rights reserved. Original usecase Deployment Semantics HBase Web indexing Anywhere single-row ACID Cassandra OLTP (web serving) Anywhere eventual Cloud Spanner OLTP SaaS-only (GCE) full ACID HDFS OLAP Physical HW bulk access only Kudu HTAP Anywhere single-row ACID Narrowing the options Similar storage implementations (SSTable, Log-Structured-Merge) Let’s compare with Spanner since it’s shiny, new, and similar to Kudu! Only store originally designed for HTAP
  • 31. 31© Cloudera, Inc. All rights reserved. Not-Only-SQL in Depth: Comparing Cloud Spanner and Kudu+Impala
  • 32. 32© Cloudera, Inc. All rights reserved. Apache Kudu: Scalable and fast tabular storage Scalable • Tested up to 275 nodes (~3PB cluster) • Designed to scale to 1000s of nodes and tens of PBs Fast • Millions of read/write operations per second across cluster • Multiple GB/second read throughput per node Tabular • Represents data in structured tables like a relational database •Strict schema, finite column count, no BLOBs • Individual record-level access to 100+ billion row tables
  • 33. 33© Cloudera, Inc. All rights reserved. Cloud Spanner at a glance
  • 34. 34© Cloudera, Inc. All rights reserved. Kudu vs Spanner: Consistency and Availability Kudu Spanner Winner? Concurrency control MVCC (with HybridTime) MVCC (with TrueTime) Spanner (but needs atomic clock hardware!) Read-only (analytic) queries Consistent Snapshot Isolation Consistent Snapshot Isolation Tie Transactions Single-row ACID Multi-row ACID (small sets of rows only) Spanner Availability/ Replication Replicated log (Raft, 3 replicas) Replicated log (Multi-Paxos, 3 replicas) Tie
  • 35. 35© Cloudera, Inc. All rights reserved. Kudu vs Spanner: Data Access Kudu Spanner Winner? Programmatic APIs Java, C++, Python C#, Go, Java, Node, PHP, Python, Ruby Spanner Secondary Indexes no supported Spanner SQL via Impala or Spark (SQL 2003 w/ Analytic extensions) Built-in (simple ANSI99 queries only, no write support) Kudu Ecosystem Integrations Spark, Impala, Flume, Apex, StreamSets, et al. ?? (very limited) Kudu
  • 36. 36© Cloudera, Inc. All rights reserved. Kudu Spanner Winner? Partitioning Hash or range, explicit Range only (automatic) <it depends> Load balancing manual automatic Spanner Deployment Environment on-prem or cloud SaaS only (lock- in) Kudu Ops model operate yourself SaaS (no ops) Spanner Licensing Apache License closed source Kudu Kudu vs Spanner: operational factors
  • 37. 37© Cloudera, Inc. All rights reserved. Checkpoint so far • Systems are really pretty similar • No accident - Kudu’s replication, partitioning, and data model inherit a lot from Spanner • Current feature gaps • Spanner ahead on transactional feature set (OLTP focus) • Kudu ahead on analytic feature set (OLAP focus) What about underlying storage and performance?
  • 38. 38© Cloudera, Inc. All rights reserved. Spanner Storage - SSTable / Log-Structured Merge • SSTable (sorted-string table) • same storage format as BigTable (inherited code) • row-oriented design • Each row <cola, colb, colc, ...> stored on disk in that format • Optimal for OLTP (read 1 row = 1 disk seek) • Inefficient for OLAP (high CPU on scans) • not schema-aware • little opportunity for type-specific compression techniques, etc. “SSTables have proven to be remarkably robust even when used for schematized data consisting largely of small values, often traversed by column. But they are ultimately a poor fit and leave a lot of performance on the table.”
  • 39. 39© Cloudera, Inc. All rights reserved. base columnar data Kudu Storage - Columnar + Deltas • Stores most of its data in an internal columnar format • Each column stored, encoded, and compressed separately, in small chunks • Similar to Parquet, with enhancements: • Indexes allow fast seeking by key or by position (for low-latency read) • Delta Stores allow tracking of updated and deleted rows c1 c2 c3 c4 + deltas (recently changed rows) d1 d2 c1 c2 c3 c4 1 hi 0.1 N 3 bye 0.2 N 2 cat 0.1 N 1 dog 0.5 Y read-time
  • 40. 40© Cloudera, Inc. All rights reserved. So how much does it really matter? Analytics benchmarks
  • 41. 41© Cloudera, Inc. All rights reserved. Benchmark setup Cloud Spanner 5 “nodes” (unknown specs) us-central1 region (multi-zone) Price: $0.90/node/hr * 5 nodes = $3240/month Kudu on GCE 5 n1-standard-16 (16vCPU, 60G RAM) us-central1 region (multi-zone) 500G Persistent SSD disk each Price: $0.54/node/hr * 5 + 500GB * $0.17/GB/mo * 5 = $2366.80/month *drops to $1009 if preemptible is used! * factoring in sustained-use discount 30% Lower!
  • 42. 42© Cloudera, Inc. All rights reserved. Test 1: TPCH Data Loading • Used a separate node to load the TPC-H “LINEITEM” table • 600M rows, 75GB in CSV format • Multi-threaded Java program* to load, followed best practices *Loader available at https://github.com/toddlipcon/spanner-kudu-comparison
  • 43. 43© Cloudera, Inc. All rights reserved. Test 2: TPCH Queries • SELECT COUNT(*) • TPCH Q1, Q6: simple GROUP BY/SUM/COUNT which scan the whole table
  • 44. 44© Cloudera, Inc. All rights reserved. Test 3: YCSB Loading • Standard YCSB benchmark • Configured as recommended in the cloudspanner/README file • Experienced many errors, timeouts, and multi-minute stalls loading spanner • eventually succeeded on third try • so take these results with a grain of salt!
  • 45. 45© Cloudera, Inc. All rights reserved. YCSB Throughput (Load and random-read)
  • 46. 46© Cloudera, Inc. All rights reserved. YCSB Latencies (for read workload)
  • 47. 47© Cloudera, Inc. All rights reserved. YCSB Workload A (50/50 read/write mix) Kudu is not optimized for high update-rate scenarios. See KUDU-749
  • 48. 48© Cloudera, Inc. All rights reserved. Benchmark summary • Kudu ingests data at least 4x faster • Stability issues with Cloud Spanner ingestion (cause unknown) • Kudu performs simple analytic queries 10-100x faster • Spanner wins on high-percentile tail latencies • Kudu performance degrades significantly in 50/50 R/W mix workload • Reminders: • Kudu cluster has 30% lower cost, and can be run on any provider! • Kudu doesn’t have the same rich OLTP feature set as Spanner (indexes, multi-row transactions, etc)
  • 49. 49© Cloudera, Inc. All rights reserved. Conclusions
  • 50. 50© Cloudera, Inc. All rights reserved. Conclusions • NoSQL and SQL are converging again • We now get “best of both worlds” from both communities! • Many different excellent choices are now available for building hybrid transactional/analytic applications • Understand the trade-offs before settling on an architecture • Seemingly small details can make orders-of-magnitude difference • Consider non-functional differences as well (licensing, deployment, lock-in, etc)
  • 51. 51© Cloudera, Inc. All rights reserved. Acknowledgements • Spanner team for publishing papers, especially SIGMOD 2017 (“Spanner: Becoming a SQL System”) • Cloud Spanner team and developer advocates (Deepti Srivastava, Robert Kubis) • Siamak Tazari (YCSB binding for Cloud Spanner) • Cloudera (paying my GCE bill)
  • 52. 52© Cloudera, Inc. All rights reserved. kudu.apache.org @tlipcon | @ApacheKudu