SlideShare a Scribd company logo
1 of 72
Download to read offline
Breakthrough OLAP
Performance with
Cassandra and Spark
Evan Chan
August 2015
Who am I?
Distinguished Engineer,
@evanfchan
User and contributor to Spark since 0.9, Cassandra since 0.6
Co-creator and maintainer of
TupleJump
http://velvia.github.io
Spark Job Server
About Tuplejump
is a big data technology leader providing solutions for
rapid insights from data.
Tuplejump
- the first Spark-Cassandra integration
- an open source Lucene indexer for Cassandra
- open source HDFS for Cassandra
Calliope
Stargate
SnackFS
Didn't I attend the same talk last year?
Similar title, but mostly new material
Will reveal new open source projects! :)
Problem Space
Need analytical database / queries on structured big data
Something SQL-like, very flexible and fast
Pre-aggregation too limiting
Fast data / constant updates
Ideally, want my queries to run over fresh data too
Example: Video analytics
Typical collection and analysis of consumer events
3 billion new events every day
Video publishers want updated stats, the sooner the better
Pre-aggregation only enables simple dashboard UIs
What if one wants to offer more advanced analysis, or a
generic data query API?
Eg, top countries filtered by device type, OS, browser
Requirements
Scalable - rules out PostGreSQL, etc.
Easy to update and ingest new data
Not traditional OLAP cubes - that's not what I'm talking
about
Very fast for analytical queries - OLAP not OLTP
Extremely flexible queries
Preferably open source
Parquet
Widely used, lots of support (Spark, Impala, etc.)
Problem: Parquet is read-optimized, not easy to use for writes
Cannot support idempotent writes
Optimized for writing very large chunks, not small updates
Not suitable for time series, IoT, etc.
Often needs multiple passes of jobs for compaction of small
files, deduplication, etc.
 
People really want a database-like abstraction, not a file format!
Turns out this has been solved before!
Even .Facebook uses Vertica
MPP Databases
Easy writes plus fast queries, with constant transfers
Automatic query optimization by storing intermediate query
projections
Stonebraker, et. al. - paper (Brown Univ)CStore
What's wrong with MPP Databases?
Closed source
$$$
Usually don't scale horizontally that well (or cost is prohibitive)
Cassandra
Horizontally scalable
Very flexible data modelling (lists, sets, custom data types)
Easy to operate
Perfect for ingestion of real time / machine data
Best of breed storage technology, huge community
BUT: Simple queries only
OLTP-oriented
Apache Spark
Horizontally scalable, in-memory queries
Functional Scala transforms - map, filter, groupBy, sort
etc.
SQL, machine learning, streaming, graph, R, many more plugins
all on ONE platform - feed your SQL results to a logistic
regression, easy!
Huge number of connectors with every single storage
technology
Spark provides the missing fast, deep
analytics piece of Cassandra!
Spark and Cassandra
OLAP Architectures
Separate Storage and Query Layers
Combine best of breed storage and query platforms
Take full advantage of evolution of each
Storage handles replication for availability
Query can replicate data for scaling read concurrency -
independent!
Spark as Cassandra's Cache
Spark SQL
Appeared with Spark 1.0
In-memory columnar store
Parquet, Json, Cassandra connector, Avro, many more
SQL as well as DataFrames (Pandas-style) API
Indexing integrated into data sources (eg C* secondary
indexes)
Write custom functions in Scala .... take that Hive UDFs!!
Integrates well with MLBase, Scala/Java/Python
Connecting Spark to Cassandra
Datastax's
Tuplejump
Spark Cassandra Connector
Calliope
 
Get started in one line with spark-shell!
bin/spark-shell
--packagescom.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M3
--confspark.cassandra.connection.host=127.0.0.1
Caching a SQL Table from Cassandra
DataFrames support in Cassandra Connector 1.4.0 (and 1.3.0):
valsqlContext=neworg.apache.spark.sql.SQLContext(sc)
valdf=sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table"->"gdelt","keyspace"->"test"))
.load()
df.registerTempTable("gdelt")
sqlContext.cacheTable("gdelt")
sqlContext.sql("SELECTcount(monthyear)FROMgdelt").show()
 
Spark does no caching by default - you will always be reading
from C*!
How Spark SQL's Table Caching Works
Spark Cached Tables can be Really Fast
GDELT dataset, 4 million rows, 60 columns, localhost
Method secs
Uncached 317
Cached 0.38
 
Almost a 1000x speedup!
On an 8-node EC2 c3.XL cluster, 117 million rows, can run
common queries 1-2 seconds against cached dataset.
Tuning Connector Partitioning
spark.cassandra.input.split.size
Guideline: One split per partition, one partition per CPU core
Much more parallelism won't speed up job much, but will
starve other C* requests
Lesson #1: Take Advantage of Spark
Caching!
Problems with Cached Tables
Still have to read the data from Cassandra first, which is slow
Amount of RAM: your entire data + extra for conversion to
cached table
Cached tables only live in Spark executors - by default
tied to single context - not HA
once any executor dies, must re-read data from C*
Caching takes time: convert from RDD[Row] to compressed
columnar format
Cannot easily combine new RDD[Row] with cached tables
(and keep speed)
Problems with Cached Tables
If you don't have enough RAM, Spark can cache your tables
partly to disk. This is still way, way, faster than scanning an entire
C* table. However, cached tables are still tied to a single Spark
context/application.
Also: rdd.cache()is NOT the same as SQLContext's
cacheTable!
What about C* Secondary Indexing?
Spark-Cassandra Connector and Calliope can both reduce I/O by
using Cassandra secondary indices. Does this work with caching?
No, not really, because only the filtered rows would be cached.
Subsequent queries against this limited cached table would not
give you expected results.
Tachyon Off-Heap Caching
Intro to Tachyon
Tachyon: an in-memory cache for HDFS and other binary data
sources
Keeps data off-heap, so multiple Spark applications/executors
can share data
Solves HA problem for data
Wait, wait, wait!
What am I caching exactly? Tachyon is designed for caching files
or binary blobs.
A serialized form of CassandraRow/CassandraRDD?
Raw output from Cassandra driver?
What you really want is this:
Cassandra SSTable -> Tachyon (as row cache) -> CQL -> Spark
Bad programmers worry about the code. Good
programmers worry about data structures.
- Linus Torvalds
 
Are we really thinking holistically about data modelling, caching,
and how it affects the entire systems architecture?
Efficient Columnar Storage in Cassandra
Wait, I thought Cassandra was columnar?
How Cassandra stores your CQL Tables
Suppose you had this CQL table:
CREATETABLE(
departmenttext,
empIdtext,
firsttext,
lasttext,
ageint,
PRIMARYKEY(department,empId)
);
How Cassandra stores your CQL Tables
PartitionKey 01:first 01:last 01:age 02:first 02:last 02:age
Sales Bob Jones 34 Susan O'Connor 40
Engineering Dilbert P ? Dogbert Dog 1
 
Each row is stored contiguously. All columns in row 2 come after
row 1.
To analyze only age, C* still has to read every field.
Cassandra is really a row-based, OLTP-oriented datastore.
Unless you know how to use it otherwise :)
The traditional row-based data storage
approach is dead
- Michael Stonebraker
Columnar Storage (Memory)
Name column
0 1
0 1
 
Dictionary: {0: "Barak", 1: "Hillary"}
 
Age column
0 1
46 66
Columnar Storage (Cassandra)
Review: each physical row in Cassandra (e.g. a "partition key")
stores its columns together on disk.
 
Schema CF
Rowkey Type
Name StringDict
Age Int
 
Data CF
Rowkey 0 1
Name 0 1
Age 46 66
Columnar Format solves I/O
Compression
Dictionary compression - HUGE savings for low-cardinality
string columns
RLE, other techniques
Reduce I/O
Only columns needed for query are loaded from disk
Batch multiple rows in one cell for efficiency (avoid cluster key
overhead)
Columnar Format solves Caching
Use the same format on disk, in cache, in memory scan
Caching works a lot better when the cached object is the
same!!
No data format dissonance means bringing in new bits of data
and combining with existing cached data is seamless
So, why isn't everybody doing this?
No columnar storage format designed to work with NoSQL
stores
Efficient conversion to/from columnar format a hard problem
Most infrastructure is still row oriented
Spark SQL/DataFrames based on RDD[Row]
Spark Catalyst is a row-oriented query parser
All hard work leads to profit, but mere talk leads
to poverty.
- Proverbs 14:23
Columnar Storage Performance Study
 
http://github.com/velvia/cassandra-gdelt
GDELT Dataset
1979 to now
60 columns, 250 million+ rows, 250GB+
Let's compare Cassandra I/O only, no caching or Spark
Global Database of Events, Language, and Tone
The scenarios
1. Narrow table - CQL table with one row per partition key
2. Wide table - wide rows with 10,000 logical rows per partition
key
3. Columnar layout - 1000 rows per columnar chunk, wide rows,
with dictionary compression
First 4 million rows, localhost, SSD, C* 2.0.9, LZ4 compression.
Compaction performed before read benchmarks.
Query and ingest times
Scenario Ingest Read all
columns
Read one
column
Narrow
table
1927
sec
505 sec 504 sec
Wide
table
3897
sec
365 sec 351 sec
Columnar 93 sec 8.6 sec 0.23 sec
 
On reads, using a columnar format is up to 2190x faster, while
ingestion is 20-40x faster.
Of course, real life perf gains will depend heavily on query,
table width, etc. etc.
Disk space usage
Scenario Disk used
Narrow table 2.7 GB
Wide table 1.6 GB
Columnar 0.34 GB
The disk space usage helps explain some of the numbers.
Towards Extreme Query Performance
The filo project
is a binary data vector library
designed for extreme read performance with minimal
deserialization costs.
http://github.com/velvia/filo
Designed for NoSQL, not a file format
random or linear access
on or off heap
missing value support
Scala only, but cross-platform support possible
What is the ceiling?
This Scala loop can read integers from a binary Filo blob at a rate
of 2 billion integers per second - single threaded:
defsumAllInts():Int={
vartotal=0
for{i<-0untilnumValuesoptimized}{
total+=sc(i)
}
total
}
Vectorization of Spark Queries
The project.Tungsten
Process many elements from the same column at once, keep data
in L1/L2 cache.
Coming in Spark 1.4 through 1.6
Hot Column Caching in Tachyon
Has a "table" feature, originally designed for Shark
Keep hot columnar chunks in shared off-heap memory for fast
access
Introducing FiloDB
 
http://github.com/velvia/FiloDB
What's in the name?
Rich sweet layers of distributed, versioned database goodness
Distributed
Apache Cassandra. Scale out with no SPOF. Cross-datacenter
replication. Proven storage and database technology.
Versioned
Incrementally add a column or a few rows as a new version. Easily
control what versions to query. Roll back changes inexpensively.
Stream out new versions as continuous queries :)
Columnar
Parquet-style storage layout
Retrieve select columns and minimize I/O for OLAP queries
Add a new column without having to copy the whole table
Vectorization and lazy/zero serialization for extreme
efficiency
100% Reactive
Built completely on the Typesafe Platform:
Scala 2.10 and SBT
Spark (including custom data source)
Akka Actors for rational scale-out concurrency
Futures for I/O
Phantom Cassandra client for reactive, type-safe C* I/O
Typesafe Config
Spark SQL Queries!
SELECTfirst,last,ageFROMcustomers
WHERE_version>3ANDage<40LIMIT100
Read to and write from Spark Dataframes
Append/merge to FiloDB table from Spark Streaming
FiloDB vs Parquet
Comparable read performance - with lots of space to improve
Assuming co-located Spark and Cassandra
On localhost, both subsecond for simple queries (GDELT
1979-1984)
FiloDB has more room to grow - due to hot column caching
and much less deserialization overhead
Lower memory requirement due to much smaller block sizes
Much better fit for IoT / Machine / Time-series applications
Limited support for types
array / set / map support not there, but will be added later
Where FiloDB Fits In
Use regular C* denormalized tables for OLTP and single-key
lookups
Use FiloDB for the remaining ad-hoc or more complex
analytical queries
Simplify your analytics infrastructure!
No need to export to Hadoop/Parquet/data warehouse.
Use Spark and C* for both OLAP and OLTP!
Perform ad-hoc OLAP analysis of your time-series, IoT data
Simplify your Lambda Architecture...
( )https://www.mapr.com/developercentral/lambda-architecture
With Spark, Cassandra, and FiloDB
Ma, where did all the components go?
You mean I don't have to deal with Hadoop?
Use Cassandra as a front end to store IoT data first
Exactly-Once Ingestion from Kafka
New rows appended via Kafka
Writes are idempotent - no need to dedup!
Converted to columnar chunks on ingest and stored in C*
Only necessary columnar chunks are read into Spark for
minimal I/O
You can help!
Send me your use cases for OLAP on Cassandra and Spark
Especially IoT and Geospatial
Email if you want to contribute
Thanks...
to the entire OSS community, but in particular:
Lee Mighdoll, Nest/Google
Rohit Rai and Satya B., Tuplejump
My colleagues at Socrata
 
If you want to go fast, go alone. If you want to go
far, go together.
-- African proverb
DEMO TIME
GDELT: Regular C* Tables vs FiloDB
Extra Slides
When in doubt, use brute force
- Ken Thompson
Automatic Columnar Conversion using
Custom Indexes
Write to Cassandra as you normally do
Custom indexer takes changes, merges and compacts into
columnar chunks behind scenes
Implementing Lambda is Hard
Use real-time pipeline backed by a KV store for new updates
Lots of moving parts
Key-value store, real time sys, batch, etc.
Need to run similar code in two places
Still need to deal with ingesting data to Parquet/HDFS
Need to reconcile queries against two different places

More Related Content

What's hot

What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Yoshiyasu SAEKI
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
BigData_TP3 : Spark
BigData_TP3 : SparkBigData_TP3 : Spark
BigData_TP3 : SparkLilia Sfaxi
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo ProductsMikio Hirabayashi
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisKnoldus Inc.
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB
 
トランザクションの設計と進化
トランザクションの設計と進化トランザクションの設計と進化
トランザクションの設計と進化Kumazaki Hiroki
 
Cassandra techniques de modelisation avancee
Cassandra techniques de modelisation avanceeCassandra techniques de modelisation avancee
Cassandra techniques de modelisation avanceeDuyhai Doan
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 

What's hot (20)

What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
BigData_TP3 : Spark
BigData_TP3 : SparkBigData_TP3 : Spark
BigData_TP3 : Spark
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo Products
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Chapitre 4 no sql
Chapitre 4 no sqlChapitre 4 no sql
Chapitre 4 no sql
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
トランザクションの設計と進化
トランザクションの設計と進化トランザクションの設計と進化
トランザクションの設計と進化
 
Cassandra techniques de modelisation avancee
Cassandra techniques de modelisation avanceeCassandra techniques de modelisation avancee
Cassandra techniques de modelisation avancee
 
Real time data quality on Flink
Real time data quality on FlinkReal time data quality on Flink
Real time data quality on Flink
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 

Similar to Breakthrough OLAP performance with Cassandra and Spark

TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiCodemotion Dubai
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingDibyendu Bhattacharya
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Applicationsupertom
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at PollfishPollfish
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlDavid Daeschler
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsYousun Jeong
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergyniallmilton
 
cassandra
cassandracassandra
cassandraAkash R
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for SysadminsNathan Milford
 

Similar to Breakthrough OLAP performance with Cassandra and Spark (20)

TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Cassndra (4).pptx
Cassndra (4).pptxCassndra (4).pptx
Cassndra (4).pptx
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Application
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy
 
cassandra
cassandracassandra
cassandra
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 

More from Evan Chan

Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustEvan Chan
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesEvan Chan
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Evan Chan
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureEvan Chan
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and SparkEvan Chan
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server TalkEvan Chan
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 

More from Evan Chan (15)

Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 

Recently uploaded

Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 

Recently uploaded (20)

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 

Breakthrough OLAP performance with Cassandra and Spark

  • 1. Breakthrough OLAP Performance with Cassandra and Spark Evan Chan August 2015
  • 2. Who am I? Distinguished Engineer, @evanfchan User and contributor to Spark since 0.9, Cassandra since 0.6 Co-creator and maintainer of TupleJump http://velvia.github.io Spark Job Server
  • 3. About Tuplejump is a big data technology leader providing solutions for rapid insights from data. Tuplejump - the first Spark-Cassandra integration - an open source Lucene indexer for Cassandra - open source HDFS for Cassandra Calliope Stargate SnackFS
  • 4. Didn't I attend the same talk last year? Similar title, but mostly new material Will reveal new open source projects! :)
  • 5. Problem Space Need analytical database / queries on structured big data Something SQL-like, very flexible and fast Pre-aggregation too limiting Fast data / constant updates Ideally, want my queries to run over fresh data too
  • 6. Example: Video analytics Typical collection and analysis of consumer events 3 billion new events every day Video publishers want updated stats, the sooner the better Pre-aggregation only enables simple dashboard UIs What if one wants to offer more advanced analysis, or a generic data query API? Eg, top countries filtered by device type, OS, browser
  • 7. Requirements Scalable - rules out PostGreSQL, etc. Easy to update and ingest new data Not traditional OLAP cubes - that's not what I'm talking about Very fast for analytical queries - OLAP not OLTP Extremely flexible queries Preferably open source
  • 8. Parquet Widely used, lots of support (Spark, Impala, etc.) Problem: Parquet is read-optimized, not easy to use for writes Cannot support idempotent writes Optimized for writing very large chunks, not small updates Not suitable for time series, IoT, etc. Often needs multiple passes of jobs for compaction of small files, deduplication, etc.   People really want a database-like abstraction, not a file format!
  • 9. Turns out this has been solved before! Even .Facebook uses Vertica
  • 10. MPP Databases Easy writes plus fast queries, with constant transfers Automatic query optimization by storing intermediate query projections Stonebraker, et. al. - paper (Brown Univ)CStore
  • 11. What's wrong with MPP Databases? Closed source $$$ Usually don't scale horizontally that well (or cost is prohibitive)
  • 12. Cassandra Horizontally scalable Very flexible data modelling (lists, sets, custom data types) Easy to operate Perfect for ingestion of real time / machine data Best of breed storage technology, huge community BUT: Simple queries only OLTP-oriented
  • 13. Apache Spark Horizontally scalable, in-memory queries Functional Scala transforms - map, filter, groupBy, sort etc. SQL, machine learning, streaming, graph, R, many more plugins all on ONE platform - feed your SQL results to a logistic regression, easy! Huge number of connectors with every single storage technology
  • 14. Spark provides the missing fast, deep analytics piece of Cassandra!
  • 15. Spark and Cassandra OLAP Architectures
  • 16. Separate Storage and Query Layers Combine best of breed storage and query platforms Take full advantage of evolution of each Storage handles replication for availability Query can replicate data for scaling read concurrency - independent!
  • 18. Spark SQL Appeared with Spark 1.0 In-memory columnar store Parquet, Json, Cassandra connector, Avro, many more SQL as well as DataFrames (Pandas-style) API Indexing integrated into data sources (eg C* secondary indexes) Write custom functions in Scala .... take that Hive UDFs!! Integrates well with MLBase, Scala/Java/Python
  • 19. Connecting Spark to Cassandra Datastax's Tuplejump Spark Cassandra Connector Calliope   Get started in one line with spark-shell! bin/spark-shell --packagescom.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M3 --confspark.cassandra.connection.host=127.0.0.1
  • 20. Caching a SQL Table from Cassandra DataFrames support in Cassandra Connector 1.4.0 (and 1.3.0): valsqlContext=neworg.apache.spark.sql.SQLContext(sc) valdf=sqlContext.read .format("org.apache.spark.sql.cassandra") .options(Map("table"->"gdelt","keyspace"->"test")) .load() df.registerTempTable("gdelt") sqlContext.cacheTable("gdelt") sqlContext.sql("SELECTcount(monthyear)FROMgdelt").show()   Spark does no caching by default - you will always be reading from C*!
  • 21. How Spark SQL's Table Caching Works
  • 22. Spark Cached Tables can be Really Fast GDELT dataset, 4 million rows, 60 columns, localhost Method secs Uncached 317 Cached 0.38   Almost a 1000x speedup! On an 8-node EC2 c3.XL cluster, 117 million rows, can run common queries 1-2 seconds against cached dataset.
  • 23. Tuning Connector Partitioning spark.cassandra.input.split.size Guideline: One split per partition, one partition per CPU core Much more parallelism won't speed up job much, but will starve other C* requests
  • 24. Lesson #1: Take Advantage of Spark Caching!
  • 25. Problems with Cached Tables Still have to read the data from Cassandra first, which is slow Amount of RAM: your entire data + extra for conversion to cached table Cached tables only live in Spark executors - by default tied to single context - not HA once any executor dies, must re-read data from C* Caching takes time: convert from RDD[Row] to compressed columnar format Cannot easily combine new RDD[Row] with cached tables (and keep speed)
  • 26. Problems with Cached Tables If you don't have enough RAM, Spark can cache your tables partly to disk. This is still way, way, faster than scanning an entire C* table. However, cached tables are still tied to a single Spark context/application. Also: rdd.cache()is NOT the same as SQLContext's cacheTable!
  • 27. What about C* Secondary Indexing? Spark-Cassandra Connector and Calliope can both reduce I/O by using Cassandra secondary indices. Does this work with caching? No, not really, because only the filtered rows would be cached. Subsequent queries against this limited cached table would not give you expected results.
  • 29. Intro to Tachyon Tachyon: an in-memory cache for HDFS and other binary data sources Keeps data off-heap, so multiple Spark applications/executors can share data Solves HA problem for data
  • 30. Wait, wait, wait! What am I caching exactly? Tachyon is designed for caching files or binary blobs. A serialized form of CassandraRow/CassandraRDD? Raw output from Cassandra driver? What you really want is this: Cassandra SSTable -> Tachyon (as row cache) -> CQL -> Spark
  • 31. Bad programmers worry about the code. Good programmers worry about data structures. - Linus Torvalds   Are we really thinking holistically about data modelling, caching, and how it affects the entire systems architecture?
  • 32. Efficient Columnar Storage in Cassandra Wait, I thought Cassandra was columnar?
  • 33. How Cassandra stores your CQL Tables Suppose you had this CQL table: CREATETABLE( departmenttext, empIdtext, firsttext, lasttext, ageint, PRIMARYKEY(department,empId) );
  • 34. How Cassandra stores your CQL Tables PartitionKey 01:first 01:last 01:age 02:first 02:last 02:age Sales Bob Jones 34 Susan O'Connor 40 Engineering Dilbert P ? Dogbert Dog 1   Each row is stored contiguously. All columns in row 2 come after row 1. To analyze only age, C* still has to read every field.
  • 35. Cassandra is really a row-based, OLTP-oriented datastore. Unless you know how to use it otherwise :)
  • 36. The traditional row-based data storage approach is dead - Michael Stonebraker
  • 37. Columnar Storage (Memory) Name column 0 1 0 1   Dictionary: {0: "Barak", 1: "Hillary"}   Age column 0 1 46 66
  • 38. Columnar Storage (Cassandra) Review: each physical row in Cassandra (e.g. a "partition key") stores its columns together on disk.   Schema CF Rowkey Type Name StringDict Age Int   Data CF Rowkey 0 1 Name 0 1 Age 46 66
  • 39. Columnar Format solves I/O Compression Dictionary compression - HUGE savings for low-cardinality string columns RLE, other techniques Reduce I/O Only columns needed for query are loaded from disk Batch multiple rows in one cell for efficiency (avoid cluster key overhead)
  • 40. Columnar Format solves Caching Use the same format on disk, in cache, in memory scan Caching works a lot better when the cached object is the same!! No data format dissonance means bringing in new bits of data and combining with existing cached data is seamless
  • 41. So, why isn't everybody doing this? No columnar storage format designed to work with NoSQL stores Efficient conversion to/from columnar format a hard problem Most infrastructure is still row oriented Spark SQL/DataFrames based on RDD[Row] Spark Catalyst is a row-oriented query parser
  • 42. All hard work leads to profit, but mere talk leads to poverty. - Proverbs 14:23
  • 43.
  • 44. Columnar Storage Performance Study   http://github.com/velvia/cassandra-gdelt
  • 45. GDELT Dataset 1979 to now 60 columns, 250 million+ rows, 250GB+ Let's compare Cassandra I/O only, no caching or Spark Global Database of Events, Language, and Tone
  • 46. The scenarios 1. Narrow table - CQL table with one row per partition key 2. Wide table - wide rows with 10,000 logical rows per partition key 3. Columnar layout - 1000 rows per columnar chunk, wide rows, with dictionary compression First 4 million rows, localhost, SSD, C* 2.0.9, LZ4 compression. Compaction performed before read benchmarks.
  • 47. Query and ingest times Scenario Ingest Read all columns Read one column Narrow table 1927 sec 505 sec 504 sec Wide table 3897 sec 365 sec 351 sec Columnar 93 sec 8.6 sec 0.23 sec   On reads, using a columnar format is up to 2190x faster, while ingestion is 20-40x faster. Of course, real life perf gains will depend heavily on query, table width, etc. etc.
  • 48. Disk space usage Scenario Disk used Narrow table 2.7 GB Wide table 1.6 GB Columnar 0.34 GB The disk space usage helps explain some of the numbers.
  • 49. Towards Extreme Query Performance
  • 50. The filo project is a binary data vector library designed for extreme read performance with minimal deserialization costs. http://github.com/velvia/filo Designed for NoSQL, not a file format random or linear access on or off heap missing value support Scala only, but cross-platform support possible
  • 51. What is the ceiling? This Scala loop can read integers from a binary Filo blob at a rate of 2 billion integers per second - single threaded: defsumAllInts():Int={ vartotal=0 for{i<-0untilnumValuesoptimized}{ total+=sc(i) } total }
  • 52. Vectorization of Spark Queries The project.Tungsten Process many elements from the same column at once, keep data in L1/L2 cache. Coming in Spark 1.4 through 1.6
  • 53. Hot Column Caching in Tachyon Has a "table" feature, originally designed for Shark Keep hot columnar chunks in shared off-heap memory for fast access
  • 55. What's in the name? Rich sweet layers of distributed, versioned database goodness
  • 56. Distributed Apache Cassandra. Scale out with no SPOF. Cross-datacenter replication. Proven storage and database technology.
  • 57. Versioned Incrementally add a column or a few rows as a new version. Easily control what versions to query. Roll back changes inexpensively. Stream out new versions as continuous queries :)
  • 58. Columnar Parquet-style storage layout Retrieve select columns and minimize I/O for OLAP queries Add a new column without having to copy the whole table Vectorization and lazy/zero serialization for extreme efficiency
  • 59. 100% Reactive Built completely on the Typesafe Platform: Scala 2.10 and SBT Spark (including custom data source) Akka Actors for rational scale-out concurrency Futures for I/O Phantom Cassandra client for reactive, type-safe C* I/O Typesafe Config
  • 60. Spark SQL Queries! SELECTfirst,last,ageFROMcustomers WHERE_version>3ANDage<40LIMIT100 Read to and write from Spark Dataframes Append/merge to FiloDB table from Spark Streaming
  • 61. FiloDB vs Parquet Comparable read performance - with lots of space to improve Assuming co-located Spark and Cassandra On localhost, both subsecond for simple queries (GDELT 1979-1984) FiloDB has more room to grow - due to hot column caching and much less deserialization overhead Lower memory requirement due to much smaller block sizes Much better fit for IoT / Machine / Time-series applications Limited support for types array / set / map support not there, but will be added later
  • 62. Where FiloDB Fits In Use regular C* denormalized tables for OLTP and single-key lookups Use FiloDB for the remaining ad-hoc or more complex analytical queries Simplify your analytics infrastructure! No need to export to Hadoop/Parquet/data warehouse. Use Spark and C* for both OLAP and OLTP! Perform ad-hoc OLAP analysis of your time-series, IoT data
  • 63. Simplify your Lambda Architecture... ( )https://www.mapr.com/developercentral/lambda-architecture
  • 64. With Spark, Cassandra, and FiloDB Ma, where did all the components go? You mean I don't have to deal with Hadoop? Use Cassandra as a front end to store IoT data first
  • 65. Exactly-Once Ingestion from Kafka New rows appended via Kafka Writes are idempotent - no need to dedup! Converted to columnar chunks on ingest and stored in C* Only necessary columnar chunks are read into Spark for minimal I/O
  • 66. You can help! Send me your use cases for OLAP on Cassandra and Spark Especially IoT and Geospatial Email if you want to contribute
  • 67. Thanks... to the entire OSS community, but in particular: Lee Mighdoll, Nest/Google Rohit Rai and Satya B., Tuplejump My colleagues at Socrata   If you want to go fast, go alone. If you want to go far, go together. -- African proverb
  • 68. DEMO TIME GDELT: Regular C* Tables vs FiloDB
  • 70. When in doubt, use brute force - Ken Thompson
  • 71. Automatic Columnar Conversion using Custom Indexes Write to Cassandra as you normally do Custom indexer takes changes, merges and compacts into columnar chunks behind scenes
  • 72. Implementing Lambda is Hard Use real-time pipeline backed by a KV store for new updates Lots of moving parts Key-value store, real time sys, batch, etc. Need to run similar code in two places Still need to deal with ingesting data to Parquet/HDFS Need to reconcile queries against two different places