SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
OLAP WITH SPARK AND 
CASSANDRA 
#CassandraSummit 
EVAN CHAN 
SEPT 2014
WHO AM I? 
Principal Engineer, 
@evanfchan 
Creator of 
Socrata, Inc. 
http://github.com/velvia 
Spark Job Server
WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE 
PEOPLE. 
data.edmonton.ca finances.worldbank.org data.cityofchicago.org 
data.seattle.gov data.oregon.gov data.wa.gov 
www.metrochicagodata.org data.cityofboston.gov 
info.samhsa.gov explore.data.gov data.cms.gov data.ok.gov 
data.nola.gov data.illinois.gov data.colorado.gov 
data.austintexas.gov data.undp.org www.opendatanyc.com 
data.mo.gov data.nfpa.org data.raleighnc.gov dati.lombardia.it 
data.montgomerycountymd.gov data.cityofnewyork.us 
data.acgov.org data.baltimorecity.gov data.energystar.gov 
data.somervillema.gov data.maryland.gov data.taxpayer.net 
bronx.lehman.cuny.edu data.hawaii.gov data.sfgov.org
WE ARE SWIMMING IN DATA!
BIG DATA AT SOCRATA 
Tens of thousands of datasets, each one up to 30 million rows 
Customer demand for billion row datasets 
Want to analyze across datasets
BIG DATA AT OOYALA 
2.5 billion analytics pings a day = almost a trillion events a 
year. 
Roll up tables - 30 million rows per day
HOW CAN WE ALLOW CUSTOMERS TO QUERY A 
YEAR'S WORTH OF DATA? 
Flexible - complex queries included 
Sometimes you can't denormalize your data enough 
Fast - interactive speeds 
Near Real Time - can't make customers wait hours before 
querying new data
RDBMS? POSTGRES? 
Start hitting latency limits at ~10 million rows 
No robust and inexpensive solution for querying across shards 
No robust way to scale horizontally 
PostGres runs query on single thread unless you partition 
(painful!) 
Complex and expensive to improve performance (eg rollup 
tables, huge expensive servers)
OLAP CUBES? 
Materialize summary for every possible combination 
Too complicated and brittle 
Takes forever to compute - not for real time 
Explodes storage and memory
When in doubt, use brute force 
- Ken Thompson
CASSANDRA 
Horizontally scalable 
Very flexible data modelling (lists, sets, custom data types) 
Easy to operate 
No fear of number of rows or documents 
Best of breed storage technology, huge community 
BUT: Simple queries only
APACHE SPARK 
Horizontally scalable, in-memory queries 
Functional Scala transforms - map, filter, groupBy, sort 
etc. 
SQL, machine learning, streaming, graph, R, many more plugins 
all on ONE platform - feed your SQL results to a logistic 
regression, easy! 
THE Hottest big data platform, huge community, leaving 
Hadoop in the dust 
Developers love it
SPARK PROVIDES THE MISSING FAST, DEEP 
ANALYTICS PIECE OF CASSANDRA!
INTEGRATING SPARK AND CASSANDRA 
Scala solutions: 
Datastax integration: 
https://github.com/datastax/spark-cassandra- 
connector 
(CQL-based) 
Calliope
A bit more work: 
Use traditional Cassandra client with RDDs 
Use an existing InputFormat, like CqlPagedInputFormat 
Only reason to go here is probably you are not on CQL version of 
Cassandra, or you're using Shark/Hive.
A SPARK AND CASSANDRA 
OLAP ARCHITECTURE
SEPARATE STORAGE AND QUERY LAYERS 
Combine best of breed storage and query platforms 
Take full advantage of evolution of each 
Storage handles replication for availability 
Query can replicate data for scaling read concurrency - 
independent!
SCALE NODES, NOT 
DEVELOPER TIME!!
KEEPING IT SIMPLE 
Maximize row scan speed 
Columnar representation for efficiency 
Compressed bitmap indexes for fast algebra 
Functional transforms for easy memoization, testing, 
concurrency, composition
SPARK AS CASSANDRA'S CACHE
EVEN BETTER: TACHYON OFF-HEAP CACHING
INITIAL ATTEMPTS 
val rows = Seq( 
Seq("Burglary", "19xx Hurston", 10), 
Seq("Theft", "55xx Floatilla Ave", 5) 
) 
sc.parallelize(rows) 
.map { values => (values[0], values) } 
.groupByKey 
.reduce(_[2] + _[2])
No existing generic query engine for Spark when we started 
(Shark was in infancy, had no indexes, etc.), so we built our own 
For every row, need to extract out needed columns 
Ability to select arbitrary columns means using Seq[Any], no 
type safety 
Boxing makes integer aggregation very expensive and memory 
inefficient
COLUMNAR STORAGE AND QUERYING
The traditional row-based data storage 
approach is dead 
- Michael Stonebraker
TRADITIONAL ROW-BASED STORAGE 
Same layout in memory and on disk: 
Name Age 
Barak 46 
Hillary 66 
Each row is stored contiguously. All columns in row 2 come after 
row 1.
COLUMNAR STORAGE (MEMORY) 
Name column 
0 1 
0 1 
Dictionary: {0: "Barak", 1: "Hillary"} 
Age column 
0 1 
46 66
COLUMNAR STORAGE (CASSANDRA) 
Review: each physical row in Cassandra (e.g. a "partition key") 
stores its columns together on disk. 
Schema CF 
Rowkey Type 
Name StringDict 
Age Int 
Data CF 
Rowkey 0 1 
Name 0 1 
Age 46 66
ADVANTAGES OF COLUMNAR STORAGE 
Compression 
Dictionary compression - HUGE savings for low-cardinality 
string columns 
RLE 
Reduce I/O 
Only columns needed for query are loaded from disk 
Can keep strong types in memory, avoid boxing 
Batch multiple rows in one cell for efficiency
ADVANTAGES OF COLUMNAR QUERYING 
Cache locality for aggregating column of data 
Take advantage of CPU/GPU vector instructions for ints / 
doubles 
avoid row-ifying until last possible moment 
easy to derive computed columns 
Use vector data / linear math libraries
COLUMNAR QUERY ENGINE VS ROW-BASED IN 
SCALA 
Custom RDD of column-oriented blocks of data 
Uses ~10x less heap 
10-100x faster for group by's on a single node 
Scan speed in excess of 150M rows/sec/core for integer 
aggregations
SO, GREAT, OLAP WITH CASSANDRA AND 
SPARK. NOW WHAT?
DATASTAX: CASSANDRA SPARK INTEGRATION 
Datastax Enterprise now comes with HA Spark 
HA master, that is. 
spark-cassandra-connector
SPARK SQL 
Appeared with Spark 1.0 
In-memory columnar store 
Can read from Parquet and JSON now; direct Cassandra 
integration coming 
Querying is not column-based (yet) 
No indexes 
Write custom functions in Scala .... take that Hive UDFs!! 
Integrates well with MLBase, Scala/Java/Python
CACHING A SQL TABLE FROM CASSANDRA 
val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
sc.cassandraTable[GDeltRow]("gdelt, "1979to2009") 
.registerAsTable("gdelt") 
sqlContext.cacheTable("gdelt") 
sqlContext.sql("SELECT Actor2Code, Actor2Name, Actor2CountryCode, AvgTone from gdelt ORDER Remember Spark is lazy, nothing is executed until the 
collect() 
In Spark 1.1+: registerTempTable
SOME PERFORMANCE NUMBERS 
GDELT dataset, 117 million rows, 57 columns, ~50GB 
Spark 1.0.2, AWS 8 x c3.xlarge, cached in memory 
Query Avg 
time 
(sec) 
SELECT count(*) FROM gdelt 
WHERE Actor2CountryCode = 
'CHN' 
0.49 
SELECT 4 columns Top K 1.51 
SELECT Top countries by Avg Tone 
2.69 
(Group By)
IMPORTANT - CACHING 
By default, queries will read data from source - Cassandra - 
every time 
Spark RDD Caching - much faster, but big waste of memory 
(row oriented) 
Spark SQL table caching - fastest, memory efficient
WORK STILL NEEDED 
Indexes 
Columnar querying for fast aggregation 
Tachyon support for Cassandra/CQL 
Efficient reading from columnar storage formats
LESSONS 
Extremely fast distributed querying for these use cases 
Data doesn't change much (and only bulk changes) 
Analytical queries for subset of columns 
Focused on numerical aggregations 
Small numbers of group bys 
For fast query performance, cache your data using Spark SQL 
Concurrent queries is a frontier with Spark. Use additional 
Spark contexts.
THANK YOU!
EXTRA SLIDES
EXAMPLE CUSTOM INTEGRATION USING 
ASTYANAX 
val cassRDD = sc.parallelize(rowkeys). 
flatMap { rowkey => 
columnFamily.get(rowkey).execute().asScala 
}
SOME COLUMNAR ALTERNATIVES 
Monetdb and Infobright - true columnar stores (storage + 
querying) 
Vertica and C-Store 
Google BigQuery - columnar cloud database, Dremel based 
Amazon RedShift

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Scale-Out Using Spark in Serverless Herd Mode!
Scale-Out Using Spark in Serverless Herd Mode!Scale-Out Using Spark in Serverless Herd Mode!
Scale-Out Using Spark in Serverless Herd Mode!
 

Andere mochten auch

TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and Spark
Artem Chebotko
 
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
DataStax Academy
 
Apache Cassandra at Narmal 2014
Apache Cassandra at Narmal 2014Apache Cassandra at Narmal 2014
Apache Cassandra at Narmal 2014
DataStax Academy
 
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
DataStax Academy
 

Andere mochten auch (20)

TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
 
Overiew of Cassandra and Doradus
Overiew of Cassandra and DoradusOveriew of Cassandra and Doradus
Overiew of Cassandra and Doradus
 
Extending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance AnalyticsExtending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance Analytics
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and Spark
 
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
 
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
 
Apache Cassandra at Narmal 2014
Apache Cassandra at Narmal 2014Apache Cassandra at Narmal 2014
Apache Cassandra at Narmal 2014
 
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
 
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
 
Introduction to Dating Modeling for Cassandra
Introduction to Dating Modeling for CassandraIntroduction to Dating Modeling for Cassandra
Introduction to Dating Modeling for Cassandra
 
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
Cassandra Summit 2014: Apache Cassandra at Telefonica CBSCassandra Summit 2014: Apache Cassandra at Telefonica CBS
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
 
Coursera's Adoption of Cassandra
Coursera's Adoption of CassandraCoursera's Adoption of Cassandra
Coursera's Adoption of Cassandra
 
Cassandra Summit 2014: Monitor Everything!
Cassandra Summit 2014: Monitor Everything!Cassandra Summit 2014: Monitor Everything!
Cassandra Summit 2014: Monitor Everything!
 
Production Ready Cassandra (Beginner)
Production Ready Cassandra (Beginner)Production Ready Cassandra (Beginner)
Production Ready Cassandra (Beginner)
 
Cassandra Summit 2014: The Cassandra Experience at Orange — Season 2
Cassandra Summit 2014: The Cassandra Experience at Orange — Season 2Cassandra Summit 2014: The Cassandra Experience at Orange — Season 2
Cassandra Summit 2014: The Cassandra Experience at Orange — Season 2
 
The Last Pickle: Distributed Tracing from Application to Database
The Last Pickle: Distributed Tracing from Application to DatabaseThe Last Pickle: Distributed Tracing from Application to Database
The Last Pickle: Distributed Tracing from Application to Database
 
New features in 3.0
New features in 3.0New features in 3.0
New features in 3.0
 
Introduction to .Net Driver
Introduction to .Net DriverIntroduction to .Net Driver
Introduction to .Net Driver
 

Ähnlich wie Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark

Olap with Spark and Cassandra
Olap with Spark and CassandraOlap with Spark and Cassandra
Olap with Spark and Cassandra
DataStax Academy
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 

Ähnlich wie Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark (20)

Olap with Spark and Cassandra
Olap with Spark and CassandraOlap with Spark and Cassandra
Olap with Spark and Cassandra
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15
 
Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan OttTrivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
Cassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting dataCassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting data
 

Mehr von DataStax Academy

Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 

Mehr von DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark

  • 1. OLAP WITH SPARK AND CASSANDRA #CassandraSummit EVAN CHAN SEPT 2014
  • 2. WHO AM I? Principal Engineer, @evanfchan Creator of Socrata, Inc. http://github.com/velvia Spark Job Server
  • 3. WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE PEOPLE. data.edmonton.ca finances.worldbank.org data.cityofchicago.org data.seattle.gov data.oregon.gov data.wa.gov www.metrochicagodata.org data.cityofboston.gov info.samhsa.gov explore.data.gov data.cms.gov data.ok.gov data.nola.gov data.illinois.gov data.colorado.gov data.austintexas.gov data.undp.org www.opendatanyc.com data.mo.gov data.nfpa.org data.raleighnc.gov dati.lombardia.it data.montgomerycountymd.gov data.cityofnewyork.us data.acgov.org data.baltimorecity.gov data.energystar.gov data.somervillema.gov data.maryland.gov data.taxpayer.net bronx.lehman.cuny.edu data.hawaii.gov data.sfgov.org
  • 4. WE ARE SWIMMING IN DATA!
  • 5. BIG DATA AT SOCRATA Tens of thousands of datasets, each one up to 30 million rows Customer demand for billion row datasets Want to analyze across datasets
  • 6. BIG DATA AT OOYALA 2.5 billion analytics pings a day = almost a trillion events a year. Roll up tables - 30 million rows per day
  • 7. HOW CAN WE ALLOW CUSTOMERS TO QUERY A YEAR'S WORTH OF DATA? Flexible - complex queries included Sometimes you can't denormalize your data enough Fast - interactive speeds Near Real Time - can't make customers wait hours before querying new data
  • 8. RDBMS? POSTGRES? Start hitting latency limits at ~10 million rows No robust and inexpensive solution for querying across shards No robust way to scale horizontally PostGres runs query on single thread unless you partition (painful!) Complex and expensive to improve performance (eg rollup tables, huge expensive servers)
  • 9. OLAP CUBES? Materialize summary for every possible combination Too complicated and brittle Takes forever to compute - not for real time Explodes storage and memory
  • 10. When in doubt, use brute force - Ken Thompson
  • 11.
  • 12. CASSANDRA Horizontally scalable Very flexible data modelling (lists, sets, custom data types) Easy to operate No fear of number of rows or documents Best of breed storage technology, huge community BUT: Simple queries only
  • 13. APACHE SPARK Horizontally scalable, in-memory queries Functional Scala transforms - map, filter, groupBy, sort etc. SQL, machine learning, streaming, graph, R, many more plugins all on ONE platform - feed your SQL results to a logistic regression, easy! THE Hottest big data platform, huge community, leaving Hadoop in the dust Developers love it
  • 14. SPARK PROVIDES THE MISSING FAST, DEEP ANALYTICS PIECE OF CASSANDRA!
  • 15. INTEGRATING SPARK AND CASSANDRA Scala solutions: Datastax integration: https://github.com/datastax/spark-cassandra- connector (CQL-based) Calliope
  • 16. A bit more work: Use traditional Cassandra client with RDDs Use an existing InputFormat, like CqlPagedInputFormat Only reason to go here is probably you are not on CQL version of Cassandra, or you're using Shark/Hive.
  • 17. A SPARK AND CASSANDRA OLAP ARCHITECTURE
  • 18. SEPARATE STORAGE AND QUERY LAYERS Combine best of breed storage and query platforms Take full advantage of evolution of each Storage handles replication for availability Query can replicate data for scaling read concurrency - independent!
  • 19. SCALE NODES, NOT DEVELOPER TIME!!
  • 20. KEEPING IT SIMPLE Maximize row scan speed Columnar representation for efficiency Compressed bitmap indexes for fast algebra Functional transforms for easy memoization, testing, concurrency, composition
  • 22. EVEN BETTER: TACHYON OFF-HEAP CACHING
  • 23. INITIAL ATTEMPTS val rows = Seq( Seq("Burglary", "19xx Hurston", 10), Seq("Theft", "55xx Floatilla Ave", 5) ) sc.parallelize(rows) .map { values => (values[0], values) } .groupByKey .reduce(_[2] + _[2])
  • 24. No existing generic query engine for Spark when we started (Shark was in infancy, had no indexes, etc.), so we built our own For every row, need to extract out needed columns Ability to select arbitrary columns means using Seq[Any], no type safety Boxing makes integer aggregation very expensive and memory inefficient
  • 26. The traditional row-based data storage approach is dead - Michael Stonebraker
  • 27. TRADITIONAL ROW-BASED STORAGE Same layout in memory and on disk: Name Age Barak 46 Hillary 66 Each row is stored contiguously. All columns in row 2 come after row 1.
  • 28. COLUMNAR STORAGE (MEMORY) Name column 0 1 0 1 Dictionary: {0: "Barak", 1: "Hillary"} Age column 0 1 46 66
  • 29. COLUMNAR STORAGE (CASSANDRA) Review: each physical row in Cassandra (e.g. a "partition key") stores its columns together on disk. Schema CF Rowkey Type Name StringDict Age Int Data CF Rowkey 0 1 Name 0 1 Age 46 66
  • 30. ADVANTAGES OF COLUMNAR STORAGE Compression Dictionary compression - HUGE savings for low-cardinality string columns RLE Reduce I/O Only columns needed for query are loaded from disk Can keep strong types in memory, avoid boxing Batch multiple rows in one cell for efficiency
  • 31. ADVANTAGES OF COLUMNAR QUERYING Cache locality for aggregating column of data Take advantage of CPU/GPU vector instructions for ints / doubles avoid row-ifying until last possible moment easy to derive computed columns Use vector data / linear math libraries
  • 32. COLUMNAR QUERY ENGINE VS ROW-BASED IN SCALA Custom RDD of column-oriented blocks of data Uses ~10x less heap 10-100x faster for group by's on a single node Scan speed in excess of 150M rows/sec/core for integer aggregations
  • 33. SO, GREAT, OLAP WITH CASSANDRA AND SPARK. NOW WHAT?
  • 34.
  • 35. DATASTAX: CASSANDRA SPARK INTEGRATION Datastax Enterprise now comes with HA Spark HA master, that is. spark-cassandra-connector
  • 36. SPARK SQL Appeared with Spark 1.0 In-memory columnar store Can read from Parquet and JSON now; direct Cassandra integration coming Querying is not column-based (yet) No indexes Write custom functions in Scala .... take that Hive UDFs!! Integrates well with MLBase, Scala/Java/Python
  • 37. CACHING A SQL TABLE FROM CASSANDRA val sqlContext = new org.apache.spark.sql.SQLContext(sc) sc.cassandraTable[GDeltRow]("gdelt, "1979to2009") .registerAsTable("gdelt") sqlContext.cacheTable("gdelt") sqlContext.sql("SELECT Actor2Code, Actor2Name, Actor2CountryCode, AvgTone from gdelt ORDER Remember Spark is lazy, nothing is executed until the collect() In Spark 1.1+: registerTempTable
  • 38. SOME PERFORMANCE NUMBERS GDELT dataset, 117 million rows, 57 columns, ~50GB Spark 1.0.2, AWS 8 x c3.xlarge, cached in memory Query Avg time (sec) SELECT count(*) FROM gdelt WHERE Actor2CountryCode = 'CHN' 0.49 SELECT 4 columns Top K 1.51 SELECT Top countries by Avg Tone 2.69 (Group By)
  • 39. IMPORTANT - CACHING By default, queries will read data from source - Cassandra - every time Spark RDD Caching - much faster, but big waste of memory (row oriented) Spark SQL table caching - fastest, memory efficient
  • 40. WORK STILL NEEDED Indexes Columnar querying for fast aggregation Tachyon support for Cassandra/CQL Efficient reading from columnar storage formats
  • 41. LESSONS Extremely fast distributed querying for these use cases Data doesn't change much (and only bulk changes) Analytical queries for subset of columns Focused on numerical aggregations Small numbers of group bys For fast query performance, cache your data using Spark SQL Concurrent queries is a frontier with Spark. Use additional Spark contexts.
  • 44. EXAMPLE CUSTOM INTEGRATION USING ASTYANAX val cassRDD = sc.parallelize(rowkeys). flatMap { rowkey => columnFamily.get(rowkey).execute().asScala }
  • 45. SOME COLUMNAR ALTERNATIVES Monetdb and Infobright - true columnar stores (storage + querying) Vertica and C-Store Google BigQuery - columnar cloud database, Dremel based Amazon RedShift