This document provides an overview of real-time analytics with Apache Cassandra and Apache Spark. It discusses how Spark can be used for stream processing over Cassandra for storage. Spark Streaming ingests real-time data from sources like Kafka and processes it using DStreams that operate on microbatches. This allows joining streaming and batch data. Cassandra is optimized for high write throughput and scales horizontally. The combination of Spark and Cassandra enables transactional analytics over large datasets in real-time.
2. Guido Schmutz
• Working for Trivadis for more than 18 years
• Oracle ACE Director for Fusion Middleware and SOA
• Author of different books
• Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
• Technology Manager @ Trivadis
• More than 25 years of software development experience
• Contact: guido.schmutz@trivadis.com
• Blog: http://guidoschmutz.wordpress.com
• Twitter: gschmutz
4. Big Data Definition (4 Vs)
+ Time to action ? – Big Data + Real-Time = Stream Processing
Characteristics of Big Data: Its Volume,
Velocity and Variety in combination
5. What is Real-Time Analytics?
What is it? Why do we need
it?
How does it work?
• Collect real-time data
• Process data as it flows in
• Data in Motion over Data at
Rest
• Reports and Dashboard
access processed data
Time
Events RespondAnalyze
Short time to
analyze &
respond
§ Required - for new business models
§ Desired - for competitive advantage
6. Real Time Analytics Use Cases
• Algorithmic Trading
• Online Fraud Detection
• Geo Fencing
• Proximity/Location Tracking
• Intrusion detection systems
• Traffic Management
• Recommendations
• Churn detection
• Internet of Things (IoT) / Intelligence
Sensors
• Social Media/Data Analytics
• Gaming Data Feed
• …
8. Motivation – Why Apache Spark?
Hadoop MapReduce: Data Sharing on Disk
Spark: Speed up processing by using Memory instead of Disks
map reduce . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
op1 op2
. . .
Input
Output
Output
9. Apache Spark
Apache Spark is a fast and general engine for large-scale data processing
• The hot trend in Big Data!
• Originally developed 2009 in UC Berkley’s AMPLab
• Based on 2007 Microsoft Dryad paper
• Written in Scala, supports Java, Python, SQL and R
• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x
faster on disk
• One of the largest OSS communities in big data with over 200 contributors in 50+
organizations
• Open Sourced in 2010 – since 2014 part of Apache Software foundation
23. Apache Kafka
distributed publish-subscribe messaging system
Designed for processing of real time activity stream data (logs, metrics collections,
social media streams, …)
Initially developed at LinkedIn, now part of Apache
Does not use JMS API and standards
Kafka maintains feeds of messages in topics Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
32. Discretized Stream (DStream)
time 1 time 2 time 3
message
time n….
f(message 1)
RDD @time 1
f(message 2)
f(message n)
….
message 1
RDD @time 1
message 2
message n
….
result 1
result 2
result n
….
message message message
f(message 1)
RDD @time 2
f(message 2)
f(message n)
….
message 1
RDD @time 2
message 2
message n
….
result 1
result 2
result n
….
f(message 1)
RDD @time 3
f(message 2)
f(message n)
….
message 1
RDD @time 3
message 2
message n
….
result 1
result 2
result n
….
f(message 1)
RDD @time n
f(message 2)
f(message n)
….
message 1
RDD @time n
message 2
message n
….
result 1
result 2
result n
….
Input Stream
Event DStream
MappedDStream
map()
saveAsHadoopFiles()
Time Increasing
DStreamTransformation Lineage
Actions Trigger
Spark Jobs
Adapted from Chris Fregly: http://slidesha.re/11PP7FV
33. Apache Spark Streaming – Core concepts
Discretized Stream (DStream)
• Core Spark Streaming abstraction
• micro batches of RDD’s
• Operations similar to RDD
Input DStreams
• Represents the stream of raw data received
from streaming sources
• Data can be ingested from many sources:
Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP
Socket, Akka actors, etc.
• Custom Sources can be easily written for
custom data sources
Operations
• Same as Spark Core + Additional Stateful
transformations (window, reduceByWindow)
35. Apache Cassandra
Apache Cassandra™ is a free
• Distributed…
• High performance…
• Extremely scalable…
• Fault tolerant (i.e. no single point of failure)…
post-relational database solution
Optimized for high write throughput
37. Motivation - Why NoSQL Databases?
aaa • Dynamo Paper (2007)
• How to build a data store that is
• Reliable
• Performant
• “Always On”
• Nothing new and shiny
• 24 other papers cited
• Evolutionary
38. Motivation - Why NoSQL Databases?
• Google Big Table (2006)
• Richer data model
• 1 key and lot’s of values
• Fast sequential access
• 38 other papers cited
39. Motivation - Why NoSQL Databases?
• Cassandra Paper (2008)
• Distributed features of Dynamo
• Data Model and storage from BigTable
• February 2010 graduated to a top-level Apache
Project
40. Apache Cassandra – More than one server
All nodes participate in a cluster
Shared nothing
Add or remove as needed
More capacity? Add more servers
Node is a basic unit inside a cluster
Each node owns a range of partitions
Consistent Hashing
Node 1
Node 2
Node 3
Node 4
[26-50]
[0-25]
[51-75]
[76-100] [0-25]
[0-25]
[26-50]
[26-50]
[51-75]
[51-75]
[76-100]
[76-100]
41. Apache Cassandra – Fully Replicated
Client writes local
Data syncs across WAN
Replication per Data Center
Node 1
Node 2
Node 3
Node 4
Node 1
Node 2
Node 3
Node 4
West East
Client
42. Apache Cassandra
What is Cassandra NOT?
• A Data Ocean
• A Data Lake
• A Data Pond
• An In-Memory Database
• A Key-Value Store
• Not for Data Warehousing
What are good use cases?
• Product Catalog / Playlists
• Personalization (Ads, Recommendations)
• Fraud Detection
• Time Series (Finance, Smart Meter)
• IoT / Sensor Data
• Graph / Network data
43. How Cassandra stores data
• Model brought from Google Bigtable
• Row Key and a lot of columns
• Column names sorted (UTF8, Int, Timestamp, etc.)
Column Name … Column Name
Column Value Column Value
Timestamp Timestamp
TTL TTL
Row Key
1 2 Billion
Billion of Rows
47. Spark and Cassandra Architecture
Spark Connector
Weather
Station
Spark Streaming
(Near Real-Time)
SparkSQL
(Structured Data)
MLlib
(Machine Learning)
GraphX
(Graph Analysis)
Weather
Station
Weather
Station
Weather
Station
Weather
Station
48. Spark and Cassandra Architecture
• Single Node running Cassandra
• Spark Worker is really small
• Spark Master lives outside a
node
• Spark Worker starts Spark
Executer in separate JVM
• Node local
Worker
Master
Executer
Executer
Server
Executer
49. Spark and Cassandra Architecture
Worker
Worker
Worker
Master
Worker
• Each node runs Spark and
Cassandra
• Spark Master can make
decisions based on Token
Ranges
• Spark likes to work on small
partitions of data across a
large cluster
• Cassandra likes to spread out
data in a large cluster
0-25
26-50
51-75
76-100
Will only have
to analyze 25%
of data!
51. Cassandra and Spark
Cassandra Cassandra & Spark
Joins and Unions No Yes
Transformations Limited Yes
Outside Data Integration No Yes
Aggregations Limited Yes
53. Summary
Kafka
• Topics store information broken into
partitions
• Brokers store partitions
• Partitions are replicated for data
resilience
Cassandra
• Goals of Apache Cassandra are all
about staying online and performant
• Best for applications close to your users
• Partitions are similar data grouped by a
partition key
Spark
• Replacement for Hadoop Map Reduce
• In memory
• More operations than just Map and Reduce
• Makes data analysis easier
• Spark Streaming can take a variety of sources
Spark + Cassandra
• Cassandra acts as the storage layer for Spark
• Deploy in a mixed cluster configuration
• Spark executors access Cassandra using the
DataStax connector
54. Lambda Architecture with Spark/Cassandra
Data
Collection
(Analytical) Batch Data Processing
Batch
compute
Result StoreData
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
(Analytical) Real-Time Data Processing
Stream/Event Processing
Batch
compute
Messaging
Result Store
Query
Engine
Result Store
Computed
Information
Raw Data
(Reservoir)
55. Lambda Architecture with Spark/Cassandra
Data
Collection
(Analytical) Batch Data Processing
Batch
compute
Result StoreData
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
(Analytical) Real-Time Data Processing
Stream/Event Processing
Batch
compute
Messaging
Result Store
Query
Engine
Result Store
Computed
Information
Raw Data
(Reservoir)