This document provides an agenda for a presentation on integrating Apache Cassandra and Apache Spark. The presentation will cover RDBMS vs NoSQL databases, an overview of Cassandra including data model and queries, and Spark including RDDs and running Spark on Cassandra data. Examples will be shown of performing joins between Cassandra and Spark DataFrames for both simple and complex queries.
2. Big Data Era
• Online applications
• Internet of Things
• Big Data:
– Data Velocity
– Data Variety
– Data Volume
– Data Complexity
3. RDBMS vs NoSQL
RDBMS
relates to ACID
NoSQL
relates to CAP
A: Atomicity
C: Consistency
I: Isolation
D: Durability
All four
C: Consistency
A: Availability
P: Partition Tolerance
Pick 2 out of 3
5. NoSQL
• Shared Nothing
– remove dependency between the scaling units
– private memory and peripheral disks
• Master-less Architecture
– Most of NoSQL databases offer master – master data replication strategy
(some exceptions may apply, e.g. Redis)
6. What is Apache Cassandra
• Open Source DDBMS
• Initially developed at Facebook, 2008
• Developed in Java
• Combination of
Dynamo DB(Amazon) – architecture principals
BigTable (Google) – SST design
• DataStax Enterprise commercial distribution
7. Why Cassandra?
• Storing Huge Datasets – Elastic Scalability
• Multi – Master Replication
• High Availability – No SPOF
• Eventual Consistency
• Flexible Data Model
• Locality
• Highest Write Throughput Time
8. Gossip & Seeds
• Gossip
Protocol
Discovers location, state information
Timer runs every second
Info: onJoin, onAlive, onDead, onChange
• Seeds
Bootstrapping other nodes
9. Data Distribution & Replication
• In Cassandra data distribution and replication go together
• Replication is affected by:
1. Consistent Hashing & Partitioners
Data partitioning methodology across the cluster
2. Replication Strategy
Determines the replicas of each row of data
3. Snitch
Defines topology information for replicas placement
4. Virtual Nodes
Assign data ownership to physical machines
10. Consistent Hashing
• Distributes the data across the cluster
• Partitions the data based on the partition key of each row
Row
Key
Hashed
value
Node
John 772335892720368
0754
D
Andrew -
672337285403678
0875
A
Mike 116860462738794
0318
C
11. Partitioners
• Defines the Hash function for Consistent Hashing
• Compute the token for each row key
Types
Murmur3Partitioner
Values: [ -263 … 0 …. +263 ]
Random Partitioner
Values: [ 0…2127 – 1 ]
ByteOrderedPartitioner (Not Recommended)
12. Data Replication
• Replication Factor
Number of replicas across the cluster (e.g. RF = 2, RF = 4)
• Replication Strategies
Simple Strategy
Single Data Center
Replicas are placed clockwise – no network topology into account
replication = {'class' : 'SimpleStrategy', 'replication_factor':3};
Network Topology Strategy
Multiple Data Centers
Replicas are placed in different racks
13. Snitch
• Affects where the replicas are placed
• Determines the data centers and the racks that the Machines belong to
• 9 Different Snitches
Simple Snitch: Single Data Center
Gossiping Property File Snitch: automatic update for new nodes via
gossip – production recommended
Property File Snitch: Location of nodes determined by rack and data
center
EC2 Multi Region Snitch: Amazon Web Services
Google Cloud Snitch: Multiple Regions
14. Cassandra Virtual Nodes
Ring without VNodes
Contiguous token
Contiguous data range
One large range
Ring with Vnodes
Non-contiguous token
Non-contiguous data range
Many smaller ranges
Why Vnodes
Even distribution of data
Faster rebuild of a node failure
16. Client Requests
Client
Running application, read/write requests
(JAVA, Python, C++, PHP, etc…)
Coordinator
• Handles the requests
• Finds Nodes based on Partitioner and Replica Placement Strategy
• Any Node can act as the Coordinator
17. Consistency Levels Write and Read
• Tunable consistency
• Specify how many replicas must respond to consider an operation a success
LEVEL WRITE READ
One X X
Two X X
Three X X
Quorum X X
Any X
All X X
Each_Quorum X X
Local_Quorum X X
Local_One X X
QUORUM = ceil(RF/2)
R + W > N
R: needed #replicas for read
W: needed #replicas for write
N: replication factor
19. C* Data Model
Cassandra Backbone for efficient queries
First look the queries
First Concepts
Keyspace: similar to a relational schema, contains Column Families
Column Family(CF): similar to a relational table, contains the data
Super Column (SC): contains multiple CFs
20. C* Data Model
Next Concepts
– Column based key value store (multi level dictionary)
– Think of it as a JSON representation or Map [String, Map [String, Data] ]
– SST: Sorted String Table
Column Family
| Columns
↓ |
{"Street Monitor": ↓
{"Hollywood": { "avg.speed": 75,
↑ "vehicles": 45,
| "time": "2015-03-02 09:35” }
| ↑
Keys |
| Values
↓
{"Santa Monica": { "avg_speed": 35,
"vehicles": 50,
"time": "2015–03–02 10:35"
}
}
21. C* PRIMARY KEY
Last Concept – Primary Key:
– Remember JSON format
– Storage is 2 – level nested HashMap
– A table/column family has Primary Key which consists of
Level 1: Partition key and clustering key
Level 2: Clustering key and data
Partition Key Clustering Key
Responsible for hashing the
data to the corresponding
physical machine. Make it
random to have evenly
distributed datasets.
Responsible for ordering the
data inside the table.
22. C* PRIMARY KEY
Types of C* Primary Key:
1. Compound Key
Exactly one Partition Key
e.g. PK(Parition_key1)
PK(Parition_key1, Clustering_key1)
PK(Parition_key1, Clustering_key1, Clustering_key2,…)
2. Composite Key
Two or more Partition Keys, careful with the syntax
e.g. PK( ( Partition_key1, Parition_Key2 ) )
PK( ( Partition_key1, Partition_Key2,…), Clustering_Key1, Clustering_Key2,…)
Skinny Rows & Wide Rows
Skinny Rows: if the Primary Key contains only the partition Key
Wide rows: if the Primary Key contains columns other than the partition key
23. BDTS Example
Keyspace: highway
client.CreateTable ("""
CREATE TABLE IF NOT EXISTS highway.street_monitoring (
ONSTREET varchar,
YEAR int,
MONTH int,
DAY int,
TIME int,
POSTMILE float,
DIRECTION int,
FROMSTREET varchar,
TOSTREET varchar,
SPEED int,
VOLUME int,
OCCUPANCY int,
HOVSPEED int,
PRIMARY KEY ( ( ONSTREET,YEAR,MONTH
),DAY,TIME,POSTMILE,DIRECTION )
);
""")
client.CreateIndex("highway","street_monitoring",”SPEED")
client.CreateIndex("highway","street_monitoring","FROMSTREET")
client.CreateIndex("highway","street_monitoring","TOSTREET")
client.CreateTable ("""
CREATE TABLE IF NOT EXISTS highway.regional_monitoring (
REGION varchar,
SPEED int,
VOLUME int,
OCCUPANCY int,
HOVSPEED int,
YEAR int,
HH int,
MONTH int,
DAY int,
SENSOR_ID int,
PRIMARY KEY ( ( REGION,YEAR,MONTH ),HH, DAY, SENSOR_ID )
);
""")
client.CreateIndex("highway","regional_monitoring",”SPEED")
Partition Key: Onstreet, Year, Month
Clustering Key: Day, Time, Postmile, Direction
Partition Key: Region, Year, Month
Clustering Key: HH, Day, Sensor_id
24. C* Queries
• Pure CQL does not support:
JOINS, and Sub queries || GroupBy and OrderBy only on clustering columns
• No Aggregate Functions supported at Cassandra 2.0.+, later versions will
• Always need to restrict the preceding part of subject
Guidelines:
Partition key columns support the = operator
The last column in the partition key supports the IN operator
Clustering columns support the =, >, >=, <, and <= operators
Secondary index columns support the = operator
Query1 – Some parts of Partition key
SELECT * FROM highway.street_monitoring WHERE onstreet=‘I-10’ AND month=3
Error Message: cassandra.InvalidRequest: code=2200 [Invalid query] message="Partition key part year must be
restricted since preceding part is”
Query2 – Full partition key
SELECT * FROM highway.street_monitoring WHERE onstreet=‘I-10’ AND year=2015 AND month
IN(2,4)
Error Message: None
25. C* Queries
Query3 – Range on Partition Key
SELECT * FROM highway.street_monitoring WHERE onstreet=‘I-10’ AND year=2014 AND
month<=1
Error Message: cassandra.InvalidRequest: code=2200 [Invalid query] message="Only EQ and IN relation are
supported on the partition key (unless you use the token() function)"
Query4 – Only secondary index
SELECT * FROM highway.street_monitoring WHERE day>=2 AND day<=3
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable
performance. If you want to execute this query despite the performance unpredictability, use ALLOW
FILTERING
Query5 – Range on Secondary Index
SELECT * FROM highway.street_monitoring WHERE onstreet='47' AND year=2014 AND month=2
AND day=21 AND time>=360 AND time<=7560 AND speed>30
Error Message: cassandra.InvalidRequest: code=2200 [Invalid query] message="No indexed columns present in
by-columns clause with Equal operator"
26. Cassandra Configurations
{CASSANDRA_HOME}/conf/cassandra.yaml
listen_address: < internal IP address for rest of the nodes – communication, gossip />
broadcast_address: < external IP when deployed in multiple regions />
rpc_address: < address for drivers access – internal IP, hostname/>
seeds: < addresses of seed nodes />
commitlog_directory: < commitLog is written sequentially all the time, affected by the seek time/>
saved_caches_directory: < tables keys and row caches are stored />
data_file_directories: < SST tables, holds all data written to the nodes />
{CASSANDRA_HOME}/conf/cassandra-env.sh
MAX_HEAP_SIZE
sets Maximum Heap Size for JVM
Default 1gb, do not set it too high, max 8gb
HEAP_NEWSIZE
new generation size, good guideline is 100 MB per CPU core
27. C* Last Call
• Real-Time Data – Clustering
CREATE TABLE latest_temperatures (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id, event_time),
) WITH CLUSTERING ORDER BY (event_time DESC);
28. Big Data Analytics Stack - BDAS
Origin Berkeley AMP Lab
Multiple Packages
Multiple Data Sources
Spark is the Kernel of Functionality
29. What is Spark
• An open-source cluster computing platform for fast
and general purpose large-scale data analytics
• In-memory computations
• Software suite
• Built in Scala
• Highly Accessible (Java, Python, Scala, SQL APIs)
• Started on 2009, Berkeley AMP lab
• Master-Slave Architecture
• Still developing, numerous contributors
30. Spark vs Hadoop
Hadoop Issues
1. Data Replication
2. Disk I/O
3. Serialization increases execution time
Performance Degradation when applying:
1. Iterative Jobs
2. Interactive Analytics
31. Spark vs Hadoop
• Spark Characteristics against Hadoop MR
Data Reuse
Interactive data analytics
Ad-hoc queries
Iterative algorithms
( Machine Learning Algorithms - MLlib
Graph Processing Algorithms – GraphX )
Real-time data flow processing
( Spark Streaming )
Faster
x100 in memory
x10 on disk
32. Daytona Competition 2014
Goal:
sort 100 TB of data
Hadoop:
generates 3100 GB/s
of disk I/O
time: 300% of Spark
Spark:
Generates 500 GB/s
of disk I/O
All the sorting took place
on disk (HDFS), without
using Spark’s in-memory
cache
http://sortbenchmark.org/
33. RDDs
• Stands for Resilient Distributed Datasets
Definition
An RDD is an immutable, in-memory collection of objects. Each RDD is split into
multiple partitions, which in turn are computed on different nodes of the cluster.
RDDs can be:
(1) External Datasets or
(2) Parallelizing Collections
34. RDDs Operations
• Two Distinct Important Operations
transformations: return a new RDD
actions: return final value
Scala code:
val visits = spark.hadoopfile(“hdfs:// … “)
/* tranformation */
val counts = visits.map( v => (v.url,1))
.reduceByKey((a,b) => a + b)
/* action */
counts.collect()
• Lazy Evaluation
Spark starts the execution when an action is called. Spark internally
stores metadata of how to compute the transformations data.
35. Spark Fault Tolerance
• RDD lineage
Spark logs information for different RDDs
Information is derived from transformations (e.g. map, filter,
join)
Crucial for data recovery upon partition failure
36. Spark Runtime
Driver
• Central Coordinator
• Creates a Spark Context
• Where the main() method runs
• Creates RDDS
• Performs RDDs transformations
and actions
Cluster Manager
Manages Cluster Resources
Cluster Worker
Contains the Spark worker, Executor
Driver + Executors = Spark Application
37. Spark Driver Duties
1. Convert User Program into Tasks 2. Schedule Tasks on Executors
• The Driver is responsible for
coordinating the individual tasks
on the Executors
• Checks the Executors and delivers
the tasks to the appropriate
location
• Tasks from different Applications
run on different JVMs
38. Executors
• Properties
Worker Processes
Launch at the start
Die when application ends
• Mission
1. Run individual Tasks &
return results to the Spark Driver
2. In memory storage for the RDDs
[ .cache() | .persist() ]
39. Spark Cluster Managers
• Spark Standalone Scheduler
FIFO scheduling
• For multi-tenancy systems:
Hadoop YARN
recommended when dealing with HDFS
for fast access due to nodes locality
Apache Mesos
fine-grained: static memory, dynamic cores
coarse-grained: static memory and cores
40. SPARK Installation
Key Concepts:
1. Spark versions: 1.2.1(stable), 1.3.1(latest release), of course previous
2. Apache Maven
3. Scala Build Tools (sbt)
4. Hadoop Version
HDFS protocol compatible across versions (e.g. 2.2.x, 2.3.x, default 1.0.4)
Apache Maven and SBT are required for configuring any dependencies or plugins required
for project installation (add new plugins, handle exceptions)
Examples:
Maven changes apply to:
{SPARK_HOME}/pom.xml
SBT changes apply to:
{SPARK_HOME}/sbt/sbt
44. Cassandra JOINS Spark
From Simple to Very Complex
SIMPLE:
val cc = new CassandraSQLContext(sc)
val config = sc.cassandraTable("highway","highway_config").select("config_id","agency","link_id").where("onstreet
= ?","I-605”)
val joined = config.joinWithCassandraTable("highway","highway_history").select("speed”)
val specified_joined = joined.on(SomeColumns("config_id","agency","link_id"))
joined.collect()
COMPLEX:
val hists = sc.cassandraTable("highway", "highway_history").where("config_id = ?","85").where("agency =
?","Caltrans-D7").where("event_time = ?","2014-04-03 02:47”)
val configs = sc.cassandraTable("highway","highway_config_metrics").where("onstreet = ?","SR-
60").where("fromstreet = ?","EUCLID”)
val histsKeyBy = hists.map(f =>
(((f.getInt("config_id"),f.getString("agency"))),(f.getString("event_time"),f.getInt("occupancy"),f.getInt("speed
"),f.getInt("volume"),f.getInt("hovspeed"))))
val configsKeyBy = configs.map(f =>
((f.getInt("config_id"),f.getString("agency")),(f.getString("event_time"),f.getString("onstreet"),f.getString("fr
omstreet"),f.getInt("direction"))))
val joined = histsKeyBy.join(configsKeyBy)
joined.collect()
val speed_avg = joined.map(x => x._2._1._2).mean()