SlideShare ist ein Scribd-Unternehmen logo
1 von 45
AGENDA
RDBMS vs NoSQL
Cassandra
Spark
Cassandra & Spark Integration
Presenter: Dimitris Stripelis
Contact: d.stripelis@hotmail.com
Big Data Era
• Online applications
• Internet of Things
• Big Data:
– Data Velocity
– Data Variety
– Data Volume
– Data Complexity
RDBMS vs NoSQL
RDBMS
relates to ACID
NoSQL
relates to CAP
A: Atomicity
C: Consistency
I: Isolation
D: Durability
All four
C: Consistency
A: Availability
P: Partition Tolerance
Pick 2 out of 3
RDBMS vs NoSQL
• RDBMS vertical scale
• NoSQL horizontal scale
NoSQL
• Shared Nothing
– remove dependency between the scaling units
– private memory and peripheral disks
• Master-less Architecture
– Most of NoSQL databases offer master – master data replication strategy
(some exceptions may apply, e.g. Redis)
What is Apache Cassandra
• Open Source DDBMS
• Initially developed at Facebook, 2008
• Developed in Java
• Combination of
Dynamo DB(Amazon) – architecture principals
BigTable (Google) – SST design
• DataStax Enterprise commercial distribution
Why Cassandra?
• Storing Huge Datasets – Elastic Scalability
• Multi – Master Replication
• High Availability – No SPOF
• Eventual Consistency
• Flexible Data Model
• Locality
• Highest Write Throughput Time
Gossip & Seeds
• Gossip
Protocol
Discovers location, state information
Timer runs every second
Info: onJoin, onAlive, onDead, onChange
• Seeds
Bootstrapping other nodes
Data Distribution & Replication
• In Cassandra data distribution and replication go together
• Replication is affected by:
1. Consistent Hashing & Partitioners
Data partitioning methodology across the cluster
2. Replication Strategy
Determines the replicas of each row of data
3. Snitch
Defines topology information for replicas placement
4. Virtual Nodes
Assign data ownership to physical machines
Consistent Hashing
• Distributes the data across the cluster
• Partitions the data based on the partition key of each row
Row
Key
Hashed
value
Node
John 772335892720368
0754
D
Andrew -
672337285403678
0875
A
Mike 116860462738794
0318
C
Partitioners
• Defines the Hash function for Consistent Hashing
• Compute the token for each row key
Types
Murmur3Partitioner
Values: [ -263 … 0 …. +263 ]
Random Partitioner
Values: [ 0…2127 – 1 ]
ByteOrderedPartitioner (Not Recommended)
Data Replication
• Replication Factor
Number of replicas across the cluster (e.g. RF = 2, RF = 4)
• Replication Strategies
Simple Strategy
Single Data Center
Replicas are placed clockwise – no network topology into account
replication = {'class' : 'SimpleStrategy', 'replication_factor':3};
Network Topology Strategy
Multiple Data Centers
Replicas are placed in different racks
Snitch
• Affects where the replicas are placed
• Determines the data centers and the racks that the Machines belong to
• 9 Different Snitches
Simple Snitch: Single Data Center
Gossiping Property File Snitch: automatic update for new nodes via
gossip – production recommended
Property File Snitch: Location of nodes determined by rack and data
center
EC2 Multi Region Snitch: Amazon Web Services
Google Cloud Snitch: Multiple Regions
Cassandra Virtual Nodes
Ring without VNodes
Contiguous token
Contiguous data range
One large range
Ring with Vnodes
Non-contiguous token
Non-contiguous data range
Many smaller ranges
Why Vnodes
Even distribution of data
Faster rebuild of a node failure
BDTS Example
BDTS Cassandra Ring
Client Requests
Client
Running application, read/write requests
(JAVA, Python, C++, PHP, etc…)
Coordinator
• Handles the requests
• Finds Nodes based on Partitioner and Replica Placement Strategy
• Any Node can act as the Coordinator
Consistency Levels Write and Read
• Tunable consistency
• Specify how many replicas must respond to consider an operation a success
LEVEL WRITE READ
One X X
Two X X
Three X X
Quorum X X
Any X
All X X
Each_Quorum X X
Local_Quorum X X
Local_One X X
QUORUM = ceil(RF/2)
R + W > N
R: needed #replicas for read
W: needed #replicas for write
N: replication factor
The Write Path
CommitLog
(Durability)Write
(Client) Memtable
(flushed to)
SSTable
(Compaction)
C* Data Model
Cassandra Backbone for efficient queries
First look the queries
First Concepts
Keyspace: similar to a relational schema, contains Column Families
Column Family(CF): similar to a relational table, contains the data
Super Column (SC): contains multiple CFs
C* Data Model
Next Concepts
– Column based key value store (multi level dictionary)
– Think of it as a JSON representation or Map [String, Map [String, Data] ]
– SST: Sorted String Table
Column Family
| Columns
↓ |
{"Street Monitor": ↓
{"Hollywood": { "avg.speed": 75,
↑ "vehicles": 45,
| "time": "2015-03-02 09:35” }
| ↑
Keys |
| Values
↓
{"Santa Monica": { "avg_speed": 35,
"vehicles": 50,
"time": "2015–03–02 10:35"
}
}
C* PRIMARY KEY
Last Concept – Primary Key:
– Remember JSON format
– Storage is 2 – level nested HashMap
– A table/column family has Primary Key which consists of
Level 1: Partition key and clustering key
Level 2: Clustering key and data
Partition Key Clustering Key
Responsible for hashing the
data to the corresponding
physical machine. Make it
random to have evenly
distributed datasets.
Responsible for ordering the
data inside the table.
C* PRIMARY KEY
Types of C* Primary Key:
1. Compound Key
Exactly one Partition Key
e.g. PK(Parition_key1)
PK(Parition_key1, Clustering_key1)
PK(Parition_key1, Clustering_key1, Clustering_key2,…)
2. Composite Key
Two or more Partition Keys, careful with the syntax
e.g. PK( ( Partition_key1, Parition_Key2 ) )
PK( ( Partition_key1, Partition_Key2,…), Clustering_Key1, Clustering_Key2,…)
Skinny Rows & Wide Rows
Skinny Rows: if the Primary Key contains only the partition Key
Wide rows: if the Primary Key contains columns other than the partition key
BDTS Example
Keyspace: highway
client.CreateTable ("""
CREATE TABLE IF NOT EXISTS highway.street_monitoring (
ONSTREET varchar,
YEAR int,
MONTH int,
DAY int,
TIME int,
POSTMILE float,
DIRECTION int,
FROMSTREET varchar,
TOSTREET varchar,
SPEED int,
VOLUME int,
OCCUPANCY int,
HOVSPEED int,
PRIMARY KEY ( ( ONSTREET,YEAR,MONTH
),DAY,TIME,POSTMILE,DIRECTION )
);
""")
client.CreateIndex("highway","street_monitoring",”SPEED")
client.CreateIndex("highway","street_monitoring","FROMSTREET")
client.CreateIndex("highway","street_monitoring","TOSTREET")
client.CreateTable ("""
CREATE TABLE IF NOT EXISTS highway.regional_monitoring (
REGION varchar,
SPEED int,
VOLUME int,
OCCUPANCY int,
HOVSPEED int,
YEAR int,
HH int,
MONTH int,
DAY int,
SENSOR_ID int,
PRIMARY KEY ( ( REGION,YEAR,MONTH ),HH, DAY, SENSOR_ID )
);
""")
client.CreateIndex("highway","regional_monitoring",”SPEED")
Partition Key: Onstreet, Year, Month
Clustering Key: Day, Time, Postmile, Direction
Partition Key: Region, Year, Month
Clustering Key: HH, Day, Sensor_id
C* Queries
• Pure CQL does not support:
JOINS, and Sub queries || GroupBy and OrderBy only on clustering columns
• No Aggregate Functions supported at Cassandra 2.0.+, later versions will
• Always need to restrict the preceding part of subject
Guidelines:
Partition key columns support the = operator
The last column in the partition key supports the IN operator
Clustering columns support the =, >, >=, <, and <= operators
Secondary index columns support the = operator
Query1 – Some parts of Partition key
SELECT * FROM highway.street_monitoring WHERE onstreet=‘I-10’ AND month=3
Error Message: cassandra.InvalidRequest: code=2200 [Invalid query] message="Partition key part year must be
restricted since preceding part is”
Query2 – Full partition key
SELECT * FROM highway.street_monitoring WHERE onstreet=‘I-10’ AND year=2015 AND month
IN(2,4)
Error Message: None
C* Queries
Query3 – Range on Partition Key
SELECT * FROM highway.street_monitoring WHERE onstreet=‘I-10’ AND year=2014 AND
month<=1
Error Message: cassandra.InvalidRequest: code=2200 [Invalid query] message="Only EQ and IN relation are
supported on the partition key (unless you use the token() function)"
Query4 – Only secondary index
SELECT * FROM highway.street_monitoring WHERE day>=2 AND day<=3
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable
performance. If you want to execute this query despite the performance unpredictability, use ALLOW
FILTERING
Query5 – Range on Secondary Index
SELECT * FROM highway.street_monitoring WHERE onstreet='47' AND year=2014 AND month=2
AND day=21 AND time>=360 AND time<=7560 AND speed>30
Error Message: cassandra.InvalidRequest: code=2200 [Invalid query] message="No indexed columns present in
by-columns clause with Equal operator"
Cassandra Configurations
{CASSANDRA_HOME}/conf/cassandra.yaml
listen_address: < internal IP address for rest of the nodes – communication, gossip />
broadcast_address: < external IP when deployed in multiple regions />
rpc_address: < address for drivers access – internal IP, hostname/>
seeds: < addresses of seed nodes />
commitlog_directory: < commitLog is written sequentially all the time, affected by the seek time/>
saved_caches_directory: < tables keys and row caches are stored />
data_file_directories: < SST tables, holds all data written to the nodes />
{CASSANDRA_HOME}/conf/cassandra-env.sh
MAX_HEAP_SIZE
sets Maximum Heap Size for JVM
Default 1gb, do not set it too high, max 8gb
HEAP_NEWSIZE
new generation size, good guideline is 100 MB per CPU core
C* Last Call
• Real-Time Data – Clustering
CREATE TABLE latest_temperatures (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id, event_time),
) WITH CLUSTERING ORDER BY (event_time DESC);
Big Data Analytics Stack - BDAS
Origin Berkeley AMP Lab
Multiple Packages
Multiple Data Sources
Spark is the Kernel of Functionality
What is Spark
• An open-source cluster computing platform for fast
and general purpose large-scale data analytics
• In-memory computations
• Software suite
• Built in Scala
• Highly Accessible (Java, Python, Scala, SQL APIs)
• Started on 2009, Berkeley AMP lab
• Master-Slave Architecture
• Still developing, numerous contributors
Spark vs Hadoop
Hadoop Issues
1. Data Replication
2. Disk I/O
3. Serialization increases execution time
Performance Degradation when applying:
1. Iterative Jobs
2. Interactive Analytics
Spark vs Hadoop
• Spark Characteristics against Hadoop MR
Data Reuse
Interactive data analytics
Ad-hoc queries
Iterative algorithms
( Machine Learning Algorithms - MLlib
Graph Processing Algorithms – GraphX )
Real-time data flow processing
( Spark Streaming )
Faster
x100 in memory
x10 on disk
Daytona Competition 2014
Goal:
sort 100 TB of data
Hadoop:
generates 3100 GB/s
of disk I/O
time: 300% of Spark
Spark:
Generates 500 GB/s
of disk I/O
All the sorting took place
on disk (HDFS), without
using Spark’s in-memory
cache
http://sortbenchmark.org/
RDDs
• Stands for Resilient Distributed Datasets
Definition
An RDD is an immutable, in-memory collection of objects. Each RDD is split into
multiple partitions, which in turn are computed on different nodes of the cluster.
RDDs can be:
(1) External Datasets or
(2) Parallelizing Collections
RDDs Operations
• Two Distinct Important Operations
transformations: return a new RDD
actions: return final value
Scala code:
val visits = spark.hadoopfile(“hdfs:// … “)
/* tranformation */
val counts = visits.map( v => (v.url,1))
.reduceByKey((a,b) => a + b)
/* action */
counts.collect()
• Lazy Evaluation
Spark starts the execution when an action is called. Spark internally
stores metadata of how to compute the transformations data.
Spark Fault Tolerance
• RDD lineage
Spark logs information for different RDDs
Information is derived from transformations (e.g. map, filter,
join)
Crucial for data recovery upon partition failure
Spark Runtime
Driver
• Central Coordinator
• Creates a Spark Context
• Where the main() method runs
• Creates RDDS
• Performs RDDs transformations
and actions
Cluster Manager
Manages Cluster Resources
Cluster Worker
Contains the Spark worker, Executor
Driver + Executors = Spark Application
Spark Driver Duties
1. Convert User Program into Tasks 2. Schedule Tasks on Executors
• The Driver is responsible for
coordinating the individual tasks
on the Executors
• Checks the Executors and delivers
the tasks to the appropriate
location
• Tasks from different Applications
run on different JVMs
Executors
• Properties
Worker Processes
Launch at the start
Die when application ends
• Mission
1. Run individual Tasks &
return results to the Spark Driver
2. In memory storage for the RDDs
[ .cache() | .persist() ]
Spark Cluster Managers
• Spark Standalone Scheduler
FIFO scheduling
• For multi-tenancy systems:
Hadoop YARN
recommended when dealing with HDFS
for fast access due to nodes locality
Apache Mesos
fine-grained: static memory, dynamic cores
coarse-grained: static memory and cores
SPARK Installation
Key Concepts:
1. Spark versions: 1.2.1(stable), 1.3.1(latest release), of course previous
2. Apache Maven
3. Scala Build Tools (sbt)
4. Hadoop Version
HDFS protocol compatible across versions (e.g. 2.2.x, 2.3.x, default 1.0.4)
Apache Maven and SBT are required for configuring any dependencies or plugins required
for project installation (add new plugins, handle exceptions)
Examples:
Maven changes apply to:
{SPARK_HOME}/pom.xml
SBT changes apply to:
{SPARK_HOME}/sbt/sbt
Spark Configurations
Master -- spark-defaults.conf
spark.driver.cores #num of cores
spark.driver.memory 2156m
spark.driver.host instance-trans1
spark.driver.port 3389
spark.fileserver.port #Web UI port
spark.broadcast.port 4045
spark.blockManager.port 7080
spark.executor.port 8089
spark.executor.memory 9024m
spark.eventLog.enabled true
spark.master spark://instance-trans1.c.bubbly-operator-90323.internal:7077
spark.cassandra.connection.host #Internal IP Cassandra Spark intercommunication
spark.ui.port #Web UI port
C* SPARK Integration
Why?
• Leverage RDDs functionality
• Aggregation (SUM, AVG)
• Cross-table Operations (JOIN, UNION)
• Real-time Batch Processing
• Complex Analytics (MLlib, GraphX)
• Really Fast Locality Awareness
C* SPARK BDTS
Specifications: 6 Servers, 4 Core CPU, 15GB RAM
Cassandra JOINS Spark
From Simple to Very Complex
SIMPLE:
val cc = new CassandraSQLContext(sc)
val config = sc.cassandraTable("highway","highway_config").select("config_id","agency","link_id").where("onstreet
= ?","I-605”)
val joined = config.joinWithCassandraTable("highway","highway_history").select("speed”)
val specified_joined = joined.on(SomeColumns("config_id","agency","link_id"))
joined.collect()
COMPLEX:
val hists = sc.cassandraTable("highway", "highway_history").where("config_id = ?","85").where("agency =
?","Caltrans-D7").where("event_time = ?","2014-04-03 02:47”)
val configs = sc.cassandraTable("highway","highway_config_metrics").where("onstreet = ?","SR-
60").where("fromstreet = ?","EUCLID”)
val histsKeyBy = hists.map(f =>
(((f.getInt("config_id"),f.getString("agency"))),(f.getString("event_time"),f.getInt("occupancy"),f.getInt("speed
"),f.getInt("volume"),f.getInt("hovspeed"))))
val configsKeyBy = configs.map(f =>
((f.getInt("config_id"),f.getString("agency")),(f.getString("event_time"),f.getString("onstreet"),f.getString("fr
omstreet"),f.getInt("direction"))))
val joined = histsKeyBy.join(configsKeyBy)
joined.collect()
val speed_avg = joined.map(x => x._2._1._2).mean()
The End
NEXT
Cassandra Performance Tuning
Spark RDD Functional Programming
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014dhiguero
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionPatrick McFadin
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Modelebenhewitt
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connectorDuyhai Doan
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Apache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceApache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceDataStax Academy
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit
 
Geospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene indexGeospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene indexAndrés de la Peña
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...DataStax
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingVassilis Bekiaris
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with CassandraRyan King
 
Building a fully-automated Fast Data Platform
Building a fully-automated Fast Data PlatformBuilding a fully-automated Fast Data Platform
Building a fully-automated Fast Data PlatformManuel Sehlinger
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousJen Aman
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache SparkJosef Adersberger
 
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchPatricia Gorla
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 

Was ist angesagt? (20)

Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Apache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceApache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis Price
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
Geospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene indexGeospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene index
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series Modeling
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with Cassandra
 
Building a fully-automated Fast Data Platform
Building a fully-automated Fast Data PlatformBuilding a fully-automated Fast Data Platform
Building a fully-automated Fast Data Platform
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and Analytics
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise Search
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 

Ähnlich wie Presentation

Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScyllaDB
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Matthias Niehoff
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseAll Things Open
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
 
SRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon RedshiftSRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon RedshiftAmazon Web Services
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016DataStax
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
N07_RoundII_20220405.pptx
N07_RoundII_20220405.pptxN07_RoundII_20220405.pptx
N07_RoundII_20220405.pptxNguyễn Thái
 
Working Experience_V5.0
Working Experience_V5.0Working Experience_V5.0
Working Experience_V5.0Danny Lai
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introductionfardinjamshidi
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBMapR Technologies
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...Facultad de Informática UCM
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaJose Mº Muñoz
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAndre Essing
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Databasejavier ramirez
 

Ähnlich wie Presentation (20)

Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
SRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon RedshiftSRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon Redshift
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
 
MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
N07_RoundII_20220405.pptx
N07_RoundII_20220405.pptxN07_RoundII_20220405.pptx
N07_RoundII_20220405.pptx
 
Working Experience_V5.0
Working Experience_V5.0Working Experience_V5.0
Working Experience_V5.0
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introduction
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep Dive
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
 

Kürzlich hochgeladen

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 

Kürzlich hochgeladen (20)

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

Presentation

  • 1. AGENDA RDBMS vs NoSQL Cassandra Spark Cassandra & Spark Integration Presenter: Dimitris Stripelis Contact: d.stripelis@hotmail.com
  • 2. Big Data Era • Online applications • Internet of Things • Big Data: – Data Velocity – Data Variety – Data Volume – Data Complexity
  • 3. RDBMS vs NoSQL RDBMS relates to ACID NoSQL relates to CAP A: Atomicity C: Consistency I: Isolation D: Durability All four C: Consistency A: Availability P: Partition Tolerance Pick 2 out of 3
  • 4. RDBMS vs NoSQL • RDBMS vertical scale • NoSQL horizontal scale
  • 5. NoSQL • Shared Nothing – remove dependency between the scaling units – private memory and peripheral disks • Master-less Architecture – Most of NoSQL databases offer master – master data replication strategy (some exceptions may apply, e.g. Redis)
  • 6. What is Apache Cassandra • Open Source DDBMS • Initially developed at Facebook, 2008 • Developed in Java • Combination of Dynamo DB(Amazon) – architecture principals BigTable (Google) – SST design • DataStax Enterprise commercial distribution
  • 7. Why Cassandra? • Storing Huge Datasets – Elastic Scalability • Multi – Master Replication • High Availability – No SPOF • Eventual Consistency • Flexible Data Model • Locality • Highest Write Throughput Time
  • 8. Gossip & Seeds • Gossip Protocol Discovers location, state information Timer runs every second Info: onJoin, onAlive, onDead, onChange • Seeds Bootstrapping other nodes
  • 9. Data Distribution & Replication • In Cassandra data distribution and replication go together • Replication is affected by: 1. Consistent Hashing & Partitioners Data partitioning methodology across the cluster 2. Replication Strategy Determines the replicas of each row of data 3. Snitch Defines topology information for replicas placement 4. Virtual Nodes Assign data ownership to physical machines
  • 10. Consistent Hashing • Distributes the data across the cluster • Partitions the data based on the partition key of each row Row Key Hashed value Node John 772335892720368 0754 D Andrew - 672337285403678 0875 A Mike 116860462738794 0318 C
  • 11. Partitioners • Defines the Hash function for Consistent Hashing • Compute the token for each row key Types Murmur3Partitioner Values: [ -263 … 0 …. +263 ] Random Partitioner Values: [ 0…2127 – 1 ] ByteOrderedPartitioner (Not Recommended)
  • 12. Data Replication • Replication Factor Number of replicas across the cluster (e.g. RF = 2, RF = 4) • Replication Strategies Simple Strategy Single Data Center Replicas are placed clockwise – no network topology into account replication = {'class' : 'SimpleStrategy', 'replication_factor':3}; Network Topology Strategy Multiple Data Centers Replicas are placed in different racks
  • 13. Snitch • Affects where the replicas are placed • Determines the data centers and the racks that the Machines belong to • 9 Different Snitches Simple Snitch: Single Data Center Gossiping Property File Snitch: automatic update for new nodes via gossip – production recommended Property File Snitch: Location of nodes determined by rack and data center EC2 Multi Region Snitch: Amazon Web Services Google Cloud Snitch: Multiple Regions
  • 14. Cassandra Virtual Nodes Ring without VNodes Contiguous token Contiguous data range One large range Ring with Vnodes Non-contiguous token Non-contiguous data range Many smaller ranges Why Vnodes Even distribution of data Faster rebuild of a node failure
  • 16. Client Requests Client Running application, read/write requests (JAVA, Python, C++, PHP, etc…) Coordinator • Handles the requests • Finds Nodes based on Partitioner and Replica Placement Strategy • Any Node can act as the Coordinator
  • 17. Consistency Levels Write and Read • Tunable consistency • Specify how many replicas must respond to consider an operation a success LEVEL WRITE READ One X X Two X X Three X X Quorum X X Any X All X X Each_Quorum X X Local_Quorum X X Local_One X X QUORUM = ceil(RF/2) R + W > N R: needed #replicas for read W: needed #replicas for write N: replication factor
  • 18. The Write Path CommitLog (Durability)Write (Client) Memtable (flushed to) SSTable (Compaction)
  • 19. C* Data Model Cassandra Backbone for efficient queries First look the queries First Concepts Keyspace: similar to a relational schema, contains Column Families Column Family(CF): similar to a relational table, contains the data Super Column (SC): contains multiple CFs
  • 20. C* Data Model Next Concepts – Column based key value store (multi level dictionary) – Think of it as a JSON representation or Map [String, Map [String, Data] ] – SST: Sorted String Table Column Family | Columns ↓ | {"Street Monitor": ↓ {"Hollywood": { "avg.speed": 75, ↑ "vehicles": 45, | "time": "2015-03-02 09:35” } | ↑ Keys | | Values ↓ {"Santa Monica": { "avg_speed": 35, "vehicles": 50, "time": "2015–03–02 10:35" } }
  • 21. C* PRIMARY KEY Last Concept – Primary Key: – Remember JSON format – Storage is 2 – level nested HashMap – A table/column family has Primary Key which consists of Level 1: Partition key and clustering key Level 2: Clustering key and data Partition Key Clustering Key Responsible for hashing the data to the corresponding physical machine. Make it random to have evenly distributed datasets. Responsible for ordering the data inside the table.
  • 22. C* PRIMARY KEY Types of C* Primary Key: 1. Compound Key Exactly one Partition Key e.g. PK(Parition_key1) PK(Parition_key1, Clustering_key1) PK(Parition_key1, Clustering_key1, Clustering_key2,…) 2. Composite Key Two or more Partition Keys, careful with the syntax e.g. PK( ( Partition_key1, Parition_Key2 ) ) PK( ( Partition_key1, Partition_Key2,…), Clustering_Key1, Clustering_Key2,…) Skinny Rows & Wide Rows Skinny Rows: if the Primary Key contains only the partition Key Wide rows: if the Primary Key contains columns other than the partition key
  • 23. BDTS Example Keyspace: highway client.CreateTable (""" CREATE TABLE IF NOT EXISTS highway.street_monitoring ( ONSTREET varchar, YEAR int, MONTH int, DAY int, TIME int, POSTMILE float, DIRECTION int, FROMSTREET varchar, TOSTREET varchar, SPEED int, VOLUME int, OCCUPANCY int, HOVSPEED int, PRIMARY KEY ( ( ONSTREET,YEAR,MONTH ),DAY,TIME,POSTMILE,DIRECTION ) ); """) client.CreateIndex("highway","street_monitoring",”SPEED") client.CreateIndex("highway","street_monitoring","FROMSTREET") client.CreateIndex("highway","street_monitoring","TOSTREET") client.CreateTable (""" CREATE TABLE IF NOT EXISTS highway.regional_monitoring ( REGION varchar, SPEED int, VOLUME int, OCCUPANCY int, HOVSPEED int, YEAR int, HH int, MONTH int, DAY int, SENSOR_ID int, PRIMARY KEY ( ( REGION,YEAR,MONTH ),HH, DAY, SENSOR_ID ) ); """) client.CreateIndex("highway","regional_monitoring",”SPEED") Partition Key: Onstreet, Year, Month Clustering Key: Day, Time, Postmile, Direction Partition Key: Region, Year, Month Clustering Key: HH, Day, Sensor_id
  • 24. C* Queries • Pure CQL does not support: JOINS, and Sub queries || GroupBy and OrderBy only on clustering columns • No Aggregate Functions supported at Cassandra 2.0.+, later versions will • Always need to restrict the preceding part of subject Guidelines: Partition key columns support the = operator The last column in the partition key supports the IN operator Clustering columns support the =, >, >=, <, and <= operators Secondary index columns support the = operator Query1 – Some parts of Partition key SELECT * FROM highway.street_monitoring WHERE onstreet=‘I-10’ AND month=3 Error Message: cassandra.InvalidRequest: code=2200 [Invalid query] message="Partition key part year must be restricted since preceding part is” Query2 – Full partition key SELECT * FROM highway.street_monitoring WHERE onstreet=‘I-10’ AND year=2015 AND month IN(2,4) Error Message: None
  • 25. C* Queries Query3 – Range on Partition Key SELECT * FROM highway.street_monitoring WHERE onstreet=‘I-10’ AND year=2014 AND month<=1 Error Message: cassandra.InvalidRequest: code=2200 [Invalid query] message="Only EQ and IN relation are supported on the partition key (unless you use the token() function)" Query4 – Only secondary index SELECT * FROM highway.street_monitoring WHERE day>=2 AND day<=3 Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING Query5 – Range on Secondary Index SELECT * FROM highway.street_monitoring WHERE onstreet='47' AND year=2014 AND month=2 AND day=21 AND time>=360 AND time<=7560 AND speed>30 Error Message: cassandra.InvalidRequest: code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
  • 26. Cassandra Configurations {CASSANDRA_HOME}/conf/cassandra.yaml listen_address: < internal IP address for rest of the nodes – communication, gossip /> broadcast_address: < external IP when deployed in multiple regions /> rpc_address: < address for drivers access – internal IP, hostname/> seeds: < addresses of seed nodes /> commitlog_directory: < commitLog is written sequentially all the time, affected by the seek time/> saved_caches_directory: < tables keys and row caches are stored /> data_file_directories: < SST tables, holds all data written to the nodes /> {CASSANDRA_HOME}/conf/cassandra-env.sh MAX_HEAP_SIZE sets Maximum Heap Size for JVM Default 1gb, do not set it too high, max 8gb HEAP_NEWSIZE new generation size, good guideline is 100 MB per CPU core
  • 27. C* Last Call • Real-Time Data – Clustering CREATE TABLE latest_temperatures ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id, event_time), ) WITH CLUSTERING ORDER BY (event_time DESC);
  • 28. Big Data Analytics Stack - BDAS Origin Berkeley AMP Lab Multiple Packages Multiple Data Sources Spark is the Kernel of Functionality
  • 29. What is Spark • An open-source cluster computing platform for fast and general purpose large-scale data analytics • In-memory computations • Software suite • Built in Scala • Highly Accessible (Java, Python, Scala, SQL APIs) • Started on 2009, Berkeley AMP lab • Master-Slave Architecture • Still developing, numerous contributors
  • 30. Spark vs Hadoop Hadoop Issues 1. Data Replication 2. Disk I/O 3. Serialization increases execution time Performance Degradation when applying: 1. Iterative Jobs 2. Interactive Analytics
  • 31. Spark vs Hadoop • Spark Characteristics against Hadoop MR Data Reuse Interactive data analytics Ad-hoc queries Iterative algorithms ( Machine Learning Algorithms - MLlib Graph Processing Algorithms – GraphX ) Real-time data flow processing ( Spark Streaming ) Faster x100 in memory x10 on disk
  • 32. Daytona Competition 2014 Goal: sort 100 TB of data Hadoop: generates 3100 GB/s of disk I/O time: 300% of Spark Spark: Generates 500 GB/s of disk I/O All the sorting took place on disk (HDFS), without using Spark’s in-memory cache http://sortbenchmark.org/
  • 33. RDDs • Stands for Resilient Distributed Datasets Definition An RDD is an immutable, in-memory collection of objects. Each RDD is split into multiple partitions, which in turn are computed on different nodes of the cluster. RDDs can be: (1) External Datasets or (2) Parallelizing Collections
  • 34. RDDs Operations • Two Distinct Important Operations transformations: return a new RDD actions: return final value Scala code: val visits = spark.hadoopfile(“hdfs:// … “) /* tranformation */ val counts = visits.map( v => (v.url,1)) .reduceByKey((a,b) => a + b) /* action */ counts.collect() • Lazy Evaluation Spark starts the execution when an action is called. Spark internally stores metadata of how to compute the transformations data.
  • 35. Spark Fault Tolerance • RDD lineage Spark logs information for different RDDs Information is derived from transformations (e.g. map, filter, join) Crucial for data recovery upon partition failure
  • 36. Spark Runtime Driver • Central Coordinator • Creates a Spark Context • Where the main() method runs • Creates RDDS • Performs RDDs transformations and actions Cluster Manager Manages Cluster Resources Cluster Worker Contains the Spark worker, Executor Driver + Executors = Spark Application
  • 37. Spark Driver Duties 1. Convert User Program into Tasks 2. Schedule Tasks on Executors • The Driver is responsible for coordinating the individual tasks on the Executors • Checks the Executors and delivers the tasks to the appropriate location • Tasks from different Applications run on different JVMs
  • 38. Executors • Properties Worker Processes Launch at the start Die when application ends • Mission 1. Run individual Tasks & return results to the Spark Driver 2. In memory storage for the RDDs [ .cache() | .persist() ]
  • 39. Spark Cluster Managers • Spark Standalone Scheduler FIFO scheduling • For multi-tenancy systems: Hadoop YARN recommended when dealing with HDFS for fast access due to nodes locality Apache Mesos fine-grained: static memory, dynamic cores coarse-grained: static memory and cores
  • 40. SPARK Installation Key Concepts: 1. Spark versions: 1.2.1(stable), 1.3.1(latest release), of course previous 2. Apache Maven 3. Scala Build Tools (sbt) 4. Hadoop Version HDFS protocol compatible across versions (e.g. 2.2.x, 2.3.x, default 1.0.4) Apache Maven and SBT are required for configuring any dependencies or plugins required for project installation (add new plugins, handle exceptions) Examples: Maven changes apply to: {SPARK_HOME}/pom.xml SBT changes apply to: {SPARK_HOME}/sbt/sbt
  • 41. Spark Configurations Master -- spark-defaults.conf spark.driver.cores #num of cores spark.driver.memory 2156m spark.driver.host instance-trans1 spark.driver.port 3389 spark.fileserver.port #Web UI port spark.broadcast.port 4045 spark.blockManager.port 7080 spark.executor.port 8089 spark.executor.memory 9024m spark.eventLog.enabled true spark.master spark://instance-trans1.c.bubbly-operator-90323.internal:7077 spark.cassandra.connection.host #Internal IP Cassandra Spark intercommunication spark.ui.port #Web UI port
  • 42. C* SPARK Integration Why? • Leverage RDDs functionality • Aggregation (SUM, AVG) • Cross-table Operations (JOIN, UNION) • Real-time Batch Processing • Complex Analytics (MLlib, GraphX) • Really Fast Locality Awareness
  • 43. C* SPARK BDTS Specifications: 6 Servers, 4 Core CPU, 15GB RAM
  • 44. Cassandra JOINS Spark From Simple to Very Complex SIMPLE: val cc = new CassandraSQLContext(sc) val config = sc.cassandraTable("highway","highway_config").select("config_id","agency","link_id").where("onstreet = ?","I-605”) val joined = config.joinWithCassandraTable("highway","highway_history").select("speed”) val specified_joined = joined.on(SomeColumns("config_id","agency","link_id")) joined.collect() COMPLEX: val hists = sc.cassandraTable("highway", "highway_history").where("config_id = ?","85").where("agency = ?","Caltrans-D7").where("event_time = ?","2014-04-03 02:47”) val configs = sc.cassandraTable("highway","highway_config_metrics").where("onstreet = ?","SR- 60").where("fromstreet = ?","EUCLID”) val histsKeyBy = hists.map(f => (((f.getInt("config_id"),f.getString("agency"))),(f.getString("event_time"),f.getInt("occupancy"),f.getInt("speed "),f.getInt("volume"),f.getInt("hovspeed")))) val configsKeyBy = configs.map(f => ((f.getInt("config_id"),f.getString("agency")),(f.getString("event_time"),f.getString("onstreet"),f.getString("fr omstreet"),f.getInt("direction")))) val joined = histsKeyBy.join(configsKeyBy) joined.collect() val speed_avg = joined.map(x => x._2._1._2).mean()
  • 45. The End NEXT Cassandra Performance Tuning Spark RDD Functional Programming Thank you