1. From 0 to Streaming
Cassandra and Spark Streaming
Russell Spitzer
+ =
2. Who am I?
• Bioinformatics Ph.D from UCSF
• Works on the integration of
Cassandra (C*) with Hadoop, Solr,
and SPARK!
• Spends a lot of time spinning up
clusters on EC2, GCE, Azure, …
http://www.datastax.com/dev/blog/
testing-cassandra-1000-nodes-at-
a-time
• Writing FAQ’s for Spark
Troubleshooting
http://www.datastax.com/dev/blog/
common-spark-troubleshooting
3. From 0 to Streaming
Spark
How does it work?
What are the main Components?
Cluster Layout
Spark Submit
4. From 0 to Streaming
Connecting Cassandra To Spark
Spark Cassandra Connector
Spark SQL
RDD Basics
Spark
How does it work?
What are the main Components?
Cluster Layout
Spark Submit
5. From 0 to Streaming
Connecting Cassandra To Spark
Spark Cassandra Connector
Spark SQL
RDD Basics
Spark Streaming
Streaming Basics
Writing Streaming Applications
Custom Receivers
Spark
How does it work?
What are the main Components?
Cluster Layout
Spark Submit
7. Spark is a Distributed Analytics
Platform
HADOOP
•Has Generalized DAG execution
•Integrated SQL Queries
•Streaming
•Easy Abstraction for Datasets
•Support in lots of languages
All in one package!
8. Spark Provides a Simple and Efficient
framework for Distributed Computations
Node Roles 2
In Memory Caching Yes!
Generic DAG Execution Yes!
Great Abstraction For Datasets? RDD!
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
Spark Partition
9. Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
Spark Partition
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
10. Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
Spark Partition
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
11. Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
Spark Partition
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
12. RDDs Can be Generated
from a Variety of Sources
Textfiles
Parallelized Collections
13. RDDs Can be Generated
from a Variety of Sources
Textfiles
Parallelized Collections
14. Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd
=
sc.textFile("num.txt")
val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)
val
rdd3
=
rdd2.filter(
_
>
4)
rdd3.collect
15. Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd
=
sc.textFile("num.txt")
val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)
val
rdd3
=
rdd2.filter(
_
>
4)
rdd3.collect
rdd
Create
16. Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd
=
sc.textFile("num.txt")
val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)
val
rdd3
=
rdd2.filter(
_
>
4)
rdd3.collect
rdd rdd2
Transform
17. Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd
=
sc.textFile("num.txt")
val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)
val
rdd3
=
rdd2.filter(
_
>
4)
rdd3.collect
rdd rdd2 rdd3
Transform
18. Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd rdd2 rdd3rdd
ACTION
rdd
=
sc.textFile("num.txt")
val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)
val
rdd3
=
rdd2.filter(
_
>
4)
rdd3.collect
19. Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd rdd2 rdd3rdd
ACTION
rdd2
rdd
=
sc.textFile("num.txt")
val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)
val
rdd3
=
rdd2.filter(
_
>
4)
rdd3.collect
20. Transformations and Actions
RDD’s are immutable
New RDD’s created with transforms
Only when we call an action are the transforms applied
rdd
=
sc.textFile("num.txt")
val
rdd2
=
rdd.map(
x
=>
x.toInt
*2
)
val
rdd3
=
rdd2.filter(
_
>
4)
rdd3.collect
rdd rdd2 rdd3
ACTION
21. Application of Transformations is
done one Partition per Executor
1 32
4 5 6
7 8 9
RDD
Executor
Executor
Transformation
RDD’
22. Application of Transformations is
done one Partition per Executor
1 32
4 5 6
7 8 9
RDD
Executor
1 1’
Executor
2 2’
Transformation
RDD’
23. 1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 2’
Transformation
RDD’
Application of Transformations is
done one Partition per Executor
24. 1 32
4 5 6
7 8 9
RDD
Executor
3 3’
Executor
4 4’
1’ 2’
Transformation
RDD’
Application of Transformations is
done one Partition per Executor
25. 1 32
4 5 6
7 8 9
RDD
Executor
5 5’
Executor
6 6’
1’ 2’
Transformation
RDD’
Application of Transformations is
done one Partition per Executor
3’
4’
26. 1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 5’ 6’
7’ 8’ 9’
Transformation
RDD’
Application of Transformations is
done one Partition per Executor
27. 1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 6’
7’ 8’ 9’
RDD’
Failed Transformations Can be Redone By
Reapplying the Transformation to the Old
Partition
5 5’
Node Failure
28. 1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 6’
7’ 8’ 9’
RDD’
Failed Transformations Can be Redone By
Reapplying the Transformation to the Old
Partition
Reapply Transformation
5 5’
Node Failure
29. 1 32
4 5 6
7 8 9
RDD
Executor
Executor
1’ 3’2’
4’ 6’
7’ 8’ 9’
RDD’
Failed Transformations Can be Redone By
Reapplying the Transformation to the Old
Partition
Reapply Transformation
5’
Because the actions on any partition can be tracked
backwards we can recover from failure without redoing the
entire RDD
30. Use the Spark Shell to
quickly try out code samples
Available in
and
Pyspark
Spark Shell
31. Spark Context is the Core Api for
all Communication with Spark
val conf = new SparkConf()
.setAppName(appName)
.setMaster(master)
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
new SparkContext(conf)
Almost all options can also be set as environment
variables or on the command line during spark-submit!
32. Deploy Compiled Jars using
Spark Submit
https://spark.apache.org/docs/1.1.0/submitting-applications.html
Some of the commonly used options are:
--class: The entry point for your application
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--conf: Arbitrary Spark configuration property in key=value format.
spark-‐submit
-‐-‐class
MainClass
JarYouWantDistributedToExecutor.jar
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Spark-Submit
Jar
33. Deploy Compiled Jars using
Spark Submit
https://spark.apache.org/docs/1.1.0/submitting-applications.html
Some of the commonly used options are:
--class: The entry point for your application
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--conf: Arbitrary Spark configuration property in key=value format.
spark-‐submit
-‐-‐class
MainClass
JarYouWantDistributedToExecutor.jar
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Spark-Submit
Jar
34. Co-locate Spark and C* for
Best Performance
C*
C*C*
C*
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Running Spark Workers
on the same nodes as
your C* Cluster will save
network hops when
reading and writing
35. Use a Separate Datacenter
for your Analytics Workloads
C*
C*C*
C*
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
C*
C*C*
C*
OLTP OLAP
37. DataStax OSS Connector
Spark to Cassandra
https://github.com/datastax/spark-‐cassandra-‐connector
Keyspace Table
Cassandra Spark
RDD[CassandraRow]
RDD[Tuples]
Bundled
and
Supported
with
DSE
>
4.5!
38. Spark Cassandra Connector uses the
DataStax Java Driver to Read from and
Write to C*
Spark C*
Full Token
Range
Each Executor Maintains
a connection to the C*
Cluster
Spark
Executor
DataStax
Java Driver
Tokens 1-1000
Tokens 1001 -2000
Tokens …
RDD’s read into different
splits based on sets of
tokens
39. Setting up C* and Spark
DSE > 4.5.0
Just start your nodes with
dse cassandra -k
Apache Cassandra
Follow the excellent guide by Al Tobey
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
40. Several Easy Ways To Use the
Spark Cassandra Connector
• SparkSQL
• Scala
• Java
• RDD Manipulation
• Scala
• Java
• Python
41. Requirements for Following
Code Examples
The following examples use are targeted at
Spark 1.1.X
Cassandra 2.0.X
or if you are using DataStax Enterprise
DSE 4.6.x
42. Basics: Getting a Table and
Counting
CREATE KEYSPACE candy WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
use candy;
CREATE TABLE inventory ( brand text, name text, amount int, PRIMARY KEY (brand, name) ) ;
CREATE TABLE requests ( user text, name text, amount int, PRIMARY KEY (user, name) );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','Gobstopper', 10 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','WonkaBar', 3 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','SugarMountain', 2 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','ChocoIsland', 5 );
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'WonkaBar', 2);
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'ChocoIsland', 1);
43. Basics: Getting a Table and
Counting
CREATE KEYSPACE candy WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
use candy;
CREATE TABLE inventory ( brand text, name text, amount int, PRIMARY KEY (brand, name) ) ;
CREATE TABLE requests ( user text, name text, amount int, PRIMARY KEY (user, name) );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','Gobstopper', 10 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','WonkaBar', 3 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','SugarMountain', 2 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','ChocoIsland', 5 );
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'WonkaBar', 2);
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'ChocoIsland', 1);
scala>
val
rdd
=
sc.cassandraTable("candy","inventory")
scala>
rdd.count
res13:
Long
=
4
cassandraTable
44. Basics: Getting a Table and
Counting
CREATE KEYSPACE candy WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
use candy;
CREATE TABLE inventory ( brand text, name text, amount int, PRIMARY KEY (brand, name) ) ;
CREATE TABLE requests ( user text, name text, amount int, PRIMARY KEY (user, name) );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','Gobstopper', 10 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'Wonka','WonkaBar', 3 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','SugarMountain', 2 );
INSERT INTO inventory (brand, name , amount ) VALUES ( 'CandyTown','ChocoIsland', 5 );
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'WonkaBar', 2);
INSERT INTO requests (user, name , amount ) VALUES ( 'Russ', 'ChocoIsland', 1);
scala>
val
rdd
=
sc.cassandraTable("candy","inventory")
scala>
rdd.count
res13:
Long
=
4
cassandraTable
count
4
49. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
50. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
cassandraTable
51. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
52. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
10
get[Int]
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
53. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
10
get[Int]
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
scala>
case
class
invRow
(
brand:String,
name:String,
amount:Integer)
scala>
sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount
54. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
10
get[Int]
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSupportedTypes.html
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
scala>
case
class
invRow
(
brand:String,
name:String,
amount:Integer)
scala>
sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount
cassandraTable
55. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
10
get[Int]
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSupportedTypes.html
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
scala>
case
class
invRow
(
brand:String,
name:String,
amount:Integer)
scala>
sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount
cassandraTable
take(1) Array of invRows
Wonka Gob 10
Brand Name Amount
56. Getting Values From Cassandra Rows
scala>
sc.cassandraTable("candy","inventory")
.take(1)(0)
.get[Int]("amount")
res5:
Int
=
10
10
get[Int]
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSupportedTypes.html
cassandraTable
take(1) Array of CassandraRows
Wonka Gob 10
scala>
case
class
invRow
(
brand:String,
name:String,
amount:Integer)
scala>
sc.cassandraTable[invRow]("candy","inventory").take(1)(0).amount
cassandraTable
take(1) Array of invRows
Wonka Gob 10
Brand Name Amount
amount
10
57. Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
58. Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
sc.cassandraTable[invRow]("candy","inventory")
.filter(
_.amount
<5)
.saveToCassandra("candy","low")
59. Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
sc.cassandraTable[invRow]("candy","inventory")
.filter(
_.amount
<5)
.saveToCassandra("candy","low")
cassandraTable
60. Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
sc.cassandraTable[invRow]("candy","inventory")
.filter(
_.amount
<5)
.saveToCassandra("candy","low")
cassandraTable
amount
1
<5_ (Anonymous Param)
Filter
Wonka Gob 10
Brand Name Amount
61. Saving Back to Cassandra
CREATE TABLE low ( brand text, name text, amount int, PRIMARY KEY ( brand, name ));
sc.cassandraTable[invRow]("candy","inventory")
.filter(
_.amount
<5)
.saveToCassandra("candy","low")
cassandraTable
amount
1
<5_ (Anonymous Param)
Filter
Wonka Gob 10
Brand Name Amount
C*
C*C*
C*
Under the hood this is done via the
Cassandra Java Driver
62. Several Easy Ways To Use the
Spark Cassandra Connector
• SparkSQL
• Scala
• Java
• RDD Manipulation
• Scala
• Java
• Python
63. Spark Sql Provides a Fast SQL
Like Syntax For Cassandra!
HQL
SQL
Catalyst
Query Plan
Grab Data
Filter
Group
Return Results
SchemaRDD
SQL In, RDD’s Out
64. Building a Context Object For
interacting with Spark SQL
In the DSE Spark Shell both HiveContext and Cassandra Sql Context are created
automatically on startup
import
org.apache.spark.sql.cassandra.CassandraSQLContext
val
sc:
SparkContext
=
...
val
csc
=
new
CassandraSQLContext(sc)
JavaSparkContext
jsc
=
new
JavaSparkContext(conf);
//
create
a
Cassandra
Spark
SQL
context
CassandraSQLContext
csc
=
new
CassandraSQLContext(jsc.sc());
Since HiveContext Requires the Hive Driver accessing C* Directly,
HC only available in DSE.
Workaround: get SchemaRDD’s with Cassandra Sql Context then Register with HC
65. Reading Data From
Cassandra With SQL Syntax
scala>
csc.sql(
"SELECT
*
FROM
candy.inventory").collect
Array[org.apache.spark.sql.Row]
=
Array(
[Wonka,Gobstopper,10],
[Wonka,WonkaBar,3],
[CandyTown,ChocoIsland,5],
[CandyTown,SugarMountain,2]
)
QueryPlan
66. Reading Data From
Cassandra With SQL Syntax
scala>
csc.sql(
"SELECT
*
FROM
candy.inventory").collect
Array[org.apache.spark.sql.Row]
=
Array(
[Wonka,Gobstopper,10],
[Wonka,WonkaBar,3],
[CandyTown,ChocoIsland,5],
[CandyTown,SugarMountain,2]
)
SchemaRDDQueryPlan
67. Reading Data From
Cassandra With SQL Syntax
scala>
csc.sql(
"SELECT
*
FROM
candy.inventory").collect
Array[org.apache.spark.sql.Row]
=
Array(
[Wonka,Gobstopper,10],
[Wonka,WonkaBar,3],
[CandyTown,ChocoIsland,5],
[CandyTown,SugarMountain,2]
)
SchemaRDDQueryPlan
68. Counting Data From
Cassandra With SQL Syntax
scala>
csc.sql("SELECT
COUNT(*)
FROM
candy.inventory").collect
res5:
Array[org.apache.spark.sql.Row]
=
Array([4])
69. Counting Data From
Cassandra With SQL Syntax
scala>
csc.sql("SELECT
COUNT(*)
FROM
candy.inventory").collect
res5:
Array[org.apache.spark.sql.Row]
=
Array([4])
70. Joining Data From
Cassandra With SQL Syntax
scala>
csc.sql("
SELECT
*
FROM
candy.inventory
as
inventory
JOIN
candy.requests
as
requests
WHERE
inventory.name
=
requests.name").collect
res12:
Array[org.apache.spark.sql.Row]
=
Array(
[Wonka,WonkaBar,3,Russ,WonkaBar,2],
[CandyTown,ChocoIsland,5,Russ,ChocoIsland,1]
)
71. Joining Data From
Cassandra With SQL Syntax
scala>
csc.sql("
SELECT
*
FROM
candy.inventory
as
inventory
JOIN
candy.requests
as
requests
WHERE
inventory.name
=
requests.name").collect
res12:
Array[org.apache.spark.sql.Row]
=
Array(
[Wonka,WonkaBar,3,Russ,WonkaBar,2],
[CandyTown,ChocoIsland,5,Russ,ChocoIsland,1]
)
72. Insert to another Cassandra
Table
csc.sql("
INSERT
INTO
candy.low
SELECT
*
FROM
candy.inventory
as
inv
WHERE
inv.amount
<
5
").collect
73. Insert to another Cassandra
Table
csc.sql("
INSERT
INTO
candy.low
SELECT
*
FROM
candy.inventory
as
inv
WHERE
inv.amount
<
5
").collect
75. Streaming is Cool
and if you like Streaming you will be cool too
Your Data is Delicious
Like a Candy
76. Streaming is Cool
and if you like Streaming you will be cool too
Your Data is Delicious
Like a Candy
You want it right now!
77. Streaming is Cool
and if you like Streaming you will be cool too
Your Data is Delicious
Like a Candy
You want it right now!
Batch Analytics:
Waiting to do analysis after data has
accumulated means data may be out of date or
unimportant by the time we process it.
78. Streaming is Cool
and if you like Streaming you will be cool too
Your Data is Delicious
Like a Candy
You want it right now!
Batch Analytics:
Waiting to do analysis after data has
accumulated means data may be out of date or
unimportant by the time we process it.
Streaming Analytics:
We do our analytics on the data as it arrives.
The data won’t be stale and neither will our
analytics
79. DStreams: Basic unit of
Spark Streaming
Receiver
DStream
Events
Streaming involves a receiver or set of receivers each of which publishes a DStream
80. DStreams: Basic unit of
Spark Streaming
Receiver
DStream
Events
Batch Batch
RDD RDD RDD RDD
The DStream is (Discretized) into batches, the timing of which is set in the
Spark Streaming Context. Each Batch is made up of RDDs.
83. Demo Streaming Application:
Analyze HttpRequests with Spark Streaming
Spark Cassandra
HttpServerTraffic
Spark Executor
Source Included in DSE 4.6.0
84. Spark Receivers only really need to
describe how to publish to a DStream
case
class
HttpRequest(
timeuuid:
UUID,
method:
String,
headers:
Map[String,
List[String]],
uri:
URI,
body:
String)
extends
ReceiverClass
First we need to define a Case Class to make moving around
HttpRequest information Easier. This type will be used to
specify what type of DStream we are creating.
85. Spark Receivers only really need to
describe how to publish to a DStream
class
HttpReceiver(port:
Int)
extends
Receiver[HttpRequest]
(StorageLevel.MEMORY_AND_DISK_2)
with
Logging
{
def
onStart():
Unit
=
{}
def
onStop():
Unit
=
{}
}
Now we just need to write the code for a receiver to actually
publish these HttpRequest Objects
Receiver
[HttpRequest]
86. Spark Receivers only really need to
describe how to publish to a DStream
import
com.sun.net.httpserver.{
HttpExchange,
HttpHandler,
HttpServer}
def
onStart():
Unit
=
{
val
s
=
HttpServer.create(new
InetSocketAddress(p),
0)
s.createContext("/",
new
StreamHandler())
s.start()
server
=
Some(s)
}
def
onStop():
Unit
=
server
map(_.stop(0))
This will start up our server and direct all HttpTraffic to be
handled by StreamHandler
Receiver
[HttpRequest]
HttpServer
87. Spark Receivers only really need to
describe how to publish to a DStream
class
StreamHandler
extends
HttpHandler
{
override
def
handle(transaction:
HttpExchange):
Unit
=
{
val
dataReader
=
new
BufferedReader(new
InputStreamReader(transaction.getRequestBody))
val
data
=
Stream.continually(dataReader.readLine).takeWhile(_
!=
null).mkString("n")
val
headers:
Map[String,
List[String]]
=
transaction.getRequestHeaders.toMap.map
{
case
(k,
v)
=>
(k,
v.toList)}
store(HttpRequest(
UUIDs.timeBased(),
transaction.getRequestMethod,
headers,
transaction.getRequestURI,
data))
transaction.sendResponseHeaders(200,
0)
val
response
=
transaction.getResponseBody
response.close()
//
Empty
response
body
transaction.close()
//
Finish
Transaction
}
}
StreamHandler actually does the work
publishing events to the DStream.
Receiver
[HttpRequest]
HttpServer
StreamHandler
88. Streaming Context sets Batch Timing
val
ssc
=
new
StreamingContext(conf,
Seconds(5))
val
multipleStreams
=
(1
to
config.numDstreams).map
{
i
=>
ssc.receiverStream[HttpRequest](new
HttpReceiver(config.port))
}
val
requests
=
ssc.union(multipleStreams)
89. Create One Receiver Per Node
val
ssc
=
new
StreamingContext(conf,
Seconds(5))
val
multipleStreams
=
(1
to
config.numDstreams).map
{
i
=>
ssc.receiverStream[HttpRequest](new
HttpReceiver(config.port))
}
val
requests
=
ssc.union(multipleStreams)
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler
90. Merge Separate DStreams into One
val
ssc
=
new
StreamingContext(conf,
Seconds(5))
val
multipleStreams
=
(1
to
config.numDstreams).map
{
i
=>
ssc.receiverStream[HttpRequest](new
HttpReceiver(config.port))
}
val
requests
=
ssc.union(multipleStreams)
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler
Receiver
[HttpRequest]
HttpServer
StreamHandler
requests[HttpRequest]
91. Cassandra Tables to Store HttpEvents
CREATE
TABLE
IF
NOT
EXISTS
timeline
(
timesegment
bigint
,
url
text,
t_uuid
timeuuid
,
method
text,
headers
map
<text,
text>,
body
text
,
PRIMARY
KEY
((url,
timesegment)
,
t_uuid
))
Persist Every Event That
Comes into the System
92. Cassandra Tables to Store HttpEvents
CREATE
TABLE
IF
NOT
EXISTS
timeline
(
timesegment
bigint
,
url
text,
t_uuid
timeuuid
,
method
text,
headers
map
<text,
text>,
body
text
,
PRIMARY
KEY
((url,
timesegment)
,
t_uuid
))
CREATE
TABLE
IF
NOT
EXISTS
method_agg(
url
text,
method
text,
time
timestamp,
count
bigint,
PRIMARY
KEY
((url,method),
time))
Persist Every Event That
Comes into the System
Table For Counting the
Number of Accesses to
each Url Over Time
93. Cassandra Tables to Store HttpEvents
CREATE
TABLE
IF
NOT
EXISTS
timeline
(
timesegment
bigint
,
url
text,
t_uuid
timeuuid
,
method
text,
headers
map
<text,
text>,
body
text
,
PRIMARY
KEY
((url,
timesegment)
,
t_uuid
))
CREATE
TABLE
IF
NOT
EXISTS
method_agg(
url
text,
method
text,
time
timestamp,
count
bigint,
PRIMARY
KEY
((url,method),
time))
CREATE
TABLE
IF
NOT
EXISTS
sorted_urls(
url
text,
time
timestamp,
count
bigint,
PRIMARY
KEY
(time,
count)
)
Persist Every Event That
Comes into the System
Table For Counting the
Number of Accesses to
each Url Over Time
Table for finding the most
popular url in each batch
94. Persist the events without doing any
manipulation
requests.map
{
request
=>
timelineRow(
timesegment
=
UUIDs.unixTimestamp(request.timeuuid)
/
10000L,
url
=
request.uri.toString,
t_uuid
=
request.timeuuid,
method
=
request.method,
headers
=
request.headers.map
{
case
(k,
v)
=>
(k,
v.mkString("#"))},
body
=
request.body)
}.saveToCassandra("requests_ks",
"timeline")
Results
95. Persist the events without doing any
manipulation
requests.map
{
request
=>
timelineRow(
timesegment
=
UUIDs.unixTimestamp(request.timeuuid)
/
10000L,
url
=
request.uri.toString,
t_uuid
=
request.timeuuid,
method
=
request.method,
headers
=
request.headers.map
{
case
(k,
v)
=>
(k,
v.mkString("#"))},
body
=
request.body)
}.saveToCassandra("requests_ks",
"timeline")
timesegment url t_uuid method Headers Body
Results
timelineRow
96. Persist the events without doing any
manipulation
requests.map
{
request
=>
timelineRow(
timesegment
=
UUIDs.unixTimestamp(request.timeuuid)
/
10000L,
url
=
request.uri.toString,
t_uuid
=
request.timeuuid,
method
=
request.method,
headers
=
request.headers.map
{
case
(k,
v)
=>
(k,
v.mkString("#"))},
body
=
request.body)
}.saveToCassandra("requests_ks",
"timeline")
C*
C*C*
C*
timesegment url t_uuid method Headers Body
Results
timelineRow
97. Aggregate Requests by
URI and Method
requests.map(request
=>
(request.method,
request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
((m,
u),
c)
=>
((m,
u),
c,
time.milliseconds)})
.map
{
case
((m,
u),
c,
t)
=>
methodAggRow(time
=
t,
url
=
u,
method
=
m,
count
=
c)}
.saveToCassandra("requests_ks",
"method_agg")
method uri
98. Aggregate Requests by
URI and Method
requests.map(request
=>
(request.method,
request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
((m,
u),
c)
=>
((m,
u),
c,
time.milliseconds)})
.map
{
case
((m,
u),
c,
t)
=>
methodAggRow(time
=
t,
url
=
u,
method
=
m,
count
=
c)}
.saveToCassandra("requests_ks",
"method_agg")
method uri
method uri count
CountByValue
99. method uri count time
Aggregate Requests by
URI and Method
requests.map(request
=>
(request.method,
request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
((m,
u),
c)
=>
((m,
u),
c,
time.milliseconds)})
.map
{
case
((m,
u),
c,
t)
=>
methodAggRow(time
=
t,
url
=
u,
method
=
m,
count
=
c)}
.saveToCassandra("requests_ks",
"method_agg")
method uri
method uri count
countByValue
transform
100. method uri count time
Aggregate Requests by
URI and Method
requests.map(request
=>
(request.method,
request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
((m,
u),
c)
=>
((m,
u),
c,
time.milliseconds)})
.map
{
case
((m,
u),
c,
t)
=>
methodAggRow(time
=
t,
url
=
u,
method
=
m,
count
=
c)}
.saveToCassandra("requests_ks",
"method_agg")
method uri
method uri count
countByValue
transform
C*
C*C*
C*
saveToCassandra
101. method uri count time
Aggregate Requests by
URI and Method
requests.map(request
=>
(request.method,
request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
((m,
u),
c)
=>
((m,
u),
c,
time.milliseconds)})
.map
{
case
((m,
u),
c,
t)
=>
methodAggRow(time
=
t,
url
=
u,
method
=
m,
count
=
c)}
.saveToCassandra("requests_ks",
"method_agg")
method uri
method uri count
countByValue
transform
C*
C*C*
C*
saveToCassandra
102. Sort Aggregates by Batch
requests.map(request
=>
(request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
(u,
c)
=>
(u,
c,
time.milliseconds)})
.map
{
case
(u,
c,
t)
=>
sortedUrlRow(time
=
t,
url
=
u,
count
=
c)}
.saveToCassandra("requests_ks",
"sorted_urls")
uri
103. Sort Aggregates by Batch
requests.map(request
=>
(request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
(u,
c)
=>
(u,
c,
time.milliseconds)})
.map
{
case
(u,
c,
t)
=>
sortedUrlRow(time
=
t,
url
=
u,
count
=
c)}
.saveToCassandra("requests_ks",
"sorted_urls")
uri
uri count
countByValue
104. Sort Aggregates by Batch
requests.map(request
=>
(request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
(u,
c)
=>
(u,
c,
time.milliseconds)})
.map
{
case
(u,
c,
t)
=>
sortedUrlRow(time
=
t,
url
=
u,
count
=
c)}
.saveToCassandra("requests_ks",
"sorted_urls")
uri
uri count
uri count time
countByValue
transform
105. Sort Aggregates by Batch
requests.map(request
=>
(request.uri.toString))
.countByValue()
.transform((rdd,
time)
=>
rdd.map
{
case
(u,
c)
=>
(u,
c,
time.milliseconds)})
.map
{
case
(u,
c,
t)
=>
sortedUrlRow(time
=
t,
url
=
u,
count
=
c)}
.saveToCassandra("requests_ks",
"sorted_urls")
uri
uri count
uri count time
countByValue
transform
Let Cassandra
Do the Sorting! PRIMARY KEY (time, count)
C*
C*C*
C*saveToCassandra
106. Start the application!
ssc.start()
ssc.awaitTermination()
This will start the streaming application
piping all incoming data to Cassandra!
107. Live Demo
Demo run Script
#Start Streaming Application
echo "Starting Streaming Receiver(s): Logging to http_receiver.log"
cd HttpSparkStream
dse spark-submit --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d
$NUM_SPARK_NODES > ../http_receiver.log 2>&1 &
cd ..
echo "Waiting for 60 Seconds for streaming to come online"
sleep 60
#Start Http Requester
echo "Starting to send requests against streaming receivers: Logging to http_requester.log"
cd HttpRequestGenerator
./sbt/sbt "run -i $SPARK_NODE_IPS " > ../http_requester.log 2>&1 &
cd ..
#Monitor Results Via Cqlsh
watch -n 5 './monitor_queries.sh'
109. I hope this gives you some
exciting ideas for your
applications!
Questions?
110. Thanks for coming to the meetup!!
DataStax Academy offers free online Cassandra training!
Planet Cassandra has resources for learning the basics from ‘Try Cassandra’ tutorials to in depth
language and migration pages!
Find a way to contribute back to the community: talk at a meetup, or share your story on
PlanetCassandra.org!
Need help? Get questions answered with Planet Cassandra’s free virtual office hours running weekly!
Email us: Community@DataStax.com!
Getting started with Cassandra?!
In production?!
Tweet us: @PlanetCassandra!