Powerful big data processing and storage combined, this presentation walks thru the basics of integrating Apache Spark and Apache Cassandra. Presented by Alex Thompson at the Sydney Cassandra Meetup.
1. Cassandra and Spark
powerful big data processing and storage combined
Alex Thompson @ Datastax Sydney - Spark Meetup May 2016 @ Macquarie Bank
2. Parallel execution*
The core of any parallel programming framework or parallel programming language is the function and
some very simple rules:
1. A function must take at least one argument and return a value.
2. Variables used within a function or passed to a function can have no other scope than that of the
function.
The passing of arguments to a function is the “Message Passing” part you will often seen referred to in
parallel programming languages and frameworks.
* you will see Parallel execution referred to as parallel programming, distributed computing, cluster computing. You require a functional programming
language to work within this paradigm, or a language that has been retro fitted to work in parallel environments.
3. Functional programming example
function myFunction(arg1, arg2)
{
var1 = 1;
var2 = 2;
var3 = var1 * var2;
return var3;
}
Because the above function is self-encapsulated, i.e. it receives a message, it performs some work and it returns a value
it can be run on any core or any compute node asynchronously in isolation from other processes.
A parallel programming framework can throw this function at an empty or underutilised core or node for processing, thus
distributing its workload across many servers.
4. Functional programming execution
1
23
myFunction(arg1, arg2)function_1()
function_2()
function_3()
...
Parallel framework distributes functions to compute nodes for execution.
Examples of parallel capable frameworks: Scala AKKA, Erlang VM, C-MPI, C-Linda etc
Functions can be pushed to nodes in a variety of ways, based on load, availability, simple round-robin etc
Many patterns exist for design of parallel systems e.g. MPI and Actor pattern
But all involve some form of message passing.
push function to compute node
5. Hadoop split/job architecture
Hadoop was originally a file based general purpose distributed file system and parallel execution
environment developed to pull apart large web search logs by splitting up files, distributing them to
compute nodes and passing a java application to those nodes to work over the split file.
1 2 3 4
M
data
data
split
.jar
data
split
.jar
data
split
.jar
data
split
.jar
6. C* and Spark split/job architecture
You can do the same thing with Spark, but when you introduce C* into the equation, your splits are
already complete, your splits are the partitioning of data across the nodes in a C* ring:
1 2 3 4
M
.jar .jar .jar .jar
table
data
table
data
table
data
table
data
8. Spark workers and executors are deployed directly on C* nodes, they run in a separate JVM but have
direct access (no network hops) to the C* resident data on the local node.
Node 1
Cassandra
Spark is installed on same node as C*
Spark Worker
Executor Task Task
Executor Task Task
table
data
Driver Cluster Manager
Node 2 ...
The Execution Stack
9. Spark, Scala, C* and SparkC*Driver versions
Apache Spark, Typesafe Scala, Apache Cassandra and the SparkCassandraDriver* are all very fast
moving projects, you will see major point releases every couple of months on some of these projects, so
you have to be very aware of the versions of the software stack when producing a solution.
The Solution Stack
scala
code
scala
libraries
spark
libraries
driver*
scala build tool (sbt) spark
10. Memory allocation and Spark
Memory settings are required at each level:
● Driver application
● Spark master
● Workers
● Executors
The defaults are sufficient for ‘hello world’ type applications and light weight processing, usually you will need more memory
allocated to workers and executors down at node level in the real world. All processing should be pushed down to the nodes
where possible, if you find your Driver application is spiralling upward in RAM requirements or timing out you are probably
doing something wrong like using take() or collect() at driver level.
11. Set up your development environment 1
The following is based on:
spark-cassandra-connector 1.2
spark 1.2
hadoop 1
scala version 2.10
Download the spark-cassandra-connector .zip file from the github project:
https://github.com/datastax/spark-cassandra-connector
Unpack it and place it in the /opt directory and cd into it
Download sbt-launch.jar from:
http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/{version}/sbt-launch.jar
And place it in the spark-cassandra-connector/sbt directory.
12. Set up your development environment 2
Build the connector:
run /opt/spark-cassandra-connector-master> java -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m -jar
sbt/sbt-launch.jar "assembly"
This will build a standard scala connector with all dependencies in:
/opt/spark-cassandra-connector-master/spark-cassandra-connector/target/scala-{binary.version}/spark-cassandra-connector
-assembly-*.jar
And will build a java connector with all dependencies in:
/opt/spark-cassandra-connector-master/spark-cassandra-connector-java/target/scala-2.10/spark-cassandra-connector-java-a
ssembly-1.3.0-SNAPSHOT.jar
Then add this jar to your Spark executor classpath by adding the following line to your spark-default.conf:
spark.executor.extraClassPath
spark-cassandra-connector/spark-cassandra-connector/target/scala-{binary.version}/spark-cassandra-connector-assembly-$C
urrentVersion-SNAPSHOT.jar
13. Set up your development environment 3
install sbt on your OS (here Ubuntu):
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-get update
sudo apt-get install sbt
create a hello world project and run it:
http://www.scala-sbt.org/0.13/tutorial/Hello.html
and run it:
>cd /opt/spark-cassandra-connector-project
>sbt
>run
Hi!
14. Create a C* project
import com.datastax.spark.connector._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
object CassandraTest {
def main(args: Array[String]) {
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf)
val rdd = sc.cassandraTable("test", "users")
println(rdd.count)
println(rdd.first)
}
}
15. Start your Spark Master and Workers
Start a standalone master server by executing:
>cd /opt/spark-1.2.1-bin-hadoop1
>./sbin/start-master.sh
(Note: Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to
SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.)
Start one or more workers and connect them to the master via:
>./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
(./bin/spark-class org.apache.spark.deploy.worker.Worker spark://dse-vm:7077) - spark://dse-vm:7077 is the address of the master
(Note: Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the new node listed there, along with its
number of CPUs and memory (minus one gigabyte left for the OS)).
Submit your job:
>cd /opt/spark-cassandra-connector-project
>sbt
You will now be given a list of jobs that sbt has found, choose the one you want and submit it and it will run.
16. Joins in SparkSQL:
import com.datastax.spark.connector.cql.CassandraConnector
import org.apache.spark.sql.cassandra.CassandraSQLContext
import org.apache.spark.{SparkConf, SparkContext}
object SparkSqlQuery extends CassandraCapable {
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
.setJars(Array("target/scala-2.10/spark_bulk_ops-assembly.jar"))
val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf)
val connector = CassandraConnector(conf)
//inserts...
val cassandraContext = new CassandraSQLContext(sc)
val rdd = cassandraContext.sql("select t.tag, count(*) as cnt from activity_stream_api.activity a " +
"join activity_stream_api.tag_activity t on t.activity_id = a.activity_id group by t.tag order by cnt");
rdd.collect().foreach(f => println(f))
}
}
17. Joins in SparkSQL, Bulk import / export into C* - everything you wanted to do with C* and Spark but were
too afraid to try, visit the following github example code site, it also includes a full C* application template
you can use for your own systems:
https://github.com/rssvihla/spark_commons