SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Cassandra and Spark
powerful big data processing and storage combined
Alex Thompson @ Datastax Sydney - Spark Meetup May 2016 @ Macquarie Bank
Parallel execution*
The core of any parallel programming framework or parallel programming language is the function and
some very simple rules:
1. A function must take at least one argument and return a value.
2. Variables used within a function or passed to a function can have no other scope than that of the
function.
The passing of arguments to a function is the “Message Passing” part you will often seen referred to in
parallel programming languages and frameworks.
* you will see Parallel execution referred to as parallel programming, distributed computing, cluster computing. You require a functional programming
language to work within this paradigm, or a language that has been retro fitted to work in parallel environments.
Functional programming example
function myFunction(arg1, arg2)
{
var1 = 1;
var2 = 2;
var3 = var1 * var2;
return var3;
}
Because the above function is self-encapsulated, i.e. it receives a message, it performs some work and it returns a value
it can be run on any core or any compute node asynchronously in isolation from other processes.
A parallel programming framework can throw this function at an empty or underutilised core or node for processing, thus
distributing its workload across many servers.
Functional programming execution
1
23
myFunction(arg1, arg2)function_1()
function_2()
function_3()
...
Parallel framework distributes functions to compute nodes for execution.
Examples of parallel capable frameworks: Scala AKKA, Erlang VM, C-MPI, C-Linda etc
Functions can be pushed to nodes in a variety of ways, based on load, availability, simple round-robin etc
Many patterns exist for design of parallel systems e.g. MPI and Actor pattern
But all involve some form of message passing.
push function to compute node
Hadoop split/job architecture
Hadoop was originally a file based general purpose distributed file system and parallel execution
environment developed to pull apart large web search logs by splitting up files, distributing them to
compute nodes and passing a java application to those nodes to work over the split file.
1 2 3 4
M
data
data
split
.jar
data
split
.jar
data
split
.jar
data
split
.jar
C* and Spark split/job architecture
You can do the same thing with Spark, but when you introduce C* into the equation, your splits are
already complete, your splits are the partitioning of data across the nodes in a C* ring:
1 2 3 4
M
.jar .jar .jar .jar
table
data
table
data
table
data
table
data
The Spark stack
Spark workers and executors are deployed directly on C* nodes, they run in a separate JVM but have
direct access (no network hops) to the C* resident data on the local node.
Node 1
Cassandra
Spark is installed on same node as C*
Spark Worker
Executor Task Task
Executor Task Task
table
data
Driver Cluster Manager
Node 2 ...
The Execution Stack
Spark, Scala, C* and SparkC*Driver versions
Apache Spark, Typesafe Scala, Apache Cassandra and the SparkCassandraDriver* are all very fast
moving projects, you will see major point releases every couple of months on some of these projects, so
you have to be very aware of the versions of the software stack when producing a solution.
The Solution Stack
scala
code
scala
libraries
spark
libraries
driver*
scala build tool (sbt) spark
Memory allocation and Spark
Memory settings are required at each level:
● Driver application
● Spark master
● Workers
● Executors
The defaults are sufficient for ‘hello world’ type applications and light weight processing, usually you will need more memory
allocated to workers and executors down at node level in the real world. All processing should be pushed down to the nodes
where possible, if you find your Driver application is spiralling upward in RAM requirements or timing out you are probably
doing something wrong like using take() or collect() at driver level.
Set up your development environment 1
The following is based on:
spark-cassandra-connector 1.2
spark 1.2
hadoop 1
scala version 2.10
Download the spark-cassandra-connector .zip file from the github project:
https://github.com/datastax/spark-cassandra-connector
Unpack it and place it in the /opt directory and cd into it
Download sbt-launch.jar from:
http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/{version}/sbt-launch.jar
And place it in the spark-cassandra-connector/sbt directory.
Set up your development environment 2
Build the connector:
run /opt/spark-cassandra-connector-master> java -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m -jar
sbt/sbt-launch.jar "assembly"
This will build a standard scala connector with all dependencies in:
/opt/spark-cassandra-connector-master/spark-cassandra-connector/target/scala-{binary.version}/spark-cassandra-connector
-assembly-*.jar
And will build a java connector with all dependencies in:
/opt/spark-cassandra-connector-master/spark-cassandra-connector-java/target/scala-2.10/spark-cassandra-connector-java-a
ssembly-1.3.0-SNAPSHOT.jar
Then add this jar to your Spark executor classpath by adding the following line to your spark-default.conf:
spark.executor.extraClassPath
spark-cassandra-connector/spark-cassandra-connector/target/scala-{binary.version}/spark-cassandra-connector-assembly-$C
urrentVersion-SNAPSHOT.jar
Set up your development environment 3
install sbt on your OS (here Ubuntu):
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-get update
sudo apt-get install sbt
create a hello world project and run it:
http://www.scala-sbt.org/0.13/tutorial/Hello.html
and run it:
>cd /opt/spark-cassandra-connector-project
>sbt
>run
Hi!
Create a C* project
import com.datastax.spark.connector._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
object CassandraTest {
def main(args: Array[String]) {
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf)
val rdd = sc.cassandraTable("test", "users")
println(rdd.count)
println(rdd.first)
}
}
Start your Spark Master and Workers
Start a standalone master server by executing:
>cd /opt/spark-1.2.1-bin-hadoop1
>./sbin/start-master.sh
(Note: Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to
SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.)
Start one or more workers and connect them to the master via:
>./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
(./bin/spark-class org.apache.spark.deploy.worker.Worker spark://dse-vm:7077) - spark://dse-vm:7077 is the address of the master
(Note: Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the new node listed there, along with its
number of CPUs and memory (minus one gigabyte left for the OS)).
Submit your job:
>cd /opt/spark-cassandra-connector-project
>sbt
You will now be given a list of jobs that sbt has found, choose the one you want and submit it and it will run.
Joins in SparkSQL:
import com.datastax.spark.connector.cql.CassandraConnector
import org.apache.spark.sql.cassandra.CassandraSQLContext
import org.apache.spark.{SparkConf, SparkContext}
object SparkSqlQuery extends CassandraCapable {
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
.setJars(Array("target/scala-2.10/spark_bulk_ops-assembly.jar"))
val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf)
val connector = CassandraConnector(conf)
//inserts...
val cassandraContext = new CassandraSQLContext(sc)
val rdd = cassandraContext.sql("select t.tag, count(*) as cnt from activity_stream_api.activity a " +
"join activity_stream_api.tag_activity t on t.activity_id = a.activity_id group by t.tag order by cnt");
rdd.collect().foreach(f => println(f))
}
}
Joins in SparkSQL, Bulk import / export into C* - everything you wanted to do with C* and Spark but were
too afraid to try, visit the following github example code site, it also includes a full C* application template
you can use for your own systems:
https://github.com/rssvihla/spark_commons
Latest Versions and compatibility
Connector Spark Cassandra Cassandra Java Driver
1.6 1.6 2.1.5*, 2.2, 3.0 3.0
1.5 1.5, 1.6 2.1.5*, 2.2, 3.0 3.0
1.4 1.4 2.1.5* 2.1
1.3 1.3 2.1.5* 2.1
1.2 1.2 2.1, 2.0 2.1
1.1 1.1, 1.0 2.1, 2.0 2.1
1.0 1.0, 0.9 2.0 2.0
*Compatible with 2.1.X where X >= 5

Weitere ähnliche Inhalte

Was ist angesagt?

Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Shirshanka Das
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark InternalsKnoldus Inc.
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark StreamingKnoldus Inc.
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand fordThu Hiền
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internalsSigmoid
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An OverviewMohit Jain
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 

Was ist angesagt? (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 

Ähnlich wie Apache Cassandra and Apche Spark

Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Clusterphanleson
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsRavindra kumar
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - InstallationMartin Zapletal
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
maXbox Starter87
maXbox Starter87maXbox Starter87
maXbox Starter87Max Kleiner
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewMario Cartia
 

Ähnlich wie Apache Cassandra and Apche Spark (20)

Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Spark core
Spark coreSpark core
Spark core
 
Final Report - Spark
Final Report - SparkFinal Report - Spark
Final Report - Spark
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Module01
 Module01 Module01
Module01
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Spark 101
Spark 101Spark 101
Spark 101
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Scala+data
Scala+dataScala+data
Scala+data
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
maXbox Starter87
maXbox Starter87maXbox Starter87
maXbox Starter87
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 

Mehr von Alex Thompson

The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystemAlex Thompson
 
Apache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep diveApache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep diveAlex Thompson
 
Apache Cassandra - Diagnostics and monitoring
Apache Cassandra - Diagnostics and monitoringApache Cassandra - Diagnostics and monitoring
Apache Cassandra - Diagnostics and monitoringAlex Thompson
 
Deconstructing Apache Cassandra
Deconstructing Apache CassandraDeconstructing Apache Cassandra
Deconstructing Apache CassandraAlex Thompson
 
Apache Cassandra - Data modelling
Apache Cassandra - Data modellingApache Cassandra - Data modelling
Apache Cassandra - Data modellingAlex Thompson
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleAlex Thompson
 

Mehr von Alex Thompson (6)

The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Apache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep diveApache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep dive
 
Apache Cassandra - Diagnostics and monitoring
Apache Cassandra - Diagnostics and monitoringApache Cassandra - Diagnostics and monitoring
Apache Cassandra - Diagnostics and monitoring
 
Deconstructing Apache Cassandra
Deconstructing Apache CassandraDeconstructing Apache Cassandra
Deconstructing Apache Cassandra
 
Apache Cassandra - Data modelling
Apache Cassandra - Data modellingApache Cassandra - Data modelling
Apache Cassandra - Data modelling
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
 

Kürzlich hochgeladen

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 

Kürzlich hochgeladen (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 

Apache Cassandra and Apche Spark

  • 1. Cassandra and Spark powerful big data processing and storage combined Alex Thompson @ Datastax Sydney - Spark Meetup May 2016 @ Macquarie Bank
  • 2. Parallel execution* The core of any parallel programming framework or parallel programming language is the function and some very simple rules: 1. A function must take at least one argument and return a value. 2. Variables used within a function or passed to a function can have no other scope than that of the function. The passing of arguments to a function is the “Message Passing” part you will often seen referred to in parallel programming languages and frameworks. * you will see Parallel execution referred to as parallel programming, distributed computing, cluster computing. You require a functional programming language to work within this paradigm, or a language that has been retro fitted to work in parallel environments.
  • 3. Functional programming example function myFunction(arg1, arg2) { var1 = 1; var2 = 2; var3 = var1 * var2; return var3; } Because the above function is self-encapsulated, i.e. it receives a message, it performs some work and it returns a value it can be run on any core or any compute node asynchronously in isolation from other processes. A parallel programming framework can throw this function at an empty or underutilised core or node for processing, thus distributing its workload across many servers.
  • 4. Functional programming execution 1 23 myFunction(arg1, arg2)function_1() function_2() function_3() ... Parallel framework distributes functions to compute nodes for execution. Examples of parallel capable frameworks: Scala AKKA, Erlang VM, C-MPI, C-Linda etc Functions can be pushed to nodes in a variety of ways, based on load, availability, simple round-robin etc Many patterns exist for design of parallel systems e.g. MPI and Actor pattern But all involve some form of message passing. push function to compute node
  • 5. Hadoop split/job architecture Hadoop was originally a file based general purpose distributed file system and parallel execution environment developed to pull apart large web search logs by splitting up files, distributing them to compute nodes and passing a java application to those nodes to work over the split file. 1 2 3 4 M data data split .jar data split .jar data split .jar data split .jar
  • 6. C* and Spark split/job architecture You can do the same thing with Spark, but when you introduce C* into the equation, your splits are already complete, your splits are the partitioning of data across the nodes in a C* ring: 1 2 3 4 M .jar .jar .jar .jar table data table data table data table data
  • 8. Spark workers and executors are deployed directly on C* nodes, they run in a separate JVM but have direct access (no network hops) to the C* resident data on the local node. Node 1 Cassandra Spark is installed on same node as C* Spark Worker Executor Task Task Executor Task Task table data Driver Cluster Manager Node 2 ... The Execution Stack
  • 9. Spark, Scala, C* and SparkC*Driver versions Apache Spark, Typesafe Scala, Apache Cassandra and the SparkCassandraDriver* are all very fast moving projects, you will see major point releases every couple of months on some of these projects, so you have to be very aware of the versions of the software stack when producing a solution. The Solution Stack scala code scala libraries spark libraries driver* scala build tool (sbt) spark
  • 10. Memory allocation and Spark Memory settings are required at each level: ● Driver application ● Spark master ● Workers ● Executors The defaults are sufficient for ‘hello world’ type applications and light weight processing, usually you will need more memory allocated to workers and executors down at node level in the real world. All processing should be pushed down to the nodes where possible, if you find your Driver application is spiralling upward in RAM requirements or timing out you are probably doing something wrong like using take() or collect() at driver level.
  • 11. Set up your development environment 1 The following is based on: spark-cassandra-connector 1.2 spark 1.2 hadoop 1 scala version 2.10 Download the spark-cassandra-connector .zip file from the github project: https://github.com/datastax/spark-cassandra-connector Unpack it and place it in the /opt directory and cd into it Download sbt-launch.jar from: http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/{version}/sbt-launch.jar And place it in the spark-cassandra-connector/sbt directory.
  • 12. Set up your development environment 2 Build the connector: run /opt/spark-cassandra-connector-master> java -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m -jar sbt/sbt-launch.jar "assembly" This will build a standard scala connector with all dependencies in: /opt/spark-cassandra-connector-master/spark-cassandra-connector/target/scala-{binary.version}/spark-cassandra-connector -assembly-*.jar And will build a java connector with all dependencies in: /opt/spark-cassandra-connector-master/spark-cassandra-connector-java/target/scala-2.10/spark-cassandra-connector-java-a ssembly-1.3.0-SNAPSHOT.jar Then add this jar to your Spark executor classpath by adding the following line to your spark-default.conf: spark.executor.extraClassPath spark-cassandra-connector/spark-cassandra-connector/target/scala-{binary.version}/spark-cassandra-connector-assembly-$C urrentVersion-SNAPSHOT.jar
  • 13. Set up your development environment 3 install sbt on your OS (here Ubuntu): echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list sudo apt-get update sudo apt-get install sbt create a hello world project and run it: http://www.scala-sbt.org/0.13/tutorial/Hello.html and run it: >cd /opt/spark-cassandra-connector-project >sbt >run Hi!
  • 14. Create a C* project import com.datastax.spark.connector._ import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ object CassandraTest { def main(args: Array[String]) { val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf) val rdd = sc.cassandraTable("test", "users") println(rdd.count) println(rdd.first) } }
  • 15. Start your Spark Master and Workers Start a standalone master server by executing: >cd /opt/spark-1.2.1-bin-hadoop1 >./sbin/start-master.sh (Note: Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default.) Start one or more workers and connect them to the master via: >./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT (./bin/spark-class org.apache.spark.deploy.worker.Worker spark://dse-vm:7077) - spark://dse-vm:7077 is the address of the master (Note: Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS)). Submit your job: >cd /opt/spark-cassandra-connector-project >sbt You will now be given a list of jobs that sbt has found, choose the one you want and submit it and it will run.
  • 16. Joins in SparkSQL: import com.datastax.spark.connector.cql.CassandraConnector import org.apache.spark.sql.cassandra.CassandraSQLContext import org.apache.spark.{SparkConf, SparkContext} object SparkSqlQuery extends CassandraCapable { def main(args: Array[String]): Unit = { val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1") .setJars(Array("target/scala-2.10/spark_bulk_ops-assembly.jar")) val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf) val connector = CassandraConnector(conf) //inserts... val cassandraContext = new CassandraSQLContext(sc) val rdd = cassandraContext.sql("select t.tag, count(*) as cnt from activity_stream_api.activity a " + "join activity_stream_api.tag_activity t on t.activity_id = a.activity_id group by t.tag order by cnt"); rdd.collect().foreach(f => println(f)) } }
  • 17. Joins in SparkSQL, Bulk import / export into C* - everything you wanted to do with C* and Spark but were too afraid to try, visit the following github example code site, it also includes a full C* application template you can use for your own systems: https://github.com/rssvihla/spark_commons
  • 18. Latest Versions and compatibility Connector Spark Cassandra Cassandra Java Driver 1.6 1.6 2.1.5*, 2.2, 3.0 3.0 1.5 1.5, 1.6 2.1.5*, 2.2, 3.0 3.0 1.4 1.4 2.1.5* 2.1 1.3 1.3 2.1.5* 2.1 1.2 1.2 2.1, 2.0 2.1 1.1 1.1, 1.0 2.1, 2.0 2.1 1.0 1.0, 0.9 2.0 2.0 *Compatible with 2.1.X where X >= 5