Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!
1. Escape From Hadoop:
Spark One Liners for C* Ops
Kurt Russell Spitzer
DataStax
2. Who am I?
• Bioinformatics Ph.D from UCSF
• Works on the integration of
Cassandra (C*) with Hadoop,
Solr, and SPARK!!
• Spends a lot of time spinning
up clusters on EC2, GCE,
Azure, …
http://www.datastax.com/dev/
blog/testing-cassandra-1000-
nodes-at-a-time
• Developing new ways to make
sure that C* Scales
3. Why escape from Hadoop?
HADOOP
Many Moving Pieces
Map Reduce
Single Points of Failure
Lots of Overhead
And there is a way out!
4. Spark Provides a Simple and Efficient
framework for Distributed Computations
Node Roles 2
In Memory Caching Yes!
Generic DAG Execution Yes!
Great Abstraction For Datasets? RDD!
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
6. Spark is Compatible with HDFS,
Parquet, CSVs, ….
AND
APACHE CASSANDRA
Apache
Cassandra
7. Apache Cassandra is a Linearly Scaling
and Fault Tolerant noSQL Database
Linearly Scaling:
The power of the database
increases linearly with the
number of machines
2x machines = 2x throughput
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Fault Tolerant:
Nodes down != Database Down
Datacenter down != Database Down
8. Apache Cassandra
Architecture is Very Simple
Node Roles 1
Replication Tunable
Replication
Consistency Tunable
C*
C* C*
C*
Client
9. DataStax OSS Connector
Spark to Cassandra
https://github.com/datastax/spark-‐cassandra-‐connector
Cassandra Spark
Keyspace Table
RDD[CassandraRow]
RDD[Tuples]
Bundled
and
Supported
with
DSE
4.5!
10. Spark Cassandra Connector uses the
DataStax Java Driver to Read from and
Write to C*
Spark C*
Full Token
Range
Each Executor Maintains
a connection to the C*
Cluster
Spark
Executor
DataStax
Java Driver
Tokens 1001 -2000
Tokens 1-1000
Tokens …
RDD’s read into different
splits based on sets of
tokens
11. Co-locate Spark and C* for
Best Performance
C*
C* C*
Spark
Worker
C*
Spark
Worker
Spark
Master
Spark
Running Spark Workers Worker
on
the same nodes as your
C* Cluster will save
network hops when
reading and writing
12. Setting up C* and Spark
DSE > 4.5.0
Just start your nodes with
dse cassandra -k
Apache Cassandra
Follow the excellent guide by Al Tobey
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
13. We need a Distributed System
For Analytics and Batch Jobs
But it doesn’t have to be complicated!
14. Even count needs to be
distributed
Ask me to write a Map Reduce
for word count, I dare you.
You could make this easier by adding yet another
technology to your Hadoop Stack (hive, pig, impala) or
we could just do one liners on the spark shell.
15. Basics: Getting a Table and
Counting
CREATE
KEYSPACE
newyork
WITH
replication
=
{'class':
'SimpleStrategy',
'replication_factor':
1
};
use
newyork;
CREATE
TABLE
presidentlocations
(
time
int,
location
text
,
PRIMARY
KEY
time
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
1
,
'White
House'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
2
,
'White
House'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
3
,
'White
House'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
4
,
'White
House'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
5
,
'Air
Force
1'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
6
,
'Air
Force
1'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
7
,
'Air
Force
1'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
8
,
'NYC'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
9
,
'NYC'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
10
,
'NYC'
);
16. Basics: Getting a Table and
Counting
CREATE
KEYSPACE
newyork
WITH
replication
=
{'class':
'SimpleStrategy',
'replication_factor':
1
};
use
newyork;
CREATE
TABLE
presidentlocations
(
time
int,
location
text
,
PRIMARY
KEY
time
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
1
,
'White
House'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
2
,
'White
House'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
3
,
'White
House'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
4
,
'White
House'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
5
,
'Air
Force
1'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
6
,
'Air
Force
1'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
7
,
'Air
Force
1'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
8
,
'NYC'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
9
,
'NYC'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
10
,
'NYC'
);
scala>
sc.cassandraTable(“newyork","presidentlocations")
cassandraTable
17. Basics: Getting a Table and
Counting
CREATE
KEYSPACE
newyork
WITH
replication
=
{'class':
'SimpleStrategy',
'replication_factor':
1
};
use
newyork;
CREATE
TABLE
presidentlocations
(
time
int,
location
text
,
PRIMARY
KEY
time
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
1
,
'White
House'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
2
,
'White
House'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
3
,
'White
House'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
4
,
'White
House'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
5
,
'Air
Force
1'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
6
,
'Air
Force
1'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
7
,
'Air
Force
1'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
8
,
'NYC'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
9
,
'NYC'
);
INSERT
INTO
presidentlocations
(time,
location
)
VALUES
(
10
,
'NYC'
);
scala>
sc.cassandraTable(“newyork","presidentlocations")
.count
res3:
Long
=
10
cassandraTable
count
10
18. Basics: take() and toArray
scala>
sc.cassandraTable("newyork","presidentlocations")
cassandraTable
21. Basics: take() and toArray
scala>
sc.cassandraTable("newyork","presidentlocations").take(1)
!
res2:
Array[com.datastax.spark.connector.CassandraRow]
=
Array(CassandraRow{time:
9,
location:
NYC})
cassandraTable
take(1)
Array of CassandraRows
9 NYC
scala>
sc.cassandraTable(“newyork","presidentlocations").toArray
!
res3:
Array[com.datastax.spark.connector.CassandraRow]
=
Array(
CassandraRow{time:
9,
location:
NYC},
CassandraRow{time:
3,
location:
White
House},
…,
CassandraRow{time:
6,
location:
Air
Force
1})
cassandraTable
toArray
Array of CassandraRows
9 NYC
99 NNYYCC 99 NNYYCC
22. Basics: Getting Row Values
out of a CassandraRow
scala>
sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")
!
res5:
Int
=
9
cassandraTable
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
23. Basics: Getting Row Values
out of a CassandraRow
scala>
sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")
!
res5:
Int
=
9
cassandraTable
take(1)
Array of CassandraRows
9 NYC
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
24. Basics: Getting Row Values
out of a CassandraRow
scala>
sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")
!
res5:
Int
=
9
cassandraTable
take(1)
Array of CassandraRows
9 NYC
9
get[Int]
get[Int]
get[String]
…
get[Any]
Got Null ?
get[Option[Int]]
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
25. Copy A Table
Say we want to restructure our table or add a new column?
CREATE
TABLE
characterlocations
(
time
int,
character
text,
location
text,
PRIMARY
KEY
(time,character)
);
26. Copy A Table
Say we want to restructure our table or add a new column?
CREATE
TABLE
characterlocations
(
time
int,
character
text,
location
text,
PRIMARY
KEY
(time,character)
);
sc.cassandraTable(“newyork","presidentlocations")
.map(
row
=>
(
row.get[Int](“time"),
"president",
row.get[String](“location")
)).saveToCassandra("newyork","characterlocations")
cassandraTable
1 white house
27. Copy A Table
Say we want to restructure our table or add a new column?
CREATE
TABLE
characterlocations
(
time
int,
character
text,
location
text,
PRIMARY
KEY
(time,character)
);
sc.cassandraTable(“newyork","presidentlocations")
.map(
row
=>
(
row.get[Int](“time"),
"president",
row.get[String](“location")
)).saveToCassandra("newyork","characterlocations")
cassandraTable
1 white house
28. Copy A Table
Say we want to restructure our table or add a new column?
CREATE
TABLE
characterlocations
(
time
int,
character
text,
location
text,
PRIMARY
KEY
(time,character)
);
sc.cassandraTable(“newyork","presidentlocations")
.map(
row
=>
(
row.get[Int](“time"),
"president",
row.get[String](“location")
)).saveToCassandra("newyork","characterlocations")
cassandraTable
get[Int] get[String]
1 white house
1,president,white house
29. get[Int] get[String]
C*
Copy A Table
Say we want to restructure our table or add a new column?
CREATE
TABLE
characterlocations
(
time
int,
character
text,
location
text,
PRIMARY
KEY
(time,character)
);
sc.cassandraTable(“newyork","presidentlocations")
.map(
row
=>
(
row.get[Int](“time"),
"president",
row.get[String](“location")
)).saveToCassandra("newyork","characterlocations")
cassandraTable
1 white house
1,president,white house
saveToCassandra
30. get[Int] get[String]
C*
Copy A Table
Say we want to restructure our table or add a new column?
CREATE
TABLE
characterlocations
(
time
int,
character
text,
location
text,
PRIMARY
KEY
(time,character)
);
sc.cassandraTable(“newyork","presidentlocations")
.map(
row
=>
(
row.get[Int](“time"),
"president",
row.get[String](“location")
)).saveToCassandra("newyork","characterlocations")
cqlsh:newyork>
SELECT
*
FROM
characterlocations
;
!
time
|
character
|
location
-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
5
|
president
|
Air
Force
1
10
|
president
|
NYC
…
…
cassandraTable
1 white house
1,president,white house
saveToCassandra
31. Filter a Table
What if we want to filter based on a
non-clustering key column?
scala>
sc.cassandraTable(“newyork","presidentlocations")
.filter(
_.get[Int]("time")
>
7
)
.toArray
!
res9:
Array[com.datastax.spark.connector.CassandraRow]
=
Array(
CassandraRow{time:
9,
location:
NYC},
CassandraRow{time:
10,
location:
NYC},
CassandraRow{time:
8,
location:
NYC}
)
cassandraTable
32. Filter a Table
What if we want to filter based on a
non-clustering key column?
scala>
sc.cassandraTable(“newyork","presidentlocations")
.filter(
_.get[Int]("time")
>
7
)
.toArray
!
res9:
Array[com.datastax.spark.connector.CassandraRow]
=
Array(
CassandraRow{time:
9,
location:
NYC},
CassandraRow{time:
10,
location:
NYC},
CassandraRow{time:
8,
location:
NYC}
)
cassandraTable
Filter
33. Filter a Table
What if we want to filter based on a
non-clustering key column?
scala>
sc.cassandraTable(“newyork","presidentlocations")
.filter(
_.get[Int]("time")
>
7
)
.toArray
!
res9:
Array[com.datastax.spark.connector.CassandraRow]
=
Array(
CassandraRow{time:
9,
location:
NYC},
CassandraRow{time:
10,
location:
NYC},
CassandraRow{time:
8,
location:
NYC}
)
cassandraTable
Filter
_ (Anonymous Param)
1 white house
34. Filter a Table
What if we want to filter based on a
non-clustering key column?
scala>
sc.cassandraTable(“newyork","presidentlocations")
.filter(
_.get[Int]("time")
>
7
)
.toArray
!
res9:
Array[com.datastax.spark.connector.CassandraRow]
=
Array(
CassandraRow{time:
9,
location:
NYC},
CassandraRow{time:
10,
location:
NYC},
CassandraRow{time:
8,
location:
NYC}
)
cassandraTable
Filter
1 white house
get[Int]
1
_ (Anonymous Param)
35. Filter a Table
What if we want to filter based on a
non-clustering key column?
scala>
sc.cassandraTable(“newyork","presidentlocations")
.filter(
_.get[Int]("time")
>
7
)
.toArray
!
res9:
Array[com.datastax.spark.connector.CassandraRow]
=
Array(
CassandraRow{time:
9,
location:
NYC},
CassandraRow{time:
10,
location:
NYC},
CassandraRow{time:
8,
location:
NYC}
)
cassandraTable
_ (Anonymous Param) >7
1 white house
get[Int]
1
Filter
36. Filter a Table
What if we want to filter based on a
non-clustering key column?
scala>
sc.cassandraTable(“newyork","presidentlocations")
.filter(
_.get[Int]("time")
>
7
)
.toArray
!
res9:
Array[com.datastax.spark.connector.CassandraRow]
=
Array(
CassandraRow{time:
9,
location:
NYC},
CassandraRow{time:
10,
location:
NYC},
CassandraRow{time:
8,
location:
NYC}
)
cassandraTable
_ (Anonymous Param) >7
1 white house
get[Int]
1
Filter
37. Backfill a Table with a
Different Key!
CREATE
TABLE
timelines
(
time
int,
character
text,
location
text,
PRIMARY
KEY
((character),
time)
)
If we actually want to have quick
access to timelines we need a
C* table with a different
structure.
38. Backfill a Table with a
Different Key!
CREATE
TABLE
timelines
(
time
int,
character
text,
location
text,
PRIMARY
KEY
((character),
time)
)
If we actually want to have quick
access to timelines we need a
C* table with a different
structure.
sc.cassandraTable(“newyork","characterlocations")
.saveToCassandra("newyork","timelines")
1 white house
cassandraTable
president
39. Backfill a Table with a
Different Key!
CREATE
TABLE
timelines
(
time
int,
character
text,
location
text,
PRIMARY
KEY
((character),
time)
)
If we actually want to have quick
access to timelines we need a
C* table with a different
structure.
sc.cassandraTable(“newyork","characterlocations")
.saveToCassandra("newyork","timelines")
1 white house
cassandraTable
saveToCassandra
president C*
40. Backfill a Table with a
Different Key!
CREATE
TABLE
timelines
(
time
int,
character
text,
location
text,
PRIMARY
KEY
((character),
time)
)
If we actually want to have quick
access to timelines we need a
C* table with a different
structure.
sc.cassandraTable(“newyork","characterlocations")
.saveToCassandra("newyork","timelines")
cqlsh:newyork>
select
*
from
timelines;
!
character
|
time
|
location
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
president
|
1
|
White
House
president
|
2
|
White
House
president
|
3
|
White
House
president
|
4
|
White
House
president
|
5
|
Air
Force
1
president
|
6
|
Air
Force
1
president
|
7
|
Air
Force
1
president
|
8
|
NYC
president
|
9
|
NYC
president
|
10
|
NYC
1 white house
cassandraTable
saveToCassandra
president C*
41. Import a CSV
I have some data in another source which I
could really use in my Cassandra table
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)
.map(_.split(“,"))
.map(
line
=>
(line(0),line(1),line(2)))
.saveToCassandra("newyork","timelines")
textFile
42. Import a CSV
I have some data in another source which I
could really use in my Cassandra table
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)
.map(_.split(“,"))
.map(
line
=>
(line(0),line(1),line(2)))
.saveToCassandra("newyork","timelines")
textFile
Map
plissken,1,Federal Reserve
split
plissken 1 Federal Reserve
43. Import a CSV
I have some data in another source which I
could really use in my Cassandra table
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)
.map(_.split(“,"))
.map(
line
=>
(line(0),line(1),line(2)))
.saveToCassandra("newyork","timelines")
textFile
Map
plissken,1,Federal Reserve
split
plissken 1 Federal Reserve
plissken,1,Federal Reserve
44. Import a CSV
I have some data in another source which I
could really use in my Cassandra table
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)
.map(_.split(“,"))
.map(
line
=>
(line(0),line(1),line(2)))
.saveToCassandra("newyork","timelines")
textFile
Map
plissken,1,Federal Reserve
split
plissken 1 Federal Reserve
plissken,1,Federal Reserve
saveToCassandra
C*
45. Import a CSV
I have some data in another source which I
could really use in my Cassandra table
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)
.map(_.split(“,"))
.map(
line
=>
(line(0),line(1),line(2)))
.saveToCassandra("newyork","timelines")
textFile
Map
plissken,1,white house
split
plissken 1 white house
plissken,1,white house
saveToCassandra
C*
cqlsh:newyork>
select
*
from
timelines
where
character
=
'plissken';
!
character
|
time
|
location
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
plissken
|
1
|
Federal
Reserve
plissken
|
2
|
Federal
Reserve
plissken
|
3
|
Federal
Reserve
plissken
|
4
|
Court
plissken
|
5
|
Court
plissken
|
6
|
Court
plissken
|
7
|
Court
plissken
|
8
|
Stealth
Glider
plissken
|
9
|
NYC
plissken
|
10
|
NYC
46. Import a CSV
I have some data in another source which I
could really use in my Cassandra table
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)
.map(_.split(“,"))
.map(
line
=>
(line(0),line(1),line(2)))
.saveToCassandra("newyork","timelines")
textFile
Map
plissken,1,white house
split
plissken 1 white house
plissken,1,white house
saveToCassandra
C*
cqlsh:newyork>
select
*
from
timelines
where
character
=
'plissken';
!
character
|
time
|
location
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
plissken
|
1
|
Federal
Reserve
plissken
|
2
|
Federal
Reserve
plissken
|
3
|
Federal
Reserve
plissken
|
4
|
Court
plissken
|
5
|
Court
plissken
|
6
|
Court
plissken
|
7
|
Court
plissken
|
8
|
Stealth
Glider
plissken
|
9
|
NYC
plissken
|
10
|
NYC
47. Perform a Join with MySQL
Maybe a little more than one line …
MySQL Table “quotes” in “escape_from_ny”
import
java.sql._
import
org.apache.spark.rdd.JdbcRDD
Class.forName(“com.mysql.jdbc.Driver”).newInstance();//Connector/J
added
toSpark
Shell
Classpath
val
quotes
=
new
JdbcRDD(
sc,
()
=>
{
DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root")},
"SELECT
*
FROM
quotes
WHERE
?
<=
ID
and
ID
<=
?”,
0,
100,
5,
(r:
ResultSet)
=>
{
(r.getInt(2),r.getString(3))
}
)
!
quotes:
org.apache.spark.rdd.JdbcRDD[(Int,
String)]
=
JdbcRDD[9]
at
JdbcRDD
at
<console>:23
48. Perform a Join with MySQL
Maybe a little more than one line …
quotes:
org.apache.spark.rdd.JdbcRDD[(Int,
String)]
=
JdbcRDD[9]
at
JdbcRDD
at
<console>:23
!
quotes.join(
sc.cassandraTable(“newyork","timelines")
.filter(
_.get[String]("character")
==
“plissken")
.map(
row
=>
(row.get[Int](“time"),row.get[String]("location"))))
.take(1)
.foreach(println)
!
(5,
(Bob
Hauk:
There
was
an
accident.
About
an
hour
ago,
a
small
jet
went
down
inside
New
York
City.
The
President
was
on
board.
Snake
Plissken:
The
president
of
what?,
Court)
)
cassandraTable
JdbcRDD
Needs to be in the form of RDD[K,V]
5, ‘Bob Hauk: …'
49. Perform a Join with MySQL
Maybe a little more than one line …
quotes:
org.apache.spark.rdd.JdbcRDD[(Int,
String)]
=
JdbcRDD[9]
at
JdbcRDD
at
<console>:23
!
quotes.join(
sc.cassandraTable(“newyork","timelines")
.filter(
_.get[String]("character")
==
“plissken")
.map(
row
=>
(row.get[Int](“time"),row.get[String]("location"))))
.take(1)
.foreach(println)
!
(5,
(Bob
Hauk:
There
was
an
accident.
About
an
hour
ago,
a
small
jet
went
down
inside
New
York
City.
The
President
was
on
board.
Snake
Plissken:
The
president
of
what?,
Court)
)
cassandraTable
JdbcRDD
plissken,5,court
5,court
5, ‘Bob Hauk: …'
50. Perform a Join with MySQL
Maybe a little more than one line …
quotes:
org.apache.spark.rdd.JdbcRDD[(Int,
String)]
=
JdbcRDD[9]
at
JdbcRDD
at
<console>:23
!
quotes.join(
sc.cassandraTable(“newyork","timelines")
.filter(
_.get[String]("character")
==
“plissken")
.map(
row
=>
(row.get[Int](“time"),row.get[String]("location"))))
.take(1)
.foreach(println)
!
(5,
(Bob
Hauk:
There
was
an
accident.
About
an
hour
ago,
a
small
jet
went
down
inside
New
York
City.
The
President
was
on
board.
Snake
Plissken:
The
president
of
what?,
Court)
)
cassandraTable
JdbcRDD
plissken,5,court
5,court 5,(‘Bob Hauk: …’,court)
5, ‘Bob Hauk: …'
51. Perform a Join with MySQL
Maybe a little more than one line …
quotes:
org.apache.spark.rdd.JdbcRDD[(Int,
String)]
=
JdbcRDD[9]
at
JdbcRDD
at
<console>:23
!
quotes.join(
sc.cassandraTable(“newyork","timelines")
.filter(
_.get[String]("character")
==
“plissken")
.map(
row
=>
(row.get[Int](“time"),row.get[String]("location"))))
.take(1)
.foreach(println)
!
(5,
(Bob
Hauk:
There
was
an
accident.
About
an
hour
ago,
a
small
jet
went
down
inside
New
York
City.
The
President
was
on
board.
Snake
Plissken:
The
president
of
what?,
Court)
)
cassandraTable
JdbcRDD
plissken,5,court
5,court 5,(‘Bob Hauk: …’,court)
5, ‘Bob Hauk: …'
52. Easy Objects with Case
Classes
We have the technology to make this even easier!
case
class
timelineRow
(character:String,
time:Int,
location:String)
sc.cassandraTable[timelineRow](“newyork","timelines")
.filter(
_.character
==
“plissken")
.filter(
_.time
==
8)
.toArray
res13:
Array[timelineRow]
=
Array(timelineRow(plissken,8,Stealth
Glider))
timelineRow
character,time,location
53. Easy Objects with Case
Classes
We have the technology to make this even easier!
case
class
timelineRow
(character:String,
time:Int,
location:String)
sc.cassandraTable[timelineRow](“newyork","timelines")
.filter(
_.character
==
“plissken")
.filter(
_.time
==
8)
.toArray
res13:
Array[timelineRow]
=
Array(timelineRow(plissken,8,Stealth
Glider))
cassandraTable[timelineRow]
timelineRow
character,time,location
54. Easy Objects with Case
Classes
We have the technology to make this even easier!
case
class
timelineRow
(character:String,
time:Int,
location:String)
sc.cassandraTable[timelineRow](“newyork","timelines")
.filter(
_.character
==
“plissken")
.filter(
_.time
==
8)
.toArray
res13:
Array[timelineRow]
=
Array(timelineRow(plissken,8,Stealth
Glider))
cassandraTable[timelineRow]
timelineRow
character,time,location
filter
character == plissken
55. Easy Objects with Case
Classes
We have the technology to make this even easier!
case
class
timelineRow
(character:String,
time:Int,
location:String)
sc.cassandraTable[timelineRow](“newyork","timelines")
.filter(
_.character
==
“plissken")
.filter(
_.time
==
8)
.toArray
res13:
Array[timelineRow]
=
Array(timelineRow(plissken,8,Stealth
Glider))
cassandraTable[timelineRow]
timelineRow
character,time,location
filter
character == plissken
time == 8
56. Easy Objects with Case
Classes
We have the technology to make this even easier!
case
class
timelineRow
(character:String,
time:Int,
location:String)
sc.cassandraTable[timelineRow](“newyork","timelines")
.filter(
_.character
==
“plissken")
.filter(
_.time
==
8)
.toArray
res13:
Array[timelineRow]
=
Array(timelineRow(plissken,8,Stealth
Glider))
cassandraTable[timelineRow]
timelineRow
character,time,location
filter
character == plissken
time == 8
character:plissken,time:8,location: Stealth Glider
57. Easy Objects with Case
Classes
We have the technology to make this even easier!
case
class
timelineRow
(character:String,
time:Int,
location:String)
sc.cassandraTable[timelineRow](“newyork","timelines")
res13:
Array[timelineRow]
=
Array(timelineRow(plissken,8,Stealth
Glider))
The Future
.filter(
_.character
==
“plissken")
.filter(
_.time
==
8)
.toArray
cassandraTable[timelineRow]
timelineRow
character,time,location
filter
character == plissken
time == 8
character:plissken,time:8,location: Stealth Glider
59. A Map Reduce for Word
Count …
scala>
sc.cassandraTable(“newyork”,"presidentlocations")
.map(
_.get[String](“location”)
)
.flatMap(
_.split(“
“))
.map(
(_,1))
.reduceByKey(
_
+
_
)
.toArray
res17:
Array[(String,
Int)]
=
Array((1,3),
(House,4),
(NYC,3),
(Force,3),
(White,4),
(Air,3))
1 white house
cassandraTable
get[String]
60. A Map Reduce for Word
Count …
scala>
sc.cassandraTable(“newyork”,"presidentlocations")
.map(
_.get[String](“location”)
)
.flatMap(
_.split(“
“))
.map(
(_,1))
.reduceByKey(
_
+
_
)
.toArray
res17:
Array[(String,
Int)]
=
Array((1,3),
(House,4),
(NYC,3),
(Force,3),
(White,4),
(Air,3))
1 white house
white house
cassandraTable
get[String]
_.split()
61. A Map Reduce for Word
Count …
scala>
sc.cassandraTable(“newyork”,"presidentlocations")
.map(
_.get[String](“location”)
)
.flatMap(
_.split(“
“))
.map(
(_,1))
.reduceByKey(
_
+
_
)
.toArray
res17:
Array[(String,
Int)]
=
Array((1,3),
(House,4),
(NYC,3),
(Force,3),
(White,4),
(Air,3))
1 white house
white house
white, 1 house, 1
cassandraTable
get[String]
_.split()
(_,1)
62. A Map Reduce for Word
Count …
scala>
sc.cassandraTable(“newyork”,"presidentlocations")
.map(
_.get[String](“location”)
)
.flatMap(
_.split(“
“))
.map(
(_,1))
.reduceByKey(
_
+
_
)
.toArray
res17:
Array[(String,
Int)]
=
Array((1,3),
(House,4),
(NYC,3),
(Force,3),
(White,4),
(Air,3))
1 white house
white house
white, 1 house, 1
house, 1 house, 1
house, 2
cassandraTable
get[String]
_.split()
(_,1)
_ + _
63. A Map Reduce for Word
Count …
scala>
sc.cassandraTable(“newyork”,"presidentlocations")
.map(
_.get[String](“location”)
)
.flatMap(
_.split(“
“))
.map(
(_,1))
.reduceByKey(
_
+
_
)
.toArray
res17:
Array[(String,
Int)]
=
Array((1,3),
(House,4),
(NYC,3),
(Force,3),
(White,4),
(Air,3))
1 white house
white house
white, 1 house, 1
house, 1 house, 1
house, 2
cassandraTable
get[String]
_.split()
(_,1)
_ + _
64. Stand Alone App Example
https://github.com/RussellSpitzer/spark-‐cassandra-‐csv
Car,
Model,
Color
Dodge,
Caravan,
Red
Ford,
F150,
Black
Toyota,
Prius,
Green
Spark SCC
RDD
[CassandraRow]
!!!
FavoriteCars
Table
Cassandra
Column
Mapping
CSV
66. Getting started with Cassandra?!
DataStax Academy offers free online Cassandra training!
Planet Cassandra has resources for learning the basics from ‘Try Cassandra’ tutorials to in depth
language and migration pages!
Find a way to contribute back to the community: talk at a meetup, or share your story on
PlanetCassandra.org!
Need help? Get questions answered with Planet Cassandra’s free virtual office hours running weekly!
Email us: Community@DataStax.com!
Thanks for coming to the meetup!!
In production?!
Tweet us: @PlanetCassandra!
67. Thanks
for
your
Time
and
Come
to
C*
Summit!
SEPTEMBER
10
-‐
11,
2014
|
SAN
FRANCISCO,
CALIF.
|
THE
WESTIN
ST.
FRANCIS
HOTEL
Cassandra
Summit
Link