SlideShare ist ein Scribd-Unternehmen logo
1 von 67
Downloaden Sie, um offline zu lesen
Escape From Hadoop: 
Spark One Liners for C* Ops 
Kurt Russell Spitzer 
DataStax
Who am I? 
• Bioinformatics Ph.D from UCSF 
• Works on the integration of 
Cassandra (C*) with Hadoop, 
Solr, and SPARK!! 
• Spends a lot of time spinning 
up clusters on EC2, GCE, 
Azure, … 
http://www.datastax.com/dev/ 
blog/testing-cassandra-1000- 
nodes-at-a-time 
• Developing new ways to make 
sure that C* Scales
Why escape from Hadoop? 
HADOOP 
Many Moving Pieces 
Map Reduce 
Single Points of Failure 
Lots of Overhead 
And there is a way out!
Spark Provides a Simple and Efficient 
framework for Distributed Computations 
Node Roles 2 
In Memory Caching Yes! 
Generic DAG Execution Yes! 
Great Abstraction For Datasets? RDD! 
Spark 
Worker 
Spark 
Worker 
Spark 
Master 
Spark 
Worker 
Resilient Distributed 
Dataset 
Spark Executor
Spark is Compatible with HDFS, 
Parquet, CSVs, ….
Spark is Compatible with HDFS, 
Parquet, CSVs, …. 
AND 
APACHE CASSANDRA 
Apache 
Cassandra
Apache Cassandra is a Linearly Scaling 
and Fault Tolerant noSQL Database 
Linearly Scaling: 
The power of the database 
increases linearly with the 
number of machines 
2x machines = 2x throughput 
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html 
Fault Tolerant: 
Nodes down != Database Down 
Datacenter down != Database Down
Apache Cassandra 
Architecture is Very Simple 
Node Roles 1 
Replication Tunable 
Replication 
Consistency Tunable 
C* 
C* C* 
C* 
Client
DataStax OSS Connector 
Spark to Cassandra 
https://github.com/datastax/spark-­‐cassandra-­‐connector 
Cassandra Spark 
Keyspace Table 
RDD[CassandraRow] 
RDD[Tuples] 
Bundled 
and 
Supported 
with 
DSE 
4.5!
Spark Cassandra Connector uses the 
DataStax Java Driver to Read from and 
Write to C* 
Spark C* 
Full Token 
Range 
Each Executor Maintains 
a connection to the C* 
Cluster 
Spark 
Executor 
DataStax 
Java Driver 
Tokens 1001 -2000 
Tokens 1-1000 
Tokens … 
RDD’s read into different 
splits based on sets of 
tokens
Co-locate Spark and C* for 
Best Performance 
C* 
C* C* 
Spark 
Worker 
C* 
Spark 
Worker 
Spark 
Master 
Spark 
Running Spark Workers Worker 
on 
the same nodes as your 
C* Cluster will save 
network hops when 
reading and writing
Setting up C* and Spark 
DSE > 4.5.0 
Just start your nodes with 
dse cassandra -k 
Apache Cassandra 
Follow the excellent guide by Al Tobey 
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
We need a Distributed System 
For Analytics and Batch Jobs 
But it doesn’t have to be complicated!
Even count needs to be 
distributed 
Ask me to write a Map Reduce 
for word count, I dare you. 
You could make this easier by adding yet another 
technology to your Hadoop Stack (hive, pig, impala) or 
we could just do one liners on the spark shell.
Basics: Getting a Table and 
Counting 
CREATE 
KEYSPACE 
newyork 
WITH 
replication 
= 
{'class': 
'SimpleStrategy', 
'replication_factor': 
1 
}; 
use 
newyork; 
CREATE 
TABLE 
presidentlocations 
( 
time 
int, 
location 
text 
, 
PRIMARY 
KEY 
time 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
1 
, 
'White 
House' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
2 
, 
'White 
House' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
3 
, 
'White 
House' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
4 
, 
'White 
House' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
5 
, 
'Air 
Force 
1' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
6 
, 
'Air 
Force 
1' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
7 
, 
'Air 
Force 
1' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
8 
, 
'NYC' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
9 
, 
'NYC' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
10 
, 
'NYC' 
);
Basics: Getting a Table and 
Counting 
CREATE 
KEYSPACE 
newyork 
WITH 
replication 
= 
{'class': 
'SimpleStrategy', 
'replication_factor': 
1 
}; 
use 
newyork; 
CREATE 
TABLE 
presidentlocations 
( 
time 
int, 
location 
text 
, 
PRIMARY 
KEY 
time 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
1 
, 
'White 
House' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
2 
, 
'White 
House' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
3 
, 
'White 
House' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
4 
, 
'White 
House' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
5 
, 
'Air 
Force 
1' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
6 
, 
'Air 
Force 
1' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
7 
, 
'Air 
Force 
1' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
8 
, 
'NYC' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
9 
, 
'NYC' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
10 
, 
'NYC' 
); 
scala> 
sc.cassandraTable(“newyork","presidentlocations") 
cassandraTable
Basics: Getting a Table and 
Counting 
CREATE 
KEYSPACE 
newyork 
WITH 
replication 
= 
{'class': 
'SimpleStrategy', 
'replication_factor': 
1 
}; 
use 
newyork; 
CREATE 
TABLE 
presidentlocations 
( 
time 
int, 
location 
text 
, 
PRIMARY 
KEY 
time 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
1 
, 
'White 
House' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
2 
, 
'White 
House' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
3 
, 
'White 
House' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
4 
, 
'White 
House' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
5 
, 
'Air 
Force 
1' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
6 
, 
'Air 
Force 
1' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
7 
, 
'Air 
Force 
1' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
8 
, 
'NYC' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
9 
, 
'NYC' 
); 
INSERT 
INTO 
presidentlocations 
(time, 
location 
) 
VALUES 
( 
10 
, 
'NYC' 
); 
scala> 
sc.cassandraTable(“newyork","presidentlocations") 
.count 
res3: 
Long 
= 
10 
cassandraTable 
count 
10
Basics: take() and toArray 
scala> 
sc.cassandraTable("newyork","presidentlocations") 
cassandraTable
Basics: take() and toArray 
scala> 
sc.cassandraTable("newyork","presidentlocations").take(1) 
! 
res2: 
Array[com.datastax.spark.connector.CassandraRow] 
= 
Array(CassandraRow{time: 
9, 
location: 
NYC}) 
cassandraTable 
take(1) 
Array of CassandraRows 
9 NYC
Basics: take() and toArray 
scala> 
sc.cassandraTable("newyork","presidentlocations").take(1) 
! 
res2: 
Array[com.datastax.spark.connector.CassandraRow] 
= 
Array(CassandraRow{time: 
9, 
location: 
NYC}) 
cassandraTable 
take(1) 
Array of CassandraRows 
9 NYC 
scala> 
sc.cassandraTable(“newyork","presidentlocations") 
cassandraTable
Basics: take() and toArray 
scala> 
sc.cassandraTable("newyork","presidentlocations").take(1) 
! 
res2: 
Array[com.datastax.spark.connector.CassandraRow] 
= 
Array(CassandraRow{time: 
9, 
location: 
NYC}) 
cassandraTable 
take(1) 
Array of CassandraRows 
9 NYC 
scala> 
sc.cassandraTable(“newyork","presidentlocations").toArray 
! 
res3: 
Array[com.datastax.spark.connector.CassandraRow] 
= 
Array( 
CassandraRow{time: 
9, 
location: 
NYC}, 
CassandraRow{time: 
3, 
location: 
White 
House}, 
…, 
CassandraRow{time: 
6, 
location: 
Air 
Force 
1}) 
cassandraTable 
toArray 
Array of CassandraRows 
9 NYC 
99 NNYYCC 99 NNYYCC
Basics: Getting Row Values 
out of a CassandraRow 
scala> 
sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time") 
! 
res5: 
Int 
= 
9 
cassandraTable 
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
Basics: Getting Row Values 
out of a CassandraRow 
scala> 
sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time") 
! 
res5: 
Int 
= 
9 
cassandraTable 
take(1) 
Array of CassandraRows 
9 NYC 
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
Basics: Getting Row Values 
out of a CassandraRow 
scala> 
sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time") 
! 
res5: 
Int 
= 
9 
cassandraTable 
take(1) 
Array of CassandraRows 
9 NYC 
9 
get[Int] 
get[Int] 
get[String] 
… 
get[Any] 
Got Null ? 
get[Option[Int]] 
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
Copy A Table 
Say we want to restructure our table or add a new column? 
CREATE 
TABLE 
characterlocations 
( 
time 
int, 
character 
text, 
location 
text, 
PRIMARY 
KEY 
(time,character) 
);
Copy A Table 
Say we want to restructure our table or add a new column? 
CREATE 
TABLE 
characterlocations 
( 
time 
int, 
character 
text, 
location 
text, 
PRIMARY 
KEY 
(time,character) 
); 
sc.cassandraTable(“newyork","presidentlocations") 
.map( 
row 
=> 
( 
row.get[Int](“time"), 
"president", 
row.get[String](“location") 
)).saveToCassandra("newyork","characterlocations") 
cassandraTable 
1 white house
Copy A Table 
Say we want to restructure our table or add a new column? 
CREATE 
TABLE 
characterlocations 
( 
time 
int, 
character 
text, 
location 
text, 
PRIMARY 
KEY 
(time,character) 
); 
sc.cassandraTable(“newyork","presidentlocations") 
.map( 
row 
=> 
( 
row.get[Int](“time"), 
"president", 
row.get[String](“location") 
)).saveToCassandra("newyork","characterlocations") 
cassandraTable 
1 white house
Copy A Table 
Say we want to restructure our table or add a new column? 
CREATE 
TABLE 
characterlocations 
( 
time 
int, 
character 
text, 
location 
text, 
PRIMARY 
KEY 
(time,character) 
); 
sc.cassandraTable(“newyork","presidentlocations") 
.map( 
row 
=> 
( 
row.get[Int](“time"), 
"president", 
row.get[String](“location") 
)).saveToCassandra("newyork","characterlocations") 
cassandraTable 
get[Int] get[String] 
1 white house 
1,president,white house
get[Int] get[String] 
C* 
Copy A Table 
Say we want to restructure our table or add a new column? 
CREATE 
TABLE 
characterlocations 
( 
time 
int, 
character 
text, 
location 
text, 
PRIMARY 
KEY 
(time,character) 
); 
sc.cassandraTable(“newyork","presidentlocations") 
.map( 
row 
=> 
( 
row.get[Int](“time"), 
"president", 
row.get[String](“location") 
)).saveToCassandra("newyork","characterlocations") 
cassandraTable 
1 white house 
1,president,white house 
saveToCassandra
get[Int] get[String] 
C* 
Copy A Table 
Say we want to restructure our table or add a new column? 
CREATE 
TABLE 
characterlocations 
( 
time 
int, 
character 
text, 
location 
text, 
PRIMARY 
KEY 
(time,character) 
); 
sc.cassandraTable(“newyork","presidentlocations") 
.map( 
row 
=> 
( 
row.get[Int](“time"), 
"president", 
row.get[String](“location") 
)).saveToCassandra("newyork","characterlocations") 
cqlsh:newyork> 
SELECT 
* 
FROM 
characterlocations 
; 
! 
time 
| 
character 
| 
location 
-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
5 
| 
president 
| 
Air 
Force 
1 
10 
| 
president 
| 
NYC 
… 
… 
cassandraTable 
1 white house 
1,president,white house 
saveToCassandra
Filter a Table 
What if we want to filter based on a 
non-clustering key column? 
scala> 
sc.cassandraTable(“newyork","presidentlocations") 
.filter( 
_.get[Int]("time") 
> 
7 
) 
.toArray 
! 
res9: 
Array[com.datastax.spark.connector.CassandraRow] 
= 
Array( 
CassandraRow{time: 
9, 
location: 
NYC}, 
CassandraRow{time: 
10, 
location: 
NYC}, 
CassandraRow{time: 
8, 
location: 
NYC} 
) 
cassandraTable
Filter a Table 
What if we want to filter based on a 
non-clustering key column? 
scala> 
sc.cassandraTable(“newyork","presidentlocations") 
.filter( 
_.get[Int]("time") 
> 
7 
) 
.toArray 
! 
res9: 
Array[com.datastax.spark.connector.CassandraRow] 
= 
Array( 
CassandraRow{time: 
9, 
location: 
NYC}, 
CassandraRow{time: 
10, 
location: 
NYC}, 
CassandraRow{time: 
8, 
location: 
NYC} 
) 
cassandraTable 
Filter
Filter a Table 
What if we want to filter based on a 
non-clustering key column? 
scala> 
sc.cassandraTable(“newyork","presidentlocations") 
.filter( 
_.get[Int]("time") 
> 
7 
) 
.toArray 
! 
res9: 
Array[com.datastax.spark.connector.CassandraRow] 
= 
Array( 
CassandraRow{time: 
9, 
location: 
NYC}, 
CassandraRow{time: 
10, 
location: 
NYC}, 
CassandraRow{time: 
8, 
location: 
NYC} 
) 
cassandraTable 
Filter 
_ (Anonymous Param) 
1 white house
Filter a Table 
What if we want to filter based on a 
non-clustering key column? 
scala> 
sc.cassandraTable(“newyork","presidentlocations") 
.filter( 
_.get[Int]("time") 
> 
7 
) 
.toArray 
! 
res9: 
Array[com.datastax.spark.connector.CassandraRow] 
= 
Array( 
CassandraRow{time: 
9, 
location: 
NYC}, 
CassandraRow{time: 
10, 
location: 
NYC}, 
CassandraRow{time: 
8, 
location: 
NYC} 
) 
cassandraTable 
Filter 
1 white house 
get[Int] 
1 
_ (Anonymous Param)
Filter a Table 
What if we want to filter based on a 
non-clustering key column? 
scala> 
sc.cassandraTable(“newyork","presidentlocations") 
.filter( 
_.get[Int]("time") 
> 
7 
) 
.toArray 
! 
res9: 
Array[com.datastax.spark.connector.CassandraRow] 
= 
Array( 
CassandraRow{time: 
9, 
location: 
NYC}, 
CassandraRow{time: 
10, 
location: 
NYC}, 
CassandraRow{time: 
8, 
location: 
NYC} 
) 
cassandraTable 
_ (Anonymous Param) >7 
1 white house 
get[Int] 
1 
Filter
Filter a Table 
What if we want to filter based on a 
non-clustering key column? 
scala> 
sc.cassandraTable(“newyork","presidentlocations") 
.filter( 
_.get[Int]("time") 
> 
7 
) 
.toArray 
! 
res9: 
Array[com.datastax.spark.connector.CassandraRow] 
= 
Array( 
CassandraRow{time: 
9, 
location: 
NYC}, 
CassandraRow{time: 
10, 
location: 
NYC}, 
CassandraRow{time: 
8, 
location: 
NYC} 
) 
cassandraTable 
_ (Anonymous Param) >7 
1 white house 
get[Int] 
1 
Filter
Backfill a Table with a 
Different Key! 
CREATE 
TABLE 
timelines 
( 
time 
int, 
character 
text, 
location 
text, 
PRIMARY 
KEY 
((character), 
time) 
) 
If we actually want to have quick 
access to timelines we need a 
C* table with a different 
structure.
Backfill a Table with a 
Different Key! 
CREATE 
TABLE 
timelines 
( 
time 
int, 
character 
text, 
location 
text, 
PRIMARY 
KEY 
((character), 
time) 
) 
If we actually want to have quick 
access to timelines we need a 
C* table with a different 
structure. 
sc.cassandraTable(“newyork","characterlocations") 
.saveToCassandra("newyork","timelines") 
1 white house 
cassandraTable 
president
Backfill a Table with a 
Different Key! 
CREATE 
TABLE 
timelines 
( 
time 
int, 
character 
text, 
location 
text, 
PRIMARY 
KEY 
((character), 
time) 
) 
If we actually want to have quick 
access to timelines we need a 
C* table with a different 
structure. 
sc.cassandraTable(“newyork","characterlocations") 
.saveToCassandra("newyork","timelines") 
1 white house 
cassandraTable 
saveToCassandra 
president C*
Backfill a Table with a 
Different Key! 
CREATE 
TABLE 
timelines 
( 
time 
int, 
character 
text, 
location 
text, 
PRIMARY 
KEY 
((character), 
time) 
) 
If we actually want to have quick 
access to timelines we need a 
C* table with a different 
structure. 
sc.cassandraTable(“newyork","characterlocations") 
.saveToCassandra("newyork","timelines") 
cqlsh:newyork> 
select 
* 
from 
timelines; 
! 
character 
| 
time 
| 
location 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
president 
| 
1 
| 
White 
House 
president 
| 
2 
| 
White 
House 
president 
| 
3 
| 
White 
House 
president 
| 
4 
| 
White 
House 
president 
| 
5 
| 
Air 
Force 
1 
president 
| 
6 
| 
Air 
Force 
1 
president 
| 
7 
| 
Air 
Force 
1 
president 
| 
8 
| 
NYC 
president 
| 
9 
| 
NYC 
president 
| 
10 
| 
NYC 
1 white house 
cassandraTable 
saveToCassandra 
president C*
Import a CSV 
I have some data in another source which I 
could really use in my Cassandra table 
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) 
.map(_.split(“,")) 
.map( 
line 
=> 
(line(0),line(1),line(2))) 
.saveToCassandra("newyork","timelines") 
textFile
Import a CSV 
I have some data in another source which I 
could really use in my Cassandra table 
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) 
.map(_.split(“,")) 
.map( 
line 
=> 
(line(0),line(1),line(2))) 
.saveToCassandra("newyork","timelines") 
textFile 
Map 
plissken,1,Federal Reserve 
split 
plissken 1 Federal Reserve
Import a CSV 
I have some data in another source which I 
could really use in my Cassandra table 
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) 
.map(_.split(“,")) 
.map( 
line 
=> 
(line(0),line(1),line(2))) 
.saveToCassandra("newyork","timelines") 
textFile 
Map 
plissken,1,Federal Reserve 
split 
plissken 1 Federal Reserve 
plissken,1,Federal Reserve
Import a CSV 
I have some data in another source which I 
could really use in my Cassandra table 
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) 
.map(_.split(“,")) 
.map( 
line 
=> 
(line(0),line(1),line(2))) 
.saveToCassandra("newyork","timelines") 
textFile 
Map 
plissken,1,Federal Reserve 
split 
plissken 1 Federal Reserve 
plissken,1,Federal Reserve 
saveToCassandra 
C*
Import a CSV 
I have some data in another source which I 
could really use in my Cassandra table 
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) 
.map(_.split(“,")) 
.map( 
line 
=> 
(line(0),line(1),line(2))) 
.saveToCassandra("newyork","timelines") 
textFile 
Map 
plissken,1,white house 
split 
plissken 1 white house 
plissken,1,white house 
saveToCassandra 
C* 
cqlsh:newyork> 
select 
* 
from 
timelines 
where 
character 
= 
'plissken'; 
! 
character 
| 
time 
| 
location 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
plissken 
| 
1 
| 
Federal 
Reserve 
plissken 
| 
2 
| 
Federal 
Reserve 
plissken 
| 
3 
| 
Federal 
Reserve 
plissken 
| 
4 
| 
Court 
plissken 
| 
5 
| 
Court 
plissken 
| 
6 
| 
Court 
plissken 
| 
7 
| 
Court 
plissken 
| 
8 
| 
Stealth 
Glider 
plissken 
| 
9 
| 
NYC 
plissken 
| 
10 
| 
NYC
Import a CSV 
I have some data in another source which I 
could really use in my Cassandra table 
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) 
.map(_.split(“,")) 
.map( 
line 
=> 
(line(0),line(1),line(2))) 
.saveToCassandra("newyork","timelines") 
textFile 
Map 
plissken,1,white house 
split 
plissken 1 white house 
plissken,1,white house 
saveToCassandra 
C* 
cqlsh:newyork> 
select 
* 
from 
timelines 
where 
character 
= 
'plissken'; 
! 
character 
| 
time 
| 
location 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
plissken 
| 
1 
| 
Federal 
Reserve 
plissken 
| 
2 
| 
Federal 
Reserve 
plissken 
| 
3 
| 
Federal 
Reserve 
plissken 
| 
4 
| 
Court 
plissken 
| 
5 
| 
Court 
plissken 
| 
6 
| 
Court 
plissken 
| 
7 
| 
Court 
plissken 
| 
8 
| 
Stealth 
Glider 
plissken 
| 
9 
| 
NYC 
plissken 
| 
10 
| 
NYC
Perform a Join with MySQL 
Maybe a little more than one line … 
MySQL Table “quotes” in “escape_from_ny” 
import 
java.sql._ 
import 
org.apache.spark.rdd.JdbcRDD 
Class.forName(“com.mysql.jdbc.Driver”).newInstance();//Connector/J 
added 
toSpark 
Shell 
Classpath 
val 
quotes 
= 
new 
JdbcRDD( 
sc, 
() 
=> 
{ 
DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root")}, 
"SELECT 
* 
FROM 
quotes 
WHERE 
? 
<= 
ID 
and 
ID 
<= 
?”, 
0, 
100, 
5, 
(r: 
ResultSet) 
=> 
{ 
(r.getInt(2),r.getString(3)) 
} 
) 
! 
quotes: 
org.apache.spark.rdd.JdbcRDD[(Int, 
String)] 
= 
JdbcRDD[9] 
at 
JdbcRDD 
at 
<console>:23
Perform a Join with MySQL 
Maybe a little more than one line … 
quotes: 
org.apache.spark.rdd.JdbcRDD[(Int, 
String)] 
= 
JdbcRDD[9] 
at 
JdbcRDD 
at 
<console>:23 
! 
quotes.join( 
sc.cassandraTable(“newyork","timelines") 
.filter( 
_.get[String]("character") 
== 
“plissken") 
.map( 
row 
=> 
(row.get[Int](“time"),row.get[String]("location")))) 
.take(1) 
.foreach(println) 
! 
(5, 
(Bob 
Hauk: 
There 
was 
an 
accident. 
About 
an 
hour 
ago, 
a 
small 
jet 
went 
down 
inside 
New 
York 
City. 
The 
President 
was 
on 
board. 
Snake 
Plissken: 
The 
president 
of 
what?, 
Court) 
) 
cassandraTable 
JdbcRDD 
Needs to be in the form of RDD[K,V] 
5, ‘Bob Hauk: …'
Perform a Join with MySQL 
Maybe a little more than one line … 
quotes: 
org.apache.spark.rdd.JdbcRDD[(Int, 
String)] 
= 
JdbcRDD[9] 
at 
JdbcRDD 
at 
<console>:23 
! 
quotes.join( 
sc.cassandraTable(“newyork","timelines") 
.filter( 
_.get[String]("character") 
== 
“plissken") 
.map( 
row 
=> 
(row.get[Int](“time"),row.get[String]("location")))) 
.take(1) 
.foreach(println) 
! 
(5, 
(Bob 
Hauk: 
There 
was 
an 
accident. 
About 
an 
hour 
ago, 
a 
small 
jet 
went 
down 
inside 
New 
York 
City. 
The 
President 
was 
on 
board. 
Snake 
Plissken: 
The 
president 
of 
what?, 
Court) 
) 
cassandraTable 
JdbcRDD 
plissken,5,court 
5,court 
5, ‘Bob Hauk: …'
Perform a Join with MySQL 
Maybe a little more than one line … 
quotes: 
org.apache.spark.rdd.JdbcRDD[(Int, 
String)] 
= 
JdbcRDD[9] 
at 
JdbcRDD 
at 
<console>:23 
! 
quotes.join( 
sc.cassandraTable(“newyork","timelines") 
.filter( 
_.get[String]("character") 
== 
“plissken") 
.map( 
row 
=> 
(row.get[Int](“time"),row.get[String]("location")))) 
.take(1) 
.foreach(println) 
! 
(5, 
(Bob 
Hauk: 
There 
was 
an 
accident. 
About 
an 
hour 
ago, 
a 
small 
jet 
went 
down 
inside 
New 
York 
City. 
The 
President 
was 
on 
board. 
Snake 
Plissken: 
The 
president 
of 
what?, 
Court) 
) 
cassandraTable 
JdbcRDD 
plissken,5,court 
5,court 5,(‘Bob Hauk: …’,court) 
5, ‘Bob Hauk: …'
Perform a Join with MySQL 
Maybe a little more than one line … 
quotes: 
org.apache.spark.rdd.JdbcRDD[(Int, 
String)] 
= 
JdbcRDD[9] 
at 
JdbcRDD 
at 
<console>:23 
! 
quotes.join( 
sc.cassandraTable(“newyork","timelines") 
.filter( 
_.get[String]("character") 
== 
“plissken") 
.map( 
row 
=> 
(row.get[Int](“time"),row.get[String]("location")))) 
.take(1) 
.foreach(println) 
! 
(5, 
(Bob 
Hauk: 
There 
was 
an 
accident. 
About 
an 
hour 
ago, 
a 
small 
jet 
went 
down 
inside 
New 
York 
City. 
The 
President 
was 
on 
board. 
Snake 
Plissken: 
The 
president 
of 
what?, 
Court) 
) 
cassandraTable 
JdbcRDD 
plissken,5,court 
5,court 5,(‘Bob Hauk: …’,court) 
5, ‘Bob Hauk: …'
Easy Objects with Case 
Classes 
We have the technology to make this even easier! 
case 
class 
timelineRow 
(character:String, 
time:Int, 
location:String) 
sc.cassandraTable[timelineRow](“newyork","timelines") 
.filter( 
_.character 
== 
“plissken") 
.filter( 
_.time 
== 
8) 
.toArray 
res13: 
Array[timelineRow] 
= 
Array(timelineRow(plissken,8,Stealth 
Glider)) 
timelineRow 
character,time,location
Easy Objects with Case 
Classes 
We have the technology to make this even easier! 
case 
class 
timelineRow 
(character:String, 
time:Int, 
location:String) 
sc.cassandraTable[timelineRow](“newyork","timelines") 
.filter( 
_.character 
== 
“plissken") 
.filter( 
_.time 
== 
8) 
.toArray 
res13: 
Array[timelineRow] 
= 
Array(timelineRow(plissken,8,Stealth 
Glider)) 
cassandraTable[timelineRow] 
timelineRow 
character,time,location
Easy Objects with Case 
Classes 
We have the technology to make this even easier! 
case 
class 
timelineRow 
(character:String, 
time:Int, 
location:String) 
sc.cassandraTable[timelineRow](“newyork","timelines") 
.filter( 
_.character 
== 
“plissken") 
.filter( 
_.time 
== 
8) 
.toArray 
res13: 
Array[timelineRow] 
= 
Array(timelineRow(plissken,8,Stealth 
Glider)) 
cassandraTable[timelineRow] 
timelineRow 
character,time,location 
filter 
character == plissken
Easy Objects with Case 
Classes 
We have the technology to make this even easier! 
case 
class 
timelineRow 
(character:String, 
time:Int, 
location:String) 
sc.cassandraTable[timelineRow](“newyork","timelines") 
.filter( 
_.character 
== 
“plissken") 
.filter( 
_.time 
== 
8) 
.toArray 
res13: 
Array[timelineRow] 
= 
Array(timelineRow(plissken,8,Stealth 
Glider)) 
cassandraTable[timelineRow] 
timelineRow 
character,time,location 
filter 
character == plissken 
time == 8
Easy Objects with Case 
Classes 
We have the technology to make this even easier! 
case 
class 
timelineRow 
(character:String, 
time:Int, 
location:String) 
sc.cassandraTable[timelineRow](“newyork","timelines") 
.filter( 
_.character 
== 
“plissken") 
.filter( 
_.time 
== 
8) 
.toArray 
res13: 
Array[timelineRow] 
= 
Array(timelineRow(plissken,8,Stealth 
Glider)) 
cassandraTable[timelineRow] 
timelineRow 
character,time,location 
filter 
character == plissken 
time == 8 
character:plissken,time:8,location: Stealth Glider
Easy Objects with Case 
Classes 
We have the technology to make this even easier! 
case 
class 
timelineRow 
(character:String, 
time:Int, 
location:String) 
sc.cassandraTable[timelineRow](“newyork","timelines") 
res13: 
Array[timelineRow] 
= 
Array(timelineRow(plissken,8,Stealth 
Glider)) 
The Future 
.filter( 
_.character 
== 
“plissken") 
.filter( 
_.time 
== 
8) 
.toArray 
cassandraTable[timelineRow] 
timelineRow 
character,time,location 
filter 
character == plissken 
time == 8 
character:plissken,time:8,location: Stealth Glider
A Map Reduce for Word 
Count … 
scala> 
sc.cassandraTable(“newyork”,"presidentlocations") 
.map( 
_.get[String](“location”) 
) 
.flatMap( 
_.split(“ 
“)) 
.map( 
(_,1)) 
.reduceByKey( 
_ 
+ 
_ 
) 
.toArray 
res17: 
Array[(String, 
Int)] 
= 
Array((1,3), 
(House,4), 
(NYC,3), 
(Force,3), 
(White,4), 
(Air,3)) 
cassandraTable
A Map Reduce for Word 
Count … 
scala> 
sc.cassandraTable(“newyork”,"presidentlocations") 
.map( 
_.get[String](“location”) 
) 
.flatMap( 
_.split(“ 
“)) 
.map( 
(_,1)) 
.reduceByKey( 
_ 
+ 
_ 
) 
.toArray 
res17: 
Array[(String, 
Int)] 
= 
Array((1,3), 
(House,4), 
(NYC,3), 
(Force,3), 
(White,4), 
(Air,3)) 
1 white house 
cassandraTable 
get[String]
A Map Reduce for Word 
Count … 
scala> 
sc.cassandraTable(“newyork”,"presidentlocations") 
.map( 
_.get[String](“location”) 
) 
.flatMap( 
_.split(“ 
“)) 
.map( 
(_,1)) 
.reduceByKey( 
_ 
+ 
_ 
) 
.toArray 
res17: 
Array[(String, 
Int)] 
= 
Array((1,3), 
(House,4), 
(NYC,3), 
(Force,3), 
(White,4), 
(Air,3)) 
1 white house 
white house 
cassandraTable 
get[String] 
_.split()
A Map Reduce for Word 
Count … 
scala> 
sc.cassandraTable(“newyork”,"presidentlocations") 
.map( 
_.get[String](“location”) 
) 
.flatMap( 
_.split(“ 
“)) 
.map( 
(_,1)) 
.reduceByKey( 
_ 
+ 
_ 
) 
.toArray 
res17: 
Array[(String, 
Int)] 
= 
Array((1,3), 
(House,4), 
(NYC,3), 
(Force,3), 
(White,4), 
(Air,3)) 
1 white house 
white house 
white, 1 house, 1 
cassandraTable 
get[String] 
_.split() 
(_,1)
A Map Reduce for Word 
Count … 
scala> 
sc.cassandraTable(“newyork”,"presidentlocations") 
.map( 
_.get[String](“location”) 
) 
.flatMap( 
_.split(“ 
“)) 
.map( 
(_,1)) 
.reduceByKey( 
_ 
+ 
_ 
) 
.toArray 
res17: 
Array[(String, 
Int)] 
= 
Array((1,3), 
(House,4), 
(NYC,3), 
(Force,3), 
(White,4), 
(Air,3)) 
1 white house 
white house 
white, 1 house, 1 
house, 1 house, 1 
house, 2 
cassandraTable 
get[String] 
_.split() 
(_,1) 
_ + _
A Map Reduce for Word 
Count … 
scala> 
sc.cassandraTable(“newyork”,"presidentlocations") 
.map( 
_.get[String](“location”) 
) 
.flatMap( 
_.split(“ 
“)) 
.map( 
(_,1)) 
.reduceByKey( 
_ 
+ 
_ 
) 
.toArray 
res17: 
Array[(String, 
Int)] 
= 
Array((1,3), 
(House,4), 
(NYC,3), 
(Force,3), 
(White,4), 
(Air,3)) 
1 white house 
white house 
white, 1 house, 1 
house, 1 house, 1 
house, 2 
cassandraTable 
get[String] 
_.split() 
(_,1) 
_ + _
Stand Alone App Example 
https://github.com/RussellSpitzer/spark-­‐cassandra-­‐csv 
Car, 
Model, 
Color 
Dodge, 
Caravan, 
Red 
Ford, 
F150, 
Black 
Toyota, 
Prius, 
Green 
Spark SCC 
RDD 
[CassandraRow] 
!!! 
FavoriteCars 
Table 
Cassandra 
Column 
Mapping 
CSV
Thanks for listening! 
There is plenty more we can do with Spark but … 
Questions?
Getting started with Cassandra?! 
DataStax Academy offers free online Cassandra training! 
Planet Cassandra has resources for learning the basics from ‘Try Cassandra’ tutorials to in depth 
language and migration pages! 
Find a way to contribute back to the community: talk at a meetup, or share your story on 
PlanetCassandra.org! 
Need help? Get questions answered with Planet Cassandra’s free virtual office hours running weekly! 
Email us: Community@DataStax.com! 
Thanks for coming to the meetup!! 
In production?! 
Tweet us: @PlanetCassandra!
Thanks 
for 
your 
Time 
and 
Come 
to 
C* 
Summit! 
SEPTEMBER 
10 
-­‐ 
11, 
2014 
| 
SAN 
FRANCISCO, 
CALIF. 
| 
THE 
WESTIN 
ST. 
FRANCIS 
HOTEL 
Cassandra 
Summit 
Link

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Nike Tech Talk:  Double Down on Apache Cassandra and SparkNike Tech Talk:  Double Down on Apache Cassandra and Spark
Nike Tech Talk: Double Down on Apache Cassandra and Spark
 
Successful Architectures for Fast Data
Successful Architectures for Fast DataSuccessful Architectures for Fast Data
Successful Architectures for Fast Data
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelines
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
 

Andere mochten auch

Chapter 01 powerpoint
Chapter 01 powerpointChapter 01 powerpoint
Chapter 01 powerpoint
acalonico
 

Andere mochten auch (20)

Day con-lam-giau-tap-6
Day con-lam-giau-tap-6Day con-lam-giau-tap-6
Day con-lam-giau-tap-6
 
Moto montesacapra250v75manualinstrucc
Moto montesacapra250v75manualinstruccMoto montesacapra250v75manualinstrucc
Moto montesacapra250v75manualinstrucc
 
Tugas power point mengenai cara mengusir nyamuk dengan metode ilmiah
Tugas power point mengenai cara mengusir nyamuk dengan metode ilmiahTugas power point mengenai cara mengusir nyamuk dengan metode ilmiah
Tugas power point mengenai cara mengusir nyamuk dengan metode ilmiah
 
Chapter 01 powerpoint
Chapter 01 powerpointChapter 01 powerpoint
Chapter 01 powerpoint
 
รถยนต์ลอยฟ้า พลังงานแม่เหล็ก
รถยนต์ลอยฟ้า พลังงานแม่เหล็กรถยนต์ลอยฟ้า พลังงานแม่เหล็ก
รถยนต์ลอยฟ้า พลังงานแม่เหล็ก
 
Vc2014 final report_g
Vc2014 final report_gVc2014 final report_g
Vc2014 final report_g
 
CEITON Workflow + Scheduling @ IBC 2014
CEITON Workflow + Scheduling @ IBC 2014CEITON Workflow + Scheduling @ IBC 2014
CEITON Workflow + Scheduling @ IBC 2014
 
VicHealth Physical Activity Futures Jam Presentation: Homaxi Irani, HeathWall...
VicHealth Physical Activity Futures Jam Presentation: Homaxi Irani, HeathWall...VicHealth Physical Activity Futures Jam Presentation: Homaxi Irani, HeathWall...
VicHealth Physical Activity Futures Jam Presentation: Homaxi Irani, HeathWall...
 
Experimental investigation on circular hollow steel
Experimental investigation on circular hollow steelExperimental investigation on circular hollow steel
Experimental investigation on circular hollow steel
 
Through managing municipal waste public –private partnership in gwalior, m.p....
Through managing municipal waste public –private partnership in gwalior, m.p....Through managing municipal waste public –private partnership in gwalior, m.p....
Through managing municipal waste public –private partnership in gwalior, m.p....
 
VicHealth Physical Activity Futures Jam Presentation: Jeremy Kann, Tough Mudder
VicHealth Physical Activity Futures Jam Presentation: Jeremy Kann, Tough MudderVicHealth Physical Activity Futures Jam Presentation: Jeremy Kann, Tough Mudder
VicHealth Physical Activity Futures Jam Presentation: Jeremy Kann, Tough Mudder
 
20140908 bpv nieuwe_media
20140908 bpv nieuwe_media20140908 bpv nieuwe_media
20140908 bpv nieuwe_media
 
Effectual citizen relationship management with data mining techniques
Effectual citizen relationship management with data mining techniquesEffectual citizen relationship management with data mining techniques
Effectual citizen relationship management with data mining techniques
 
Indian spices and its antifungal activity
Indian spices and its antifungal activityIndian spices and its antifungal activity
Indian spices and its antifungal activity
 
7.ilma asking questions
7.ilma asking questions7.ilma asking questions
7.ilma asking questions
 
Review VLC Media Player + SDLC
Review VLC Media Player + SDLCReview VLC Media Player + SDLC
Review VLC Media Player + SDLC
 
Driving organizational growth through structured
Driving organizational growth through structuredDriving organizational growth through structured
Driving organizational growth through structured
 
motograde
motogrademotograde
motograde
 
Design and analysis of reduced size conical shape
Design and analysis of reduced size conical shapeDesign and analysis of reduced size conical shape
Design and analysis of reduced size conical shape
 
As reformas pombalinas da instrução publica - Laerte Ramos de Carvalho
As reformas pombalinas da instrução publica - Laerte Ramos de CarvalhoAs reformas pombalinas da instrução publica - Laerte Ramos de Carvalho
As reformas pombalinas da instrução publica - Laerte Ramos de Carvalho
 

Ähnlich wie Escape From Hadoop: Spark One Liners for C* Ops

Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
Holden Karau
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
Patrick McFadin
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Steve Watt
 

Ähnlich wie Escape From Hadoop: Spark One Liners for C* Ops (20)

Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Ss dotnetcodexmpl
Ss dotnetcodexmplSs dotnetcodexmpl
Ss dotnetcodexmpl
 
Shrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youShrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_you
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Converting a Rails application to Node.js
Converting a Rails application to Node.jsConverting a Rails application to Node.js
Converting a Rails application to Node.js
 
Toying with spark
Toying with sparkToying with spark
Toying with spark
 
Spark_Documentation_Template1
Spark_Documentation_Template1Spark_Documentation_Template1
Spark_Documentation_Template1
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Apache Cassandra and Go
Apache Cassandra and GoApache Cassandra and Go
Apache Cassandra and Go
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Mindmap: Oracle to Couchbase for developers
Mindmap: Oracle to Couchbase for developersMindmap: Oracle to Couchbase for developers
Mindmap: Oracle to Couchbase for developers
 

Kürzlich hochgeladen

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 

Kürzlich hochgeladen (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 

Escape From Hadoop: Spark One Liners for C* Ops

  • 1. Escape From Hadoop: Spark One Liners for C* Ops Kurt Russell Spitzer DataStax
  • 2. Who am I? • Bioinformatics Ph.D from UCSF • Works on the integration of Cassandra (C*) with Hadoop, Solr, and SPARK!! • Spends a lot of time spinning up clusters on EC2, GCE, Azure, … http://www.datastax.com/dev/ blog/testing-cassandra-1000- nodes-at-a-time • Developing new ways to make sure that C* Scales
  • 3. Why escape from Hadoop? HADOOP Many Moving Pieces Map Reduce Single Points of Failure Lots of Overhead And there is a way out!
  • 4. Spark Provides a Simple and Efficient framework for Distributed Computations Node Roles 2 In Memory Caching Yes! Generic DAG Execution Yes! Great Abstraction For Datasets? RDD! Spark Worker Spark Worker Spark Master Spark Worker Resilient Distributed Dataset Spark Executor
  • 5. Spark is Compatible with HDFS, Parquet, CSVs, ….
  • 6. Spark is Compatible with HDFS, Parquet, CSVs, …. AND APACHE CASSANDRA Apache Cassandra
  • 7. Apache Cassandra is a Linearly Scaling and Fault Tolerant noSQL Database Linearly Scaling: The power of the database increases linearly with the number of machines 2x machines = 2x throughput http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html Fault Tolerant: Nodes down != Database Down Datacenter down != Database Down
  • 8. Apache Cassandra Architecture is Very Simple Node Roles 1 Replication Tunable Replication Consistency Tunable C* C* C* C* Client
  • 9. DataStax OSS Connector Spark to Cassandra https://github.com/datastax/spark-­‐cassandra-­‐connector Cassandra Spark Keyspace Table RDD[CassandraRow] RDD[Tuples] Bundled and Supported with DSE 4.5!
  • 10. Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to C* Spark C* Full Token Range Each Executor Maintains a connection to the C* Cluster Spark Executor DataStax Java Driver Tokens 1001 -2000 Tokens 1-1000 Tokens … RDD’s read into different splits based on sets of tokens
  • 11. Co-locate Spark and C* for Best Performance C* C* C* Spark Worker C* Spark Worker Spark Master Spark Running Spark Workers Worker on the same nodes as your C* Cluster will save network hops when reading and writing
  • 12. Setting up C* and Spark DSE > 4.5.0 Just start your nodes with dse cassandra -k Apache Cassandra Follow the excellent guide by Al Tobey http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
  • 13. We need a Distributed System For Analytics and Batch Jobs But it doesn’t have to be complicated!
  • 14. Even count needs to be distributed Ask me to write a Map Reduce for word count, I dare you. You could make this easier by adding yet another technology to your Hadoop Stack (hive, pig, impala) or we could just do one liners on the spark shell.
  • 15. Basics: Getting a Table and Counting CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; use newyork; CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time ); INSERT INTO presidentlocations (time, location ) VALUES ( 1 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 2 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 3 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 4 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 5 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 6 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 7 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 8 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 9 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 10 , 'NYC' );
  • 16. Basics: Getting a Table and Counting CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; use newyork; CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time ); INSERT INTO presidentlocations (time, location ) VALUES ( 1 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 2 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 3 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 4 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 5 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 6 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 7 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 8 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 9 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 10 , 'NYC' ); scala> sc.cassandraTable(“newyork","presidentlocations") cassandraTable
  • 17. Basics: Getting a Table and Counting CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; use newyork; CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time ); INSERT INTO presidentlocations (time, location ) VALUES ( 1 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 2 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 3 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 4 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 5 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 6 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 7 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 8 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 9 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 10 , 'NYC' ); scala> sc.cassandraTable(“newyork","presidentlocations") .count res3: Long = 10 cassandraTable count 10
  • 18. Basics: take() and toArray scala> sc.cassandraTable("newyork","presidentlocations") cassandraTable
  • 19. Basics: take() and toArray scala> sc.cassandraTable("newyork","presidentlocations").take(1) ! res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC}) cassandraTable take(1) Array of CassandraRows 9 NYC
  • 20. Basics: take() and toArray scala> sc.cassandraTable("newyork","presidentlocations").take(1) ! res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC}) cassandraTable take(1) Array of CassandraRows 9 NYC scala> sc.cassandraTable(“newyork","presidentlocations") cassandraTable
  • 21. Basics: take() and toArray scala> sc.cassandraTable("newyork","presidentlocations").take(1) ! res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC}) cassandraTable take(1) Array of CassandraRows 9 NYC scala> sc.cassandraTable(“newyork","presidentlocations").toArray ! res3: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 3, location: White House}, …, CassandraRow{time: 6, location: Air Force 1}) cassandraTable toArray Array of CassandraRows 9 NYC 99 NNYYCC 99 NNYYCC
  • 22. Basics: Getting Row Values out of a CassandraRow scala> sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time") ! res5: Int = 9 cassandraTable http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
  • 23. Basics: Getting Row Values out of a CassandraRow scala> sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time") ! res5: Int = 9 cassandraTable take(1) Array of CassandraRows 9 NYC http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
  • 24. Basics: Getting Row Values out of a CassandraRow scala> sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time") ! res5: Int = 9 cassandraTable take(1) Array of CassandraRows 9 NYC 9 get[Int] get[Int] get[String] … get[Any] Got Null ? get[Option[Int]] http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
  • 25. Copy A Table Say we want to restructure our table or add a new column? CREATE TABLE characterlocations ( time int, character text, location text, PRIMARY KEY (time,character) );
  • 26. Copy A Table Say we want to restructure our table or add a new column? CREATE TABLE characterlocations ( time int, character text, location text, PRIMARY KEY (time,character) ); sc.cassandraTable(“newyork","presidentlocations") .map( row => ( row.get[Int](“time"), "president", row.get[String](“location") )).saveToCassandra("newyork","characterlocations") cassandraTable 1 white house
  • 27. Copy A Table Say we want to restructure our table or add a new column? CREATE TABLE characterlocations ( time int, character text, location text, PRIMARY KEY (time,character) ); sc.cassandraTable(“newyork","presidentlocations") .map( row => ( row.get[Int](“time"), "president", row.get[String](“location") )).saveToCassandra("newyork","characterlocations") cassandraTable 1 white house
  • 28. Copy A Table Say we want to restructure our table or add a new column? CREATE TABLE characterlocations ( time int, character text, location text, PRIMARY KEY (time,character) ); sc.cassandraTable(“newyork","presidentlocations") .map( row => ( row.get[Int](“time"), "president", row.get[String](“location") )).saveToCassandra("newyork","characterlocations") cassandraTable get[Int] get[String] 1 white house 1,president,white house
  • 29. get[Int] get[String] C* Copy A Table Say we want to restructure our table or add a new column? CREATE TABLE characterlocations ( time int, character text, location text, PRIMARY KEY (time,character) ); sc.cassandraTable(“newyork","presidentlocations") .map( row => ( row.get[Int](“time"), "president", row.get[String](“location") )).saveToCassandra("newyork","characterlocations") cassandraTable 1 white house 1,president,white house saveToCassandra
  • 30. get[Int] get[String] C* Copy A Table Say we want to restructure our table or add a new column? CREATE TABLE characterlocations ( time int, character text, location text, PRIMARY KEY (time,character) ); sc.cassandraTable(“newyork","presidentlocations") .map( row => ( row.get[Int](“time"), "president", row.get[String](“location") )).saveToCassandra("newyork","characterlocations") cqlsh:newyork> SELECT * FROM characterlocations ; ! time | character | location -­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 5 | president | Air Force 1 10 | president | NYC … … cassandraTable 1 white house 1,president,white house saveToCassandra
  • 31. Filter a Table What if we want to filter based on a non-clustering key column? scala> sc.cassandraTable(“newyork","presidentlocations") .filter( _.get[Int]("time") > 7 ) .toArray ! res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} ) cassandraTable
  • 32. Filter a Table What if we want to filter based on a non-clustering key column? scala> sc.cassandraTable(“newyork","presidentlocations") .filter( _.get[Int]("time") > 7 ) .toArray ! res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} ) cassandraTable Filter
  • 33. Filter a Table What if we want to filter based on a non-clustering key column? scala> sc.cassandraTable(“newyork","presidentlocations") .filter( _.get[Int]("time") > 7 ) .toArray ! res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} ) cassandraTable Filter _ (Anonymous Param) 1 white house
  • 34. Filter a Table What if we want to filter based on a non-clustering key column? scala> sc.cassandraTable(“newyork","presidentlocations") .filter( _.get[Int]("time") > 7 ) .toArray ! res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} ) cassandraTable Filter 1 white house get[Int] 1 _ (Anonymous Param)
  • 35. Filter a Table What if we want to filter based on a non-clustering key column? scala> sc.cassandraTable(“newyork","presidentlocations") .filter( _.get[Int]("time") > 7 ) .toArray ! res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} ) cassandraTable _ (Anonymous Param) >7 1 white house get[Int] 1 Filter
  • 36. Filter a Table What if we want to filter based on a non-clustering key column? scala> sc.cassandraTable(“newyork","presidentlocations") .filter( _.get[Int]("time") > 7 ) .toArray ! res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} ) cassandraTable _ (Anonymous Param) >7 1 white house get[Int] 1 Filter
  • 37. Backfill a Table with a Different Key! CREATE TABLE timelines ( time int, character text, location text, PRIMARY KEY ((character), time) ) If we actually want to have quick access to timelines we need a C* table with a different structure.
  • 38. Backfill a Table with a Different Key! CREATE TABLE timelines ( time int, character text, location text, PRIMARY KEY ((character), time) ) If we actually want to have quick access to timelines we need a C* table with a different structure. sc.cassandraTable(“newyork","characterlocations") .saveToCassandra("newyork","timelines") 1 white house cassandraTable president
  • 39. Backfill a Table with a Different Key! CREATE TABLE timelines ( time int, character text, location text, PRIMARY KEY ((character), time) ) If we actually want to have quick access to timelines we need a C* table with a different structure. sc.cassandraTable(“newyork","characterlocations") .saveToCassandra("newyork","timelines") 1 white house cassandraTable saveToCassandra president C*
  • 40. Backfill a Table with a Different Key! CREATE TABLE timelines ( time int, character text, location text, PRIMARY KEY ((character), time) ) If we actually want to have quick access to timelines we need a C* table with a different structure. sc.cassandraTable(“newyork","characterlocations") .saveToCassandra("newyork","timelines") cqlsh:newyork> select * from timelines; ! character | time | location -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ president | 1 | White House president | 2 | White House president | 3 | White House president | 4 | White House president | 5 | Air Force 1 president | 6 | Air Force 1 president | 7 | Air Force 1 president | 8 | NYC president | 9 | NYC president | 10 | NYC 1 white house cassandraTable saveToCassandra president C*
  • 41. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) .map(_.split(“,")) .map( line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines") textFile
  • 42. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) .map(_.split(“,")) .map( line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines") textFile Map plissken,1,Federal Reserve split plissken 1 Federal Reserve
  • 43. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) .map(_.split(“,")) .map( line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines") textFile Map plissken,1,Federal Reserve split plissken 1 Federal Reserve plissken,1,Federal Reserve
  • 44. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) .map(_.split(“,")) .map( line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines") textFile Map plissken,1,Federal Reserve split plissken 1 Federal Reserve plissken,1,Federal Reserve saveToCassandra C*
  • 45. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) .map(_.split(“,")) .map( line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines") textFile Map plissken,1,white house split plissken 1 white house plissken,1,white house saveToCassandra C* cqlsh:newyork> select * from timelines where character = 'plissken'; ! character | time | location -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ plissken | 1 | Federal Reserve plissken | 2 | Federal Reserve plissken | 3 | Federal Reserve plissken | 4 | Court plissken | 5 | Court plissken | 6 | Court plissken | 7 | Court plissken | 8 | Stealth Glider plissken | 9 | NYC plissken | 10 | NYC
  • 46. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”) .map(_.split(“,")) .map( line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines") textFile Map plissken,1,white house split plissken 1 white house plissken,1,white house saveToCassandra C* cqlsh:newyork> select * from timelines where character = 'plissken'; ! character | time | location -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ plissken | 1 | Federal Reserve plissken | 2 | Federal Reserve plissken | 3 | Federal Reserve plissken | 4 | Court plissken | 5 | Court plissken | 6 | Court plissken | 7 | Court plissken | 8 | Stealth Glider plissken | 9 | NYC plissken | 10 | NYC
  • 47. Perform a Join with MySQL Maybe a little more than one line … MySQL Table “quotes” in “escape_from_ny” import java.sql._ import org.apache.spark.rdd.JdbcRDD Class.forName(“com.mysql.jdbc.Driver”).newInstance();//Connector/J added toSpark Shell Classpath val quotes = new JdbcRDD( sc, () => { DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root")}, "SELECT * FROM quotes WHERE ? <= ID and ID <= ?”, 0, 100, 5, (r: ResultSet) => { (r.getInt(2),r.getString(3)) } ) ! quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23
  • 48. Perform a Join with MySQL Maybe a little more than one line … quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23 ! quotes.join( sc.cassandraTable(“newyork","timelines") .filter( _.get[String]("character") == “plissken") .map( row => (row.get[Int](“time"),row.get[String]("location")))) .take(1) .foreach(println) ! (5, (Bob Hauk: There was an accident. About an hour ago, a small jet went down inside New York City. The President was on board. Snake Plissken: The president of what?, Court) ) cassandraTable JdbcRDD Needs to be in the form of RDD[K,V] 5, ‘Bob Hauk: …'
  • 49. Perform a Join with MySQL Maybe a little more than one line … quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23 ! quotes.join( sc.cassandraTable(“newyork","timelines") .filter( _.get[String]("character") == “plissken") .map( row => (row.get[Int](“time"),row.get[String]("location")))) .take(1) .foreach(println) ! (5, (Bob Hauk: There was an accident. About an hour ago, a small jet went down inside New York City. The President was on board. Snake Plissken: The president of what?, Court) ) cassandraTable JdbcRDD plissken,5,court 5,court 5, ‘Bob Hauk: …'
  • 50. Perform a Join with MySQL Maybe a little more than one line … quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23 ! quotes.join( sc.cassandraTable(“newyork","timelines") .filter( _.get[String]("character") == “plissken") .map( row => (row.get[Int](“time"),row.get[String]("location")))) .take(1) .foreach(println) ! (5, (Bob Hauk: There was an accident. About an hour ago, a small jet went down inside New York City. The President was on board. Snake Plissken: The president of what?, Court) ) cassandraTable JdbcRDD plissken,5,court 5,court 5,(‘Bob Hauk: …’,court) 5, ‘Bob Hauk: …'
  • 51. Perform a Join with MySQL Maybe a little more than one line … quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23 ! quotes.join( sc.cassandraTable(“newyork","timelines") .filter( _.get[String]("character") == “plissken") .map( row => (row.get[Int](“time"),row.get[String]("location")))) .take(1) .foreach(println) ! (5, (Bob Hauk: There was an accident. About an hour ago, a small jet went down inside New York City. The President was on board. Snake Plissken: The president of what?, Court) ) cassandraTable JdbcRDD plissken,5,court 5,court 5,(‘Bob Hauk: …’,court) 5, ‘Bob Hauk: …'
  • 52. Easy Objects with Case Classes We have the technology to make this even easier! case class timelineRow (character:String, time:Int, location:String) sc.cassandraTable[timelineRow](“newyork","timelines") .filter( _.character == “plissken") .filter( _.time == 8) .toArray res13: Array[timelineRow] = Array(timelineRow(plissken,8,Stealth Glider)) timelineRow character,time,location
  • 53. Easy Objects with Case Classes We have the technology to make this even easier! case class timelineRow (character:String, time:Int, location:String) sc.cassandraTable[timelineRow](“newyork","timelines") .filter( _.character == “plissken") .filter( _.time == 8) .toArray res13: Array[timelineRow] = Array(timelineRow(plissken,8,Stealth Glider)) cassandraTable[timelineRow] timelineRow character,time,location
  • 54. Easy Objects with Case Classes We have the technology to make this even easier! case class timelineRow (character:String, time:Int, location:String) sc.cassandraTable[timelineRow](“newyork","timelines") .filter( _.character == “plissken") .filter( _.time == 8) .toArray res13: Array[timelineRow] = Array(timelineRow(plissken,8,Stealth Glider)) cassandraTable[timelineRow] timelineRow character,time,location filter character == plissken
  • 55. Easy Objects with Case Classes We have the technology to make this even easier! case class timelineRow (character:String, time:Int, location:String) sc.cassandraTable[timelineRow](“newyork","timelines") .filter( _.character == “plissken") .filter( _.time == 8) .toArray res13: Array[timelineRow] = Array(timelineRow(plissken,8,Stealth Glider)) cassandraTable[timelineRow] timelineRow character,time,location filter character == plissken time == 8
  • 56. Easy Objects with Case Classes We have the technology to make this even easier! case class timelineRow (character:String, time:Int, location:String) sc.cassandraTable[timelineRow](“newyork","timelines") .filter( _.character == “plissken") .filter( _.time == 8) .toArray res13: Array[timelineRow] = Array(timelineRow(plissken,8,Stealth Glider)) cassandraTable[timelineRow] timelineRow character,time,location filter character == plissken time == 8 character:plissken,time:8,location: Stealth Glider
  • 57. Easy Objects with Case Classes We have the technology to make this even easier! case class timelineRow (character:String, time:Int, location:String) sc.cassandraTable[timelineRow](“newyork","timelines") res13: Array[timelineRow] = Array(timelineRow(plissken,8,Stealth Glider)) The Future .filter( _.character == “plissken") .filter( _.time == 8) .toArray cassandraTable[timelineRow] timelineRow character,time,location filter character == plissken time == 8 character:plissken,time:8,location: Stealth Glider
  • 58. A Map Reduce for Word Count … scala> sc.cassandraTable(“newyork”,"presidentlocations") .map( _.get[String](“location”) ) .flatMap( _.split(“ “)) .map( (_,1)) .reduceByKey( _ + _ ) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3)) cassandraTable
  • 59. A Map Reduce for Word Count … scala> sc.cassandraTable(“newyork”,"presidentlocations") .map( _.get[String](“location”) ) .flatMap( _.split(“ “)) .map( (_,1)) .reduceByKey( _ + _ ) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3)) 1 white house cassandraTable get[String]
  • 60. A Map Reduce for Word Count … scala> sc.cassandraTable(“newyork”,"presidentlocations") .map( _.get[String](“location”) ) .flatMap( _.split(“ “)) .map( (_,1)) .reduceByKey( _ + _ ) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3)) 1 white house white house cassandraTable get[String] _.split()
  • 61. A Map Reduce for Word Count … scala> sc.cassandraTable(“newyork”,"presidentlocations") .map( _.get[String](“location”) ) .flatMap( _.split(“ “)) .map( (_,1)) .reduceByKey( _ + _ ) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3)) 1 white house white house white, 1 house, 1 cassandraTable get[String] _.split() (_,1)
  • 62. A Map Reduce for Word Count … scala> sc.cassandraTable(“newyork”,"presidentlocations") .map( _.get[String](“location”) ) .flatMap( _.split(“ “)) .map( (_,1)) .reduceByKey( _ + _ ) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3)) 1 white house white house white, 1 house, 1 house, 1 house, 1 house, 2 cassandraTable get[String] _.split() (_,1) _ + _
  • 63. A Map Reduce for Word Count … scala> sc.cassandraTable(“newyork”,"presidentlocations") .map( _.get[String](“location”) ) .flatMap( _.split(“ “)) .map( (_,1)) .reduceByKey( _ + _ ) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3)) 1 white house white house white, 1 house, 1 house, 1 house, 1 house, 2 cassandraTable get[String] _.split() (_,1) _ + _
  • 64. Stand Alone App Example https://github.com/RussellSpitzer/spark-­‐cassandra-­‐csv Car, Model, Color Dodge, Caravan, Red Ford, F150, Black Toyota, Prius, Green Spark SCC RDD [CassandraRow] !!! FavoriteCars Table Cassandra Column Mapping CSV
  • 65. Thanks for listening! There is plenty more we can do with Spark but … Questions?
  • 66. Getting started with Cassandra?! DataStax Academy offers free online Cassandra training! Planet Cassandra has resources for learning the basics from ‘Try Cassandra’ tutorials to in depth language and migration pages! Find a way to contribute back to the community: talk at a meetup, or share your story on PlanetCassandra.org! Need help? Get questions answered with Planet Cassandra’s free virtual office hours running weekly! Email us: Community@DataStax.com! Thanks for coming to the meetup!! In production?! Tweet us: @PlanetCassandra!
  • 67. Thanks for your Time and Come to C* Summit! SEPTEMBER 10 -­‐ 11, 2014 | SAN FRANCISCO, CALIF. | THE WESTIN ST. FRANCIS HOTEL Cassandra Summit Link