SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Downloaden Sie, um offline zu lesen
Interactive Analytics With
Spark And Cassandra
Evan Chan
Ooyala, Inc.
April 7th, 2014
Monday, April 7, 14
• Staff Engineer, Compute and Data Services, Ooyala
• Building multiple web-scale real-time systems on top of
C*, Kafka, Storm, etc.
• Scala/Akka guy
• Very excited by open source, big data projects
• @evanfchan
Who is this guy?
2
Monday, April 7, 14
• Cassandra at Ooyala
• What problem are we trying to solve?
• Spark and Shark
• Integrating Cassandra and Spark
• Our Spark/Cassandra Architecture
Agenda
3
Monday, April 7, 14
CASSANDRA AT
OOYALA
4
Monday, April 7, 14
OOYALA
Powering personalized video
experiences across all
screens.
Monday, April 7, 14
CONFIDENTIAL—DO NOT DISTRIBUTE 6CONFIDENTIAL—DO NOT DISTRIBUTE
Founded in 2007
Commercially launch in 2009
230+ employees in Silicon Valley, LA, NYC,
London, Paris, Tokyo, Sydney & Guadalajara
Global footprint, 200M unique users,
110+ countries, and more than 6,000 websites
Over 1 billion videos played per month
and 2 billion analytic events per day
25% of U.S. online viewers watch video
powered by Ooyala
COMPANY OVERVIEW
Monday, April 7, 14
CONFIDENTIAL—DO NOT DISTRIBUTE 7
TRUSTED VIDEO PARTNER
STRATEGIC PARTNERS
CUSTOMERS
CONFIDENTIAL—DO NOT DISTRIBUTE
Monday, April 7, 14
TITLE TEXT GOES HERE
• 12 clusters ranging in size from 3 to 107
nodes
• Total of 28TB of data managed over
~220 nodes
• Powers all of our analytics infrastructure
• Traditional analytics aggregations
• Recommendations and trends
• DSE/C* 1.0.x, 1.1.x, 1.2.6
We are a large Cassandra user
8
Monday, April 7, 14
TITLE TEXT GOES HERE
• Started investing in Spark beginning of 2013
• 2 teams of developers doing stuff with Spark
• Actively contributing to Spark developer community
• Deploying Spark to a large (>100 node) production
cluster
• Spark community very active, huge amount of interest
Becoming a big Spark user...
9
Monday, April 7, 14
WHAT PROBLEM ARE
WE TRYING TO SOLVE?
10
Monday, April 7, 14
From mountains of raw data...
Monday, April 7, 14
• Quickly
• Painlessly
• At scale?
To nuggets of truth...
Monday, April 7, 14
Today: Precomputed Aggregates
• Video metrics computed along several high cardinality dimensions
• Very fast lookups, but inflexible, and hard to change
• Most computed aggregates are never read
• What if we need more dynamic queries?
• Top content for mobile users in France
• Engagement curves for users who watched recommendations
• Data mining, trends, machine learning
Monday, April 7, 14
The Static - Dynamic Continuum
• Super fast lookups
• Inflexible,
wasteful
• Best for 80% most
common queries
• Always compute results
from raw data
• Flexible but slow
100% Precomputation 100% Dynamic
Monday, April 7, 14
Where We Want To Be
Partly dynamic
• Pre-aggregate most
common queries
• Flexible, fast
dynamic queries
• Easily generate
many materialized
views
Monday, April 7, 14
WHY SPARK?
16
Monday, April 7, 14
Introduction To Spark
• In-memory distributed computing framework
• Created by UC Berkeley AMP Lab in 2010
• Targeted problems that MR is bad at:
–Iterative algorithms (machine learning)
–Interactive data mining
• More general purpose than Hadoop MR
• Top level Apache project
• Active contributions from Intel, Yahoo, lots of
companies -- huge momentum
Monday, April 7, 14
Spark Vs Hadoop
HDFS
Map
Reduce
Map
Reduce
Data
map()
join()
Source 2
cache()
transform
Monday, April 7, 14
Throughput: Memory Is King
0 37500 75000 112500 150000
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Spark cached RDD 10-50x faster than raw Cassandra
Monday, April 7, 14
Developers Love It
• “I wrote my first aggregation job in 30
minutes”
• High level “distributed collections” API
• No Hadoop cruft
• Full power of Scala, Java, Python
• Interactive REPL shell
Monday, April 7, 14
Spark Vs Hadoop Word Count
file = spark.textFile("hdfs://...")
 
file.flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{
18 private final static IntWritable one = new IntWritable(1);
19 private Text word = new Text();
20
21 public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer = new StringTokenizer(line);
24 while (tokenizer.hasMoreTokens()) {
25 word.set(tokenizer.nextToken());
26 context.write(word, one);
27 }
28 }
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
32
33 public void reduce(Text key, Iterable<IntWritable> values, Context context)
34 throws IOException, InterruptedException {
35 int sum = 0;
36 for (IntWritable val : values) {
37 sum += val.get();
38 }
39 context.write(key, new IntWritable(sum));
40 }
41 }
42
43 public static void main(String[] args) throws Exception {
44 Configuration conf = new Configuration();
45
46 Job job = new Job(conf, "wordcount");
47
48 job.setOutputKeyClass(Text.class);
49 job.setOutputValueClass(IntWritable.class);
50
51 job.setMapperClass(Map.class);
52 job.setReducerClass(Reduce.class);
53
54 job.setInputFormatClass(TextInputFormat.class);
55 job.setOutputFormatClass(TextOutputFormat.class);
56
57 FileInputFormat.addInputPath(job, new Path(args[0]));
58 FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60 job.waitForCompletion(true);
61 }
62
63 }
Monday, April 7, 14
One Platform To Rule Them All
HIVE on Spark
Spark Streaming -
discretized stream
processing
• SQL, Graph, ML, Streaming all in one
framework
• Much higher code sharing/reuse
• Easy integration between components
• Fewer platforms == lower TCO
• Integration with Mesos, YARN helps
share resources
Monday, April 7, 14
Shark - HIVE On Spark
• 100% HiveQL compatible
• 10-100x faster than HIVE, answers in seconds
• Reuse UDFs, SerDe’s, StorageHandlers
• Can use DSE / CassandraFS for Metastore
Monday, April 7, 14
INTEGRATING
CASSANDRA & SPARK
24
Monday, April 7, 14
Our Spark/Shark/Cassandra Stack
Node1
Cassandra
InputFormat
SerDe
Spark
Worker
Shark
Node2
Cassandra
InputFormat
SerDe
Spark
Worker
Shark
Node3
Cassandra
InputFormat
SerDe
Spark
Worker
Shark
Spark Master Job Server
Monday, April 7, 14
OPTIONS FOR READING FROM C*
• Hadoop InputFormat
– ColumnFamilyInputFormat - reads all rows from 1
CF
– CqlPagingInputFormat, etc. - CQL3, 2-dary indexes
– Roll your own (join multiple CFs, etc)
• Spark native RDD
– sc.parallelize(rowkeys).flatMap(readColum
ns(_))
– JdbcRdd + Cassandra JDBC driver
Monday, April 7, 14
ColumnFamilyInputFormat
video type
Record1 10 1
Record2 11 5
id Video Type
Record1 10 1
Record2 11 5
• Must read from all rows
• One CF only, not very flexible
Monday, April 7, 14
Node 2Node 1
Spark RDD
• RDD = Resilient Distributed Dataset
• Multiple partitions living on different
nodes
• Each partition has records
Partition 1 Partition 2 Partition 3 Partition 4
Monday, April 7, 14
Inputformat Vs RDD
InputFormat RDD
Supports Hadoop, HIVE, Spark,
Shark
Spark / Shark only
Have to implement multiple classes
- InputFormat, RecordReader,
Writeable, etc. Clunky API.
One class - simple API.
Two APIs, and often need to
implement both (HIVE needs older...)
Just one API.
• You can easily use InputFormats in
Spark using newAPIHadoopRDD().
• Writing a custom RDD could have
saved us lots of time.
Monday, April 7, 14
Node 2Node 1
Skipping The InputFormat
Row 1 data Row 2 data Row 3 data Row 4 data
Rowkey1 Rowkey2 Rowkey3 Rowkey4
sc.parallelize(rowkeys).flatMap(readColumns(_))
Monday, April 7, 14
OUR CASSANDRA /
SPARK ARCHITECTURE
31
Monday, April 7, 14
From Raw Logs To Fast Queries
Process
C*
columnar
storage
Raw Logs
Raw Logs
Raw Logs
Spark
Spark
Spark
OLAP Table
1
OLAP Table
2
OLAP Table
3
Spark
Shark
Predefined
queries
Ad-hoc
HiveQL
Monday, April 7, 14
Why Cassandra Alone Isn’t Enough
• Over 30 million multi-dimensional
fact table rows per day
• Materializing every possible answer
isn’t close to possible
• Multi dimensional filtering and
grouping alone leads to many
billions of possible answers
• Querying fact tables in Cassandra is
too slow
• Reading millions or billion
random rows
• CQL doesn’t support grouping
Monday, April 7, 14
Our Approach
• Use Cassandra to store the raw
fact tables
• Optimize the schema for OLAP
workloads
• Fast full table reads
• Easily read fewer columns
• Use Spark for fast random row
access and fast distributed
computation
Monday, April 7, 14
uuid-
part-0
uuid-
part-1
2013-04-05T
00:00Z#id1
Section 0 Section 1 Section 2
country rows 0-9999 rows 10000-19999 rows ....
city rows 0-9999 rows 10000-19999 rows ....
Index CF
Columns CF
An OLAP Schema for Cassandra
Metadata
2013-04-05T
00:00-part-0
{columns:
[“country”,
“city”,
“plays” ]}
Metadata CF
•Optimized for: selective column loading, maximum throughput and compression
Monday, April 7, 14
OLAP Workflow
Dataset
Aggregation
Job
Query
Job
Spark
Executors
Cassandra
REST Job Server
Query
Job
Aggregate Query
Result
Query
Result
Monday, April 7, 14
Querying Data In Spark
Node 2Node 1
Partition 1
record1
record2
record3
Partition 2
record4
record5
record6
Partition 3
record7
record8
record9
Partition 4
record10
record11
record12
Convert to Spark SQL /
Shark Table
Shark / Spark SQL
rdd.group / .map / .sort /
.join etc
rdd.reduce / .collect /
.top
Monday, April 7, 14
Fault Tolerance
• Cached dataset lives in Java Heap only - what if
process dies?
• Spark lineage - automatic recomputation from
source, but this is expensive!
• Can also replicate cached dataset to survive single
node failures
• Persist materialized views back to C*, then load into
cache -- now recovery path is much faster
Monday, April 7, 14
DEMO
39
Monday, April 7, 14
Creating a Shark Table
Monday, April 7, 14
Creating a Cached Table
Monday, April 7, 14
Querying a Cached Table
Monday, April 7, 14
THANK YOU
And YES, We’re HIRING!!
ooyala.com/careers
@evanfchan
Monday, April 7, 14
Industry Trends
• Fast execution frameworks
• Impala
• In-memory databases
• VoltDB, Druid
• Streaming and real-time
• Higher-level, productive data
frameworks
Monday, April 7, 14
PERFORMANCE #’S
Spark: C* -> OLAP aggregates
cold cache, 1.4 million events
130 seconds
C* -> OLAP aggregates
warmed cache
20-30 seconds
OLAP aggregate query via
Spark
(56k records)
60 ms
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Monday, April 7, 14
EXAMPLE: OLAP PROCESSING
t0
2013-0
4-05T0
0:00Z#i
d1
{vide
o:
10,
type:
5}
2013-0
4-05T0
0:00Z#i
{vide
o:
20,
C* events
OLAP
Aggregates
OLAP
Aggregates
OLAP
Aggregates
Cached Materialized Views
Spark
Spark
Spark
Union
Query 1: Plays by
Provider
Query 2: Top content
for mobile
Monday, April 7, 14

Weitere ähnliche Inhalte

Was ist angesagt?

Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataPatrick McFadin
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in sparkPeng Cheng
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesAlexis Seigneurin
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 

Was ist angesagt? (20)

Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 

Andere mochten auch

Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand fordThu Hiền
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
 
Migration from MySQL to Cassandra for millions of active users
Migration from MySQL to Cassandra for millions of active usersMigration from MySQL to Cassandra for millions of active users
Migration from MySQL to Cassandra for millions of active usersAndrey Panasyuk
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
 
Amazon Cassandra Basics & Guidelines for AWS/EC2/VPC/EBS
Amazon Cassandra Basics & Guidelines for AWS/EC2/VPC/EBSAmazon Cassandra Basics & Guidelines for AWS/EC2/VPC/EBS
Amazon Cassandra Basics & Guidelines for AWS/EC2/VPC/EBSJean-Paul Azar
 
NoSQL - Cassandra & MongoDB.pptx
NoSQL -  Cassandra & MongoDB.pptxNoSQL -  Cassandra & MongoDB.pptx
NoSQL - Cassandra & MongoDB.pptxNaveen Kumar
 

Andere mochten auch (10)

Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Migration from MySQL to Cassandra for millions of active users
Migration from MySQL to Cassandra for millions of active usersMigration from MySQL to Cassandra for millions of active users
Migration from MySQL to Cassandra for millions of active users
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Amazon Cassandra Basics & Guidelines for AWS/EC2/VPC/EBS
Amazon Cassandra Basics & Guidelines for AWS/EC2/VPC/EBSAmazon Cassandra Basics & Guidelines for AWS/EC2/VPC/EBS
Amazon Cassandra Basics & Guidelines for AWS/EC2/VPC/EBS
 
NoSQL - Cassandra & MongoDB.pptx
NoSQL -  Cassandra & MongoDB.pptxNoSQL -  Cassandra & MongoDB.pptx
NoSQL - Cassandra & MongoDB.pptx
 

Ähnlich wie Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
SF Hadoop Users Group August 2014 Meetup Slides
SF Hadoop Users Group August 2014 Meetup SlidesSF Hadoop Users Group August 2014 Meetup Slides
SF Hadoop Users Group August 2014 Meetup SlidesYash Ranadive
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyoneAmanda Casari
 
Spark and Hadoop Technology
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology Avinash Gautam
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentationRamesh Mudunuri
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topicsValentin Kropov
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsXiao Li
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 

Ähnlich wie Cassandra Day 2014: Interactive Analytics with Cassandra and Spark (20)

Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
SF Hadoop Users Group August 2014 Meetup Slides
SF Hadoop Users Group August 2014 Meetup SlidesSF Hadoop Users Group August 2014 Meetup Slides
SF Hadoop Users Group August 2014 Meetup Slides
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone
 
Spark and Hadoop Technology
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
SparkFramework
SparkFrameworkSparkFramework
SparkFramework
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 

Mehr von Evan Chan

Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustEvan Chan
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesEvan Chan
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Evan Chan
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureEvan Chan
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
 

Mehr von Evan Chan (7)

Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 

Kürzlich hochgeladen

Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoordharasingh5698
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf203318pmpc
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 

Kürzlich hochgeladen (20)

Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 

Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

  • 1. Interactive Analytics With Spark And Cassandra Evan Chan Ooyala, Inc. April 7th, 2014 Monday, April 7, 14
  • 2. • Staff Engineer, Compute and Data Services, Ooyala • Building multiple web-scale real-time systems on top of C*, Kafka, Storm, etc. • Scala/Akka guy • Very excited by open source, big data projects • @evanfchan Who is this guy? 2 Monday, April 7, 14
  • 3. • Cassandra at Ooyala • What problem are we trying to solve? • Spark and Shark • Integrating Cassandra and Spark • Our Spark/Cassandra Architecture Agenda 3 Monday, April 7, 14
  • 5. OOYALA Powering personalized video experiences across all screens. Monday, April 7, 14
  • 6. CONFIDENTIAL—DO NOT DISTRIBUTE 6CONFIDENTIAL—DO NOT DISTRIBUTE Founded in 2007 Commercially launch in 2009 230+ employees in Silicon Valley, LA, NYC, London, Paris, Tokyo, Sydney & Guadalajara Global footprint, 200M unique users, 110+ countries, and more than 6,000 websites Over 1 billion videos played per month and 2 billion analytic events per day 25% of U.S. online viewers watch video powered by Ooyala COMPANY OVERVIEW Monday, April 7, 14
  • 7. CONFIDENTIAL—DO NOT DISTRIBUTE 7 TRUSTED VIDEO PARTNER STRATEGIC PARTNERS CUSTOMERS CONFIDENTIAL—DO NOT DISTRIBUTE Monday, April 7, 14
  • 8. TITLE TEXT GOES HERE • 12 clusters ranging in size from 3 to 107 nodes • Total of 28TB of data managed over ~220 nodes • Powers all of our analytics infrastructure • Traditional analytics aggregations • Recommendations and trends • DSE/C* 1.0.x, 1.1.x, 1.2.6 We are a large Cassandra user 8 Monday, April 7, 14
  • 9. TITLE TEXT GOES HERE • Started investing in Spark beginning of 2013 • 2 teams of developers doing stuff with Spark • Actively contributing to Spark developer community • Deploying Spark to a large (>100 node) production cluster • Spark community very active, huge amount of interest Becoming a big Spark user... 9 Monday, April 7, 14
  • 10. WHAT PROBLEM ARE WE TRYING TO SOLVE? 10 Monday, April 7, 14
  • 11. From mountains of raw data... Monday, April 7, 14
  • 12. • Quickly • Painlessly • At scale? To nuggets of truth... Monday, April 7, 14
  • 13. Today: Precomputed Aggregates • Video metrics computed along several high cardinality dimensions • Very fast lookups, but inflexible, and hard to change • Most computed aggregates are never read • What if we need more dynamic queries? • Top content for mobile users in France • Engagement curves for users who watched recommendations • Data mining, trends, machine learning Monday, April 7, 14
  • 14. The Static - Dynamic Continuum • Super fast lookups • Inflexible, wasteful • Best for 80% most common queries • Always compute results from raw data • Flexible but slow 100% Precomputation 100% Dynamic Monday, April 7, 14
  • 15. Where We Want To Be Partly dynamic • Pre-aggregate most common queries • Flexible, fast dynamic queries • Easily generate many materialized views Monday, April 7, 14
  • 17. Introduction To Spark • In-memory distributed computing framework • Created by UC Berkeley AMP Lab in 2010 • Targeted problems that MR is bad at: –Iterative algorithms (machine learning) –Interactive data mining • More general purpose than Hadoop MR • Top level Apache project • Active contributions from Intel, Yahoo, lots of companies -- huge momentum Monday, April 7, 14
  • 19. Throughput: Memory Is King 0 37500 75000 112500 150000 C*, cold cache C*, warm cache Spark RDD 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Spark cached RDD 10-50x faster than raw Cassandra Monday, April 7, 14
  • 20. Developers Love It • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API • No Hadoop cruft • Full power of Scala, Java, Python • Interactive REPL shell Monday, April 7, 14
  • 21. Spark Vs Hadoop Word Count file = spark.textFile("hdfs://...")   file.flatMap(line => line.split(" "))     .map(word => (word, 1))     .reduceByKey(_ + _) 1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 } Monday, April 7, 14
  • 22. One Platform To Rule Them All HIVE on Spark Spark Streaming - discretized stream processing • SQL, Graph, ML, Streaming all in one framework • Much higher code sharing/reuse • Easy integration between components • Fewer platforms == lower TCO • Integration with Mesos, YARN helps share resources Monday, April 7, 14
  • 23. Shark - HIVE On Spark • 100% HiveQL compatible • 10-100x faster than HIVE, answers in seconds • Reuse UDFs, SerDe’s, StorageHandlers • Can use DSE / CassandraFS for Metastore Monday, April 7, 14
  • 26. OPTIONS FOR READING FROM C* • Hadoop InputFormat – ColumnFamilyInputFormat - reads all rows from 1 CF – CqlPagingInputFormat, etc. - CQL3, 2-dary indexes – Roll your own (join multiple CFs, etc) • Spark native RDD – sc.parallelize(rowkeys).flatMap(readColum ns(_)) – JdbcRdd + Cassandra JDBC driver Monday, April 7, 14
  • 27. ColumnFamilyInputFormat video type Record1 10 1 Record2 11 5 id Video Type Record1 10 1 Record2 11 5 • Must read from all rows • One CF only, not very flexible Monday, April 7, 14
  • 28. Node 2Node 1 Spark RDD • RDD = Resilient Distributed Dataset • Multiple partitions living on different nodes • Each partition has records Partition 1 Partition 2 Partition 3 Partition 4 Monday, April 7, 14
  • 29. Inputformat Vs RDD InputFormat RDD Supports Hadoop, HIVE, Spark, Shark Spark / Shark only Have to implement multiple classes - InputFormat, RecordReader, Writeable, etc. Clunky API. One class - simple API. Two APIs, and often need to implement both (HIVE needs older...) Just one API. • You can easily use InputFormats in Spark using newAPIHadoopRDD(). • Writing a custom RDD could have saved us lots of time. Monday, April 7, 14
  • 30. Node 2Node 1 Skipping The InputFormat Row 1 data Row 2 data Row 3 data Row 4 data Rowkey1 Rowkey2 Rowkey3 Rowkey4 sc.parallelize(rowkeys).flatMap(readColumns(_)) Monday, April 7, 14
  • 31. OUR CASSANDRA / SPARK ARCHITECTURE 31 Monday, April 7, 14
  • 32. From Raw Logs To Fast Queries Process C* columnar storage Raw Logs Raw Logs Raw Logs Spark Spark Spark OLAP Table 1 OLAP Table 2 OLAP Table 3 Spark Shark Predefined queries Ad-hoc HiveQL Monday, April 7, 14
  • 33. Why Cassandra Alone Isn’t Enough • Over 30 million multi-dimensional fact table rows per day • Materializing every possible answer isn’t close to possible • Multi dimensional filtering and grouping alone leads to many billions of possible answers • Querying fact tables in Cassandra is too slow • Reading millions or billion random rows • CQL doesn’t support grouping Monday, April 7, 14
  • 34. Our Approach • Use Cassandra to store the raw fact tables • Optimize the schema for OLAP workloads • Fast full table reads • Easily read fewer columns • Use Spark for fast random row access and fast distributed computation Monday, April 7, 14
  • 35. uuid- part-0 uuid- part-1 2013-04-05T 00:00Z#id1 Section 0 Section 1 Section 2 country rows 0-9999 rows 10000-19999 rows .... city rows 0-9999 rows 10000-19999 rows .... Index CF Columns CF An OLAP Schema for Cassandra Metadata 2013-04-05T 00:00-part-0 {columns: [“country”, “city”, “plays” ]} Metadata CF •Optimized for: selective column loading, maximum throughput and compression Monday, April 7, 14
  • 36. OLAP Workflow Dataset Aggregation Job Query Job Spark Executors Cassandra REST Job Server Query Job Aggregate Query Result Query Result Monday, April 7, 14
  • 37. Querying Data In Spark Node 2Node 1 Partition 1 record1 record2 record3 Partition 2 record4 record5 record6 Partition 3 record7 record8 record9 Partition 4 record10 record11 record12 Convert to Spark SQL / Shark Table Shark / Spark SQL rdd.group / .map / .sort / .join etc rdd.reduce / .collect / .top Monday, April 7, 14
  • 38. Fault Tolerance • Cached dataset lives in Java Heap only - what if process dies? • Spark lineage - automatic recomputation from source, but this is expensive! • Can also replicate cached dataset to survive single node failures • Persist materialized views back to C*, then load into cache -- now recovery path is much faster Monday, April 7, 14
  • 40. Creating a Shark Table Monday, April 7, 14
  • 41. Creating a Cached Table Monday, April 7, 14
  • 42. Querying a Cached Table Monday, April 7, 14
  • 43. THANK YOU And YES, We’re HIRING!! ooyala.com/careers @evanfchan Monday, April 7, 14
  • 44. Industry Trends • Fast execution frameworks • Impala • In-memory databases • VoltDB, Druid • Streaming and real-time • Higher-level, productive data frameworks Monday, April 7, 14
  • 45. PERFORMANCE #’S Spark: C* -> OLAP aggregates cold cache, 1.4 million events 130 seconds C* -> OLAP aggregates warmed cache 20-30 seconds OLAP aggregate query via Spark (56k records) 60 ms 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Monday, April 7, 14
  • 46. EXAMPLE: OLAP PROCESSING t0 2013-0 4-05T0 0:00Z#i d1 {vide o: 10, type: 5} 2013-0 4-05T0 0:00Z#i {vide o: 20, C* events OLAP Aggregates OLAP Aggregates OLAP Aggregates Cached Materialized Views Spark Spark Spark Union Query 1: Plays by Provider Query 2: Top content for mobile Monday, April 7, 14