SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Downloaden Sie, um offline zu lesen
© 2014 DataStax, All Rights Reserved
Large Scale Data Analytics with DSE
Analytics
Ryan Knight
Solutions Engineer
@knight_cloud
© 2014 DataStax, All Rights Reserved
Hadoop?
© 2014 DataStax, All Rights Reserved
Hadoop Limitations
• Master / Slave Architecture
• Every Processing Step requires Disk IO
• Difficult API and Programming Model
• Designed for batch-mode jobs
• No even-streaming / real-time
• Complex Ecosystem
© 2014 DataStax, All Rights Reserved
Introduction to Spark
5
Apps in the early 2000s
were written for
Apps today
are written for
Single machines Clusters of machines
Single core processors Multicore processors
Expensive RAM Cheap RAM
Expensive disk Cheap disk
Slow networks Fast networks
Few concurrent users Lots of concurrent users
Small data sets Large data sets
Latency in seconds Latency in milliseconds
© 2014 Typesafe, All Rights Reserved. - Copied from Jonas Boner
What is Spark?
• Fast and general compute engine for large-scale data
processing
• Fault Tolerant Distributed Datasets
• Distributed Transformation on Datasets
• Integrated Batch, Iterative and Streaming Analysis
• In Memory Storage with Spill-over to Disk
© 2014 DataStax, All Rights Reserved
Advantages of Spark
• Improves efficiency through:
• In-memory data sharing
• General computation graphs - Lazy Evaluates Data
• 10x faster on disk, 100x faster in memory than
Hadoop MR
• Improves usability through:
• Rich APIs in Java, Scala, Py..??
• 2 to 5x less code
• Interactive shell
© 2014 DataStax, All Rights Reserved
Application

(Spark Driver)
Spark Master
Worker
Spark Components
You application code
which creates the SparkContext
A process which shells out to create
a Executor JVM
A Process which Manages the 

Resources of the Spark Cluster
These processes are all separate and require networking
to communicate
Hosting
Application UI
:4040
Hosting
Spark Master UI
:7080
WorkerWorkerWorkerWorker
© 2014 DataStax, All Rights Reserved
DataStax Analytics
Spark is about Data Analytics
• How do we get data into Spark?
• How can we work with large datasets?
• What do we do with the results of the analytics?
Spark Cassandra Connector
Spark Cassandra Connector
• Data locality-aware (speed)
• Read from and Write to Cassandra
• Cassandra Tables Exposed as RDD and DataFrames
• Server-Side filters (where clauses)
• Cross-table operations (JOIN, UNION, etc.)
• Mapping of Java Types to Cassandra Types
© 2014 DataStax, All Rights Reserved ●14
Spark Cassandra Connector uses the DataStax Java Driver to
Read from and Write to C*
Spark C*
Full Token
Range
Each Executor Maintains
a connection to the C*
Cluster
Spark
Executor
DataStax
Java Driver
Tokens 1-1000
Tokens 1001 -2000
Tokens …
RDD’s read into different
splits based on sets of tokens
Spark Cassandra Connector
15© 2015. All Rights Reserved.
•Simplified Deployment and Management
•Analytic Nodes configured to run Spark
•dse cassandra -k
•HA Spark Master with automatic leader election
•Stores Spark Worker metadata in Cassandra
DSE Analytics with Spark
© 2014 DataStax, All Rights Reserved
DSE Spark Architecture
Cassandr
a
Executor
ExecutorSpark
Worker

(JVM)
Cassandr
a
Executor
ExecutorSpark
Worker

(JVM)
Node 1
Node 2
Node 3
Node 4
Cassandr
a
Executor
ExecutorSpark
Worker

(JVM)
Cassandr
a
Executor
ExecutorSpark
Worker

(JVM)
Spark
Master

(JVM)
App

Driver
© 2014 DataStax, All Rights Reserved.
Confidential
Mixed Workload In One Cluster
17
Cassandra Mode
OLTP Database
Search Mode
All Data Searchable
Analytics Mode
Streaming and Analytics
C*
C
C
S A
Don’t build and maintain these yourself,
especially on top of a distributed data store.
AS
© 2014 DataStax, All Rights Reserved.
Confidential 18
Mixed Workload Cluster
DSE 4.7 Analytics + Search
• Allows Analytics Jobs to use Solr Queries
• Allows searching for data across partitions
• Example:
val table = sc.cassandraTable("music","solr")
val result = table.select("id","artist_name").where("solr_query='artist_name:Miles*'").collect
© 2014 DataStax, All Rights Reserved
Spark SQL and DataFrames
© 2014 DataStax, All Rights Reserved
• Creating and Running Spark Programs Faster
• Write less code
• Read less data
• Let the optimizer do the hard work
• Spark SQL Catalyst optimizer
Why Spark SQL?
© 2014 DataStax, All Rights Reserved
• Distributed collection of data
• Similar to a Table in a RDBMS
• Common API for reading/writing data
• API for selecting, filtering, aggregating 

and plotting structured data
• Similar to a Table in a RDBMS
DataFrame
© 2014 DataStax, All Rights Reserved
• Sources such as Cassandra, structured data
files, tables in Hive, external databases, or
existing RDDs.
• Optimization and code generation through the
Spark SQL Catalyst optimizer
• Decorator around RDD
• Previously SchemaRDD
DataFrame Part 2
© 2014 DataStax, All Rights Reserved
• Unified interface to reading/writing data in a
variety of formats
• Spark Notebook Example
Write Less Code: Input & Output
Scala for Large Scale Data Analytics
25© 2015. All Rights Reserved.
•Functional Paradigm is ideal for Data Analytics
•Strongly Typed - Enforce Schema at Every Later
•Immutable by Default - Event Logging
•Declarative instead of Imperative - Focus on
Transformation not Implementation
Spark Notebook
26© 2015. All Rights Reserved.
C*
C
C A
AA
Notebook
Notebook
Notebook
Spark Notebook Server
Cassandra Cluster with Spark Connector
Apache Spark Notebook
27© 2015. All Rights Reserved.
•Interactive Data Analytics in Browser
•Reactive / Dynamic Graphs based on Scala, SQL and
DataFrames
•Spark Streaming
•Examples notebooks covering visualization, machine
learning, streaming, graph analysis, genomics analysis
•Tune and Configure Each Notebook Separately
•https://github.com/andypetrella/spark-notebook
© 2014 DataStax, All Rights Reserved
Spark Streaming
© 2014 DataStax, All Rights Reserved
Spark Components
© 2014 DataStax, All Rights Reserved
Spark Versus Spark Streaming
© 2014 DataStax, All Rights Reserved
Spark Streaming General Architecture
© 2014 DataStax, All Rights Reserved
Spark Streaming General Architecture
© 2014 DataStax, All Rights Reserved
DStream Micro Batches
© 2014 DataStax, All Rights Reserved
Windowing
© 2014 DataStax, All Rights Reserved
Windowing
© 2014 DataStax, All Rights Reserved
Streaming Resiliency
• Streaming uses aggressive checkpointing and in-memory data
replication to improve resiliency.
• Frequent checkpointing keeps RDD lineages down to a
reasonable size.
• Checkpointing and replication mandatory since streams don’t
have source data files to reconstruct lost RDD partitions (except
for the directory ingest case).
© 2014 DataStax, All Rights Reserved
KillrWeather Architecture
© 2014 DataStax, All Rights Reserved
Spark Development
Imperative Code
final List<Integer> numbers =
Arrays.asList(1, 2, 3);
final List<Integer> numbersPlusOne =
Collections.emptyList();
for (Integer number : numbers) {
final Integer numberPlusOne = number + 1;
numbersPlusOne.add(numberPlusOne);
}
We Want Declarative Code
• Remove temporary lists - List<Integer> numbersPlusOne =
Collections.emptyList();
• Remove looping - for (Integer number : numbers)
• Focus on what - x+1
© 2014 DataStax, All Rights Reserved
Functions as Values
• Similar to a method - Expression with 0 or more
input arguments
• Simple Expressions f(x) = x+1 f(y)=y*y
• Avoid side effects and mutable state
• Output only depends on input
• Functions can be passed similar to other variables
Map, FlatMap and Filter
1 2 3 4 5 6 7 8
2 4 6 8 10 12 14 16
map (x*2)
4 6 8 102
filter ( x < 11)
30
reduce( x+nxt)
1,2 8,9 4,1 5,7
2 4 16 18 8 2 10 14
flatMap (x*2)
4 8 22
filter ( x < 10)
16
reduce( x+nxt)
Java 8
final List<Integer> numbers =
Arrays.asList(1, 2, 3);
final List<Integer> numbersPlusOne =
numbers.stream().map(number -> number + 1).
collect(Collectors.toList());
λ
Scala
val numbers = 1 to 20


val incFunc = (x:Int) => x+1

numbers.map(incFunc)



numbers.map(x => x+1)





numbers.map(_+1)
λ
SQL - Declarative or Imperative?
• SQL is Declarative
• What operation to perform and not how
to perform it
• Select doesn’t define how just what data
we want
Closures vs Functions?
• Closure is a Function which closes over
the surrounding context
• Closures can access variables in
surrounding context
• Spark Job passes closures to operate on
the data
© 2014 DataStax, All Rights Reserved
Spark Development
• Write programs in terms of parallel
transformations on distributed datasets
• Programming at a higher level of abstraction
Why Functions with Spark?
• Declarative Programming - Define What and Not
How
• Define what operations to perform and Spark
figures out how to operate on the data
• Easy to handle Events and Async Results with
Functional Callbacks
• Avoid Inner Classes
© 2014 DataStax, All Rights Reserved
Spark RDD
© 2014 DataStax, All Rights Reserved
• The primary abstraction in Spark
• Collection of data stored in the Spark Cluster
• Fault-tolerant
• Enables parallel processing on data sets
• In-Memory or On-Disk
Resilient Distributed Datasets (RDD)
© 2014 DataStax, All Rights Reserved
• Parallelized Collections
• Take an existing collection and runs functions
on it in parallel
• PairRDD
• UnionRDD
• JsonRDD
• ShuffledRDD
• CassandraRDD
Examples RDDs
© 2014 DataStax, All Rights Reserved
Spark Data Model
A1 A2 A3 A4 A5 A6 A7 A8
B1 B2 B3 B4 B5 B6 B7 B8
map
B2 B5 B7 B8B1
filter
C
reduce
Resilient Distributed Dataset
A collection:
● immutable
● iterable
● serializable
● distributed
● parallel
● lazy
© 2014 DataStax, All Rights Reserved
• RDDs are immutable - Each stage of a
transformation will create a new RDD.
• RDDs are lazy
• A DAG (directed acyclic graph) of computation
is constructed.
• The actual data is processed only when
results are requested.
Resilient Distributed Datasets (RDD)
© 2014 DataStax, All Rights Reserved
• RDDs know their “parents” and transitively, all
ancestors.
• RDDs are resilient - A lost partition is
reconstructed from ancestors.
• Transformation history / Lineage of the Data for
Re-computation when needed
Resilient Distributed Datasets (RDD)
© 2014 DataStax, All Rights Reserved
RDD Operations - Not Only Map & Reduce

Weitere ähnliche Inhalte

Was ist angesagt?

Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructuremattlieber
 
Scylla Summit 2016: Compose on Containing the Database
Scylla Summit 2016: Compose on Containing the DatabaseScylla Summit 2016: Compose on Containing the Database
Scylla Summit 2016: Compose on Containing the DatabaseScyllaDB
 
OpenStack Best Practices and Considerations - terasky tech day
OpenStack Best Practices and Considerations  - terasky tech dayOpenStack Best Practices and Considerations  - terasky tech day
OpenStack Best Practices and Considerations - terasky tech dayArthur Berezin
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesYousun Jeong
 
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...ScyllaDB
 
Cassandra on Docker
Cassandra on DockerCassandra on Docker
Cassandra on DockerInstaclustr
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL storeEdward Capriolo
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSAdrian Cockcroft
 
Apache Cassandra Management
Apache Cassandra ManagementApache Cassandra Management
Apache Cassandra ManagementInstaclustr
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
 
February 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with DockerFebruary 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with DockerYahoo Developer Network
 
Guaranteeing Storage Performance by Mike Tutkowski
Guaranteeing Storage Performance by Mike TutkowskiGuaranteeing Storage Performance by Mike Tutkowski
Guaranteeing Storage Performance by Mike Tutkowskibuildacloud
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Spark Summit
 
Scaling DataStax in Docker
Scaling DataStax in DockerScaling DataStax in Docker
Scaling DataStax in DockerDataStax
 
Building Distributed Systems With Riak and Riak Core
Building Distributed Systems With Riak and Riak CoreBuilding Distributed Systems With Riak and Riak Core
Building Distributed Systems With Riak and Riak CoreAndy Gross
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedInGuozhang Wang
 
Scalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosScalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and Mesosnelsonadpresent
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe
 
Looking towards an official cassandra sidecar netflix
Looking towards an official cassandra sidecar   netflixLooking towards an official cassandra sidecar   netflix
Looking towards an official cassandra sidecar netflixVinay Kumar Chella
 
Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Janos Matyas
 

Was ist angesagt? (20)

Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
 
Scylla Summit 2016: Compose on Containing the Database
Scylla Summit 2016: Compose on Containing the DatabaseScylla Summit 2016: Compose on Containing the Database
Scylla Summit 2016: Compose on Containing the Database
 
OpenStack Best Practices and Considerations - terasky tech day
OpenStack Best Practices and Considerations  - terasky tech dayOpenStack Best Practices and Considerations  - terasky tech day
OpenStack Best Practices and Considerations - terasky tech day
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on Kubernetes
 
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...
 
Cassandra on Docker
Cassandra on DockerCassandra on Docker
Cassandra on Docker
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL store
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
Apache Cassandra Management
Apache Cassandra ManagementApache Cassandra Management
Apache Cassandra Management
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
February 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with DockerFebruary 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with Docker
 
Guaranteeing Storage Performance by Mike Tutkowski
Guaranteeing Storage Performance by Mike TutkowskiGuaranteeing Storage Performance by Mike Tutkowski
Guaranteeing Storage Performance by Mike Tutkowski
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
 
Scaling DataStax in Docker
Scaling DataStax in DockerScaling DataStax in Docker
Scaling DataStax in Docker
 
Building Distributed Systems With Riak and Riak Core
Building Distributed Systems With Riak and Riak CoreBuilding Distributed Systems With Riak and Riak Core
Building Distributed Systems With Riak and Riak Core
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Scalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosScalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and Mesos
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
 
Looking towards an official cassandra sidecar netflix
Looking towards an official cassandra sidecar   netflixLooking towards an official cassandra sidecar   netflix
Looking towards an official cassandra sidecar netflix
 
Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014
 

Andere mochten auch

Managing (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraManaging (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraDataStax Academy
 
Continuous Deployment with Cassandra
Continuous Deployment with CassandraContinuous Deployment with Cassandra
Continuous Deployment with CassandraDataStax Academy
 
Becoming Friends with Cassandra
Becoming Friends with Cassandra Becoming Friends with Cassandra
Becoming Friends with Cassandra DataStax
 
Webinar: Continuous Deployment with MongoDB at Kitchensurfing
Webinar: Continuous Deployment with MongoDB at KitchensurfingWebinar: Continuous Deployment with MongoDB at Kitchensurfing
Webinar: Continuous Deployment with MongoDB at KitchensurfingMongoDB
 
Ops manager webinar mar 5, 2015
Ops manager webinar   mar 5, 2015Ops manager webinar   mar 5, 2015
Ops manager webinar mar 5, 2015MongoDB
 
«CI. Jenkins. 2GIS» — Игорь Павлов, 2ГИС
«CI. Jenkins. 2GIS» — Игорь Павлов, 2ГИС «CI. Jenkins. 2GIS» — Игорь Павлов, 2ГИС
«CI. Jenkins. 2GIS» — Игорь Павлов, 2ГИС DevDay
 
Continuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 frameworkContinuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 frameworkb0ris_1
 
MySQLドライバの改良と軽量O/Rマッパーの紹介
MySQLドライバの改良と軽量O/Rマッパーの紹介MySQLドライバの改良と軽量O/Rマッパーの紹介
MySQLドライバの改良と軽量O/Rマッパーの紹介kwatch
 
Scaling Crashlytics: Building Analytics on Redis 2.6
Scaling Crashlytics: Building Analytics on Redis 2.6Scaling Crashlytics: Building Analytics on Redis 2.6
Scaling Crashlytics: Building Analytics on Redis 2.6Crashlytics
 
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
Battery Ventures: Simulating and Visualizing Large Scale Cassandra DeploymentsBattery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
Battery Ventures: Simulating and Visualizing Large Scale Cassandra DeploymentsDataStax Academy
 
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...DataStax: How to Roll Cassandra into Production Without Losing your Health, M...
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...DataStax Academy
 
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetch
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetchDataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetch
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetchDataStax Academy
 
Crash course intro to cassandra
Crash course   intro to cassandraCrash course   intro to cassandra
Crash course intro to cassandraJon Haddad
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core ConceptsJon Haddad
 
DataStax: 7 Deadly Sins for Cassandra Ops
DataStax: 7 Deadly Sins for Cassandra OpsDataStax: 7 Deadly Sins for Cassandra Ops
DataStax: 7 Deadly Sins for Cassandra OpsDataStax Academy
 
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...DataStax Academy
 
Cassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessCassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessJon Haddad
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraJon Haddad
 
Enter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkEnter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkJon Haddad
 

Andere mochten auch (20)

Managing (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraManaging (Schema) Migrations in Cassandra
Managing (Schema) Migrations in Cassandra
 
Continuous Deployment with Cassandra
Continuous Deployment with CassandraContinuous Deployment with Cassandra
Continuous Deployment with Cassandra
 
Becoming Friends with Cassandra
Becoming Friends with Cassandra Becoming Friends with Cassandra
Becoming Friends with Cassandra
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
Webinar: Continuous Deployment with MongoDB at Kitchensurfing
Webinar: Continuous Deployment with MongoDB at KitchensurfingWebinar: Continuous Deployment with MongoDB at Kitchensurfing
Webinar: Continuous Deployment with MongoDB at Kitchensurfing
 
Ops manager webinar mar 5, 2015
Ops manager webinar   mar 5, 2015Ops manager webinar   mar 5, 2015
Ops manager webinar mar 5, 2015
 
«CI. Jenkins. 2GIS» — Игорь Павлов, 2ГИС
«CI. Jenkins. 2GIS» — Игорь Павлов, 2ГИС «CI. Jenkins. 2GIS» — Игорь Павлов, 2ГИС
«CI. Jenkins. 2GIS» — Игорь Павлов, 2ГИС
 
Continuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 frameworkContinuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 framework
 
MySQLドライバの改良と軽量O/Rマッパーの紹介
MySQLドライバの改良と軽量O/Rマッパーの紹介MySQLドライバの改良と軽量O/Rマッパーの紹介
MySQLドライバの改良と軽量O/Rマッパーの紹介
 
Scaling Crashlytics: Building Analytics on Redis 2.6
Scaling Crashlytics: Building Analytics on Redis 2.6Scaling Crashlytics: Building Analytics on Redis 2.6
Scaling Crashlytics: Building Analytics on Redis 2.6
 
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
Battery Ventures: Simulating and Visualizing Large Scale Cassandra DeploymentsBattery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
 
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...DataStax: How to Roll Cassandra into Production Without Losing your Health, M...
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...
 
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetch
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetchDataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetch
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetch
 
Crash course intro to cassandra
Crash course   intro to cassandraCrash course   intro to cassandra
Crash course intro to cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
DataStax: 7 Deadly Sins for Cassandra Ops
DataStax: 7 Deadly Sins for Cassandra OpsDataStax: 7 Deadly Sins for Cassandra Ops
DataStax: 7 Deadly Sins for Cassandra Ops
 
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
 
Cassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessCassandra 3.0 Awesomeness
Cassandra 3.0 Awesomeness
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
Enter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkEnter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy Spark
 

Ähnlich wie Large Scale Data Analytics with Spark and Cassandra on the DSE Platform

Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; pythonMaloy Manna, PMP®
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...DataStax
 
Cassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideCassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideMohammed Fazuluddin
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Spark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All DevelopersSpark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All DevelopersKnoldus Inc.
 

Ähnlich wie Large Scale Data Analytics with Spark and Cassandra on the DSE Platform (20)

Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Cassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideCassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction Guide
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Spark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All DevelopersSpark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All Developers
 

Mehr von DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and DriversDataStax Academy
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph DatabasesDataStax Academy
 

Mehr von DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph Databases
 

Kürzlich hochgeladen

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Kürzlich hochgeladen (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Large Scale Data Analytics with Spark and Cassandra on the DSE Platform

  • 1. © 2014 DataStax, All Rights Reserved Large Scale Data Analytics with DSE Analytics Ryan Knight Solutions Engineer @knight_cloud
  • 2. © 2014 DataStax, All Rights Reserved Hadoop?
  • 3. © 2014 DataStax, All Rights Reserved Hadoop Limitations • Master / Slave Architecture • Every Processing Step requires Disk IO • Difficult API and Programming Model • Designed for batch-mode jobs • No even-streaming / real-time • Complex Ecosystem
  • 4. © 2014 DataStax, All Rights Reserved Introduction to Spark
  • 5. 5 Apps in the early 2000s were written for Apps today are written for Single machines Clusters of machines Single core processors Multicore processors Expensive RAM Cheap RAM Expensive disk Cheap disk Slow networks Fast networks Few concurrent users Lots of concurrent users Small data sets Large data sets Latency in seconds Latency in milliseconds © 2014 Typesafe, All Rights Reserved. - Copied from Jonas Boner
  • 6. What is Spark? • Fast and general compute engine for large-scale data processing • Fault Tolerant Distributed Datasets • Distributed Transformation on Datasets • Integrated Batch, Iterative and Streaming Analysis • In Memory Storage with Spill-over to Disk
  • 7. © 2014 DataStax, All Rights Reserved Advantages of Spark • Improves efficiency through: • In-memory data sharing • General computation graphs - Lazy Evaluates Data • 10x faster on disk, 100x faster in memory than Hadoop MR • Improves usability through: • Rich APIs in Java, Scala, Py..?? • 2 to 5x less code • Interactive shell
  • 8. © 2014 DataStax, All Rights Reserved
  • 9. Application
 (Spark Driver) Spark Master Worker Spark Components You application code which creates the SparkContext A process which shells out to create a Executor JVM A Process which Manages the 
 Resources of the Spark Cluster These processes are all separate and require networking to communicate Hosting Application UI :4040 Hosting Spark Master UI :7080 WorkerWorkerWorkerWorker
  • 10. © 2014 DataStax, All Rights Reserved DataStax Analytics
  • 11. Spark is about Data Analytics • How do we get data into Spark? • How can we work with large datasets? • What do we do with the results of the analytics?
  • 13. Spark Cassandra Connector • Data locality-aware (speed) • Read from and Write to Cassandra • Cassandra Tables Exposed as RDD and DataFrames • Server-Side filters (where clauses) • Cross-table operations (JOIN, UNION, etc.) • Mapping of Java Types to Cassandra Types
  • 14. © 2014 DataStax, All Rights Reserved ●14 Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to C* Spark C* Full Token Range Each Executor Maintains a connection to the C* Cluster Spark Executor DataStax Java Driver Tokens 1-1000 Tokens 1001 -2000 Tokens … RDD’s read into different splits based on sets of tokens Spark Cassandra Connector
  • 15. 15© 2015. All Rights Reserved. •Simplified Deployment and Management •Analytic Nodes configured to run Spark •dse cassandra -k •HA Spark Master with automatic leader election •Stores Spark Worker metadata in Cassandra DSE Analytics with Spark
  • 16. © 2014 DataStax, All Rights Reserved DSE Spark Architecture Cassandr a Executor ExecutorSpark Worker
 (JVM) Cassandr a Executor ExecutorSpark Worker
 (JVM) Node 1 Node 2 Node 3 Node 4 Cassandr a Executor ExecutorSpark Worker
 (JVM) Cassandr a Executor ExecutorSpark Worker
 (JVM) Spark Master
 (JVM) App
 Driver
  • 17. © 2014 DataStax, All Rights Reserved. Confidential Mixed Workload In One Cluster 17 Cassandra Mode OLTP Database Search Mode All Data Searchable Analytics Mode Streaming and Analytics C* C C S A Don’t build and maintain these yourself, especially on top of a distributed data store. AS
  • 18. © 2014 DataStax, All Rights Reserved. Confidential 18 Mixed Workload Cluster
  • 19. DSE 4.7 Analytics + Search • Allows Analytics Jobs to use Solr Queries • Allows searching for data across partitions • Example: val table = sc.cassandraTable("music","solr") val result = table.select("id","artist_name").where("solr_query='artist_name:Miles*'").collect
  • 20. © 2014 DataStax, All Rights Reserved Spark SQL and DataFrames
  • 21. © 2014 DataStax, All Rights Reserved • Creating and Running Spark Programs Faster • Write less code • Read less data • Let the optimizer do the hard work • Spark SQL Catalyst optimizer Why Spark SQL?
  • 22. © 2014 DataStax, All Rights Reserved • Distributed collection of data • Similar to a Table in a RDBMS • Common API for reading/writing data • API for selecting, filtering, aggregating 
 and plotting structured data • Similar to a Table in a RDBMS DataFrame
  • 23. © 2014 DataStax, All Rights Reserved • Sources such as Cassandra, structured data files, tables in Hive, external databases, or existing RDDs. • Optimization and code generation through the Spark SQL Catalyst optimizer • Decorator around RDD • Previously SchemaRDD DataFrame Part 2
  • 24. © 2014 DataStax, All Rights Reserved • Unified interface to reading/writing data in a variety of formats • Spark Notebook Example Write Less Code: Input & Output
  • 25. Scala for Large Scale Data Analytics 25© 2015. All Rights Reserved. •Functional Paradigm is ideal for Data Analytics •Strongly Typed - Enforce Schema at Every Later •Immutable by Default - Event Logging •Declarative instead of Imperative - Focus on Transformation not Implementation
  • 26. Spark Notebook 26© 2015. All Rights Reserved. C* C C A AA Notebook Notebook Notebook Spark Notebook Server Cassandra Cluster with Spark Connector
  • 27. Apache Spark Notebook 27© 2015. All Rights Reserved. •Interactive Data Analytics in Browser •Reactive / Dynamic Graphs based on Scala, SQL and DataFrames •Spark Streaming •Examples notebooks covering visualization, machine learning, streaming, graph analysis, genomics analysis •Tune and Configure Each Notebook Separately •https://github.com/andypetrella/spark-notebook
  • 28. © 2014 DataStax, All Rights Reserved Spark Streaming
  • 29. © 2014 DataStax, All Rights Reserved Spark Components
  • 30. © 2014 DataStax, All Rights Reserved Spark Versus Spark Streaming
  • 31. © 2014 DataStax, All Rights Reserved Spark Streaming General Architecture
  • 32. © 2014 DataStax, All Rights Reserved Spark Streaming General Architecture
  • 33. © 2014 DataStax, All Rights Reserved DStream Micro Batches
  • 34. © 2014 DataStax, All Rights Reserved Windowing
  • 35. © 2014 DataStax, All Rights Reserved Windowing
  • 36. © 2014 DataStax, All Rights Reserved Streaming Resiliency • Streaming uses aggressive checkpointing and in-memory data replication to improve resiliency. • Frequent checkpointing keeps RDD lineages down to a reasonable size. • Checkpointing and replication mandatory since streams don’t have source data files to reconstruct lost RDD partitions (except for the directory ingest case).
  • 37. © 2014 DataStax, All Rights Reserved KillrWeather Architecture
  • 38. © 2014 DataStax, All Rights Reserved Spark Development
  • 39. Imperative Code final List<Integer> numbers = Arrays.asList(1, 2, 3); final List<Integer> numbersPlusOne = Collections.emptyList(); for (Integer number : numbers) { final Integer numberPlusOne = number + 1; numbersPlusOne.add(numberPlusOne); }
  • 40. We Want Declarative Code • Remove temporary lists - List<Integer> numbersPlusOne = Collections.emptyList(); • Remove looping - for (Integer number : numbers) • Focus on what - x+1
  • 41. © 2014 DataStax, All Rights Reserved Functions as Values • Similar to a method - Expression with 0 or more input arguments • Simple Expressions f(x) = x+1 f(y)=y*y • Avoid side effects and mutable state • Output only depends on input • Functions can be passed similar to other variables
  • 42. Map, FlatMap and Filter 1 2 3 4 5 6 7 8 2 4 6 8 10 12 14 16 map (x*2) 4 6 8 102 filter ( x < 11) 30 reduce( x+nxt) 1,2 8,9 4,1 5,7 2 4 16 18 8 2 10 14 flatMap (x*2) 4 8 22 filter ( x < 10) 16 reduce( x+nxt)
  • 43. Java 8 final List<Integer> numbers = Arrays.asList(1, 2, 3); final List<Integer> numbersPlusOne = numbers.stream().map(number -> number + 1). collect(Collectors.toList()); λ
  • 44. Scala val numbers = 1 to 20 
 val incFunc = (x:Int) => x+1
 numbers.map(incFunc)
 
 numbers.map(x => x+1)
 
 
 numbers.map(_+1) λ
  • 45. SQL - Declarative or Imperative? • SQL is Declarative • What operation to perform and not how to perform it • Select doesn’t define how just what data we want
  • 46. Closures vs Functions? • Closure is a Function which closes over the surrounding context • Closures can access variables in surrounding context • Spark Job passes closures to operate on the data
  • 47. © 2014 DataStax, All Rights Reserved Spark Development • Write programs in terms of parallel transformations on distributed datasets • Programming at a higher level of abstraction
  • 48. Why Functions with Spark? • Declarative Programming - Define What and Not How • Define what operations to perform and Spark figures out how to operate on the data • Easy to handle Events and Async Results with Functional Callbacks • Avoid Inner Classes
  • 49. © 2014 DataStax, All Rights Reserved Spark RDD
  • 50. © 2014 DataStax, All Rights Reserved • The primary abstraction in Spark • Collection of data stored in the Spark Cluster • Fault-tolerant • Enables parallel processing on data sets • In-Memory or On-Disk Resilient Distributed Datasets (RDD)
  • 51. © 2014 DataStax, All Rights Reserved • Parallelized Collections • Take an existing collection and runs functions on it in parallel • PairRDD • UnionRDD • JsonRDD • ShuffledRDD • CassandraRDD Examples RDDs
  • 52. © 2014 DataStax, All Rights Reserved Spark Data Model A1 A2 A3 A4 A5 A6 A7 A8 B1 B2 B3 B4 B5 B6 B7 B8 map B2 B5 B7 B8B1 filter C reduce Resilient Distributed Dataset A collection: ● immutable ● iterable ● serializable ● distributed ● parallel ● lazy
  • 53. © 2014 DataStax, All Rights Reserved • RDDs are immutable - Each stage of a transformation will create a new RDD. • RDDs are lazy • A DAG (directed acyclic graph) of computation is constructed. • The actual data is processed only when results are requested. Resilient Distributed Datasets (RDD)
  • 54. © 2014 DataStax, All Rights Reserved • RDDs know their “parents” and transitively, all ancestors. • RDDs are resilient - A lost partition is reconstructed from ancestors. • Transformation history / Lineage of the Data for Re-computation when needed Resilient Distributed Datasets (RDD)
  • 55. © 2014 DataStax, All Rights Reserved RDD Operations - Not Only Map & Reduce