Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
BÂLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENÈVE
HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURI...
Guido Schmutz
• Working for Trivadis for more than 18 years
• Oracle ACE Director for Fusion Middleware and SOA
• Author o...
Agenda
1. Introduction
2. Apache Spark
3. Apache Cassandra
4. Combining Spark & Cassandra
5. Summary
Big Data Definition (4 Vs)
+	Time	to	action	?	– Big	Data	+	Real-Time	=	Stream	Processing
Characteristics	of	Big	Data:	Its	...
What is Real-Time Analytics?
What is it? Why do we need
it?
How does it work?
• Collect real-time data
• Process data as i...
Real Time Analytics Use Cases
• Algorithmic Trading
• Online Fraud Detection
• Geo Fencing
• Proximity/Location Tracking
•...
Apache Spark
Motivation – Why Apache Spark?
Hadoop MapReduce: Data Sharing on Disk
Spark: Speed up processing by using Memory instead o...
Apache Spark
Apache Spark is a fast and general engine for large-scale data processing
• The hot trend in Big Data!
• Orig...
Apache Spark
Spark	SQL
(Batch	Processing)
Blink	DB
(Approximate
Querying)
Spark	Streaming
(Real-Time)
MLlib,	Spark	R
(Mach...
Resilient Distributed Dataset (RDD)
Are
• Immutable
• Re-computable
• Fault tolerant
• Reusable
Have Transformations
• Pro...
RDD RDD
Input Source
• File
• Database
• Stream
• Collection
.count() ->	100
Data
Partitions RDD
Data
Partition	0
Partition	1
Partition	2
Partition	3
Partition	4
Partition	5
Partition	6
Partition	7
Partit...
Partitions RDD
Data
Partition	0
Partition	1
Partition	2
Partition	3
Partition	4
Partition	5
Partition	6
Partition	7
Partit...
Partitions RDD
Data
Partition	0
Partition	1
Partition	2
Partition	3
Partition	4
Partition	5
Partition	6
Partition	7
Partit...
Stage 1 – reduceByKey()
Stage 1 – flatMap() + map()
Spark Workflow Input	HDFS	File
HadoopRDD
MappedRDD
ShuffledRDD
Text	Fi...
Spark Workflow HDFS	File	Input	1
HadoopRDD
FilteredRDD
MappedRDD
ShuffledRDD
HDFS	File	Output
HadoopRDD
MappedRDD
HDFS	Fil...
Spark Execution Model
Data	
Storage
Worker
Master
Executer
Executer
Server
Executer
Stage 1 – flatMap() + map()
Spark Execution Model
Data	
Storage
Worker
Master
Executer
Data	
Storage
Worker
Executer
Data	...
Stage 2 – reduceByKey()
Spark Execution Model
Data	
Storage
Worker
Executer
Data	
Storage
Worker
Executer
RDD
P0
Wide	Tran...
Batch vs. Real-Time Processing
Petabytes	of	Data
Gigabytes
Per	Second
Various Input Sources
Apache Kafka
distributed publish-subscribe messaging system
Designed for processing of real time activity stream data (log...
Apache Kafka
Kafka Broker
Temperature
Processor
Temperature	Topic
Rainfall	Topic
1 2 3 4 5 6
Rainfall
Processor1 2 3 4 5 6...
Apache Kafka
Kafka Broker
Temperature
Processor
Temperature	Topic
Rainfall	Topic
1 2 3 4 5 6
Rainfall
Processor
Partition	...
Apache
Kafka
Kafka Broker
Temperature
Processor
Weather
Station
Temperature	Topic
Rainfall	Topic
Rainfall
Processor
P	0
Te...
Discretized Stream (DStream)
Kafka
Weather
Station
Weather
Station
Weather
Station
Discretized Stream (DStream)
Kafka
Weather
Station
Weather
Station
Weather
Station
Discretized Stream (DStream)
Kafka
Weather
Station
Weather
Station
Weather
Station
Discretized Stream (DStream)
Kafka
Weather
Station
Weather
Station
Weather
Station Discrete	by	time
Individual	Event
DStre...
Discretized Stream (DStream)
DStream DStream
X	Seconds
Transform
.countByValue()
.reduceByKey()
.join
.map
Discretized Stream (DStream)
time	1 time	2 time	3
message
time	n….
f(message 1)
RDD	@time	1
f(message 2)
f(message n)
….
m...
Apache Spark Streaming – Core concepts
Discretized Stream (DStream)
• Core Spark Streaming abstraction
• micro batches of ...
Apache Cassandra
Apache Cassandra
Apache Cassandra™ is a free
• Distributed…
• High performance…
• Extremely scalable…
• Fault tolerant (i....
Apache Cassandra - History
Bigtable Dynamo
Motivation - Why NoSQL Databases?
aaa • Dynamo Paper (2007)
• How to build a data store that is
• Reliable
• Performant
• ...
Motivation - Why NoSQL Databases?
• Google Big Table (2006)
• Richer data model
• 1 key and lot’s of values
• Fast sequent...
Motivation - Why NoSQL Databases?
• Cassandra Paper (2008)
• Distributed features of Dynamo
• Data Model and storage from ...
Apache Cassandra – More than one server
All nodes participate in a cluster
Shared nothing
Add or remove as needed
More cap...
Apache Cassandra – Fully Replicated
Client writes local
Data syncs across WAN
Replication per Data Center
Node	1
Node	2
No...
Apache Cassandra
What is Cassandra NOT?
• A Data Ocean
• A Data Lake
• A Data Pond
• An In-Memory Database
• A Key-Value S...
How Cassandra stores data
• Model brought from Google Bigtable
• Row Key and a lot of columns
• Column names sorted (UTF8,...
Combining Spark & Cassandra
Spark and Cassandra Architecture – Great Combo
Good	at	analyzing	a	huge	amount	
of	data
Good	at	storing	a	huge	amount	of	
...
Spark and Cassandra Architecture
Spark	Streaming
(Near	Real-Time)
SparkSQL
(Structured	Data)
MLlib
(Machine	Learning)
Grap...
Spark and Cassandra Architecture
Spark	Connector
Weather
Station
Spark	Streaming
(Near	Real-Time)
SparkSQL
(Structured	Dat...
Spark and Cassandra Architecture
• Single Node running Cassandra
• Spark Worker is really small
• Spark Master lives outsi...
Spark and Cassandra Architecture
Worker
Worker
Worker
Master
Worker
• Each node runs Spark and
Cassandra
• Spark Master ca...
Spark and Cassandra Architecture
Master
0-25
26-50
51-75
76-100
Worker
Worker
WorkerWorker
0-25
26-50
51-75
76-100
Transac...
Cassandra and Spark
Cassandra Cassandra	&	Spark
Joins	and	Unions No Yes
Transformations Limited Yes
Outside	Data	Integrati...
Summary
Summary
Kafka
• Topics store information broken into
partitions
• Brokers store partitions
• Partitions are replicated for...
Lambda Architecture with Spark/Cassandra
Data
Collection
(Analytical)	Batch	Data	Processing
Batch
compute
Result	StoreData...
Lambda Architecture with Spark/Cassandra
Data
Collection
(Analytical)	Batch	Data	Processing
Batch
compute
Result	StoreData...
Guido Schmutz
Technology Manager
guido.schmutz@trivadis.com
Real-Time Analytics with Apache Cassandra and Apache Spark
Nächste SlideShare
Wird geladen in …5
×

Real-Time Analytics with Apache Cassandra and Apache Spark

7.650 Aufrufe

Veröffentlicht am

Time series data is everywhere: IoT, sensor data, financial transactions. The industry has moved to databases like Cassandra to handle the high velocity and high volume of data that is now common place. However data is pointless without being able to process it in near real time. That's where Spark combined with Cassandra comes in! What was one just your storage system (Cassandra) can be transformed into an analytics system and it's really surprising how easy it is!

Veröffentlicht in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Real-Time Analytics with Apache Cassandra and Apache Spark

  1. 1. BÂLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENÈVE HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURICH Real-Time Analytics with Apache Cassandra and Apache Spark Guido Schmutz
  2. 2. Guido Schmutz • Working for Trivadis for more than 18 years • Oracle ACE Director for Fusion Middleware and SOA • Author of different books • Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data • Technology Manager @ Trivadis • More than 25 years of software development experience • Contact: guido.schmutz@trivadis.com • Blog: http://guidoschmutz.wordpress.com • Twitter: gschmutz
  3. 3. Agenda 1. Introduction 2. Apache Spark 3. Apache Cassandra 4. Combining Spark & Cassandra 5. Summary
  4. 4. Big Data Definition (4 Vs) + Time to action ? – Big Data + Real-Time = Stream Processing Characteristics of Big Data: Its Volume, Velocity and Variety in combination
  5. 5. What is Real-Time Analytics? What is it? Why do we need it? How does it work? • Collect real-time data • Process data as it flows in • Data in Motion over Data at Rest • Reports and Dashboard access processed data Time Events RespondAnalyze Short time to analyze & respond § Required - for new business models § Desired - for competitive advantage
  6. 6. Real Time Analytics Use Cases • Algorithmic Trading • Online Fraud Detection • Geo Fencing • Proximity/Location Tracking • Intrusion detection systems • Traffic Management • Recommendations • Churn detection • Internet of Things (IoT) / Intelligence Sensors • Social Media/Data Analytics • Gaming Data Feed • …
  7. 7. Apache Spark
  8. 8. Motivation – Why Apache Spark? Hadoop MapReduce: Data Sharing on Disk Spark: Speed up processing by using Memory instead of Disks map reduce . . . Input HDFS read HDFS write HDFS read HDFS write op1 op2 . . . Input Output Output
  9. 9. Apache Spark Apache Spark is a fast and general engine for large-scale data processing • The hot trend in Big Data! • Originally developed 2009 in UC Berkley’s AMPLab • Based on 2007 Microsoft Dryad paper • Written in Scala, supports Java, Python, SQL and R • Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk • One of the largest OSS communities in big data with over 200 contributors in 50+ organizations • Open Sourced in 2010 – since 2014 part of Apache Software foundation
  10. 10. Apache Spark Spark SQL (Batch Processing) Blink DB (Approximate Querying) Spark Streaming (Real-Time) MLlib, Spark R (Machine Learning) GraphX (Graph Processing) Spark Core API and Execution Model Spark Standalone MESOS YARN HDFS Elastic Search NoSQL S3 Libraries Core Runtime Cluster Resource Managers Data Stores
  11. 11. Resilient Distributed Dataset (RDD) Are • Immutable • Re-computable • Fault tolerant • Reusable Have Transformations • Produce new RDD • Rich set of transformation available • filter(), flatMap(), map(), distinct(), groupBy(), union(), join(), sortByKey(), reduceByKey(), subtract(), ... Have Actions • Start cluster computing operations • Rich set of action available • collect(), count(), fold(), reduce(), count(), …
  12. 12. RDD RDD Input Source • File • Database • Stream • Collection .count() -> 100 Data
  13. 13. Partitions RDD Data Partition 0 Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Partition 7 Partition 8 Partition 9 Server 1 Server 2 Server 3 Server 4 Server 5
  14. 14. Partitions RDD Data Partition 0 Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Partition 7 Partition 8 Partition 9 Server 1 Server 2 Server 3 Server 4 Server 5
  15. 15. Partitions RDD Data Partition 0 Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Partition 7 Partition 8 Partition 9 Server 2 Server 3 Server 4 Server 5
  16. 16. Stage 1 – reduceByKey() Stage 1 – flatMap() + map() Spark Workflow Input HDFS File HadoopRDD MappedRDD ShuffledRDD Text File Output sc.hapoopFile() map() reduceByKey() sc.saveAsTextFile() Transformations (Lazy) Action (Execute Transformations) Master MappedRDD P0 P1 P3 ShuffledRDD P0 MappedRDD flatMap() DAG Scheduler
  17. 17. Spark Workflow HDFS File Input 1 HadoopRDD FilteredRDD MappedRDD ShuffledRDD HDFS File Output HadoopRDD MappedRDD HDFS File Input 2 SparkContext.hadoopFile() SparkContext.hadoopFile()filter() map() map() join() SparkContext.saveAsHadoopFile() Transformations (Lazy) Action (Execute Transformations)
  18. 18. Spark Execution Model Data Storage Worker Master Executer Executer Server Executer
  19. 19. Stage 1 – flatMap() + map() Spark Execution Model Data Storage Worker Master Executer Data Storage Worker Executer Data Storage Worker Executer RDD P0 P1 P3 Narrow TransformationMaster filter() map() sample() flatMap() Data Storage Worker Executer
  20. 20. Stage 2 – reduceByKey() Spark Execution Model Data Storage Worker Executer Data Storage Worker Executer RDD P0 Wide Transformation Master join() reduceByKey() union() groupByKey() Shuffle ! Data Storage Worker Executer Data Storage Worker Executer
  21. 21. Batch vs. Real-Time Processing Petabytes of Data Gigabytes Per Second
  22. 22. Various Input Sources
  23. 23. Apache Kafka distributed publish-subscribe messaging system Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …) Initially developed at LinkedIn, now part of Apache Does not use JMS API and standards Kafka maintains feeds of messages in topics Kafka Cluster Consumer Consumer Consumer Producer Producer Producer
  24. 24. Apache Kafka Kafka Broker Temperature Processor Temperature Topic Rainfall Topic 1 2 3 4 5 6 Rainfall Processor1 2 3 4 5 6 Weather Station
  25. 25. Apache Kafka Kafka Broker Temperature Processor Temperature Topic Rainfall Topic 1 2 3 4 5 6 Rainfall Processor Partition 0 1 2 3 4 5 6 Partition 0 1 2 3 4 5 6 Partition 1 Temperature Processor Weather Station
  26. 26. Apache Kafka Kafka Broker Temperature Processor Weather Station Temperature Topic Rainfall Topic Rainfall Processor P 0 Temperature Processor 1 2 3 4 5 P 1 1 2 3 4 5 Kafka Broker Temperature Topic Rainfall Topic P 0 1 2 3 4 5 P 1 1 2 3 4 5 P 0 1 2 3 4 5 P 0 1 2 3 4 5
  27. 27. Discretized Stream (DStream) Kafka Weather Station Weather Station Weather Station
  28. 28. Discretized Stream (DStream) Kafka Weather Station Weather Station Weather Station
  29. 29. Discretized Stream (DStream) Kafka Weather Station Weather Station Weather Station
  30. 30. Discretized Stream (DStream) Kafka Weather Station Weather Station Weather Station Discrete by time Individual Event DStream = RDD
  31. 31. Discretized Stream (DStream) DStream DStream X Seconds Transform .countByValue() .reduceByKey() .join .map
  32. 32. Discretized Stream (DStream) time 1 time 2 time 3 message time n…. f(message 1) RDD @time 1 f(message 2) f(message n) …. message 1 RDD @time 1 message 2 message n …. result 1 result 2 result n …. message message message f(message 1) RDD @time 2 f(message 2) f(message n) …. message 1 RDD @time 2 message 2 message n …. result 1 result 2 result n …. f(message 1) RDD @time 3 f(message 2) f(message n) …. message 1 RDD @time 3 message 2 message n …. result 1 result 2 result n …. f(message 1) RDD @time n f(message 2) f(message n) …. message 1 RDD @time n message 2 message n …. result 1 result 2 result n …. Input Stream Event DStream MappedDStream map() saveAsHadoopFiles() Time Increasing DStreamTransformation Lineage Actions Trigger Spark Jobs Adapted from Chris Fregly: http://slidesha.re/11PP7FV
  33. 33. Apache Spark Streaming – Core concepts Discretized Stream (DStream) • Core Spark Streaming abstraction • micro batches of RDD’s • Operations similar to RDD Input DStreams • Represents the stream of raw data received from streaming sources • Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc. • Custom Sources can be easily written for custom data sources Operations • Same as Spark Core + Additional Stateful transformations (window, reduceByWindow)
  34. 34. Apache Cassandra
  35. 35. Apache Cassandra Apache Cassandra™ is a free • Distributed… • High performance… • Extremely scalable… • Fault tolerant (i.e. no single point of failure)… post-relational database solution Optimized for high write throughput
  36. 36. Apache Cassandra - History Bigtable Dynamo
  37. 37. Motivation - Why NoSQL Databases? aaa • Dynamo Paper (2007) • How to build a data store that is • Reliable • Performant • “Always On” • Nothing new and shiny • 24 other papers cited • Evolutionary
  38. 38. Motivation - Why NoSQL Databases? • Google Big Table (2006) • Richer data model • 1 key and lot’s of values • Fast sequential access • 38 other papers cited
  39. 39. Motivation - Why NoSQL Databases? • Cassandra Paper (2008) • Distributed features of Dynamo • Data Model and storage from BigTable • February 2010 graduated to a top-level Apache Project
  40. 40. Apache Cassandra – More than one server All nodes participate in a cluster Shared nothing Add or remove as needed More capacity? Add more servers Node is a basic unit inside a cluster Each node owns a range of partitions Consistent Hashing Node 1 Node 2 Node 3 Node 4 [26-50] [0-25] [51-75] [76-100] [0-25] [0-25] [26-50] [26-50] [51-75] [51-75] [76-100] [76-100]
  41. 41. Apache Cassandra – Fully Replicated Client writes local Data syncs across WAN Replication per Data Center Node 1 Node 2 Node 3 Node 4 Node 1 Node 2 Node 3 Node 4 West East Client
  42. 42. Apache Cassandra What is Cassandra NOT? • A Data Ocean • A Data Lake • A Data Pond • An In-Memory Database • A Key-Value Store • Not for Data Warehousing What are good use cases? • Product Catalog / Playlists • Personalization (Ads, Recommendations) • Fraud Detection • Time Series (Finance, Smart Meter) • IoT / Sensor Data • Graph / Network data
  43. 43. How Cassandra stores data • Model brought from Google Bigtable • Row Key and a lot of columns • Column names sorted (UTF8, Int, Timestamp, etc.) Column Name … Column Name Column Value Column Value Timestamp Timestamp TTL TTL Row Key 1 2 Billion Billion of Rows
  44. 44. Combining Spark & Cassandra
  45. 45. Spark and Cassandra Architecture – Great Combo Good at analyzing a huge amount of data Good at storing a huge amount of data
  46. 46. Spark and Cassandra Architecture Spark Streaming (Near Real-Time) SparkSQL (Structured Data) MLlib (Machine Learning) GraphX (Graph Analysis)
  47. 47. Spark and Cassandra Architecture Spark Connector Weather Station Spark Streaming (Near Real-Time) SparkSQL (Structured Data) MLlib (Machine Learning) GraphX (Graph Analysis) Weather Station Weather Station Weather Station Weather Station
  48. 48. Spark and Cassandra Architecture • Single Node running Cassandra • Spark Worker is really small • Spark Master lives outside a node • Spark Worker starts Spark Executer in separate JVM • Node local Worker Master Executer Executer Server Executer
  49. 49. Spark and Cassandra Architecture Worker Worker Worker Master Worker • Each node runs Spark and Cassandra • Spark Master can make decisions based on Token Ranges • Spark likes to work on small partitions of data across a large cluster • Cassandra likes to spread out data in a large cluster 0-25 26-50 51-75 76-100 Will only have to analyze 25% of data!
  50. 50. Spark and Cassandra Architecture Master 0-25 26-50 51-75 76-100 Worker Worker WorkerWorker 0-25 26-50 51-75 76-100 Transactional Analytics
  51. 51. Cassandra and Spark Cassandra Cassandra & Spark Joins and Unions No Yes Transformations Limited Yes Outside Data Integration No Yes Aggregations Limited Yes
  52. 52. Summary
  53. 53. Summary Kafka • Topics store information broken into partitions • Brokers store partitions • Partitions are replicated for data resilience Cassandra • Goals of Apache Cassandra are all about staying online and performant • Best for applications close to your users • Partitions are similar data grouped by a partition key Spark • Replacement for Hadoop Map Reduce • In memory • More operations than just Map and Reduce • Makes data analysis easier • Spark Streaming can take a variety of sources Spark + Cassandra • Cassandra acts as the storage layer for Spark • Deploy in a mixed cluster configuration • Spark executors access Cassandra using the DataStax connector
  54. 54. Lambda Architecture with Spark/Cassandra Data Collection (Analytical) Batch Data Processing Batch compute Result StoreData Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir)
  55. 55. Lambda Architecture with Spark/Cassandra Data Collection (Analytical) Batch Data Processing Batch compute Result StoreData Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir)
  56. 56. Guido Schmutz Technology Manager guido.schmutz@trivadis.com

×