Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Modernizing Infrastructures for Fast Data
Spark, Kafka, Cassandra, Reactive Platform and Mesos
by Dean Wampler, Ph.D. (@de...
Outline
•Reactive Enterprise Architectures: The Lightbend Perspective
•Big Data and the Emergence of Apache Spark
•An Arch...
Reactive Enterprise Applications:
The Lightbend Perspective
Reactive Manifesto
The Lightbend Reactive Platform
7
Online Services
IoT
Retail
Education
Technology
Social
Media
Finance
7
Big Data and the Emergence of
Apache Spark
Distributed compute frameworks: MapReduce
9
• Distribution computation over that data.
Hadoop
YARN
HDFS
MR	job	#1
MR	job	#2
Flume Sqoop
DBs
Slave	Node
DiskDiskDiskDiskDisk
Node	Mgr
Data	Node
Master
Resource	
M...
Hadoop Strengths
• Lowest CapEx system for Big Data.
• Excellent for ingesting and integrating diverse datasets.
• Flexibl...
Hadoop Weaknesses
• Complex administration.
• YARN can’t manage all distributed services.
• MapReduce:
•Has poor performan...
Why Apache Spark?
YARN
HDFS
MR	job	#1
MR	job	#2
Flume Sqoop
DBs
Slave	Node
DiskDiskDiskDiskDisk
Node	Mgr
Data	Node
Master
Resource	
Manager
...
Spark vs. MapReduce Performance
15
100x better for
many algorithms.
Spark: Major Performance Improvements
16
Sort 100TB
One of the Fastest Growing OS Projects
17
Modules
18
Spark	Streaming	
(~Real	Time)
MLlib	
(Machine	Learning)
SQL/DataFrames	
(Structured	Data)
GraphX	
(Graphs)
Spar...
The Core - Resilient Distributed Datasets
19
Spark	Streaming	
(~Real	Time)
MLlib	
(Machine	Learning)
SQL/DataFrames	
(Stru...
“Inverted Index” in Spark
20
sparkContext.textFile("/path/to/input")
.map { line =>
val array = line.split(",", 2)
(array(...
SQL queries and a “DataFrame” DSL
21
Spark	Streaming	
(~Real	Time)
MLlib	
(Machine	Learning)
SQL/DataFrames	
(Structured	D...
Use SQL or the Idiomatic DataFrame API
22
# SQL:
sqlContext.sql("""
SELECT state, age, COUNT(*) AS cnt
FROM people
GROUP B...
Spark Streaming: “Mini-batch” Processing
23
Spark	Streaming	
(~Real	Time)
MLlib	
(Machine	Learning)
SQL/DataFrames	
(Struc...
Streaming Inverted Index
24
val kafkaBrokers = "host1:port1,host2:port2,..."
val kafkaTopics = Set("topic1", "topic2", ......
25
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with kafkaBrokers and kafkaTopics
v...
An Architecture
for Fast Data
•Update a search engine in real time as web page or
documents change.
•Train a SPAM filter with every email.
•Detect anoma...
Mesos,	YARN
on	
Bare	Metal,	Cloud
HDFS,	S3,	CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/
RE...
Mesos,	YARN
on	
Bare	Metal,	Cloud
HDFS,	S3,	CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/
RE...
Mesos,	YARN
on	
Bare	Metal,	Cloud
HDFS,	S3,	CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/
RE...
Mesos,	YARN
on	
Bare	Metal,	Cloud
HDFS,	S3,	CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/
RE...
Mesos,	YARN
on	
Bare	Metal,	Cloud
HDFS,	S3,	CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/
RE...
Mesos,	YARN
on	
Bare	Metal,	Cloud
HDFS,	S3,	CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/
RE...
Mesos,	YARN
on	
Bare	Metal,	Cloud
HDFS,	S3,	CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/
RE...
Mesos,	YARN
on	
Bare	Metal,	Cloud
HDFS,	S3,	CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/
RE...
Mesos,	YARN
on	
Bare	Metal,	Cloud
HDFS,	S3,	CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/
RE...
Mesos,	YARN
on	
Bare	Metal,	Cloud
HDFS,	S3,	CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/
RE...
Mesos,	YARN
on	
Bare	Metal,	Cloud
HDFS,	S3,	CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/
RE...
Mesos,	YARN
on	
Bare	Metal,	Cloud
HDFS,	S3,	CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/
RE...
•Next Steps
•Learn - Fast Data: Big Data Evolved
•Watch - Using Spark, Kafka, Cassandra and Akka on
Mesos for Real-Time Pe...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos
Nächste SlideShare
Wird geladen in …5
×

Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

11.377 Aufrufe

Veröffentlicht am

The Big Data industry emerged in response to the unprecedented sizes of data sets collected by Internet companies and the particular needs they had to store and use that data.

Today, the need to process that data more quickly is morphing Big Data architectures into Fast Data architectures. This session discusses the forces driving this trend and the most popular tools that have emerged to address particular design challenges:
Spark - For sophisticated processing of data streams, as well as traditional batch-mode processing.
Kafka - For durable and scalable ingestion and distribution of data streams.
Cassandra - For scalable, flexible persistence.
Reactive Platform: Lagom, Akka, and Play - For integration of other components and building microservices.
Mesos - For cluster resource management.
---

About the presenter:

Dean Wampler, Ph.D. is the Architect for Big Data Products and Services and a member of the office of the CTO at Lightbend. He is designing the product strategy and technical architecture for Lightbend's Spark on Mesos products and emerging streaming tools built around Spark and Lightbend’s ConductR and Akka products. Dean has written books on Scala, Functional Programming, and Hive for O'Reilly. He speaks at and co-organizes many industry conferences. He also organizes several Chicago-area user groups and contributes to many open-source projects, including Apache Spark. Dean has a Ph.D. in Physics from the University of Washington.

Veröffentlicht in: Software
  • Als Erste(r) kommentieren

Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

  1. 1. Modernizing Infrastructures for Fast Data Spark, Kafka, Cassandra, Reactive Platform and Mesos by Dean Wampler, Ph.D. (@deanwampler)
  2. 2. Outline •Reactive Enterprise Architectures: The Lightbend Perspective •Big Data and the Emergence of Apache Spark •An Architecture for Fast Data 2
  3. 3. Reactive Enterprise Applications: The Lightbend Perspective
  4. 4. Reactive Manifesto
  5. 5. The Lightbend Reactive Platform
  6. 6. 7 Online Services IoT Retail Education Technology Social Media Finance 7
  7. 7. Big Data and the Emergence of Apache Spark
  8. 8. Distributed compute frameworks: MapReduce 9 • Distribution computation over that data.
  9. 9. Hadoop YARN HDFS MR job #1 MR job #2 Flume Sqoop DBs Slave Node DiskDiskDiskDiskDisk Node Mgr Data Node Master Resource Manager Name Node
  10. 10. Hadoop Strengths • Lowest CapEx system for Big Data. • Excellent for ingesting and integrating diverse datasets. • Flexible: from classic analytics (aggregations and data warehousing) to machine learning. 11
  11. 11. Hadoop Weaknesses • Complex administration. • YARN can’t manage all distributed services. • MapReduce: •Has poor performance. •A difficult programming model. •Doesn’t support stream processing. 12
  12. 12. Why Apache Spark?
  13. 13. YARN HDFS MR job #1 MR job #2 Flume Sqoop DBs Slave Node DiskDiskDiskDiskDisk Node Mgr Data Node Master Resource Manager Name Node Spark job #1 Spark job #2 Hadoop 2013: Embrace Spark
  14. 14. Spark vs. MapReduce Performance 15 100x better for many algorithms.
  15. 15. Spark: Major Performance Improvements 16 Sort 100TB
  16. 16. One of the Fastest Growing OS Projects 17
  17. 17. Modules 18 Spark Streaming (~Real Time) MLlib (Machine Learning) SQL/DataFrames (Structured Data) GraphX (Graphs) Spark RDD (Core)
  18. 18. The Core - Resilient Distributed Datasets 19 Spark Streaming (~Real Time) MLlib (Machine Learning) SQL/DataFrames (Structured Data) GraphX (Graphs) Spark RDD (Core) Cluster Node Node Node Node RDD Partition 1 RDD Partition 2 RDD Partition 3 RDD Partition 4
  19. 19. “Inverted Index” in Spark 20 sparkContext.textFile("/path/to/input") .map { line => val array = line.split(",", 2) (array(0), array(1)) }.flatMap { case (id, contents) => toWords(contents).map(w=>((w,id),1)) }.reduceByKey { (count1, count2) => count1 + count2 }.map { case ((word, path), n) => (word, (path, n))} .groupByKey .map { case (word, list) => (word, sortByCount(list)) }.saveAsTextFile("/path/to/output") reduceByKey flatMap textFile map map groupByKey map saveAsTextFile
  20. 20. SQL queries and a “DataFrame” DSL 21 Spark Streaming (~Real Time) MLlib (Machine Learning) SQL/DataFrames (Structured Data) GraphX (Graphs) Spark RDD (Core) • For data with a fixed schema... • Write SQL queries (currently a subset of HiveQL). • Use equivalent Python-inspired DataFrame API.
  21. 21. Use SQL or the Idiomatic DataFrame API 22 # SQL: sqlContext.sql(""" SELECT state, age, COUNT(*) AS cnt FROM people GROUP BY state, age ORDER BY cnt DESC, state ASC, age ASC """) // DataFrame (Scala): people.state($"state", $"age") .groupBy($"state", $"age").count() .orderBy($"count".desc, $"state".asc, $"age".asc)
  22. 22. Spark Streaming: “Mini-batch” Processing 23 Spark Streaming (~Real Time) MLlib (Machine Learning) SQL/DataFrames (Structured Data) GraphX (Graphs) Spark RDD (Core) DStream RDD #2 Event Event Event Event Event … Windows (2 batches) t0 RDD #1 Event Event RDD #3 Event Event Event t1 = t0 + ∆ t2 = t0 + 2∆ t3 = t0 + 3∆
  23. 23. Streaming Inverted Index 24 val kafkaBrokers = "host1:port1,host2:port2,..." val kafkaTopics = Set("topic1", "topic2", ...) val sparkConf = new SparkConf().setAppName("...") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with kafkaBrokers and kafkaTopics val kafkaParams = Map[String, String]("metadata.broker.list" -> kafkaBrokers) val messages = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder]( ssc, kafkaParams, kafkaTopics) messages.flatMap {case (topic,text) => toWords(text).map(w=>((w,topic),1L))} .reduceByKey (_ + _) .map {case ((word, topic), n) => (word, (path, n))} .groupByKey .map {case (word, list) => (word, sortByCount(list))}
  24. 24. 25 val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with kafkaBrokers and kafkaTopics val kafkaParams = Map[String, String]("metadata.broker.list" -> kafkaBrokers) val messages = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder]( ssc, kafkaParams, kafkaTopics) messages.flatMap {case (topic,text) => toWords(text).map(w=>((w,topic),1L))} .reduceByKey (_ + _) .map {case ((word, topic), n) => (word, (path, n))} .groupByKey .map {case (word, list) => (word, sortByCount(list))} .saveAsTextFiles("/path/to/output") ssc.start() ssc.awaitTermination()
  25. 25. An Architecture for Fast Data
  26. 26. •Update a search engine in real time as web page or documents change. •Train a SPAM filter with every email. •Detect anomalies as they happen through processing of logs and monitoring data. Fast as in Streaming. Why?
  27. 27. Mesos, YARN on Bare Metal, Cloud HDFS, S3, CFSv2SQL/NoSQL Core Streaming SQL MLlib GraphX Fast Data Architecture HTTP/ REST Internet ReacHve Services Logs and Other Files Actors Cluster …Persist Akka Streams Web Services
  28. 28. Mesos, YARN on Bare Metal, Cloud HDFS, S3, CFSv2SQL/NoSQL Core Streaming SQL MLlib GraphX Fast Data Architecture HTTP/ REST Internet ReacHve Services Logs and Other Files Actors Cluster …Persist Akka Streams Web Services Core of Spark, Kafka, and Cassandra
  29. 29. Mesos, YARN on Bare Metal, Cloud HDFS, S3, CFSv2SQL/NoSQL Core Streaming SQL MLlib GraphX Fast Data Architecture HTTP/ REST Internet ReacHve Services Logs and Other Files Actors Cluster …Persist Akka Streams Web Services “SMACK” Stack
  30. 30. Mesos, YARN on Bare Metal, Cloud HDFS, S3, CFSv2SQL/NoSQL Core Streaming SQL MLlib GraphX Fast Data Architecture HTTP/ REST Internet ReacHve Services Logs and Other Files Actors Cluster …Persist Akka Streams Web Services Data Sources
  31. 31. Mesos, YARN on Bare Metal, Cloud HDFS, S3, CFSv2SQL/NoSQL Core Streaming SQL MLlib GraphX Fast Data Architecture HTTP/ REST Internet ReacHve Services Logs and Other Files Actors Cluster …Persist Akka Streams Web Services Event Event Event Event Event Event Producer Consumer bounded queue feedbackfeedback Reactive Streams
  32. 32. Mesos, YARN on Bare Metal, Cloud HDFS, S3, CFSv2SQL/NoSQL Core Streaming SQL MLlib GraphX Fast Data Architecture HTTP/ REST Internet ReacHve Services Logs and Other Files Actors Cluster …Persist Akka Streams Web Services Lightbend Reactive Platform
  33. 33. Mesos, YARN on Bare Metal, Cloud HDFS, S3, CFSv2SQL/NoSQL Core Streaming SQL MLlib GraphX Fast Data Architecture HTTP/ REST Internet ReacHve Services Logs and Other Files Actors Cluster …Persist Akka Streams Web Services Kafka for Stream Storage
  34. 34. Mesos, YARN on Bare Metal, Cloud HDFS, S3, CFSv2SQL/NoSQL Core Streaming SQL MLlib GraphX Fast Data Architecture HTTP/ REST Internet ReacHve Services Logs and Other Files Actors Cluster …Persist Akka Streams Web Services Service 1 Log & Other Files Internet Services Service 2 Service 3 Services Services N * M links ConsumersProducers
  35. 35. Mesos, YARN on Bare Metal, Cloud HDFS, S3, CFSv2SQL/NoSQL Core Streaming SQL MLlib GraphX Fast Data Architecture HTTP/ REST Internet ReacHve Services Logs and Other Files Actors Cluster …Persist Akka Streams Web Services Service 1 Log & Other Files Internet Services Service 2 Service 3 Services Services N + M links ConsumersProducers
  36. 36. Mesos, YARN on Bare Metal, Cloud HDFS, S3, CFSv2SQL/NoSQL Core Streaming SQL MLlib GraphX Fast Data Architecture HTTP/ REST Internet ReacHve Services Logs and Other Files Actors Cluster …Persist Akka Streams Web Services Minibatch Processing
  37. 37. Mesos, YARN on Bare Metal, Cloud HDFS, S3, CFSv2SQL/NoSQL Core Streaming SQL MLlib GraphX Fast Data Architecture HTTP/ REST Internet ReacHve Services Logs and Other Files Actors Cluster …Persist Akka Streams Web Services Short and Long- term Storage
  38. 38. Mesos, YARN on Bare Metal, Cloud HDFS, S3, CFSv2SQL/NoSQL Core Streaming SQL MLlib GraphX Fast Data Architecture HTTP/ REST Internet ReacHve Services Logs and Other Files Actors Cluster …Persist Akka Streams Web Services Infrastructure
  39. 39. •Next Steps •Learn - Fast Data: Big Data Evolved •Watch - Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization •Review - Spark success stories by Lightbend clients

×