Real Time Data Processing Using Spark Streaming

1© Cloudera, Inc. All rights reserved.
Hari Shreedharan, Software Engineer @ Cloudera
Committer/PMC Member, Apache Flume
Committer, Apache Sqoop
Contributor, Apache Spark
Author, Using Flume (O’Reilly)
Real Time Data Processing
using Spark Streaming

Motivation for Real-Time Stream Processing
Data is being created at unprecedented rates
• Exponential data growth from mobile, web, social
• Connected devices: 9B in 2012 to 50B by 2020
• Over 1 trillion sensors by 2020
• Datacenter IP traffic growing at CAGR of 25%
How can we harness it data in real-time?
• Value can quickly degrade → capture value immediately
• From reactive analysis to direct operational impact
• Unlocks new competitive advantages
• Requires a completely new approach...

Use Cases Across Industries
Credit
Identify
fraudulent transactions
as soon as they occur.
Transportation
Dynamic
Re-routing
Of traffic or
Vehicle Fleet.
Retail
• Dynamic
Inventory
Management
• Real-time
In-store
Offers and
recommendations
Consumer
Internet &
Mobile
Optimize user
engagement based
on user’s current
behavior.
Healthcare
Continuously
monitor patient
vital stats and
proactively identify
at-risk patients.
Manufacturing
• Identify
equipment
failures and
react instantly
• Perform
Proactive
maintenance.
Surveillance
Identify
threats
and intrusions
In real-time
Digital
Advertising
& Marketing
Optimize and
personalize content
based on real-time
information.

From Volume and Variety to Velocity
Present
Batch + Stream Processing
Time to Insight of Seconds
Big-Data = Volume + Variety
Big-Data = Volume + Variety + Velocity
Past
Present
Hadoop Ecosystem evolves as well…
Past
Big Data has evolved
Batch Processing
Time to insight of Hours

Key Components of Streaming Architectures
Data Ingestion
& Transportation
Service
Real-Time Stream
Processing Engine
Kafka Flume
System Management
Security
Data Management & Integration
Real-Time
Data Serving

Canonical Stream Processing Architecture
Kafka
Data Ingest
App 1
App 2
.
.
.
Kafka Flume
HDFS
HBase
Data
Sources

What is Spark?
Spark is a general purpose computational framework - more
flexibility than MapReduce
Key properties:
• Leverages distributed memory
• Full Directed Graph expressions for data parallel computations
• Improved developer experience
Yet retains:
Linear scalability, Fault-tolerance and Data Locality based computations

Spark: Easy and Fast Big Data
•Easy to Develop
•Rich APIs in Java, Scala,
Python
•Interactive shell
•Fast to Run
•General execution graphs
•In-memory storage
2-5× less code
Up to 10× faster on disk,
100× in memory

Easy: High productivity language support
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(s => s.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>()
{
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
• Native support for multiple languages with identical APIs
• Use of closures, iterations and other common language constructs to minimize code

Easy: Use Interactively
• Interactive exploration of data for data scientists – no need to develop “applications”
• Developers can prototype application on live system as they build application

Easy: Expressive API
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
• sample
• take
• first
• partitionBy
• mapWith
• pipe
• save

Easy: Example – Word Count (M/R)
public static class WordCountMapClass
extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one =
new IntWritable(1);
private Text word = new Text();
public void map(
LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr
= new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce
extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(
Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

Easy: Example – Word Count (Spark)
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

Easy: Out of the Box Functionality
• Hadoop Integration
• Works with Hadoop Data
• Runs With YARN
• Libraries
• MLlib
• Spark Streaming
• GraphX (alpha)
• Language support:
• Improved Python support
• SparkR
• Java 8
• Schema support in Spark’s APIs

Spark Architecture
Driver
Worker
Worker
Worker
Data
RAM
Data
RAM
Data
RAM

RDDs
RDD = Resilient Distributed Datasets
• Immutable representation of data
• Operations on one RDD creates a new one
• Memory caching layer that stores data in a distributed, fault-tolerant cache
• Created by parallel transformations on data in stable storage
• Lazy materialization
Two observations:
a. Can fall back to disk when data-set does not fit in memory
b. Provides fault-tolerance through concept of lineage

Fast: Using RAM, Operator Graphs
• In-memory Caching
• Data Partitions read from RAM
instead of disk
• Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
= cached
partition
= RDD
join
filter
groupBy
B: B:
C: D: E:
F:
map
A:
map
take

Spark Streaming
Extension of Apache Spark’s Core API, for Stream Processing.
The Framework Provides
Fault Tolerance
Scalability
High-Throughput

Spark Streaming
• Incoming data represented as Discretized Streams (DStreams)
• Stream is broken down into micro-batches
• Each micro-batch is an RDD – can share code between batch and streaming

val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2tweets DStream
hashTags DStream
Stream composed of
small (1-10s) batch
computations
“Micro-batch” Architecture

Use DStreams for Windowing Functions

Spark Streaming
• Runs as a Spark job
• YARN or standalone for scheduling
• YARN has KDC integration
• Use the same code for real-time Spark Streaming and for batch Spark jobs.
• Integrates natively with messaging systems such as Flume, Kafka, Zero MQ….
• Easy to write “Receivers” for custom messaging systems.

Sharing Code between Batch and Streaming
def filterErrors (rdd: RDD[String]): RDD[String] = {
rdd.filter(s => s.contains(“ERROR”))
}
Library that filters “ERRORS”
• Streaming generates RDDs periodically
• Any code that operates on RDDs can therefore be used in streaming as
well

Sharing Code between Batch and Streaming
val lines = sc.textFile(…)
val filtered = filterErrors(lines)
filtered.saveAsTextFile(...)
Spark:
val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435)
val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => {
filterErrors(rdd)
}))
filtered.saveAsTextFiles(…)
Spark Streaming:

Spark Streaming Use-Cases
• Real-time dashboards
• Show approximate results in real-time
• Reconcile periodically with source-of-truth using Spark
• Joins of multiple streams
• Time-based or count-based “windows”
• Combine multiple sources of input to produce composite data
• Re-use RDDs created by Streaming in other Spark jobs.

Hadoop in the Spark world
YARN
Spark
Spark
Streaming
GraphX MLlib
HDFS, HBase
HivePig
Impala
MapReduce2
Shark
Search
Core Hadoop
Support Spark components
Unsupported add-ons

Current project status
• 100+ contributors and 25+ companies contributing
• Includes: Databricks, Cloudera, Intel, Yahoo etc
• Dozens of production deployments
• Included in CDH!

More Info..
• CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera-
docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html
• Cloudera Blog: http://blog.cloudera.com/blog/category/spark/
• Apache Spark homepage: http://spark.apache.org/
• Github: https://github.com/apache/spark

Thank you

Real Time Data Processing Using Spark Streaming

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie Real Time Data Processing Using Spark Streaming

Ähnlich wie Real Time Data Processing Using Spark Streaming (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Real Time Data Processing Using Spark Streaming

Hinweis der Redaktion