Apache Spark streaming and HBase

®
© 2015 MapR Technologies 1
®
© 2014 MapR Technologies
Overview of Apache Spark Streaming
Carol McDonald

®
Agenda
•  Why Apache Spark Streaming ?
•  What is Apache Spark Streaming?
–  Key Concepts and Architecture
•  How it works by Example

®
Why Spark Streaming?
•  Process Time Series data :
–  Results in near-real-time
•  Use Cases
–  Social network trends
–  Website statistics, monitoring
–  Fraud detection
–  Advertising click monetization
put
put
put
put
Time stamped data
data
•  Sensor, System Metrics, Events, log files
•  Stock Ticker, User Activity
•  Hi Volume, Velocity
Data for real-time
monitoring

®
What is time series data?
•  Stuff with timestamps
–  Sensor data
–  log files
–  Phones..
Credit Card Transactions Web user behaviour
Social media
Log files
Geodata
Sensors

®
Why Spark Streaming ?
What If?
•  You want to analyze data as it arrives?
For Example Time Series Data: Sensors, Clicks, Logs, Stats

®
Batch Processing
It's 6:01 and 72 degrees
It was hot at 6:05
yesterday!
Batch processing may be too late for some events

®
Event Processing
It's 6:05 and
90 degrees
Someone should
open a window!
Streaming
Its becoming important to process events as they arrive

®
What is Spark Streaming?
•  extension of the core Spark AP
•  enables scalable, high-throughput, fault-tolerant stream
processing of live data
Data Sources Data Sinks

®
Stream Processing Architecture
Streaming
Sources/Apps
MapR-FS
Data Ingest
Topics
MapR-DB
Data Storage
MapR-FS
Apps

Stream
Processing

®
Key Concepts
•  Data Sources:
–  File Based: HDFS
–  Network Based: TCP sockets,
Twitter, Kafka, Flume, ZeroMQ, Akka Actor
•  Transformations
•  Output Operations
MapR-FS
Topics

®
Spark Streaming Architecture
•  Divide
data
stream
into
batches
of
X
seconds

– Called
DStream
=

sequence
of
RDDs

Spark
Streaming
input data
stream
DStream RDD batches
Batch
interval
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1

®
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
•  read only collection of
elements

®
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
•  read only collection of
elements
•  operated on in parallel
•  Cached in memory
–  Or on disk
•  Fault tolerant

®
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
Action Value
linesWithErrorRDD.count()!
6!
!
linesWithErrorRDD.first()!
# Error line!
textFile = sc.textFile(”SomeFile.txt”)!
linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in
line)!

®
Process DStream
transform

Transform

map

reduceByValue

count

DStream
RDDs
Dstream

RDDs

transform
transform

•  Process
using
transformaBons

– creates
new
RDDs

data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 1 RDD @ time 2 RDD @ time 3

®
Key Concepts
•  Data Sources
•  Transformations: create new DStream
–  Standard RDD operations: map, filter, union, reduce, join, …
–  Stateful operations: UpdateStateByKey(function),
countByValueAndWindow, …
•  Output Operations

®
Spark Streaming Architecture
•  processed
results
are
pushed
out

in
batches

Spark
batches of processed
results
Spark
Streaming
input data
stream
DStream RDD batches
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3

®
Key Concepts
•  Data Sources
•  Transformations
•  Output Operations: trigger Computation
–  saveAsHadoopFiles – save to HDFS
–  saveAsHadoopDataset – save to Hbase
–  saveAsTextFiles
–  foreach – do anything with each batch of RDDs
MapR-DB
MapR-FS

®
Learning Goals
•  How it works by example

®
Use Case: Time Series Data
Data for
real-time monitoring
read
Spark Processing
Spark
Streaming
Oil Pump Sensor data

®
Convert Line of CSV data to Sensor Object
case class Sensor(resid: String, date: String, time: String,
hz: Double, disp: Double, flo: Double, sedPPM: Double,
psi: Double, chlPPM: Double)
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble,
p(6).toDouble, p(7).toDouble, p(8).toDouble)
}

®
Schema
•  All events stored, data CF could be set to expire data
•  Filtered alerts put in alerts CF
•  Daily summaries put in Stats CF
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0

®
Basic Steps for Spark Streaming code
These are the basic steps for Spark Streaming code:
1.  create a Dstream
1.  Apply transformations
2.  Apply output operations
2.  Start receiving data and processing it
–  using streamingContext.start().
3.  Wait for the processing to be stopped
–  using streamingContext.awaitTermination().

®
Create a DStream
val ssc = new StreamingContext(sparkConf, Seconds(2))
val linesDStream = ssc.textFileStream(“/mapr/stream")
batch

'me
0-‐1

linesDStream
batch

'me
1-‐2

batch

'me
1-‐2

DStream:
a
sequence
of
RDDs
represenBng
a

stream
of
data

stored
in
memory
as
an

RDD

®
Process DStream
val linesDStream = ssc.textFileStream(”directory path")
val sensorDStream = linesDStream.map(parseSensor)
map

new
RDDs
created

for
every
batch

batch

'me
0-‐1

linesDStream
RDDs
sensorDstream

RDDs

batch

'me
1-‐2

map
map

batch

'me
1-‐2

®
Process DStream
// for Each RDD
sensorDStream.foreachRDD { rdd =>
// filter sensor data for low psi
val alertRDD = rdd.filter(sensor => sensor.psi < 5.0)
. . .
}

®
DataFrame and SQL Operations
// for Each RDD parse into a sensor object filter
. . .
alertRdd.toDF().registerTempTable(”alert”)
// join alert data with pump maintenance info
val res = sqlContext.sql(
"select s.resid,s.psi, p.pumpType
from alert s join pump p on s.resid = p.resid
join maint m on p.resid=m.resid")
. . .
}

®
Save to HBase
// for Each RDD parse into a sensor object filter
. . .
// convert alert to put object write to HBase alerts
alertRDD.map(Sensor.convertToPutAlert)
.saveAsHadoopDataset(jobConfig)
}

®
Save to HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
map

Put
objects
wriFen

To
HBase

batch

'me
0-‐1

linesRDD
DStream
sensorRDD

Dstream

batch

'me
1-‐2

map
map

batch

'me
1-‐2

HBase
save save save
output
opera'on:
persist
data
to
external
storage

®
Start Receiving Data
. . .
}
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()

®
Using HBase as a Source and Sink
read
write
Spark applicationHBase database
EXAMPLE: calculate and store summaries,
Pre-Computed, Materialized View

®
HBase
HBase Read and Write
val hBaseRDD = sc.newAPIHadoopRDD(
conf,classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
keyStatsRDD.map { case (k, v) => convertToPut(k,
v) }.saveAsHadoopDataset(jobConfig)
newAPIHadoopRDD
Row key Result
saveAsHadoopDataset
Key Put
HBase
Scan Result

®
Read HBase
// Load an RDD of (rowkey, Result) tuples from HBase table
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
// get Result
val resultRDD = hBaseRDD.map(tuple => tuple._2)
// transform into an RDD of (RowKey, ColumnValue)s
val keyValueRDD = resultRDD.map(
result => (Bytes.toString(result.getRow()).split(" ")(0),
Bytes.toDouble(result.value)))
// group by rowkey , get statistics for column value
val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list =>
StatCounter(list))

®
Write HBase
// save to HBase table CF data
val jobConfig: JobConf = new JobConf(conf, this.getClass)
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName)
// convert psi stats to put and write to hbase table stats column family
keyStatsRDD.map { case (k, v) =>
convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)

®
MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data
•  https://www.mapr.com/blog/spark-streaming-hbase

®
Free HBase On Demand Training
(includes Hive and MapReduce with HBase)
•  https://www.mapr.com/services/mapr-academy/big-data-hadoop-
online-training

®
Soon to Come
•  Spark On Demand Training
–  https://www.mapr.com/services/mapr-academy/

®
References
•  Spark web site: http://spark.apache.org/
•  https://databricks.com/
•  Spark on MapR:
–  http://www.mapr.com/products/apache-spark
•  Spark SQL and DataFrame Guide
•  Apache Spark vs. MapReduce – Whiteboard Walkthrough
•  Learning Spark - O'Reilly Book
•  Apache Spark

®
Q&A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies

Apache Spark streaming and HBase

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Apache Spark streaming and HBase

Ähnlich wie Apache Spark streaming and HBase (20)

Mehr von Carol McDonald

Mehr von Carol McDonald (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Apache Spark streaming and HBase