2. About Us
Arijit Tarafdar
Software Engineer@Azure HDInsight. Work on Spark Streaming/Structured Streaming
service in Azure. Committee Member of XGBoost@DMLC and Apache MxNet (incubator).
Spark Contributor. Known as CodingCat in GitHub.
Nan Zhu
Software Engineer@Azure HDInsight. Work on Spark/Spark Streaming on Azure.
Previously worked with other distributed platforms like DryadLinq and MPI. Also
worked on graph coloring algorithms which was contributed to ADOL-C
(https://projects.coin-or.org/ADOL-C).
3. Real Time Data Analytics Results
Processing Engine
Continuous Data Source Control Manager
Continuous Data Source API
Persistent Data Storage Layer
Spark Streaming, Structured Streaming
Deliver real time data to Spark at scale
Real time view of data (message queue
or files filtered by timestamp)
Blobs/Queues/Tables/Files
Continuous Application Architecture and Role of Spark Connectors
Not only size of data is increasing, but also the velocity of data
◦ Sensors, IoT devices, social networks and online transactions are all generating
data that needs to be monitored constantly and acted upon quickly.
4. Outline
•Recap of Spark Streaming
•Introduction to Event Hubs
•Connecting Azure Event Hubs and Spark Streaming
•Design Considerations for Spark Streaming Connector
•Contributions Back to Community
•Future Work
5. Spark Streaming - Background
Task 1
Task 2
Task L
RDD 1 @ t RDD 1 @ t-1 RDD 1 @ 0
Stream 1
Task 1
Task 2
Task M
RDD N @ t RDD N @ t-1 RDD N @ 0
Stream N
Micro Batch @ t
Task 1
Task 2
Task L
Task 1
Task 2
Task M
Window Duration
Batch Duration
6. Azure Event Hubs - Introduction
Partition 1
Partition 2
Partition J
Event Hubs 1
Partition 1
Partition 2
Partition K
Event Hubs L
Event Hubs Namespace 1
Partition 1
Partition 2
Partition K
Event Hubs 1
Partition 1
Partition 2
Partition P
Event Hubs N
Event Hubs Namespace M
8. Data Flow – Event Hubs
• Proactive message delivery
• Efficient in terms of communication cost
• Data source treated as commit log of events
• Events read in batch per receive() call
New Old
Event Hubs Partition
(Event Hubs Server)
Prefetch Queue
(Event Hubs Client)
Streaming
Application
9. Event Hubs – Offset Management
• Event Hubs expect offset management to be performed on the receiver side
• Spark streaming uses DFS based persistent store (HDFS, ADLS, etc.)
• Offset is stored per consumer group per partition per event hubs per event hubs namespace
/* An interface to read/write offset for a given Event Hubs
namespace/name/partition */
@SerialVersionUID(1L)
trait OffsetStore extends Serializable {
def open(): Unit
def write(offset: String): Unit
def read() : String
def close(): Unit
}
19. Bridging Spark Streaming and
Event Hubs WITHOUT Receiver
How the Idea Extends to Other
Data Sources (in Azure & Your IT
Infrastructure)?
20. Extra Resources
Requirements in
Event Hubs
Receiver-based
Connection
Problems
Long-Running
Receiver/Proactive
Message Fetching
Long-Running
Receiver Tasks
Extra Resource
Requirements
Fault tolerance
Mechanism
WAL/Spark
Checkpoint
Perf./Data Loss due
to Spark Bug/No
Recovery from
Code Update
Client-side Offset
Management
Offset Store Looks fine….
21. From Event Hubs to General
Data Sources (1)
•Communication Pattern
• Azure Event Hubs: Long-Running Receiver, Proactive Data Delivery
• Kafka: Receiver Start/Shutdown in a free-style, Passive Data
Delivery
Most Critical Factor in Designing a Resource-Efficient
Spark Streaming Connector!
22. Tackling Extra Resource
Requirement
Azure EventHubs
EvH-
Namespace-1
EventHub-1
P1
PN
.
.
.
Reduce Resource Requirements:
Customized Receiver
Logic
User-Defined
Lambdas
EventHubsRDD
.map()
MapPartitionsRDD
Spark Tasks
Compact Data Receiving and Processing in the same Task
Inspired by Kafka
Direct DStream!
Being More Challenging with a
Different Communication Pattern!
23. Bridging Spark Execution Model and
Communication Pattern Expectation
Azure EventHubs
EvH-
Namespace-1
EventHub-1
P1
PN
.
.
.
Customized Receiver
Logic
User-Defined
Lambdas
EventHubsRDD
.map()
MapPartitionsRDD
Spark Task
Passive
Message
Delivery Layer
Recv(expectedMsgNum:
Int) – Blocking API
Long-running/Proactive Receiver expected by Event Hubs
vs.
Transient Tasks started for each Batch by Spark
24. Takeaways (1)
Requirements in
Event Hubs
Receiver-based
Connection
Problems Solution
Long-Running
Receiver/Proactive
Message Fetching
Long-Running
Receiver Tasks
Extra Resource
Requirements
Compact Data
Receiving/Processi
ng, with the
facilitates from
Passive Message
Delivery
Communication Pattern in Data Sources Plays the Key
Role in Resource-Efficient Design of Spark Streaming
Connector
26. Fault Tolerance
Requirements in
Event Hubs
Receiver-based
Connection
Problems
Long-Running
Receiver/Proactive
Message Fetching
Long-Running
Receiver Tasks
Extra Resource
Requirements
Fault tolerance
Mechanism
WAL/Spark
Checkpoint
Perf./Data Loss due
to Spark Bug/No
Recovery from Code
Update
Client-side Offset
Management
Offset Store Looks fine….
27. From Event Hubs to General
Data Sources (2)
•Fault-Tolerance
• Capability
• Guarantee graceful recovery (no data loss, recover from where
you stopped, etc.) with application stops for various reasons
• Efficiency
• Minimum impact to application performance and user
deployment
28. …RDD L-t RDD L-(t-1) RDD L-0 Stream L
Unexpected Application Stop
Checkpoint Time
RDD L-(t-1)RDD L-t
Recovery
From Checkpoint, or Re-evaluated
Capability – Recover from
unexpected stop
29. …RDD L-(t-1) RDD L-0 Stream L
Application Upgrade
…
Application Stop
Spark Checkpoint Mechanism Serializes
Everything and does not recognize a re-compiled
class
Capability – Recover from
planned stop
RDD L-(2t)
Resume Application
with updated
Implementation
Fetch the latest offset
Offset Store
Your Connector shall maintain this!!!
30. Efficiency - What to be
Contained in Checkpoint Files?
• Checkpointing takes your computing resources!!!
• Received Event Data
• too large
• The range of messages to be processed in each batch
• Small enough to quickly persist data
Azure EventHubs
EvH-
Namespace-1
EventHub-1
P1
PN
.
.
.
EventHubsRDD
.map()
MapPartitionsRDD
Passive
Message
Delivery Layer
Recv(expectedMsgNum:
Int) – Blocking API
Persist this mapping relationship, i.e. using EventHubs itself as data backup
31. Efficiency - Checkpoint
Cleanup
•Connectors for Data Source Requiring Client-
side offset management generates Data/Files
for each Batch
• You have to clean up SAFELY
• Keep recovery feasible
• Coordinate with Spark’s checkpoint process
• Override clearCheckpointData() in EventHubsDStream (our
implementation of Dstream)
• Triggered by Batch Completion
• Delete all offset records out of the remembering window
32. Takeaways (2)
Requirements in
Event Hubs
Receiver-based
Connection
Problems Solution
Fault tolerance
Mechanism
WAL/Spark
Checkpoint
Perf./Data Loss due
to Spark Bug/No
Recovery from Code
Update
Checkpoint Mapping
Relationship instead
of Data/Self-
management Offset
Store/Coordinate
Checkpoint Cleanup
Fault Tolerance Design is about Interaction with Spark
Streaming Checkpoint
34. Offset Management
Requirements in
Event Hubs
Receiver-based
Connection
Problems
Long-Running
Receiver/Proactive
Message Fetching
Long-Running
Receiver Tasks
Extra Resource
Requirements
Fault tolerance
Mechanism
WAL/Spark
Checkpoint
Data Loss due to
Spark Bug
Client-side Offset
Management
Offset Store Looks fine….
Is it really fine???
35. From Event Hubs to General
Data Sources (3)
•Message Addressing Rate Control
36. Message Addressing
• Why Message Addressing?
• When creating a client instance of data source in a Spark task, where to start receiving?
• Without this info, you have to replay the stream for every newly created client
Data
Source
Client
Start from the first msg
Data
Source
Client
Start from where?
• Design Options:
• Xth message (X: 0, 1, 2, 3, 4….)
• server side metadata to map the message ID to the offset in storage system
• Actual offset
• Simpler server side design
Fault
Or
Next Batch
37. Rate Control
• Why Rate Control
• Prevent the messages flooding into the processing pipelines
• e.g. just start processing a queued up data sources
• Design Options
• Number of messages: I want to consume 1000 messages in next batch
• Assuming the homogeneous processing overhead
• Size of messages: I want to receive at most 1000 bytes in next batch
• Complicated Server side logic -> track the delivered size
• Larger messages, longer processing time is not always true
Data
Source
Client
Start from the first msg
Data
Source
Client
Consume all messages at
once? May crash your
processing engine!!!A Long Stop!!!
38. Kafka Choice
• Message Addressing:
• Xth message: 0, 1, 2, 3, 4, ..
• Rate Control
• Number of Messages: 0, 1, 2, 3, 4, …
Driver
Executor
Executor
Kafka
Message Addressing and Rate Control:
Batch 0: How many messages are to be
processed in next batch, and where to start? 0
- 999
Batch 1: How many messages are to be processed
in next batch, and where to start? 1000 - 1999
39. Azure Event Hubs’ Choice
• Message Addressing:
• Offset of messages: 0, size of msg 0, size of (msg 0 + msg 1),…
• Rate Control
• Number of Messages: 0, 1, 2, 3, 4, …
This brings totally different connector
design/implementation!!!
40. Distributed Information for Rate
Control and Message Addressing
Driver
Executor
Executor
Rate Control:
Batch 0: How many messages are to
be processed in next batch, and
where to start? 0 - 999
Azure EventHubs
Batch 1: How many messages are to be
processed in next batch, and where to
start? 1000 - 1999
What’s the offset of
1000th message???
The answer appeared in Executor
side (when Task receives the
message (as part of message
metadata))
Build a Channel to Pass
Information from Executor to
Driver!!!
45. Takeaways (3)
• There are multiple options on the Server-side design for Message
Addressing and Rate Control
• To design and implement a Spark Streaming connector, you have to
understand what are the options adopted in server side
The key is the combination!!!
46. Contribute Back to
Community
Failed Recovery from checkpoint caused by the multi-threads issue in
Spark Streaming scheduler
https://issues.apache.org/jira/browse/SPARK-19280
One Realistic Example of its Impact: You are potentially getting wrong
data when you use Kafka and reduceByWindow and recover from a
failure
Data loss caused by improper post-batch-completed processing
https://issues.apache.org/jira/browse/SPARK-18905
Inconsistent Behavior of Spark Streaming Checkpoint
https://issues.apache.org/jira/browse/SPARK-19233
47. Summary
• Spark Streaming Connector for Azure Event Hubs enables the user to perform
various types of analytics over streaming data from a fully managed, cloud-scale
message telemetry ingestion service
• https://github.com/hdinsight/spark-eventhubs
• Design and Implementation of Spark Streaming Connectors
• Coordinate Execution Model and Communication Pattern
• Fault Tolerance (Spark Streaming Checkpoint v.s. self-managed fault tolerance facilitates)
• Message Addressing and Rate Control (Server&Connector Co-Design)
• Contributing Back to the Community
• Microsoft is the organization with the most open source contributors in 2016!!!
• http://www.businessinsider.com/microsoft-github-open-source-2016-9
48. If you do not want to handle
this complexity
Move to Azure HDInsight…
49. Future Work
Structured Streaming integration with Event Hubs (will release at the
end of month)
Streaming Data Visualization with PowerBI (alpha released mode)
Streaming ETL Solutions on Azure HDInsight!
50. Thank You!!!
Build a Powerful&Robust Data Analytic
Pipeline with Spark@Azure HDInsight!!!
Hinweis der Redaktion
Two types of datasets
Bounded: Finite, unchanging datasets
Unbounded: Infinite datasets that are appended to continuously
Unbounded – data is generated all the time and we want to know now
Glue between unbounded data source like event hubs and powerful processing engine like Spark
Goal is to deliver near real time analysis or view.
Micro-batching mechanism, processes continuous and infinite data source
Batch scheduled at regular time interval or after certain number of events received
Distributed Stream is the highest level abstraction over continuous creation and expiration of RDDs
Batch duration – single RDD generated
Window duration – multiple of batch duration, may use multiple RDDs
RDDs contains partitions, one task per partitions
High throughput, low latency offered as platform as a service on Azure
No cluster set up required, no monitoring required
User can concentrate only on ingress and egress of data
Event hubs namespace collection of event hubs, an event hub is a collection of partitions, a partition is a sequential collection of events
Up to 32 partitions per event hub but can be increased if required
-HTTP or AMQP with transport level security (TLS/SSL)
-HTTP has higher message transmission overhead
-AMQP has higher connection setup overhead
-Consumer group gives logical view of event hubs partitions, including addressing same partition at different offsets
-Up to 20 consumer groups per event hubs
-1 receiver per consumer group
Each partition can be viewed as a commit log
Event Hubs client maintains prefetch queue to proactively get messages from the server
Receive call by application gets messages in batch from the prefetch queue to the caller.
No support from Event Hubs server yet
Offset is managed by the Event Hubs connector at the Spark application side
Uses distributed file system like HDFS, ADLS, etc.
Offset is stored per consumer group, per partition, per event hub, per event hub namespace
Event hubs clients are initialized with an initial offset from the which event hubs will start sending data
Offset is determined in one of three ways – start of stream, previously saved offset, enqueue time
- How do we bridge
Reliable receivers – received data backed up in a reliable persistent store (WAL), no data lost between application restarts
Reliable receivers – offset saved after saving to persistent store and pushing to block manager
Both executors and driver use the WAL
On application restart data is processed from the WAL first up to the offset saved before the previous application stop
Receiver tasks then start the event hubs clients, one per partition with the last offset saved for each partition.
Describe each parameter.
Extends spark provided Receiver class with specific type of Array of Bytes which is the exact content of the user data per event.
Storage level whether to spill to disk when memory usage reaches capacity.
On start establishes connections to event hubs
On stop cleans up the connections
Reliably store data to block manager
Restart call stop and start.