As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads
Building a Streaming Microservice Architecture: with Apache Spark Structured Streaming and Friends
1.
2. Building a Streaming Microservice
Architecture: With Spark
Structured Streaming and Friends
Scott Haines
Senior Principal Software Engineer
3. Introductions
âȘ I work at Twilio
âȘ Over 10 years working on Streaming
Architectures
âȘ Helped Bring Streaming-First Spark Architecture
to Voice & Voice Insights
âȘ Leads Spark Office Hours @ Twilio
âȘ Loves Distributed Systems
About Me
Scott Haines: Senior Principal Software Engineer @newfront
4. Agenda
The Big Picture
What the Architecture looks like
Protocol Buffers
What they are. Why they rule!
GRPC / Protocol Streams
Versioned Data Lineage as a Service
How this fits into Spark
Structured Streaming with Protobuf support
9. Protocol Buffers
âȘ Strict Types
âȘ Enforce structure at compile time
âȘ Similar to StructType in Apache Spark
âȘ Interoperable with Spark via ExpressionEncoding extension
âȘ Versioning API / Data Pipeline
âȘ Compiled protobuf (*.proto) can be released like normal code
âȘ Interoperable
âȘ Pick your favorite programming language and compile and release.
âȘ Supports Java, Scala, C++, Go, Obj-C, Node-JS, Python and more
Why use them?
10. Protocol Buffers
âȘ Code Gen
âȘ Automatically generate Builder classes
âȘ Being lazy is okay!
âȘ Optimized
âȘ Messages are optimized and ship with their own
Serialization/Deserialization mechanics (SerDe)
Why use them?
12. gRPC
âȘ High Performance
âȘ Compact Binary Exchange Format
âȘ Make API Calls to the Server like they were Client local
âȘ Cross Language/Cross Platform
âȘ Autogenerate API definitions for idiomatic client and server â just
implement the interfaces
âȘ Bi-Directional Streaming
âȘ Pluggable support for streaming with HTTP/2 transport
What is it?
GRPC Client
GRPC Server GRPC Server GRPC Server
HTTP /2
14. GRPC
âȘ Define Messages
âȘ What kind of Data are your sending?
âȘ Example: Click Tracking / Impression Tracking
âȘ What is necessary for the public interface?
âȘ Example: AdImpression and Response
How it works?
15. GRPC
âȘ Service Definition
âȘ Compile your rpc definition to generate Service Interfaces
âȘ Uses the Same protobuf definition (service.proto) as your
Client/Server request and response objects
âȘ Can be used to create a binding Service Contract within your
organization or publicly
How it works?
16. GRPC
âȘ Implement the Service
âȘ Compilation of the Service auto-generates your
interfaces.
âȘ Just implement the service contracts.
How it works?
17. GRPC
âȘ Protocol Streams
âȘ Messages (protobuf) are emitted to Kafka topic(s)
from the Server Layer
âȘ Protocol Streams are now available from the Kafka
Topics bound to a given Service / Collection of
Messages
âȘ Sets up Spark for the Hand-Off
How it works?
18. GRPC
System Architecture
GRPC Client
GRPC Server GRPC Server GRPC Server
Kafka Broker
Kafka Broker
6
HTTP /2
Topic: ads.click.stream
Client: service.adTrack(trackedAd)
Server: ClickTrackService.adTrack(trackedAd)
20. Structured Streaming with Protobuf
âȘ Expression Encoding
âȘ Natively Interop with Protobuf in Apache Spark.
âȘ Protobuf to Case Class conversion from
scalapb.
âȘ Product encoding comes for free via import
sparkSession.implicits._
From Protocol Buffer to StructType through ExpressionEncoders
21. Structured Streaming with Protobuf
âȘ Native is Better
âȘ Strict Native Kafka to DataFrame conversion with no need
for transformation to intermediary types
âȘ Mutations and Joins can be done across DataFrame or
Datasets API.
âȘ Create RealTime Data Pipelines, Machine Learning
Pipelines and More.
âȘ Rest at Night knowing the pipelines are safe!
From Protocol Buffer to StructType through ExpressionEncoders
22. Structured Streaming with Protobuf
âȘ Strict Data Writer
âȘ Compiled / Versioned Protobuf can be used to strictly
enforce the format of your Writers even
âȘ Use Protobuf to define the StructType that can be used in
your conversions to *Parquet. (* must abide by parquet
nesting rules )
âȘ Declarative Input / Output means that Streaming
Applications donât go down due to incompatible Data
Streams
âȘ Can also be used with Delta so that the version of the
schema lines up with compiled Protobuf.
From Protocol Buffer to StructType through ExpressionEncoders
23. Structured Streaming with Protobuf
âȘ Real World Use Case
âȘ Close of Books Data Lineage Job
âȘ Uses End to End Protobuf
âȘ Enables teams to move quick with guarantees regarding
the Data being published and at what Frequency
âȘ Can be emitted at different speeds to different locations
based on configuration
Example: Streaming Transformation Pipeline
24. Streaming Microservice Architecture
GRPC Client
GRPC Server GRPC Server GRPC Server
1
2
3
Kafka Broker
4
Kafka Broker
5
6
Spark Application
7 8
HDFS
S39
HTTP /2
26. What We Learned
âȘ Language
Agnostic
Structured Data
âȘ Compile Time
Guarantees
âȘ Lightning Fast
Serialization/Dese
rialization
âȘ Language
Agnostic Binary
Services
âȘ Low-Latency
âȘ Compile Time
Guarantees
âȘ Smart Framework
GRPCProtobuf
âȘ Highly Available
âȘ Native Connector
for Spark
âȘ Topic Based Binary
Protobuf Store
âȘ Use to Pass
Records to one or
more Downstream
Services
Kafka
âȘ Handle Data
Reliably
âȘ Protobuf to
Dataset /
DataFrames is
awesome
âȘ Parquet / Delta
plays nice as
Columnar Data
Exchange format
Structured Streaming