Timothy will introduce Apache Pulsar, an open-source distributed messaging and streaming platform. He will discuss how to build real-time applications using Pulsar with various libraries, schemas, languages, frameworks and tools. The presentation will cover what Pulsar is, its functions and components, how it compares to other technologies like Apache Kafka, its advantages, and how to integrate it with tools like Apache Flink, Apache Spark, Apache NiFi and more. A demo and Q&A will follow.
2. ● Introduction
● What is Apache Pulsar?
● Pulsar Functions
● Apache NiFi
● Apache Flink
● Apache Spark
● Demo
● Q&A
In this session, Timothy
will introduce you to the
world of Apache Pulsar
and how to build real-time
messaging and streaming
applications with a variety
of OSS libraries, schemas,
languages, frameworks,
and tools.
3. Tim Spann
Developer Advocate
Tim Spann
Developer Advocate at StreamNative
● FLiP(N) Stack = Flink, Pulsar and NiFi Stack
● Streaming Systems & Data Architecture Expert
● Experience:
○ 15+ years of experience with streaming technologies including Pulsar, Flink, Spark, NiFi, Big
Data, Cloud, MXNet, IoT, Python and more.
○ Today, he helps to grow the Pulsar community sharing rich technical knowledge and experience
at both global conferences and through individual conversations.
4. This week in Apache Flink, Apache
Pulsar, Apache NiFi, Apache Spark
and open source friends.
https://bit.ly/32dAJft
FLiP Stack Weekly
7. Apache Pulsar is built to support legacy applications, handle the
needs of modern apps, and supports NextGen applications
Support legacy workloads.
Compatible with popular
messaging and streaming tools.
Legacy
Built for today's real-time
event driven applications.
Modern
Scalable, adaptive architecture
ready for the future of real-time
streaming.
NextGen
8.
9. Apache Pulsar has a vibrant community
560+
Contributors
10,000+
Commits
7,000+
Slack Members
1,000+
Organizations
Using Pulsar
10. It is often assumed that Pulsar and Kafka have equal capabilities. In reality,
Pulsar offers a superset of Kafka.
● Pulsar is streaming and queuing together
● Pulsar is cloud-native with stateless brokers
● Natively includes geo-replication, multi-tenancy, and end-to-end
security out of the box
● Pulsar provides automated rebalancing
● Pulsar offers 100X lower latency w/ 2.5 greater throughput than Kafka
Advantages of Apache Pulsar
11. Apache Pulsar features
Cloud native with decoupled
storage and compute layers.
Built-in compatibility with your
existing code and messaging
infrastructure.
Geographic redundancy and high
availability included.
Centralized cluster management
and oversight.
Elastic horizontal and vertical
scalability.
Seamless and instant partitioning
rebalancing with no downtime.
Flexible subscription model
supports a wide array of use cases.
Compatible with the tools you use
to store, analyze, and process data.
12. ● “Bookies”
● Stores messages and cursors
● Messages are grouped in
segments/ledgers
● A group of bookies form an
“ensemble” to store a ledger
● “Brokers”
● Handles message routing and
connections
● Stateless, but with caches
● Automatic load-balancing
● Topics are composed of
multiple segments
●
● Stores metadata for both
Pulsar and BookKeeper
● Service discovery
Store
Messages
Metadata &
Service Discovery
Metadata &
Service Discovery
Pulsar Cluster
Metadata
Storage
Pulsar Cluster
13. Component Description
Value /
Data payload
The data carried by the message. All Pulsar messages contain raw bytes,
although message data can also conform to data schemas.
Key Messages are optionally tagged with keys, used in partitioning and also is useful
for things like topic compaction.
Properties An optional key/value map of user-defined properties.
Producer name The name of the producer who produces the message. If you do not specify a
producer name, the default name is used.
Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The
sequence ID of the message is its order in that sequence.
Messages - the Basic Unit of Apache Pulsar
14. Different subscription modes
have different semantics:
Exclusive/Failover -
guaranteed order, single active
consumer
Shared - multiple active
consumers, no order
Key_Shared - multiple active
consumers, order for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer D-2
Key-Shared
<
K
1,
V
10
>
<
K
1,
V
11
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1,
V
10
>
<
K
2
,V
2
1>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
1,
V
11
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover
Apache Pulsar Subscription Modes
16. Unified Messaging Model
Simplify your data infrastructure and
enable new use cases with queuing and
streaming capabilities in one platform.
Multi-tenancy
Enable multiple user groups to share the
same cluster, either via access control, or
in entirely different namespaces.
Scalability
Decoupled data computing and storage
enable horizontal scaling to handle data
scale and management complexity.
Geo-replication
Support for multi-datacenter replication
with both asynchronous and
synchronous replication for built-in
disaster recovery.
Tiered storage
Enable historical data to be offloaded to
cloud-native storage and store event
streams for indefinite periods of time.
Apache Pulsar Benefits
17. Messaging Use Cases Streaming Use Cases
Service x commands service y to make some
change.
Example: order service removing item from
inventory service
Moving large amounts of data to another service
(real-time ETL).
Example: logs to elasticsearch
Distributing messages that represent work
among n workers.
Example: order processing not in main “thread”
Periodic jobs moving large amounts of data and
aggregating to more traditional stores.
Example: logs to s3
Sending “scheduled” messages.
Example: notification service for marketing emails
or push notifications
Computing a near real-time aggregate of a message
stream, split among n workers, with order being
important.
Example: real-time analytics over page views
Messaging vs Streaming
18. Messaging Use Case Streaming Use Case
Retention The amount of data retained is
relatively small - typically only a day
or two of data at most.
Large amounts of data are retained,
with higher ingest volumes and
longer retention periods.
Throughput Messaging systems are not designed
to manage big “catch-up” reads.
Streaming systems are designed to
scale and can handle use cases
such as catch-up reads.
Differences in Consumption
19. byte[] msgIdBytes = // Some byte
array
MessageId id =
MessageId.fromByteArray(msgIdBytes);
Reader<byte[]> reader =
pulsarClient.newReader()
.topic(topic)
.startMessageId(id)
.create();
Create a reader that will read from
some message between earliest and
latest.
Reader
Apache Pulsar Reader Interface
20. ● New Consumer type added in Pulsar 2.10 that provides a
continuously updated key-value map view of compacted topic data.
● An abstraction of a changelog stream from a primary-keyed table,
where each record in the changelog stream is an update on the
primary-keyed table with the record key as the primary key.
● READ ONLY DATA STRUCTURE!
Apache Pulsar TableView
21.
22.
23. Schema Registry
schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3
(value=Avro/Protobuf/JSON)
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers
Schema Registry
24. ● Utilizing JSON Data with a JSON Schema
● Consistency, Contracts, Clean Data
● This enables easy SQL:
○ Pulsar SQL (Presto SQL)
○ Flink SQL
○ Spark Structured Streaming
Use Schemas
31. ● Lightweight computation similar
to AWS Lambda.
● Specifically designed to use
Apache Pulsar as a message
bus.
● Function runtime can be
located within Pulsar Broker.
● Java Functions
A serverless event
streaming framework
Apache Pulsar Functions
32. ● Consume messages from one or
more Pulsar topics.
● Apply user-supplied processing
logic to each message.
● Publish the results of the
computation to another topic.
● Support multiple programming
languages (Java, Python, Go)
● Can leverage 3rd-party libraries
to support the execution of ML
models on the edge.
Apache Pulsar Functions
33. ● Visual Question and Answer
● Natural Language Processing
● Sentiment Analysis
● Text Classification
● Named Entity Recognition
● Content-based
Recommendations
• Predictive
Maintenance
• Fault Detection
• Fraud Detection
• Time-Series
Predictions
• Naive Bayes
Apache Pulsar Functions for ML Models
44. val dfPulsar = spark.readStream.format("pulsar")
.option("service.url", "pulsar://pulsar1:6650")
.option("admin.url", "http://pulsar1:8080")
.option("topic", "persistent://public/default/pi-sensors")
.load()
dfPulsar.printSchema()
val pQuery = dfPulsar.selectExpr("*")
.writeStream.format("console")
.option("truncate", false)
.start()
https://github.com/tspannhw/FLiP-Pi-BreakoutGarden
Building Spark SQL View
45. ● Java, Scala, Python Support
● Strong ETL/ELT
● Diverse ML support
● Scalable Distributed compute
● Apache Zeppelin and Jupyter Notebooks
● Fast connector for Apache Pulsar
Why Apache Spark?