Infinite event streams are the core data abstraction in Apache Pulsar. Pulsar provides two-level reading APIs for accessing events in Pulsar topics, one is pub/sub and the other one is segment readers. The pub/sub API provides a unified messaging API for accessing events in a streaming way. People can choose different subscription modes for consuming events. The segment API provides a way to access events directly from Apache BookKeeper and tiered storage, which is more suitable for batch-oriented workloads. You can combine both pub/sub API and segment API to create a unified data processing experience as well.
In the past year, we at StreamNative have been helping with many customers running Pulsar for different use cases from online queuing, event sourcing to stream and batch processing. We also worked on integrating Pulsar with different components in the big data ecosystem. In this talk, we will share our experiences and best practices of choosing the right API for accessing your event streams in Pulsar for different use cases.
2. Who Am I
❏ StreamNative Software Engineer
❏ Ex-Twitter
❏ Contributed to Apache Projects - Heron, Pulsar
❏ Interested in event streaming technologies
9. Data Processing Categories
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance
10. Data Processing Categories
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ scalability as well as fault tolerant
11. Data Processing Categories
❏ Interactive
❏ Time critical
❏ Medium data size
❏ Rerun on failures
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ scalability as well as fault tolerant
12. Data Processing Categories
❏ Interactive
❏ Time critical
❏ Medium data size
❏ Rerun on failures
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ scalability as well as fault tolerant
❏ Serverless
❏ Simple, light-weight processing
❏ Processing data with high
velocity
14. Pulsar Messaging API
❏ Read data from brokers with different Subscription Modes
❏ Consume / Seek / Receive
❏ Reprocessing data by rewinding (seeking) the cursors
16. Pulsar Segment API
❏ Read data from storage (bookkeeper or tiered storage)
❏ Fine-grained Parallelism
❏ Predicate pushdown (publish timestamp)
17. Segment Centric Storage
❏ Topic Partition (Managed Ledger)
❏ The storage layer for a single topic
partition
❏ Segment (Ledger)
❏ Single writer, append-only
❏ Replicated to multiple bookies
26. Benefits
❏ Unlimited Topic Partition Storage
❏ Instant Scaling without Data Rebalancing
❏ Broker Failure Recovery
❏ Bookie Failure Recovery
❏ Cluster Expansion
❏ Low latency reading for messaging data
❏ High throughput reading for batch data
❏ Reduced cost for whole data storage
29. Conclusion
❏ Apache Pulsar is a cloud-native messaging streaming system
❏ Multi layered architecture
❏ Segment centric storage
❏ Two levels of reading API: Pub/Sub + Segment
❏ Apache Pulsar provides a unified view of data