In Apache Pulsar Beijing Meetup, Sijie Guo and Yong Zhang gave a preview of transaction support in Pulsar 2.5.0. Sijie Guo started with the current state of messaging semantics in Pulsar and talked about the implementation of message deduplication introduced by PIP-6. Then he went into the details of why do we need transaction and how do we implement transaction in Pulsar. Finally Yong walked through the whole transaction execution flow.
21. • Producer Name - Identify who is producing the messages
• Sequence ID - Identify the message
• Producer Name + Sequence ID: The unique identifier for a
message
Idempotent Producer
22. • Broker maintains a map between Producer Name and Last-
Produced-Sequence-ID
• Broker accepts messages if the sequence id of a new
message is larger than its last produced sequence id
• Broker treats messages whose sequence id are smaller
• Broker keeps the map in a de-duplication cursor (stored in
bookkeeper)
Guaranteed Message Deduplication
31. • `bin/pulsar-admin set-deduplication -e tenant/namespace`
• Set producer name when creating a Producer
• Specify increasing sequence id when producing messages
Enable Exactly Once
32. • It only works when producing messages to one partition
• It only works for producing one message
• There is no atomicity when producing multiple messages to
one partition or many partitions
• Consumers are required to store the MessageId along with
its state and seek back to the MessageId when restoring
the state
Limitations
36. • Transfer Topic : record the transfer requests
• Cash Transfer Function: perform the cash transfer action
• BalanceUpdate Topic: record the balance-update requests
PulsarCash, powered by Apache Pulsar
42. • Atomic writes across multiple partitions
• Atomic acknowledges across multiple subscriptions
• All the actions made within one transaction either all
succeed or all fail
• Consumers are *ONLY* allowed to read committed
messages
Transaction Semantics
43. Message<String> message = inputConsumer.receive();
CompletableFuture<MessageId> sendFuture1 =
producer1.newMessage().value(“output-message-1”).sendAsync();
CompletableFuture<MessageId> sendFuture2 =
producer2.newMessage().value(“output-message-2”).sendAsync();
inputConsumer.acknowledgeAsync(message.getMessageId());
Without Transaction API
44. Broker-0 Broker-1
InputTopic OutputTopic-1 OutputTopic-2
Cursor
Data Log Data Log
Pulsar Client
Input
Consumer
Producer 1 Producer 2
0) Receive Message
1) Produce Messages
2) Ack Messages
48. • TC: transaction manager, coordinating committing and
aborting transactions
• In-Memory + Transaction Log
• Transaction Log is powered by a partitioned Pulsar topic
• `pulsar/system/__transaction_coordinator_log`
• Locating a TC is locating a partition of the transaction log
topic
Transaction Coordinator (TC)
50. • TB: store and index transaction data per topic partition
• TB is implemented using another ML (managed-ledger) as
TB log
• Messages are appended to into TB log
• Transaction Index is maintained in memory and
snapshotted to ledgers
• Transaction Index can be replayed from TB log
Transaction Buffer (TB)
52. • Introduce ACK_PENDING state
• Add response for acknowledgement, aka ack-on-ack
• Ack state is updated to cursor ledger
• Ack state can be replayed from cursor ledger
Transactional Subscription State
69. • Transaction support in other languages (e.g. C++, Go)
• Transaction in Pulsar Functions & Pulsar IO
• Transaction in Kafka-on-Pulsar (KOP)
• Transaction for Flink / Spark job
• Transaction for State storage in Pulsar Functions
• …
Roadmap
70. • Ivan Kelly
• Matteo Merli
• Jia Zhai
• Penghui Li
• Marvin Cai
• Yong Zhang
• … and many other Pulsar users & contributors
Credits