United Airlines uses Apache Pulsar and Apache BookKeeper to improve operational reliability by connecting to the FAA's real-time SWIM message feed. Pulsar acts as a messaging platform to publish and subscribe to the SWIM feed, while BookKeeper provides durable storage. This replaces point-to-point integrations and allows United to perform both real-time and historical analytics on the feed. Connectors are needed to interface Pulsar with United's existing operational systems until more sophisticated streaming interfaces are available.
1. Apache Pulsar as a
Dual Streaming /
Batch Processor
Joe Olson
Senior Manager, Big Data Analytics
Apache Road Show Chicago - May 2019
2. Agenda
United and the Airline Industry
How Publish – Subscribe Compute
Model Presents Opportunity
Apache Pulsar & Apache Bookkeeper
Use Case: FAA’s Real Time SWIM
Feed
3. 2
About United Airlines…..
1,348 aircraft (779 mainline, 569 regional) with 250+ on order (supply chain)
158M passengers in 2018
(public facing web site, mobile app, time / geospatial based inventory, loyalty program, surveys, ancillary sales)
4900 daily departures (scheduling, operations, weather, route planning)
355 airports served, in 48 countries (baggage claim, check-ins)
88,000 employees worldwide (scheduling, pay)
Constantly in motion! Future (and past) always changing.
A data scientist / data engineer dream.
Source: https://hub.united.com/corporate-fact-sheet/
4. 3
Business Goals
Improve Customer Experience
- How can we reduce friction when booking a reservation? Maneuvering through an airport?
- How can we deliver a consistent message across all channels? (mobile app, web site, social media etc)
Improve Employee Experience
- How can we keep employees better informed of the current situation so they can relay it to the customers?
- What are we learning from our surveys about what the customer bases says is / isn’t working?
Revenue Generation
- What personalized offers can we make to our customers?
- Are our offers competitive with the rest of the industry?
Improve Operational Reliability
- How can we better prepare for weather or other operational interruptions?
- How can we manage the fleet better and insure spare parts are where they need to be?
6. 5
Apache Pulsar – Key Points
“Apache Pulsar is an open-source distributed pub-sub messaging system originally
created at Yahoo and now part of the Apache Software Foundation”
- Designed for low publish latency (< 5ms) at scale with strong durability guarantees
- Persistent message storage based on Apache BookKeeper.
- Tiered storage provides opportunity for batch and stream processing in the same platform.
- Built from the ground up as a multi-tenant system: isolation, quotas, etc
- Geo-replication designed in – across data centers or geographic regions.
- Pulsar has run in production at Yahoo scale for over 3 years, with millions of messages per
second across millions of topics. Can scale to hundreds of nodes.
- Easily deploy lightweight compute logic without a separate stream processing engine.
- REST Admin API for provisioning, administration, tools and monitoring. Deploy on bare metal
or Kubernetes.
7. 6
Apache Pulsar – Multi Tenancy
Pulsar was designed from the
ground up to be a multi-tenant
system. In Pulsar, tenants are
the highest administrative unit
within a Pulsar instance.
Capacity allocated to a tenant.
A namespace is the
administrative unit
nomenclature within a tenant.
The configuration policies set
on a namespace apply to all
the topics created in that
namespace
8. 7
Apache Pulsar – Subscription Models
In exclusive mode, only a single consumer is
allowed to attach to the subscription
In shared or round robin mode, multiple
consumers can attach to the same subscription.
Messages are delivered in a round robin
distribution across consumers, and any given
message is delivered to only one consumer.
Ordering not guaranteed.
In failover mode, multiple consumers can attach
to the same subscription. The first consumer will
initially be the only one receiving messages.
This consumer is called the master consumer.
When the master consumer disconnects, all
(non-acked and subsequent) messages will be
delivered to the next consumer in line
9. 8
Apache Pulsar – Reference Architecture
One or more brokers handles and load balances
incoming messages from producers, dispatches
messages to consumers
- Topic lookup + data transfer
- Messages dispatched out of a managed
ledger cache, or if under load from persistent
storage (Bookkeeper)
- Coordination with the local and global meta
stores (Zookeeper)
A BookKeeper cluster consisting of one or more
bookies handles persistent storage of messages
Local Zookeeper handles coordination tasks
within a cluster, and a global cluster handles
coordination instance wide (Georeplication)
10. 9
Apache BookKeeper - Key Points
Apache BookKeeper is a scalable, fault tolerant, low latency log storage service
delivering durability and consistency guarantees and can provide access to both historic
and real time data
- Atomic unit is an entry
- A ledger is a bound set of entries, a stream is an unbound set of ledgers.
- Individual servers storing ledgers are called bookies.
- Entries are written to ledgers sequentially, and at most, once (append-only)
- Each bookie handles fragments of ledgers as part of an ensemble. (striping)
A stream of ledgers…
entry
11. 10
Apache BookKeeper – Reference Architecture
Two APIs:
- Ledger API – allows direct interaction with
ledgers, allowing you most flexibility in
working with bookies.
- Log stream API – allows you to interact with
streams without dealing with lower level
ledgers.
Bookies advertise themselves to the Zookeeper
metadata cluster.
12. 11
Apache BookKeeper – Storage Requirements
Clients should be able to write and read streams of entries with very low latency (under 5
milliseconds), even when providing strong durability
Data storage should be durable, consistent, and fault tolerant
The system should enable clients to stream or tail ledgers to propagate data as they’re written
The system should be able to store and provide access to both historic and real-time data
13. 12
Apache BookKeeper – Durability
Example:bookies 1-5 are the ensemble for the ledger.
Entries are striped across the bookies.
Write quorum in this case is 3 (all entries written to 3
bookies)
Write is considered successful when the ack quorum
(in this case 2) successfully acknowledge the write
(fsync).
Wide variety of writing to bookies in the case of
system degradation.
Maximize bandwidth by scaling out bookies
Improve latency by tuning the ack quorum.
Replication supports durability
14. 13
Apache BookKeeper – Consistency & Availability
Consistency for log reads:
- An entry successfully written is immediately
readable.
- An entry read once is always readable.
- All entries written previously are also readable.
- The order of records is identical across all readers.
- Consistency accomplished via LastAddConfirmed
(LAC) – a spin on a two phase commit.
Availability:
- Write can be performed as long as there are
enough bookies to satisfy the ack quorum.
- Read can be performed by any bookie in the
cluster.
16. 15
Apache BookKeeper – Data Distribution
Storage capacity for a single log stream
constrained by the capacity of the cluster,
never a single host.
No stream rebalancing when capacity is added.
New bookies will be discovered, and available
for writing.
Replica repair when failure detected is efficient
because it can be concurrently from multiple
hosts.
All due to segmenting the streams.
17. 16
Apache Pulsar – Tiered Storage
Broker
Bookies
Infinite Stream
Infinite stream – most recent data stored on the
broker, rest stored in bookies, as capacity of
cluster allows
- Write
- Tailing Read
- Catchup Read
18. 17
Apache Pulsar – Tiered Storage
Infinite stream
- Offloader: move segments off the Pulsar
cluster and onto commodity storage.
- Can be triggered on time, size, or demand.
Access
- Broker knows how to read data back, or
bypass bookies and read segments directly.
19. 18
Apache Pulsar – Bringing It All Together
Producer
Subscriber
Segment
Reader
Unbounded stream
Bounded stream
20. 19
Apache Pulsar – Bringing It All Together
Producer
Subscriber
Segment
Reader
Unbounded stream
Batch Processing Stream
Processing
21. 20
Use Case – Improve Operational Reliability
SWIM (System Wide Information Management)
- Real time FAA message feed describing the current and future state of the nation’s managed
airspace - traffic, weather, airport operations, etc.
- Publishers (such as airlines) push their operational information to an endpoint.
- Allows subscribers (such as airlines) on common published message interface.
Airline needs:
- Connect the information in this feed up with their existing operational systems.
• Maintain current state on assets.
- Real time and historical analytics on this feed – traditional and predictive (ML / AI).
26. 25
Architecture - Target State Considerations
Scheduling
Flight
Plans
Weather
Airport
Operations
Airline Systems:
Airspace
Operations
Producer
Subscriber
File
Connector
JDBC
Connector
API
Connector
Connectivity to the operational
systems is mostly through file,
JDBC, and API interfaces.
Most of these are not designed for
streaming interfaces (yet).
How to connect up a topic with a
systems that are not designed to
work with streams?
27. 26
Architecture - Target State Considerations
Scheduling
Flight
Plans
Weather
Airport
Operations
Airline Systems:
Airspace
Operations
Producer
Subscriber
File
Connector
JDBC
Connector
API
Connector
What if there were both batch and
streaming interfaces?
Use the batch interface until more
sophisticated streaming interfaces
come online.
An API written around the segment
reader can help to close the last
mile.
Treat as batch when needed, treat
as stream when needed.
Segment
Reader API