Building data product requires having lambda architecture to bridge the batch and streaming processing. AirStream is a framework built on top of HBase to allow users to easily build data products at Airbnb. It proved HBase is impactful and useful in the production for mission critical data products.
In the talk, we will present the applications to leverage HBase to compute moving average, distinct count, window based join and etc. in the streaming computation.
Also, we will talk about how to leverage HBase to bridge the gap between batch and streaming queries, including building presto-hbase connector to serve near real time ad-hoc query.
by Liyin Tang of AirBnB
7. Liyin Tang and Jingwei Lu
Our Foundations
• Combine stream with batch
• Shared global state store
7
8. Liyin Tang and Jingwei Lu
Unified API through AirStream
• Declarative job configuration
• Computation operator or sink can be shared by stream and
batch job.
• Stream source vs static source
• Single driver execute stream/batch mode job
8
9. AirStream
Liyin Tang and Jingwei Lu
9
HBase Tables
Spark StreamingSpark StreamingSpark StreamingSpark Streaming
Spark BatchSpark BatchSpark BatchSpark Batch
Shared Global State Store
10. Shared Global State Store
Liyin Tang and Jingwei Lu
10
DataFrame
HBase
Region 1
Region 2
Region N
Re-partition
<Region 1, [RowKey,
Value]>
<Region 2, [RowKey,
Value]>
<Region N, [RowKey,
Value]>
… …
Puts
HFile
BulkLoad
11. Shared Global State Store
Liyin Tang and Jingwei Lu
11
HBase Tables
Spark Streaming/Batch Jobs
Multi-Gets Prefix Scan Time Range Scan
13. • Rich API for point-lookups and sequential scan
(TimeRange, TTL, Prefix Scan …)
• Merged view based on version
• Unified API for streaming writes and bulk uploads
• Unified API for reading from live table and snapshot table
13
Why HBase