2. About Me
● Engineer @ Lyft
● Teams - ETA, Data Science Platform, Data Platform
● Accomplishments
○ ETA model training from 4 months to every 10 minutes
○ Real-time traffic updates
○ Flyte - Large Scale Orchestration and Batch Compute
○ Lyftlearn - Custom Machine Learning Library
○ Dryft - Real-time Feature Generation for Machine Learning
3. Streaming Use Cases
● Firehose -> S3 for Hive and
Presto
● Real-time Traffic
● Primetime + Heatmaps
● Fraud Detection
● Ride Receipts
● Driver incentives
● Passenger coupons
● Anomaly Detection
● Map Correction
● Location Smoothing
● Bad Experience Detection
● Accident Detection
● ...and so much more
4. A Brief History of Lyft
● <2015 - monolithic PHP app, Redshift
● 2015 - Python services, “workers”, Fanner,
Spark, Job Scheduler
● 2016 - Go services, Hive
● 2017 - Presto
● 2018 - Druid, Kafka, Flink, Beam
6. Fanner
● Pub-sub layer on Kinesis
● Curates output streams based on event type and simple value
filters
● Pros - easy to use, integrated with JobScheduler
● Cons - no ordering guarantees, limited scaling
Kinesis
All Events
Fanner
Kinesis
Event A
Kinesis
Event B
Kinesis
Event A + B
7. Job Scheduler
● Register data to call webhook
● Fire and forget
● Immediate or scheduled
● Pros - easy asynchronous
programming, scales well
● Cons - no ordering guarantees,
suboptimal outage handling, no replay
Service
SQS
Workers
Target Service
8. “Workers”
● Simple asynchronous/stream programming
● KCL worker to grab data from Kinesis,
transform, store in Redis
● Additional workers read from Redis,
transform, store in Redis
● Pros - familiar programming paradigm
● Cons - failure handling, checkpointing,
joins, replay, etc. are roll your own (aka
error prone)
9. Next Generation Goals
● Build Community
● Lower support burden
● Enhance developer ergonomics
● Scale with the business
● Support additional use cases
10. Next Generation Overview
● Pub-Sub -> Kafka
○ Large community, mature technology
● Stream Processing -> Flink
○ Growing community, event time processing
● Apache Beam for cross language
support
● SPaaS
○ Kubernetes
○ Dryft
11. Dryft
● Need - Consistent Feature Generation
○ The value of your machine learning results is only as good as the data
○ Subtle changes to how a feature value is generated can significantly impact results
● Solution - Unify feature generation
○ Batch processing for bulk creation of features for training ML models
○ Stream processing for real-time creation of features for scoring ML models
● How - Flink SQL
○ Use Flink as the processing engine using streaming or bulk data
○ Add automation to make it super simple to launch and maintain feature generation programs
at scale
12. Dryft Program
{
"source": "dryft",
"query_file": "decl_ride_completed.sql",
"kinesis": {
"stream": "declridecompleted" },
"features": {
"n_total_rides": {
"description": "All time ride count per
user",
"type": "int",
"version": 1 }
}
}
SELECT COALESCE(user_lyft_id,
passenger_lyft_id, passenger_id, -1) AS
user_id,
COUNT(ride_id) as n_total_rides
FROM event_ride_completed
GROUP BY COALESCE(user_lyft_id,
passenger_lyft_id, passenger_id, -1)
13. Dryft Program Execution
● Backfill - read historic data from S3, process, sink to S3
● Real-time - read stream data from Kinesis/Kafka, process, sink
to DynamoDB
SinkS3 Source SQL
SinkKinesis/Kafka Source SQL
14. Bootstrapping
● Read historic data from S3
● Transition to reading real-time data
● https://data-artisans.com/flink-forward/resources/bootstrappin
g-state-in-apache-flink
S3 Source
Kinesis/Kafka Source
Business
Logic
Sink
< Target Time
>= Target Time
15. Continuous Window Semantics
● Continuous window counts go up and down
● SELECT user_id, count(ride_id) OVER (PARTITION BY
user_id ORDER BY rowtime RANGE INTERVAL '1' HOUR
PRECEDING) from event_ride_completed
● Example: rides 1p, 1:30p, 2:15p, 3:45p
● Flink default is one message in, one message out
○ Output = 1@1p, 2@1:30p, 2@2:15p, 1@3:45p
● Dryft is one message in, two messages out
○ Output = 1@1p, 2@1:30p, 1@2p, 2@2:15p, 1@2:30p,
0@3:15p, 1@3:45p, 0@4:45p
● Other fancy SQL analysis tricks too
16. Beyond Feature Generation
● Reactive Programming
○ When an event occurs, execute some logic
● Asynchronous Programming
○ Perform more processing asynchronously
● Geotriggering
○ Emit an event when someone enters or leaves an
area
● Change Data Capture
○ Mirror production data in analytical data stores
17. The End of the Stream
● Real-time Visualization w/ Druid and
Superset
● Presto for <10s over large data sets
● Hive and Spark for extreme data sets
● Flyte - extreme data set workflows