Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018

Streaming@Lyft
Flink Meetup June 2018
Gregory Fee

About Me
● Engineer @ Lyft
● Teams - ETA, Data Science Platform, Data Platform
● Accomplishments
○ ETA model training from 4 months to every 10 minutes
○ Real-time traffic updates
○ Flyte - Large Scale Orchestration and Batch Compute
○ Lyftlearn - Custom Machine Learning Library
○ Dryft - Real-time Feature Generation for Machine Learning

Streaming Use Cases
● Firehose -> S3 for Hive and
Presto
● Real-time Traffic
● Primetime + Heatmaps
● Fraud Detection
● Ride Receipts
● Driver incentives
● Passenger coupons
● Anomaly Detection
● Map Correction
● Location Smoothing
● Bad Experience Detection
● Accident Detection
● ...and so much more

A Brief History of Lyft
● <2015 - monolithic PHP app, Redshift
● 2015 - Python services, “workers”, Fanner,
Spark, Job Scheduler
● 2016 - Go services, Hive
● 2017 - Presto
● 2018 - Druid, Kafka, Flink, Beam

Streaming Architecture Overview
Mobile
Services
Ingest/
Enrich
Fanner
KCL
Job
Scheduler
S3

Fanner
● Pub-sub layer on Kinesis
● Curates output streams based on event type and simple value
filters
● Pros - easy to use, integrated with JobScheduler
● Cons - no ordering guarantees, limited scaling
Kinesis
All Events
Fanner
Kinesis
Event A
Kinesis
Event B
Kinesis
Event A + B

Job Scheduler
● Register data to call webhook
● Fire and forget
● Immediate or scheduled
● Pros - easy asynchronous
programming, scales well
● Cons - no ordering guarantees,
suboptimal outage handling, no replay
Service
SQS
Workers
Target Service

“Workers”
● Simple asynchronous/stream programming
● KCL worker to grab data from Kinesis,
transform, store in Redis
● Additional workers read from Redis,
transform, store in Redis
● Pros - familiar programming paradigm
● Cons - failure handling, checkpointing,
joins, replay, etc. are roll your own (aka
error prone)

Next Generation Goals
● Build Community
● Lower support burden
● Enhance developer ergonomics
● Scale with the business
● Support additional use cases

Next Generation Overview
● Pub-Sub -> Kafka
○ Large community, mature technology
● Stream Processing -> Flink
○ Growing community, event time processing
● Apache Beam for cross language
support
● SPaaS
○ Kubernetes
○ Dryft

Dryft
● Need - Consistent Feature Generation
○ The value of your machine learning results is only as good as the data
○ Subtle changes to how a feature value is generated can significantly impact results
● Solution - Unify feature generation
○ Batch processing for bulk creation of features for training ML models
○ Stream processing for real-time creation of features for scoring ML models
● How - Flink SQL
○ Use Flink as the processing engine using streaming or bulk data
○ Add automation to make it super simple to launch and maintain feature generation programs
at scale

Dryft Program
{
"source": "dryft",
"query_file": "decl_ride_completed.sql",
"kinesis": {
"stream": "declridecompleted" },
"features": {
"n_total_rides": {
"description": "All time ride count per
user",
"type": "int",
"version": 1 }
}
}
SELECT COALESCE(user_lyft_id,
passenger_lyft_id, passenger_id, -1) AS
user_id,
COUNT(ride_id) as n_total_rides
FROM event_ride_completed
GROUP BY COALESCE(user_lyft_id,
passenger_lyft_id, passenger_id, -1)

Dryft Program Execution
● Backfill - read historic data from S3, process, sink to S3
● Real-time - read stream data from Kinesis/Kafka, process, sink
to DynamoDB
SinkS3 Source SQL
SinkKinesis/Kafka Source SQL

Bootstrapping
● Read historic data from S3
● Transition to reading real-time data
● https://data-artisans.com/flink-forward/resources/bootstrappin
g-state-in-apache-flink
S3 Source
Kinesis/Kafka Source
Business
Logic
Sink
< Target Time
>= Target Time

Continuous Window Semantics
● Continuous window counts go up and down
● SELECT user_id, count(ride_id) OVER (PARTITION BY
user_id ORDER BY rowtime RANGE INTERVAL '1' HOUR
PRECEDING) from event_ride_completed
● Example: rides 1p, 1:30p, 2:15p, 3:45p
● Flink default is one message in, one message out
○ Output = 1@1p, 2@1:30p, 2@2:15p, 1@3:45p
● Dryft is one message in, two messages out
○ Output = 1@1p, 2@1:30p, 1@2p, 2@2:15p, 1@2:30p,
0@3:15p, 1@3:45p, 0@4:45p
● Other fancy SQL analysis tricks too

Beyond Feature Generation
● Reactive Programming
○ When an event occurs, execute some logic
● Asynchronous Programming
○ Perform more processing asynchronously
● Geotriggering
○ Emit an event when someone enters or leaves an
area
● Change Data Capture
○ Mirror production data in analytical data stores

The End of the Stream
● Real-time Visualization w/ Druid and
Superset
● Presto for <10s over large data sets
● Hive and Spark for extreme data sets
● Flyte - extreme data set workflows

Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018

Ähnlich wie Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018 (20)

Mehr von Bowen Li

Mehr von Bowen Li (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018