Virtual Flink Forward 2020: High quality performant and cost efficient schema-aware data streams on Flink at Netflix scale - Jagannathrao Mudda, Ramayan Tiwari

Schema-aware Data Streams
at Netflix Scale
Jagannathrao Mudda
Ramayan Tiwari

The 3Vs of Big Data
Handling DATA
VARIETY is critical
for the DATA
QUALITY

167+ million members spanning 190+ countries
Netflix Scale

Millions of client devices with different versions of desktop, mobile and tv OS
Netflix Scale

Data Variety:
● 450+ event types from millions of devices
● Structured and Semi-structured event data
* These data points are limited to user behaviour data coming from client devices
The 3Vs at Netflix*
Data Velocity:
● 350K+ requests per second real time ✅
● 7+ million events processing per second ✅
Data Volume:
● 400+ billion events being collected every day ✅
● Petabyte of data per day at rest ✅

● Netflix consumer app
○ Events capturing user interaction, intent and
behavior
○ Events capturing app and system
performance
● Netflix production studio apps
● Netflix partners apps (resellers bundling Netflix
with their services)
● Sales, marketing, advertising, promotions events
Data Variety at Netflix

● Misinterpretation of data leads to:
○ Inconsistent metrics, data insights
○ Poor recommendations and personalization
○ Inconclusive A/B testing results
○ Decrease in Member Joy leading to Churn
● Data producer changes could break data consumer apps
● Hard to deprecate any event types
Data Variety Impact on Data Quality

● Limit unstructured data unless absolutely required
● Curate or transform unstructured data during
processing
● Schematize structured/semi-structured data
● Build Schema-aware Data Streams
How to Handle Data Variety ?

Use Case:
Event Processing Pipeline

● Schematization (Defining/Updating Schema)
● Schemafication (Generating schema compliant events)
● Schema Validation
● Integration with Streaming Application
● Schema Definition of Data at Rest
Phases for Schema-aware Data Stream
Design

● Schematization
○ New events types are added frequently
○ Existing events types are being updated
○ How to define schema for event types ?
○ How to seamlessly notify client app/server side app ?
○ How to handle schema evolution of event types ?
Design Challenges
● Schemafication
■ Client side
■ Server Side

● Schema Validation
○ Compile time/Runtime ?
○ How to handle schema non-compliant events?
Design Challenges
● Schema-aware data streams
○ How to define schema for data streams generated by
stateless/stateful applications?
○ How to handle schema evolution of data streams ?
○ How does consumers get access to schema of the data stream?
● Data at Rest
○ How to make to cost effective and still highly performant

● Client side schemafication
○ Send schema update notification to every client/device
○ Access to schema registry from client/device (outside vip)
○ Package updated schema with the image and deploy new
version on each device
● Server side schemafication
○ Generate schema compliant records in Flink Streaming App
○ Use latest Avro Schema from schema registry and generate
Avro Records
○ Schema client in the app to get schema update notification
Design Approaches - Schemafication

● Compile time validation
○ Data type and mandatory field validation while creating instance of Specific Avro
Record
○ Build and push a new image for every schema change.
● Run time validation
○ Data type validation while creating instance of Avro generic records
○ Mandatory fields validations when Avro generic records are serialized
○ Send schema non-compliant records to a different channel with schema errors
○ Schema non-compliant records can continue to be in JSON format
Design Approaches - Schema Validation

Schema Validation Sequence Diagram

● Data Streams can contain event, context and other
enriched attributes
● Data Streams can be enriched, transformed by
streaming apps
● Data Streams schema can be evolved
● Stateless and Stateful application can perform
generic transformation and aggregation
Schema-aware Data Streams
Design Requirements

Schema Aware Data Streams Design

● Data At Rest in Avro format
○ Full schema evolution support
○ Row oriented not good for wide, high volume table
● Embedded Avro Binary Column in Parquet format
○ Serialize large column using avro binary format
○ Table is columnar in parquet format with embedded avro binary
column
○ Highly performant
○ An UDF to deserialize the avro binary column
Design Approaches - Data At Rest

● Schema for Data In Motion
○ No misinterpretation of data
○ High Data Quality
■ Realtime data quality checks
■ Segregation of Schema compliant and non compliant data
● Compute Efficiency
○ Binary Encoded data in motion
○ Processing data more efficient upto 30%
● Storage Efficiency
○ Binary encoded column in the data store
○ Upto 40% less storage
● Cost Efficiency
○ Upto 40% Cost Savings
● Enable to Deliver High-Quality Performant and Cost Efficient Schema-aware Data Streams
Schema-aware Data Streams Benefits

● JSON Processing versus Avro Generic Record Processing
● Enabled us to do more compute/processing at ingestion layer
● Moved Decompaction to an app that is doing avro processing
An example of Compute Benefit

● Consumers are in sync with the schema of data streams
● Consistent metrics, data insights
● Great recommendations and personalization
● Conclusive A/B testing results
● Decrease in turnaround time for feature/app performance improvement
● …
● Increase in Member Joy
Greater Data Quality Translates to

Virtual Flink Forward 2020: High quality performant and cost efficient schema-aware data streams on Flink at Netflix scale - Jagannathrao Mudda, Ramayan Tiwari

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Flink Forward

Mehr von Flink Forward (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Virtual Flink Forward 2020: High quality performant and cost efficient schema-aware data streams on Flink at Netflix scale - Jagannathrao Mudda, Ramayan Tiwari