Virtual Flink Forward 2020: High quality performant and cost efficient schema-aware data streams on Flink at Netflix scale - Jagannathrao Mudda, Ramayan Tiwari
Over 4 million events per second with hundreds of event types and a petabyte of data per day received from a variety of client devices of 150+ million members lead to an enormous scale of user behavior data and application performance data. This data is being used to power machine learning, recommendations, personalization, and many other services to enhance user experience. The data quality of billions of events and performance/cost efficiency in processing such large data set is very critical as it drives every decision that we make during the process of creating and showing the appropriate content to our users. We, at Netflix, developed multiple services and tools on Flink infrastructure that enable us to schematize the event data consisting of a growing set of event types and produce schema-compliant data streams. Schema definitions, updates, registrations, update-notifications, backward-compatibility and schema-compliance validations in real-time at Netflix scale have been a super challenging task. We have used various techniques and open source technologies to overcome these challenges. With this solution, we are able to deliver high-quality data streams, improve efficiency and achieve up to 40% cost savings. Come, join us to learn how we are able to schematize data streams, rollout schema upgrades, seamlessly migrate producer and consumers with schema enforcement, and handle non-conforming events to deliver high-quality Flink data streams that enable us to delight users
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Virtual Flink Forward 2020: High quality performant and cost efficient schema-aware data streams on Flink at Netflix scale - Jagannathrao Mudda, Ramayan Tiwari
4. Millions of client devices with different versions of desktop, mobile and tv OS
Netflix Scale
5. Data Variety:
● 450+ event types from millions of devices
● Structured and Semi-structured event data
* These data points are limited to user behaviour data coming from client devices
The 3Vs at Netflix*
Data Velocity:
● 350K+ requests per second real time ✅
● 7+ million events processing per second ✅
Data Volume:
● 400+ billion events being collected every day ✅
● Petabyte of data per day at rest ✅
6. ● Netflix consumer app
○ Events capturing user interaction, intent and
behavior
○ Events capturing app and system
performance
● Netflix production studio apps
● Netflix partners apps (resellers bundling Netflix
with their services)
● Sales, marketing, advertising, promotions events
Data Variety at Netflix
7. ● Misinterpretation of data leads to:
○ Inconsistent metrics, data insights
○ Poor recommendations and personalization
○ Inconclusive A/B testing results
○ Decrease in Member Joy leading to Churn
● Data producer changes could break data consumer apps
● Hard to deprecate any event types
Data Variety Impact on Data Quality
8. ● Limit unstructured data unless absolutely required
● Curate or transform unstructured data during
processing
● Schematize structured/semi-structured data
● Build Schema-aware Data Streams
How to Handle Data Variety ?
10. ● Schematization (Defining/Updating Schema)
● Schemafication (Generating schema compliant events)
● Schema Validation
● Integration with Streaming Application
● Schema Definition of Data at Rest
Phases for Schema-aware Data Stream
Design
11. ● Schematization
○ New events types are added frequently
○ Existing events types are being updated
○ How to define schema for event types ?
○ How to seamlessly notify client app/server side app ?
○ How to handle schema evolution of event types ?
Design Challenges
● Schemafication
■ Client side
■ Server Side
12. ● Schema Validation
○ Compile time/Runtime ?
○ How to handle schema non-compliant events?
Design Challenges
● Schema-aware data streams
○ How to define schema for data streams generated by
stateless/stateful applications?
○ How to handle schema evolution of data streams ?
○ How does consumers get access to schema of the data stream?
● Data at Rest
○ How to make to cost effective and still highly performant
14. ● Client side schemafication
○ Send schema update notification to every client/device
○ Access to schema registry from client/device (outside vip)
○ Package updated schema with the image and deploy new
version on each device
● Server side schemafication
○ Generate schema compliant records in Flink Streaming App
○ Use latest Avro Schema from schema registry and generate
Avro Records
○ Schema client in the app to get schema update notification
Design Approaches - Schemafication
16. ● Compile time validation
○ Data type and mandatory field validation while creating instance of Specific Avro
Record
○ Build and push a new image for every schema change.
● Run time validation
○ Data type validation while creating instance of Avro generic records
○ Mandatory fields validations when Avro generic records are serialized
○ Send schema non-compliant records to a different channel with schema errors
○ Schema non-compliant records can continue to be in JSON format
Design Approaches - Schema Validation
18. ● Data Streams can contain event, context and other
enriched attributes
● Data Streams can be enriched, transformed by
streaming apps
● Data Streams schema can be evolved
● Stateless and Stateful application can perform
generic transformation and aggregation
Schema-aware Data Streams
Design Requirements
20. ● Data At Rest in Avro format
○ Full schema evolution support
○ Row oriented not good for wide, high volume table
● Embedded Avro Binary Column in Parquet format
○ Serialize large column using avro binary format
○ Table is columnar in parquet format with embedded avro binary
column
○ Highly performant
○ An UDF to deserialize the avro binary column
Design Approaches - Data At Rest
22. ● Schema for Data In Motion
○ No misinterpretation of data
○ High Data Quality
■ Realtime data quality checks
■ Segregation of Schema compliant and non compliant data
● Compute Efficiency
○ Binary Encoded data in motion
○ Processing data more efficient upto 30%
● Storage Efficiency
○ Binary encoded column in the data store
○ Upto 40% less storage
● Cost Efficiency
○ Upto 40% Cost Savings
● Enable to Deliver High-Quality Performant and Cost Efficient Schema-aware Data Streams
Schema-aware Data Streams Benefits
23. ● JSON Processing versus Avro Generic Record Processing
● Enabled us to do more compute/processing at ingestion layer
● Moved Decompaction to an app that is doing avro processing
An example of Compute Benefit
24. ● Consumers are in sync with the schema of data streams
● Consistent metrics, data insights
● Great recommendations and personalization
● Conclusive A/B testing results
● Decrease in turnaround time for feature/app performance improvement
● …
● Increase in Member Joy
Greater Data Quality Translates to