SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Schema-aware Data Streams
at Netflix Scale
Jagannathrao Mudda
Ramayan Tiwari
The 3Vs of Big Data
Handling DATA
VARIETY is critical
for the DATA
QUALITY
167+ million members spanning 190+ countries
Netflix Scale
Millions of client devices with different versions of desktop, mobile and tv OS
Netflix Scale
Data Variety:
● 450+ event types from millions of devices
● Structured and Semi-structured event data
* These data points are limited to user behaviour data coming from client devices
The 3Vs at Netflix*
Data Velocity:
● 350K+ requests per second real time ✅
● 7+ million events processing per second ✅
Data Volume:
● 400+ billion events being collected every day ✅
● Petabyte of data per day at rest ✅
● Netflix consumer app
○ Events capturing user interaction, intent and
behavior
○ Events capturing app and system
performance
● Netflix production studio apps
● Netflix partners apps (resellers bundling Netflix
with their services)
● Sales, marketing, advertising, promotions events
Data Variety at Netflix
● Misinterpretation of data leads to:
○ Inconsistent metrics, data insights
○ Poor recommendations and personalization
○ Inconclusive A/B testing results
○ Decrease in Member Joy leading to Churn
● Data producer changes could break data consumer apps
● Hard to deprecate any event types
Data Variety Impact on Data Quality
● Limit unstructured data unless absolutely required
● Curate or transform unstructured data during
processing
● Schematize structured/semi-structured data
● Build Schema-aware Data Streams
How to Handle Data Variety ?
Use Case:
Event Processing Pipeline
● Schematization (Defining/Updating Schema)
● Schemafication (Generating schema compliant events)
● Schema Validation
● Integration with Streaming Application
● Schema Definition of Data at Rest
Phases for Schema-aware Data Stream
Design
● Schematization
○ New events types are added frequently
○ Existing events types are being updated
○ How to define schema for event types ?
○ How to seamlessly notify client app/server side app ?
○ How to handle schema evolution of event types ?
Design Challenges
● Schemafication
■ Client side
■ Server Side
● Schema Validation
○ Compile time/Runtime ?
○ How to handle schema non-compliant events?
Design Challenges
● Schema-aware data streams
○ How to define schema for data streams generated by
stateless/stateful applications?
○ How to handle schema evolution of data streams ?
○ How does consumers get access to schema of the data stream?
● Data at Rest
○ How to make to cost effective and still highly performant
Schematization Design
● Client side schemafication
○ Send schema update notification to every client/device
○ Access to schema registry from client/device (outside vip)
○ Package updated schema with the image and deploy new
version on each device
● Server side schemafication
○ Generate schema compliant records in Flink Streaming App
○ Use latest Avro Schema from schema registry and generate
Avro Records
○ Schema client in the app to get schema update notification
Design Approaches - Schemafication
Schemafication Design
● Compile time validation
○ Data type and mandatory field validation while creating instance of Specific Avro
Record
○ Build and push a new image for every schema change.
● Run time validation
○ Data type validation while creating instance of Avro generic records
○ Mandatory fields validations when Avro generic records are serialized
○ Send schema non-compliant records to a different channel with schema errors
○ Schema non-compliant records can continue to be in JSON format
Design Approaches - Schema Validation
Schema Validation Sequence Diagram
● Data Streams can contain event, context and other
enriched attributes
● Data Streams can be enriched, transformed by
streaming apps
● Data Streams schema can be evolved
● Stateless and Stateful application can perform
generic transformation and aggregation
Schema-aware Data Streams
Design Requirements
Schema Aware Data Streams Design
● Data At Rest in Avro format
○ Full schema evolution support
○ Row oriented not good for wide, high volume table
● Embedded Avro Binary Column in Parquet format
○ Serialize large column using avro binary format
○ Table is columnar in parquet format with embedded avro binary
column
○ Highly performant
○ An UDF to deserialize the avro binary column
Design Approaches - Data At Rest
Data At Rest Design
● Schema for Data In Motion
○ No misinterpretation of data
○ High Data Quality
■ Realtime data quality checks
■ Segregation of Schema compliant and non compliant data
● Compute Efficiency
○ Binary Encoded data in motion
○ Processing data more efficient upto 30%
● Storage Efficiency
○ Binary encoded column in the data store
○ Upto 40% less storage
● Cost Efficiency
○ Upto 40% Cost Savings
● Enable to Deliver High-Quality Performant and Cost Efficient Schema-aware Data Streams
Schema-aware Data Streams Benefits
● JSON Processing versus Avro Generic Record Processing
● Enabled us to do more compute/processing at ingestion layer
● Moved Decompaction to an app that is doing avro processing
An example of Compute Benefit
● Consumers are in sync with the schema of data streams
● Consistent metrics, data insights
● Great recommendations and personalization
● Conclusive A/B testing results
● Decrease in turnaround time for feature/app performance improvement
● …
● Increase in Member Joy
Greater Data Quality Translates to
Questions

Weitere ähnliche Inhalte

Mehr von Flink Forward

Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!Flink Forward
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesFlink Forward
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitFlink Forward
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkFlink Forward
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionFlink Forward
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 

Mehr von Flink Forward (20)

Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 

Kürzlich hochgeladen

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Virtual Flink Forward 2020: High quality performant and cost efficient schema-aware data streams on Flink at Netflix scale - Jagannathrao Mudda, Ramayan Tiwari

  • 1. Schema-aware Data Streams at Netflix Scale Jagannathrao Mudda Ramayan Tiwari
  • 2. The 3Vs of Big Data Handling DATA VARIETY is critical for the DATA QUALITY
  • 3. 167+ million members spanning 190+ countries Netflix Scale
  • 4. Millions of client devices with different versions of desktop, mobile and tv OS Netflix Scale
  • 5. Data Variety: ● 450+ event types from millions of devices ● Structured and Semi-structured event data * These data points are limited to user behaviour data coming from client devices The 3Vs at Netflix* Data Velocity: ● 350K+ requests per second real time ✅ ● 7+ million events processing per second ✅ Data Volume: ● 400+ billion events being collected every day ✅ ● Petabyte of data per day at rest ✅
  • 6. ● Netflix consumer app ○ Events capturing user interaction, intent and behavior ○ Events capturing app and system performance ● Netflix production studio apps ● Netflix partners apps (resellers bundling Netflix with their services) ● Sales, marketing, advertising, promotions events Data Variety at Netflix
  • 7. ● Misinterpretation of data leads to: ○ Inconsistent metrics, data insights ○ Poor recommendations and personalization ○ Inconclusive A/B testing results ○ Decrease in Member Joy leading to Churn ● Data producer changes could break data consumer apps ● Hard to deprecate any event types Data Variety Impact on Data Quality
  • 8. ● Limit unstructured data unless absolutely required ● Curate or transform unstructured data during processing ● Schematize structured/semi-structured data ● Build Schema-aware Data Streams How to Handle Data Variety ?
  • 10. ● Schematization (Defining/Updating Schema) ● Schemafication (Generating schema compliant events) ● Schema Validation ● Integration with Streaming Application ● Schema Definition of Data at Rest Phases for Schema-aware Data Stream Design
  • 11. ● Schematization ○ New events types are added frequently ○ Existing events types are being updated ○ How to define schema for event types ? ○ How to seamlessly notify client app/server side app ? ○ How to handle schema evolution of event types ? Design Challenges ● Schemafication ■ Client side ■ Server Side
  • 12. ● Schema Validation ○ Compile time/Runtime ? ○ How to handle schema non-compliant events? Design Challenges ● Schema-aware data streams ○ How to define schema for data streams generated by stateless/stateful applications? ○ How to handle schema evolution of data streams ? ○ How does consumers get access to schema of the data stream? ● Data at Rest ○ How to make to cost effective and still highly performant
  • 14. ● Client side schemafication ○ Send schema update notification to every client/device ○ Access to schema registry from client/device (outside vip) ○ Package updated schema with the image and deploy new version on each device ● Server side schemafication ○ Generate schema compliant records in Flink Streaming App ○ Use latest Avro Schema from schema registry and generate Avro Records ○ Schema client in the app to get schema update notification Design Approaches - Schemafication
  • 16. ● Compile time validation ○ Data type and mandatory field validation while creating instance of Specific Avro Record ○ Build and push a new image for every schema change. ● Run time validation ○ Data type validation while creating instance of Avro generic records ○ Mandatory fields validations when Avro generic records are serialized ○ Send schema non-compliant records to a different channel with schema errors ○ Schema non-compliant records can continue to be in JSON format Design Approaches - Schema Validation
  • 18. ● Data Streams can contain event, context and other enriched attributes ● Data Streams can be enriched, transformed by streaming apps ● Data Streams schema can be evolved ● Stateless and Stateful application can perform generic transformation and aggregation Schema-aware Data Streams Design Requirements
  • 19. Schema Aware Data Streams Design
  • 20. ● Data At Rest in Avro format ○ Full schema evolution support ○ Row oriented not good for wide, high volume table ● Embedded Avro Binary Column in Parquet format ○ Serialize large column using avro binary format ○ Table is columnar in parquet format with embedded avro binary column ○ Highly performant ○ An UDF to deserialize the avro binary column Design Approaches - Data At Rest
  • 21. Data At Rest Design
  • 22. ● Schema for Data In Motion ○ No misinterpretation of data ○ High Data Quality ■ Realtime data quality checks ■ Segregation of Schema compliant and non compliant data ● Compute Efficiency ○ Binary Encoded data in motion ○ Processing data more efficient upto 30% ● Storage Efficiency ○ Binary encoded column in the data store ○ Upto 40% less storage ● Cost Efficiency ○ Upto 40% Cost Savings ● Enable to Deliver High-Quality Performant and Cost Efficient Schema-aware Data Streams Schema-aware Data Streams Benefits
  • 23. ● JSON Processing versus Avro Generic Record Processing ● Enabled us to do more compute/processing at ingestion layer ● Moved Decompaction to an app that is doing avro processing An example of Compute Benefit
  • 24. ● Consumers are in sync with the schema of data streams ● Consistent metrics, data insights ● Great recommendations and personalization ● Conclusive A/B testing results ● Decrease in turnaround time for feature/app performance improvement ● … ● Increase in Member Joy Greater Data Quality Translates to