Over 137 million members worldwide are enjoying TV series, feature films across a wide variety of genres and languages on Netflix. It leads to petabyte scale of user behavior data. At Netflix, our client logging platform collects and processes this data to empower recommendations, personalization and many other services to enhance user experience. Built with Apache Flink, this platform processes 100s of billion events and a petabyte data per day, 2.5 million events/sec in sub milliseconds latency. The processing involves a series of data transformations such as decryption and data enrichment of customer, geo, device information using microservices based lookups.
The transformed and enriched data is further used by multiple data consumers for a variety of applications such as improving user-experience with A/B tests, tracking application performance metrics, tuning algorithms. This causes redundant reads of the dataset by multiple batch jobs and incurs heavy processing costs. To avoid this, we have developed a config driven, centralized, managed platform, on top of Apache Flink, that reads this data once and routes it to multiple streams based on dynamic configuration. This has resulted in improved computation efficiency, reduced costs and reduced operational overhead.
Stream processing at scale while ensuring that the production systems are scalable and cost-efficient brings interesting challenges. In this talk, we will share about how we leverage Apache Flink to achieve this, the challenges we faced and our learnings while running one of the largest Flink application at Netflix.
2. ● Consolidated Logging (CL) Overview
● High Level Architecture of CL platform
● Log Processing at Scale
● Event Extractor Use Case
● Monitoring and Alerting
● Impact of Flink based Platform
Agenda
4. Build an integrated solution to provide insights into user
behavior and application performance metrics through
client-side logging.
Consolidated Logging
5. Use Cases Powered By CL
● Personalization
● Recommendations
● A/B Experimentation
● Application Performance
10. ● Generic log processing application - supports different logging specifications
● Real-time processing
○ Data transformations
○ Data enrichment - Membership information, Geo, Device type
■ Joins
● Single source of truth with unified output schema
● Supports different data sinks: Kafka/Hive
● SLA
○ RPS: 3.5 million events per sec at peak, Latency: < 3ms
CL App Features
11. CL App Design
● Stateless Flink Application (Flink 1.4, Kafka 1.1)
○ At-least once processing
● Isolation of concerns through separate Flink jobs for different use cases/sink types
● Different job DAGs with common framework library: Fan In/ Fan Out
12. Common Log Processing Framework
Log
Consumer
Config
Reader
(FP)
Data
Enrichment
Data
Transformations
Spec
Parser
Data
Sink
Raw
events
Processed
events
Kafka
Kafka
Hive / Iceberg
CL Schema /
App Schema
Request
Type &
Version
Source
Segregated
sources
Multiple
sinks
Raw
events
Hive
Data
Partitioning
Events
Backup
13. ● Embarrassingly parallel job (parallelism over 2000)
○ Uniform CPU utilization with high number of partitions on source kafka topic
● High memory pressure and GC pause on JM - Recovery failure/restart loop
○ Memory leak in archiving execution history (FLINK-10066)
○ Scaling bottleneck of kafka source’s union state (FLINK-10122)
● Overwhelmed coordinator due to thundering herd problem with high parallelism
(KIP-266)
Learnings & Best Practices
15. ● Data compression ratio was worse for parquet and kafka (~ 4x)
○ Upstream kafka producer batching difference increased data entropy
● Backlog in kafka can lead to sudden load on external micro-services
● Kafka backpressure leads to task failures
○ Duplicate events
● Guice dependency injection conflicts with Flink
○ classloader.resolve-order=parent-first
Learnings & Best Practices
19. ● Growth/Scale
○ 3.5 million events/sec
○ Reading same data multiple times
■ Compute redundancy
■ Scale Kafka infrastructure for outgoing bytes
■ Operational Overhead
● High Compute and Operational cost
Problems with CL Legacy Pipeline
21. ● Stateless Single Flink Application
● Read data once, apply processing and route it to multiple streams
● Configuration driven Processing, without code change
● SQL Support on Stream
● Filter, Transformation and Projection support on stream
● Out of box metrics for users
What is Event Extractor ?
22. ● User configuration in Yaml
● Confings are managed in version control and updated in s3
● Example config
Event Extractor User Interface
filterExpression: field1= 'Presented' and field2 like '%impressionToken%' and field3 not like
'%storyArt%'
projectionExpression: field_name1, field_name2, field_name3, field_name5
transformations: { OutputFieldName:inner_field, fieldName:top_level_field,
nestedFieldName:inner_field, type: type}
sinkDetails: {sinkType: kafka, name: topic_name}
ownerName: email-address
routeName: unique_name
23. Event Extractor Design
Config
Reader
SQL Parser
Config
Parser
Transformation
Projection
User Config Management Pipeline
Filter Function
Schema Builder
Elastic Search
Sink
Kafka Sink
Hive Sink
User
Configs via
S3
CL Enriched
Stream
Hive
Multiple Kafka Sinks
Event Extractor
24. ● Scaling single Flink Application
● Lack of Isolation
○ Isolated by type of sink application writes to
○ Deployment per sink type (Kafka,Hive,Elasticsearch)
● Back pressure is shared between multiple consumers
○ Consumer Kafka topics are created in the same cluster
○ Canaries and testing before on boarding new config
Challenges with Event Extractor
25. ● Buildup of Network Pressure caused S3 checkpoint failures due to socket timeouts
○ Job goes into restart loop due to high frequency of checkpoint failures
○ Better g1gc and increase s3 timeouts
● Tuning parallelism to avoid unbalanced CPU Utilization
○ Extensive CPU Flame Graphs and system metrics to identify bottlenecks
○ Setting parallelism in multiples of Kafka partitions and task slots to achieve better
cpu utilization
Learnings and Best Practices
26. ● Flink Kafka Consumer needs continuous stream to progress high watermark (FLINK-5479)
○ StickyPartitioner Producer skips producing data to out of sync partitions
○ Setting stickyPartitioner.minQualifiedIsrRatio=1.0 helps to produce data to out of
sync partitions
● Outlier Container/Broker (due to bad hardware)
○ Consumer gets non-linear traffic pattern (stuck consumer alert)
○ Producer throws BatchExpiredTimeout Exception and increase in checkpoint
failures
Learnings and Best Practices
27. ● Keystone (Self-Serve UI) for deployment of streaming apps
○ Out of box ELK stack support for application logs
○ Automated Alerts integration with Atlas
● Deployment Strategy
○ Minimize Duplicates, Checkpoints are stored in S3
● Restart Strategy
○ Fine-grained Recovery
Deployment
30. CL Platform Benefits
Improved
Data
Processing
Can Handle
Large Payloads
compared to
Legacy pipeline
Improved error
handling
Reduced
Data Loss
Reduced points
of failures
Ability to backfill
or reprocess
historic raw
events
Legacy Tables
Decommission
and Reduced
Storage
Redundancy
Read once and
route to
different sinks
through event
extractor
Single source of
truth (SSOT) for
CL Data in Data
warehouse
Schema
consistency
across CL
components
and Tools
Single
Source of
Truth
Reduced Cost
&
Operational
Overhead