Netflix Recommendations using Spark + Cassandra

•Download as PPTX, PDF•

9 likes•6,605 views

Learning is an analytic process of exploring the past in order to predict the future. Hence, being able to travel back in time to create features is critical for machine learning projects to be successful. To enable this, we built a time machine that computes features for any arbitrary time in the recent past for offline experimentation. We also built a real-time stream processing system to capture the interests of members during different times of the day and to quickly adapt to changes in the collective interests of members as it happens in case of real-world events. Building the time machine for offline experimentation and the real-time infrastructure for online recommendations with Apache Spark (Streaming) and Apache Cassandra empowered us to both scale up the data size by an order of magnitude and train and validate the models in less time. We will delve into the architecture, use case details, data models used for cassandra and share our learnings. About the Speakers Prasanna Padmanabhan Engineering Manager, Netflix Prasanna leads the Data Systems for Personalization team at Netflix. His primary focus is on building various big data infrastructure components that help their algorithmic engineers to innovate faster and improve personalization for Netflix members. In the past, he has built distributed data systems that leverages both batch and stream processing. Roopa Tangirala Engineering Manager, Netflix Roopa Tangirala is an experienced engineering leader with extensive background in databases, be they distributed or relational. She manages the database engineering team at Netflix responsible for operating cloud persistent and semipersistent runtime stores for Netflix, which includes Cassandra, Elasticsearch, Dynomite and MySQL databases, by ensuring data availability, durability, and scalability to meet the growing business needs.

Software

Netflix Recommendations using
Spark + Cassandra
Prasanna Padmanabhan
Roopa Tangirala

Turn on Netflix and the absolute best
content for you would automatically start
playing

Ranking
Everything is a Recommendation
Rows
Over 80% of what
members watch
comes from our
recommendations
Recommendations
are driven by
Machine Learning
Algorithms

Data Driven
Offline Experiment
using Historical
Data
Online
A/B Testing
Rollout Feature to
ALL members
Success Success
Fail
Algorithmic Page
Generation
Trending Now

Algorithmic Page Generation
Personalizing the ordering of rows
on the homepage

Algorithmic Page Generation
Without Algorithmic Page Generation With Algorithmic Page Generation
Diversity of the Page
Affinity for specific rows
Drawbacks

Algorithmic Page Generation
Production Variant 1

Algorithmic Page Generation
Production Variant 1 Variant 2
Row Distribution
TV/Movie Ratio

Algorithmic Page Generation
Production Variant 1 Variant 2
Evaluate best variant
based on the plays
Actual
Plays:

Variant 2
Algorithmic Page Generation
Production Variant 1
Evaluate best variant
based on the plays
Actual
Plays:

Offline Experiment Architecture
Member
Selection
Runs once a day
Ratings
Service
S3
Snapshot Snapshot
Store
Snapshot
Forklift
Viewing
History
Service
MyList
Service
Data
Snapshots
Evaluate
Metrics
Generate
Pages
… …
A/B Test

Data Model - Requirements
• Need for historical service data
• Optimize for Batch Writes and Point Reads

Data Model
20161009_1001
20161009_1002
DATE_MEMBER_ID
MyList
BLOB
MyList
BLOB
R
O
W
S
COLUMN
COLUMN FAMILY:
MYLIST

Data Model
20161009_1001
20161009_1002
DATE_MEMBER_ID
ViewingData
BLOB
ViewingData
BLOB
R
O
W
S
COLUMN
COLUMN FAMILY:
VIEWING-HISTORY

Data Model
20161009_1001_0
20161009_1001_1
DATE_MEMBERID_IDX
ViewingData
BLOB
ViewingData
BLOB
R
O
W
S
COLUMN
20161009_1001_2
ViewingData
BLOB
COLUMN FAMILY:
VIEWING-HISTORY

Trending Now
Videos that are Trending and
Personalized for you

Trending Now - Data Infrastructure
Impression
Service
Viewing
History
Service
UI
Online
Services
Trends
Store
Compute
Trends
Model
Training
Captures
videos shown
in view port
Captures
videos
played by
members
Publish
Models
Viewing
History
Service
Ratings. .. .

State Management in Cassandra
Video Number of Plays
Stranger Things 100
Narcos 200
Orange is the new Black 300

State Management in Cassandra
Trends
Store
State
Present
?
Compute Trends
Yes
No
Init State from
Cassandra
Load State
Update
State
Read
Events

Data Model - Requirements
• Trending data is for a specific interval of time
• Optimize for Batch Writes and Batch Reads

Data Model
101_METADATA
102_METADATA
VIDEOID_METADATA
Plays
BLOB
Plays
BLOB
R
O
W
S
COLUMNS
103_METADATA
Plays
BLOB
COLUMN FAMILY:
Interval 1,
Interval 2
…
Interval N
Impressions
BLOB
Impressions
BLOB
Impressions
BLOB

Roopa Tangirala
Engineering Manager @ Netflix
Twitter - @roopatangirala

TTL HANDLING
• TTL Reading And Writing is Asymmetric -
CASSANDRA 12216
• Thrift Column TTL vs CQL Row TTL

1
6
5
4
3
2
PARTITION DIFFERENCES
500000
200000
425k450k475k
200k175k150k125k
500k

TUNING
• spark.cassandra.connection.keep_alive_m
s
• spark.cassandra.connection.timeout_ms
• spark.driver.maxResultSize

OOM EXCEPTIONS
Spark.executor.memory
spark.cassandra.input.split.size_in_mb

WRITES SPEED SPARK
• cassandra.output.batch.size.bytes
• cassandra.output.batch.size.rows
• cassandra.output.concurrent.writes
• cassandra.output.throughput_mb_per_sec

Write Timeouts
cassandra.output.throughput_mb_per_sec

What's hot

Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner

Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit

Kafka for Live Commerce to Transform the Retail and Shopping MetaverseKai Wähner

Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica

Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Spark Summit

Data Streaming Ecosystem Management at Booking.com confluent

DW Migration Webinar-March 2022.pptxDatabricks

Unified MLOps: Feature Stores & Model DeploymentDatabricks

Machine Learning and AIJames Serra

Global Netflix PlatformAdrian Cockcroft

Kafka Connect & Streams - the ecosystem around KafkaGuido Schmutz

Automated Testing with Logic Apps and SpecflowBizTalk360

Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward

Apache Kafka in the Insurance IndustryKai Wähner

Benefits of Stream Processing and Apache Kafka Use Casesconfluent

Apache flinkAhmed Nader

Frame - Feature Management for Productive Machine LearningDavid Stein

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

PKS Automation Station...All Aboard: Enabling Team Access to PKS with a Conco...VMware Tanzu

Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Zalando Technology

What's hot (20)

Kafka for Real-Time Replication between Edge and Hybrid Cloud

Flexible and Real-Time Stream Processing with Apache Flink

Kafka for Live Commerce to Transform the Retail and Shopping Metaverse

Stephan Ewen - Experiences running Flink at Very Large Scale

Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...

Data Streaming Ecosystem Management at Booking.com

DW Migration Webinar-March 2022.pptx

Unified MLOps: Feature Stores & Model Deployment

Machine Learning and AI

Global Netflix Platform

Kafka Connect & Streams - the ecosystem around Kafka

Automated Testing with Logic Apps and Specflow

Using the New Apache Flink Kubernetes Operator in a Production Deployment

Apache Kafka in the Insurance Industry

Benefits of Stream Processing and Apache Kafka Use Cases

Apache flink

Frame - Feature Management for Productive Machine Learning

Building a Data Pipeline from Scratch - Joe Crobak

PKS Automation Station...All Aboard: Enabling Team Access to PKS with a Conco...

Stream Processing using Apache Flink in Zalando's World of Microservices - Re...

Viewers also liked

Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016DataStax

Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...DataStax

Multi-Region Cassandra ClustersInstaclustr

GumGum: Multi-Region Cassandra in AWSDataStax Academy

Cassandra @ Yahoo Japan | Cassandra Summit 2016Yahoo!デベロッパーネットワーク

Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016DataStax

Optimizing Cassandra in AWSgreggulrich

OOW16 - Ready or Not: Applying Secure Configuration to Oracle E-Business Suit...vasuballa

Lessons learnt at building recommendation services at industry scaleDomonkos Tikk

C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...DataStax

Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...DataStax

Dynomite @ Redis Conference 2016Ioannis Papapanagiotou

OSCON TALK: Becoming Friends with Cassandra and SparkDani Traphagen

Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016DataStax

Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit

Practical Data Mining with RapidMiner Studio 7 : A Basic and IntermediateBig Data Engineering, Faculty of Engineering, Dhurakij Pundit University

Microservices with Apache CamelClaus Ibsen

Securing CassandraInstaclustr

Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...Redis Labs

Переезжаем на Yandex ClickHouse / Александр Зайцев (LifeStreet)Ontico

Viewers also liked (20)

Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...

Multi-Region Cassandra Clusters

GumGum: Multi-Region Cassandra in AWS

Cassandra @ Yahoo Japan | Cassandra Summit 2016

Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016

Optimizing Cassandra in AWS

OOW16 - Ready or Not: Applying Secure Configuration to Oracle E-Business Suit...

Lessons learnt at building recommendation services at industry scale

C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...

Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...

Dynomite @ Redis Conference 2016

OSCON TALK: Becoming Friends with Cassandra and Spark

Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016

Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Practical Data Mining with RapidMiner Studio 7 : A Basic and Intermediate

Microservices with Apache Camel

Securing Cassandra

Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...

Переезжаем на Yandex ClickHouse / Александр Зайцев (LifeStreet)

Similar to Netflix Recommendations using Spark + Cassandra

Scaling the Netflix APIDaniel Jacobson

Maintaining the Front Door to Netflix : The Netflix APIDaniel Jacobson

Machine Learning at Netflix ScaleAish Fenton

Scaling the Netflix API - From Atlassian Dev DenDaniel Jacobson

Scaling the Netflix API - OSCONDaniel Jacobson

PhD Defense: Dynamic Generation of Personalized Hybrid Recommender SystemsSimon Dooms

(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014Amazon Web Services

Gam301 Real-Time Game Analytics with Amazon Redshift, Amazon Kinesis, and Ama...Amazon Web Services Korea

Maintaining the Front Door to NetflixBenjamin Schmaus

Maintaining the Netflix Front Door - Presentation at Intuit MeetupDaniel Jacobson

Microservices, Events, and Breaking the Data Monolith with KafkaVMware Tanzu

(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...Amazon Web Services

Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Spark Summit

Wireframes for UX Revamp of the ProductGaurav Bhatia

DevOps on AWS: Advanced Techniques for Amazon EC2 Deployments on AWSAmazon Web Services

Data Collection and Analysis for Better Requirements: Just the Facts, Ma'amTechWell

Modern Data Architectures for Real Time Analytics & EngagementAmazon Web Services

How to increase engagement and conversionsJan Petr

Introduction to Real-time, Streaming Data and Amazon Kinesis: Streaming Data ...Amazon Web Services

Netflix Recommendations Feature Engineering with Time TravelFaisal Siddiqi

Similar to Netflix Recommendations using Spark + Cassandra (20)

Scaling the Netflix API

Maintaining the Front Door to Netflix : The Netflix API

Machine Learning at Netflix Scale

Scaling the Netflix API - From Atlassian Dev Den

Scaling the Netflix API - OSCON

PhD Defense: Dynamic Generation of Personalized Hybrid Recommender Systems

(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014

Gam301 Real-Time Game Analytics with Amazon Redshift, Amazon Kinesis, and Ama...

Maintaining the Front Door to Netflix

Maintaining the Netflix Front Door - Presentation at Intuit Meetup

Microservices, Events, and Breaking the Data Monolith with Kafka

(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...

Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...

Wireframes for UX Revamp of the Product

DevOps on AWS: Advanced Techniques for Amazon EC2 Deployments on AWS

Data Collection and Analysis for Better Requirements: Just the Facts, Ma'am

Modern Data Architectures for Real Time Analytics & Engagement

How to increase engagement and conversions

Introduction to Real-time, Streaming Data and Amazon Kinesis: Streaming Data ...

Netflix Recommendations Feature Engineering with Time Travel

Recently uploaded

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig

Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services

Recruitment Management Software Benefits (Infographic)Hr365.us smith

Introduction Computer Science - Software Design.pdfFerryKemperman

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Advantages of Odoo ERP 17 for Your BusinessEnvertis Software Solutions

Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

How to submit a standout Adobe Champion ApplicationBradBedford3

2.pdf Ejercicios de programación competitivaDiego Iván Oliveros Acosta

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions

Recently uploaded (20)

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

Automate your Kamailio Test Calls - Kamailio World 2024

Unveiling the Future: Sylius 2.0 New Features

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...

Recruitment Management Software Benefits (Infographic)

Introduction Computer Science - Software Design.pdf

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

PREDICTING RIVER WATER QUALITY ppt presentation

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Advantages of Odoo ERP 17 for Your Business

Xen Safety Embedded OSS Summit April 2024 v4.pdf

Intelligent Home Wi-Fi Solutions | ThinkPalm

How to submit a standout Adobe Champion Application

2.pdf Ejercicios de programación competitiva

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service

Ahmed Motair CV April 2024 (Senior SW Developer)

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...

Netflix Recommendations using Spark + Cassandra

1. Netflix Recommendations using Spark + Cassandra Prasanna Padmanabhan Roopa Tangirala

2. Turn on Netflix and the absolute best content for you would automatically start playing

3. Netflix Recommendations

4. Netflix Recommendations

5. Ranking Everything is a Recommendation Rows Over 80% of what members watch comes from our recommendations Recommendations are driven by Machine Learning Algorithms

6. Data Driven Offline Experiment using Historical Data Online A/B Testing Rollout Feature to ALL members Success Success Fail Algorithmic Page Generation Trending Now

7. Offline Experimentation

8. Algorithmic Page Generation Personalizing the ordering of rows on the homepage

9. Algorithmic Page Generation Without Algorithmic Page Generation With Algorithmic Page Generation Diversity of the Page Affinity for specific rows Drawbacks

10. Algorithmic Page Generation Production

11. Algorithmic Page Generation Production Variant 1

12. Algorithmic Page Generation Production Variant 1 Variant 2 Row Distribution TV/Movie Ratio

13. Algorithmic Page Generation Production Variant 1 Variant 2 Evaluate best variant based on the plays Actual Plays:

14. Algorithmic Page Generation Production Variant 1 Variant 2 Evaluate best variant based on the plays Actual Plays:

15. Algorithmic Page Generation Production Variant 1 Variant 2 Evaluate best variant based on the plays Actual Plays:

16. Variant 2 Algorithmic Page Generation Production Variant 1 Evaluate best variant based on the plays Actual Plays:

17. Offline Experiment Architecture Member Selection Runs once a day Ratings Service S3 Snapshot Snapshot Store Snapshot Forklift Viewing History Service MyList Service Data Snapshots Evaluate Metrics Generate Pages … … A/B Test

18. Data Model - Requirements • Need for historical service data • Optimize for Batch Writes and Point Reads

19. Data Model 20161009_1001 20161009_1002 DATE_MEMBER_ID MyList BLOB MyList BLOB R O W S COLUMN COLUMN FAMILY: MYLIST

20. Data Model 20161009_1001 20161009_1002 DATE_MEMBER_ID ViewingData BLOB ViewingData BLOB R O W S COLUMN COLUMN FAMILY: VIEWING-HISTORY

21. Data Model 20161009_1001_0 20161009_1001_1 DATE_MEMBERID_IDX ViewingData BLOB ViewingData BLOB R O W S COLUMN 20161009_1001_2 ViewingData BLOB COLUMN FAMILY: VIEWING-HISTORY

22. Online A/B Testing

23. Trending Now Videos that are Trending and Personalized for you

24. Trending Now It’s 7 PM on a Monday

25. Trending Now It’s 10 PM on a Saturday

26. Trending Now Pokeman

27. Fast Feedback Loop

28. Trending Now - Data Infrastructure Impression Service Viewing History Service UI Online Services Trends Store Compute Trends Model Training Captures videos shown in view port Captures videos played by members Publish Models Viewing History Service Ratings. .. .

29. State Management in Cassandra Video Number of Plays Stranger Things 100 Narcos 200 Orange is the new Black 300

30. State Management in Cassandra Trends Store State Present ? Compute Trends Yes No Init State from Cassandra Load State Update State Read Events

31. Data Model - Requirements • Trending data is for a specific interval of time • Optimize for Batch Writes and Batch Reads

32. Data Model 101_METADATA 102_METADATA VIDEOID_METADATA Plays BLOB Plays BLOB R O W S COLUMNS 103_METADATA Plays BLOB COLUMN FAMILY: Interval 1, Interval 2 … Interval N Impressions BLOB Impressions BLOB Impressions BLOB

33. Roopa Tangirala Engineering Manager @ Netflix Twitter - @roopatangirala

34. FORKLIFTER

35. ARCHITECTURE SOURCE TARGET

36. USE CASES

37.

38. APACHE THRIFT CQL

39.

40. DEMO

41.

42. WHY NOT DSE SPARK?

43.

44. SCALABILITY

45. COST EFFECTIVENESS

46. LESSONS LEARNT

47. TTL HANDLING • TTL Reading And Writing is Asymmetric - CASSANDRA 12216 • Thrift Column TTL vs CQL Row TTL

48. 1 6 5 4 3 2 PARTITION DIFFERENCES 500000 200000 425k450k475k 200k175k150k125k 500k

49. TUNING • spark.cassandra.connection.keep_alive_m s • spark.cassandra.connection.timeout_ms • spark.driver.maxResultSize

50. OOM EXCEPTIONS Spark.executor.memory spark.cassandra.input.split.size_in_mb

51. WRITES SPEED SPARK • cassandra.output.batch.size.bytes • cassandra.output.batch.size.rows • cassandra.output.concurrent.writes • cassandra.output.throughput_mb_per_sec

52. Write Timeouts cassandra.output.throughput_mb_per_sec

53.

54. QUESTIONS?

Editor's Notes

Good afternoon everyone. My name is Prasanna. I lead the Data Systems for Personalization team at Netflix. Our team builds the Machine Learning infrastructure that powers Netflix recommendations and I have Roopa with me who leads the Cloud Database Engineering team at Netflix. Today, we are going to talk about a few use cases where we use Spark + Cassandra in our data pipelines and share some of the learnings from it.
At Netflix, we aspire to a day when our members can turn on Netflix and the absolute best content for them has already started playing for them. While we know we are far away from realizing this dream, it sets a vision for us to improve the recommendations that span our service. So, where do we use recommendations in our service.
Our journey of building the recommendations systems started with predicting the rating that our members will give for a video and based on that recommend appropriate videos
That later evolved into creating meaningful grouping of videos and being able to personalize the videos within each group.
Today, we have multitude of algorithms for doing recommendations. Not only are the videos within a row personalized for you, but the rows themselves are personalized for you. Our 80% of what our members watch come from the videos that are recommended to them, which are driven by machine learning algorithms. So how do we improve these algorithms to realize our grand vision
Just like everything else at Netflix, we follow a data driven approach to improve our recommendations. Once we have an idea, we run an offline experiment using historical data to see if this new idea would have made better recommendations. If it did, we would deploy it to an online A/B test to see it performs well in Production too. We look at various metrics such as Viewing hours, Member Retention and member satisfaction to evaluate the success of an A/B test. If the A/B test is a success, we would rollout that feature to ALL members. And If not, go back to the whiteboard, come up with a better idea and start over the offline experiment. For the rest of my talk, I’m going to take one use case of Offline Experimentation and one use case for an Online A/B and walk through how we use Spark and Cassandra to help improve recommendations
As we saw earlier, Offline Experiment is a step prior to doing an A/B test. It helps us decide if an idea is even worth doing an A/B test.
Let’s take the use case of Algorithmic Page generation for Offline Experimentation. How can we personalize the ordering of rows on the homepage for each member.
We initially used to have a rule based approach of page generation. For example, the rules could specify that the 1st row be Continue Watching, the 2nd row be Top Picks and so on and so forth. The drawback for this approach is that it does not take into account the diversity of the page nor the affinity of our members to specific rows. Algorithmic Page generation addresses these issues by personalize the row and the ordering of the rows on the home page based on our member’s viewing patterns, diversity of the page and many more attributes.
Let’s take an example to see how we evaluate different pages algorithmically. Say this is a page that a member sees based on the current Production algorithm.
Variant 1 is a new page that was generated with a new algorithm
And Variant 2 is another page that was generated with another new algorithm. We first look for some of the basic things like how is the Row distribution (for ex: how many members see a Continue Watching row) and how is the TV/Movie Ratio (Does one variant over index on say TV shows)
More importantly, we look at the actual videos that were played by the member and find the best page that could have made those videos easily discoverable. In this case, say the member played Hot Rod, The Short Game and Family Guy.
We can see that HotRod was recommended in all the 3 versions of the page, except that Production and Variant 2 recommended that video much higher in the page.
Similarly, Short Game too was recommended in all the 3 versions of the page, except that Variant 2 again recommended that video much higher in the page.
We also look at negative samples. Family Guy was a video that was played, but not recommended in any version of the page (probably our members searched for it). We typically consider this as a fail in our recommendations. Given this data, we would choose Variant 2 as the winning page algorithm as it would likely surfaces the videos that would be played by our members much higher in the page.
Now lets look at the offline experimentation architecture that made this possible. The most critical requirement for building an offline experiment is to provide an ability to travel back in time and be able to generate the page our members would have seen, if they used our service at a given time in the past. We built this ability to time travel by snapshotting data of our various online services and use that snapshot data to generate the experimental page. The first step in building the snapshot infrastructure is to select the set of members for whom we need to Snapshot data. Snapshotting data for all our members would be an expensive operation. Rather, we select a stratified set of our members based on member’s tenure, their viewing patterns, the devices they use etc. Once we have the set of users, the next step is to snapshot data of various online services such as Ratings, Viewing History, MyList that help improve personalization. As you folks might be aware, Netflix embraces a fine grained Service Oriented architecture for our cloud based deployment model. These snapshot data are then stored in S3 in nested parquet format for both space and time efficiency. Many of our offline experiments run inside Spark and they can directly consume the snapshot data from S3. However, for Algorithmic Page generation, we need to consume this snapshot data for one member at a time. This is because we are reusing our existing online systems, which generates the page for a live user request to also generate the experimental page given the snapshot data. S3 is not suited to do random seeks of the data stored in it. Alternatively, We know Cassandra is well suited for this use case. To that end, we used Spark to read the snapshot data from S3 and write that data into Cassandra. We used the Spark Cassandra connector, which took care of the nitty gritty details of connecting to all the cassandra nodes in the ring, maintaining the connection pool, doing retries and optimizing the reads/writes to Cassandra. Once the data is available in Cassandra, we will now be able to get the state of netflix data services for any given member and a timestamp in the past. We can then generate the experimental pages for this member based on the new algorithm and evaluate the metrics needed to see which of those page algorithms could have done better recommendations and if there is a clear winner, we would deploy it to a A/B test.
Before we look into the data models that we used in Cassandra. There were 2 requirements that we needed to address when building the data model: The need for storing historical data from various data services such as Ratings, Viewing History. This is the core for building the time machine Optimize the data model for batch writes that happen from Spark and for Point reads from the online systems during Page generation
Here is the data model that we used for storing our member’s MyList data for Offline Experiment. So yes, the obvious thing is to have different column families for different data services. Date and MemberId concatenated together formed the Row Key. Column name was a static string and its value being a blog of the MyList data for that member. With this data model, a query to get a member’s MyList data for a given timestamp in the past would translate to a point query read, which is very efficient in Cassandra.
However, a similar data model for storing Viewing History would not work. This is because the viewing history data for a member could be very big and would become a wide row, causing heap pressure which inturn would affect latencies
To avoid the issues of having a wide row, we divided the rows into a predefined set of shards. In this case, Date MemberId and Shard Index becomes the row key and the Viewing data blob was the column value.
Now lets focus on the next use case of how did we use Spark (Spark Streaming to be precise) + Cassandra for an Online A/B test.
The Trending Now row captures the video that are Trending, but personalized for you.
Here is a screenshot of my Trending Now row when its 7 pm on a Monday, when my daughter takes control of our remote.
Here is a screenshot of my Trending Now row when its 10 pm on a Saturday. Its ME time and the ACTION finally begins 
Oh yeah, Pokemon’s impact was seen on Netflix too
The key to building a Trending Now row is to have a Fast Feedback loop. Netflix is supported in 1000s of devices with each sending various types of data that help improve personalization such as a Play Event or the fact that a video is recommended to a member and not played and so on so forth. We built various data systems that captures these data In real time. Once we have the required data for personalization, we built several Sparl Streaming applications that can read these data in real time and compute the trends data, all in real-time. The Trends data is fed as i/p to our recommender systems which then looks at a member’s taste and personalized
Lets do a little more deeper dive into the architecture. We capture all user interactions within our service. For ex: the videos that are recommended to our members and shown in their view port is captured by the Impression Service. Similarly, the videos that are played by our members are captured by the Viewing History Service. Both these data services sends those events into Kafka. Spark Streaming jobs consume these events from Kafka and compute the Trends data required for building the Trending Now row. This trends data is persisted into Cassandra. Again, we use the Spark Cassandra connector in the Spark Streaming job to batch write all the Trends data into Cassandra. The one thing that we had to configure in the connector was the connection timeout, which is different for a Streaming job. The Trends data is then combined with data from services such as Viewing History and Ratings and fed as input to the Model Training job. The output of which is a model that is consumed by online services to do the Personalized Trending Now recommendations for the next time interval.
We also use Cassandra for managing the state of our spark streaming jobs. So what is a State in Spark Streaming? Think of it as a simple Key Value pair that gets updated continuously as events happen in real time. Let’s say we simply want to the count the number of times a video is played. In that case, videoId and the count forms the state. Spark provides a way to bootstrap the state when your streaming job is restarted or started for the first time.
For Trending now, as we read events from Kafka, we first checks if the State is Present. If It did, we use the data from the existing state and the new event that was read from Kafka, perform computations and update the new state back into Cassandra. If not, it would load the State data from Cassandra and load into Spark as part of bootstrapping.
Lets look into the data model that we used for Trending Now. The 2 main requirements to address is 1) Trends data is applicable only a specific interval of time and that 2) we should optimize for both batch writes and batch reads as model training happens inside Spark
It’s kind’a obvious that we had to create separate column families for separate interval of times primarily because given the time interval, we would need only the data from that interval. VideoId along with some metadata such as Country, Timezone formed the row key. We had 2 columns that contained the Plays and Impressions data for each video.
With that, I would like to now introduce Roopa who will walk us through few more use cases for using Spark + Cassandra and our learning's from it. You just saw two main data driven spark cassandra use cases. Once Prassan’s ab testing is a success , it needs to be rolled out to ALL members and there is a huge growth in dataset which needs a bigger Cassandra cluster.
You want to move bulk dataset from one cluster to another. How would you go about doing it in a fast and quick way? Meet forklift- why is it called forklift? We are moving data across the clusters! Lets look into the architecture in detail now
Meson is Netflix built - general purpose workflow orchestration and scheduling framework. It manages the lifecycle of several ML pipelines that execute workloads across heterogeneous systems. This same framework was used for the forklift too. Spark Cassandra Connector This library lets you expose Cassandra tables as Spark RDDs, write Spark RDDs to Cassandra tables, and execute arbitrary CQL queries in your Spark applications. Mesos provides task isolation and excellent abstraction of CPU, memory, storage, and other compute resources. Meson leverages these features to achieve scale and fault tolerance for its tasks. Spark jobs submitted from Meson share the same Mesos slaves to run the tasks.
What makes this extension different? The nodes are not being doubled which cassandra can do well for you, instead they are being increased by few percentage. We don’t use vnodes, so the only option for adding capacity to the cluster is either doubling or creating a new cluster and populating the data. We are taking about clusters having hunderds of nodes and doubling always does not work. Forklift comes in very handy for this type of use case.
We were very early adaptors of Cassandra and started using it in 0.5 version in production. So of course we were using thrift and all our streaming microservices which were built over the years were based on thrift’s schemaless design and used to access cassandra. With the advent of CQL there are apps which want to use the richer datamodel of CQL and migrate to cql for better performance. Forklifter plays a great job with the migration since you can map the datamodel and transform from source to destination in the relevant format.
This is another use case we use forklift for, where for certain big clusters, instead of replacing nodes one at a time that would take weeks, we create a new cluster in trusty and forklift the data after the dual writes are enabled.
Performance
We get good support from datastax and one of the options was using DSE spark instead of running datastax spark connector talking to cassandra. But performance of our cassandra clusters was a concern since these clusters are used in the path of streaming serving all the members watch great streaming content, have very strict SLA’s. Running spark along with cassandra would constraint the limited resources we have in AWS and was a big concern.
Cassandra is statefull, and if we had spark and cassandra running together, its not easy to scale up the cluster when you are running into resource constrains. With spark running seperately we can scale up and down the spark cluster with out affecting cassandra
Running the spark and cassandra clusters are cost effective too, since we can use the instances from the shared pool and release them when the job is complete.
Can lead to NPE - An `TTL` of 0 when written becomes a `null` in C* When read, this `TTL` becomes a `null`  The `null` cannot be written back to C* as `TTL` Fixed in 3.1- we used a workaround of translating the data when reading from source and writing to destination. -------- In thrift you could define the column level TTL and different columns could have different ttl’s. In cql there is a row TTL and there is no way to define column TTL in a single mutation. SO when you are copying data from thrift to CQL you would need to split the writes into multiple mutations by batching.
Input.split.size_in_mb uses a internal system table in C* ( >= 2.1.5) to determine the size of the data in C*. The table is called system.size_estimates is not meant to be absolutely accurate so there will be some inaccuracy with smaller tables and split sizes. When you use spark cassandra connector cassandraTable() function to load data from Cassandra to Spark it will automatically create Spark partitions aligned to the Cassandra partition key. It will try to create an appropriate number of partitions by estimating the size of the table and dividing this by the parameter spark.cassandra.input.split.size_in_mb (64mb by default). (One instance where the default will need changing is if you have a small source table – in that case use withReadConf() to override the parameter.)
conf spark.cassandra.connection.keep_alive_ms, 5000 =900000 , Period of time to keep unused connections open -- conf spark.cassandra.connection.timeout_ms default 5000, - we put 50 Maximum period of time to attempt connecting to a node -- conf spark.driver.maxResultSize=default 1G but we had it 4g Limit of total size of serialized results of all partitions for each Spark action (e.g. collect). Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size is above this limit. Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory and memory overhead of objects in JVM). Setting a proper limit can protect the driver from out-of-memory errors.
This usually means that the size of the partitions you are attempting to create are larger than the executor's heap can handle. Remember that all of the executors run in the same JVM so the size of the data is multiplied by the number of executor slots. increase the heap size of the executors spark.executor.memory or shrink the size of the partitions by decreasing spark.cassandra.input.split.size_in_mb
cassandra.output.batch.size.bytes Default = 1024. Maximum total size of the batch in bytes. Overridden by spark.cassandra.output.batch.size.rows cassandra.output.batch.size.rows (default: auto – batch size determined by size.byts): Number of rows per single batch. The default is 'auto' which means the connector will adjust the number of rows based on the amount of data in each row cassandra.output.concurrent.writes (default: 5) Maximum number of batches executed in parallel by a single Spark task cassandra.output.throughput_mb_per_sec (default: unlimited): Maximum write throughput allowed per single core in MB/s. Limit this on long (+8 hour) runs to 70% of your max throughput as seen on a smaller job for stability
Spark is able to issue write requests much more quickly than Cassandra can handle them. This can lead to GC issues and build up of hints. - Version 1.2 higher -cassandra.output.throughput_mb_per_sec - Allows you to control the amount of data written to C* per Spark core per second. If this is the case with your application, older versions- try lowering the number of concurrent writes and the current batch size using the following options. spark.cassandra.output.batch.size.rows spark.cassandra.output.concurrent.writes
I would like you to leave you all with this thought - Spark Cassandra connector library makes it very easy to create spark applications that need access to Cassandra!! We have used in and seen, you all do too. If you are excited about the ML algorithms and how we can go back to making the very first desire, of customer clicking Netflix and their favourite movie or TV show starts playing or you are super excited about the scale and challenges in providing persistence as a service, do talk to Pransanna or Me as we are always looking for great talent to join our teams!!

Netflix Recommendations using Spark + Cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Netflix Recommendations using Spark + Cassandra

Similar to Netflix Recommendations using Spark + Cassandra (20)

More from DataStax

More from DataStax (20)

Recently uploaded

Recently uploaded (20)

Netflix Recommendations using Spark + Cassandra

Editor's Notes