SlideShare a Scribd company logo
1 of 54
Netflix Recommendations using
Spark + Cassandra
Prasanna Padmanabhan
Roopa Tangirala
Turn on Netflix and the absolute best
content for you would automatically start
playing
Netflix Recommendations
Netflix Recommendations
Ranking
Everything is a Recommendation
Rows
Over 80% of what
members watch
comes from our
recommendations
Recommendations
are driven by
Machine Learning
Algorithms
Data Driven
Offline Experiment
using Historical
Data
Online
A/B Testing
Rollout Feature to
ALL members
Success Success
Fail
Algorithmic Page
Generation
Trending Now
Offline Experimentation
Algorithmic Page Generation
Personalizing the ordering of rows
on the homepage
Algorithmic Page Generation
Without Algorithmic Page Generation With Algorithmic Page Generation
Diversity of the Page
Affinity for specific rows
Drawbacks
Algorithmic Page Generation
Production
Algorithmic Page Generation
Production Variant 1
Algorithmic Page Generation
Production Variant 1 Variant 2
Row Distribution
TV/Movie Ratio
Algorithmic Page Generation
Production Variant 1 Variant 2
Evaluate best variant
based on the plays
Actual
Plays:
Algorithmic Page Generation
Production Variant 1 Variant 2
Evaluate best variant
based on the plays
Actual
Plays:
Algorithmic Page Generation
Production Variant 1 Variant 2
Evaluate best variant
based on the plays
Actual
Plays:
Variant 2
Algorithmic Page Generation
Production Variant 1
Evaluate best variant
based on the plays
Actual
Plays:
Offline Experiment Architecture
Member
Selection
Runs once a day
Ratings
Service
S3
Snapshot Snapshot
Store
Snapshot
Forklift
Viewing
History
Service
MyList
Service
Data
Snapshots
Evaluate
Metrics
Generate
Pages
… …
A/B Test
Data Model - Requirements
• Need for historical service data
• Optimize for Batch Writes and Point Reads
Data Model
20161009_1001
20161009_1002
DATE_MEMBER_ID
MyList
BLOB
MyList
BLOB
R
O
W
S
COLUMN
COLUMN FAMILY:
MYLIST
Data Model
20161009_1001
20161009_1002
DATE_MEMBER_ID
ViewingData
BLOB
ViewingData
BLOB
R
O
W
S
COLUMN
COLUMN FAMILY:
VIEWING-HISTORY
Data Model
20161009_1001_0
20161009_1001_1
DATE_MEMBERID_IDX
ViewingData
BLOB
ViewingData
BLOB
R
O
W
S
COLUMN
20161009_1001_2
ViewingData
BLOB
COLUMN FAMILY:
VIEWING-HISTORY
Online A/B Testing
Trending Now
Videos that are Trending and
Personalized for you
Trending Now
It’s 7 PM on a Monday
Trending Now
It’s 10 PM on a Saturday
Trending Now
Pokeman
Fast Feedback Loop
Trending Now - Data Infrastructure
Impression
Service
Viewing
History
Service
UI
Online
Services
Trends
Store
Compute
Trends
Model
Training
Captures
videos shown
in view port
Captures
videos
played by
members
Publish
Models
Viewing
History
Service
Ratings. .. .
State Management in Cassandra
Video Number of Plays
Stranger Things 100
Narcos 200
Orange is the new Black 300
State Management in Cassandra
Trends
Store
State
Present
?
Compute Trends
Yes
No
Init State from
Cassandra
Load State
Update
State
Read
Events
Data Model - Requirements
• Trending data is for a specific interval of time
• Optimize for Batch Writes and Batch Reads
Data Model
101_METADATA
102_METADATA
VIDEOID_METADATA
Plays
BLOB
Plays
BLOB
R
O
W
S
COLUMNS
103_METADATA
Plays
BLOB
COLUMN FAMILY:
Interval 1,
Interval 2
…
Interval N
Impressions
BLOB
Impressions
BLOB
Impressions
BLOB
Roopa Tangirala
Engineering Manager @ Netflix
Twitter - @roopatangirala
FORKLIFTER
ARCHITECTURE
SOURCE TARGET
USE CASES
APACHE THRIFT CQL
DEMO
WHY NOT DSE SPARK?
SCALABILITY
COST EFFECTIVENESS
LESSONS LEARNT
TTL HANDLING
• TTL Reading And Writing is Asymmetric -
CASSANDRA 12216
• Thrift Column TTL vs CQL Row TTL
1
6
5
4
3
2
PARTITION DIFFERENCES
500000
200000
425k450k475k
200k175k150k125k
500k
TUNING
• spark.cassandra.connection.keep_alive_m
s
• spark.cassandra.connection.timeout_ms
• spark.driver.maxResultSize
OOM EXCEPTIONS
Spark.executor.memory
spark.cassandra.input.split.size_in_mb
WRITES SPEED SPARK
• cassandra.output.batch.size.bytes
• cassandra.output.batch.size.rows
• cassandra.output.concurrent.writes
• cassandra.output.throughput_mb_per_sec
Write Timeouts
cassandra.output.throughput_mb_per_sec
QUESTIONS?

More Related Content

What's hot

Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Kafka for Live Commerce to Transform the Retail and Shopping Metaverse
Kafka for Live Commerce to Transform the Retail and Shopping MetaverseKafka for Live Commerce to Transform the Retail and Shopping Metaverse
Kafka for Live Commerce to Transform the Retail and Shopping MetaverseKai Wähner
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Spark Summit
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com confluent
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentDatabricks
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AIJames Serra
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaGuido Schmutz
 
Automated Testing with Logic Apps and Specflow
Automated Testing with Logic Apps and SpecflowAutomated Testing with Logic Apps and Specflow
Automated Testing with Logic Apps and SpecflowBizTalk360
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
Apache Kafka in the Insurance Industry
Apache Kafka in the Insurance IndustryApache Kafka in the Insurance Industry
Apache Kafka in the Insurance IndustryKai Wähner
 
Benefits of Stream Processing and Apache Kafka Use Cases
Benefits of Stream Processing and Apache Kafka Use CasesBenefits of Stream Processing and Apache Kafka Use Cases
Benefits of Stream Processing and Apache Kafka Use Casesconfluent
 
Frame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningFrame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningDavid Stein
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
PKS Automation Station...All Aboard: Enabling Team Access to PKS with a Conco...
PKS Automation Station...All Aboard: Enabling Team Access to PKS with a Conco...PKS Automation Station...All Aboard: Enabling Team Access to PKS with a Conco...
PKS Automation Station...All Aboard: Enabling Team Access to PKS with a Conco...VMware Tanzu
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Zalando Technology
 

What's hot (20)

Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Kafka for Live Commerce to Transform the Retail and Shopping Metaverse
Kafka for Live Commerce to Transform the Retail and Shopping MetaverseKafka for Live Commerce to Transform the Retail and Shopping Metaverse
Kafka for Live Commerce to Transform the Retail and Shopping Metaverse
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
 
Automated Testing with Logic Apps and Specflow
Automated Testing with Logic Apps and SpecflowAutomated Testing with Logic Apps and Specflow
Automated Testing with Logic Apps and Specflow
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Apache Kafka in the Insurance Industry
Apache Kafka in the Insurance IndustryApache Kafka in the Insurance Industry
Apache Kafka in the Insurance Industry
 
Benefits of Stream Processing and Apache Kafka Use Cases
Benefits of Stream Processing and Apache Kafka Use CasesBenefits of Stream Processing and Apache Kafka Use Cases
Benefits of Stream Processing and Apache Kafka Use Cases
 
Apache flink
Apache flinkApache flink
Apache flink
 
Frame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningFrame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine Learning
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
PKS Automation Station...All Aboard: Enabling Team Access to PKS with a Conco...
PKS Automation Station...All Aboard: Enabling Team Access to PKS with a Conco...PKS Automation Station...All Aboard: Enabling Team Access to PKS with a Conco...
PKS Automation Station...All Aboard: Enabling Team Access to PKS with a Conco...
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
 

Viewers also liked

Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016DataStax
 
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...DataStax
 
Multi-Region Cassandra Clusters
Multi-Region Cassandra ClustersMulti-Region Cassandra Clusters
Multi-Region Cassandra ClustersInstaclustr
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSDataStax Academy
 
Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016
Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016
Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016DataStax
 
Optimizing Cassandra in AWS
Optimizing Cassandra in AWSOptimizing Cassandra in AWS
Optimizing Cassandra in AWSgreggulrich
 
OOW16 - Ready or Not: Applying Secure Configuration to Oracle E-Business Suit...
OOW16 - Ready or Not: Applying Secure Configuration to Oracle E-Business Suit...OOW16 - Ready or Not: Applying Secure Configuration to Oracle E-Business Suit...
OOW16 - Ready or Not: Applying Secure Configuration to Oracle E-Business Suit...vasuballa
 
Lessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scaleLessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scaleDomonkos Tikk
 
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...DataStax
 
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...DataStax
 
OSCON TALK: Becoming Friends with Cassandra and Spark
OSCON TALK: Becoming Friends with Cassandra and SparkOSCON TALK: Becoming Friends with Cassandra and Spark
OSCON TALK: Becoming Friends with Cassandra and SparkDani Traphagen
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016DataStax
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit
 
Microservices with Apache Camel
Microservices with Apache CamelMicroservices with Apache Camel
Microservices with Apache CamelClaus Ibsen
 
Securing Cassandra
Securing CassandraSecuring Cassandra
Securing CassandraInstaclustr
 
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...Redis Labs
 
Переезжаем на Yandex ClickHouse / Александр Зайцев (LifeStreet)
Переезжаем на Yandex ClickHouse / Александр Зайцев (LifeStreet)Переезжаем на Yandex ClickHouse / Александр Зайцев (LifeStreet)
Переезжаем на Yandex ClickHouse / Александр Зайцев (LifeStreet)Ontico
 

Viewers also liked (20)

Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
 
Multi-Region Cassandra Clusters
Multi-Region Cassandra ClustersMulti-Region Cassandra Clusters
Multi-Region Cassandra Clusters
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWS
 
Cassandra @ Yahoo Japan | Cassandra Summit 2016
Cassandra @ Yahoo Japan | Cassandra Summit 2016Cassandra @ Yahoo Japan | Cassandra Summit 2016
Cassandra @ Yahoo Japan | Cassandra Summit 2016
 
Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016
Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016
Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016
 
Optimizing Cassandra in AWS
Optimizing Cassandra in AWSOptimizing Cassandra in AWS
Optimizing Cassandra in AWS
 
OOW16 - Ready or Not: Applying Secure Configuration to Oracle E-Business Suit...
OOW16 - Ready or Not: Applying Secure Configuration to Oracle E-Business Suit...OOW16 - Ready or Not: Applying Secure Configuration to Oracle E-Business Suit...
OOW16 - Ready or Not: Applying Secure Configuration to Oracle E-Business Suit...
 
Lessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scaleLessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scale
 
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
 
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
 
Dynomite @ Redis Conference 2016
Dynomite @ Redis Conference 2016Dynomite @ Redis Conference 2016
Dynomite @ Redis Conference 2016
 
OSCON TALK: Becoming Friends with Cassandra and Spark
OSCON TALK: Becoming Friends with Cassandra and SparkOSCON TALK: Becoming Friends with Cassandra and Spark
OSCON TALK: Becoming Friends with Cassandra and Spark
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Practical Data Mining with RapidMiner Studio 7 : A Basic and Intermediate
Practical Data Mining with RapidMiner Studio 7 : A Basic and IntermediatePractical Data Mining with RapidMiner Studio 7 : A Basic and Intermediate
Practical Data Mining with RapidMiner Studio 7 : A Basic and Intermediate
 
Microservices with Apache Camel
Microservices with Apache CamelMicroservices with Apache Camel
Microservices with Apache Camel
 
Securing Cassandra
Securing CassandraSecuring Cassandra
Securing Cassandra
 
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
 
Переезжаем на Yandex ClickHouse / Александр Зайцев (LifeStreet)
Переезжаем на Yandex ClickHouse / Александр Зайцев (LifeStreet)Переезжаем на Yandex ClickHouse / Александр Зайцев (LifeStreet)
Переезжаем на Yandex ClickHouse / Александр Зайцев (LifeStreet)
 

Similar to Netflix Recommendations using Spark + Cassandra

Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix APIMaintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix APIDaniel Jacobson
 
Machine Learning at Netflix Scale
Machine Learning at Netflix ScaleMachine Learning at Netflix Scale
Machine Learning at Netflix ScaleAish Fenton
 
Scaling the Netflix API - From Atlassian Dev Den
Scaling the Netflix API - From Atlassian Dev DenScaling the Netflix API - From Atlassian Dev Den
Scaling the Netflix API - From Atlassian Dev DenDaniel Jacobson
 
Scaling the Netflix API - OSCON
Scaling the Netflix API - OSCONScaling the Netflix API - OSCON
Scaling the Netflix API - OSCONDaniel Jacobson
 
PhD Defense: Dynamic Generation of Personalized Hybrid Recommender Systems
PhD Defense: Dynamic Generation of Personalized Hybrid Recommender SystemsPhD Defense: Dynamic Generation of Personalized Hybrid Recommender Systems
PhD Defense: Dynamic Generation of Personalized Hybrid Recommender SystemsSimon Dooms
 
(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014
(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014
(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014Amazon Web Services
 
Gam301 Real-Time Game Analytics with Amazon Redshift, Amazon Kinesis, and Ama...
Gam301 Real-Time Game Analytics with Amazon Redshift, Amazon Kinesis, and Ama...Gam301 Real-Time Game Analytics with Amazon Redshift, Amazon Kinesis, and Ama...
Gam301 Real-Time Game Analytics with Amazon Redshift, Amazon Kinesis, and Ama...Amazon Web Services Korea
 
Maintaining the Front Door to Netflix
Maintaining the Front Door to NetflixMaintaining the Front Door to Netflix
Maintaining the Front Door to NetflixBenjamin Schmaus
 
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit MeetupMaintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit MeetupDaniel Jacobson
 
Microservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with KafkaMicroservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with KafkaVMware Tanzu
 
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...Amazon Web Services
 
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Spark Summit
 
Wireframes for UX Revamp of the Product
Wireframes for UX Revamp of the ProductWireframes for UX Revamp of the Product
Wireframes for UX Revamp of the ProductGaurav Bhatia
 
DevOps on AWS: Advanced Techniques for Amazon EC2 Deployments on AWS
DevOps on AWS: Advanced Techniques for Amazon EC2 Deployments on AWSDevOps on AWS: Advanced Techniques for Amazon EC2 Deployments on AWS
DevOps on AWS: Advanced Techniques for Amazon EC2 Deployments on AWSAmazon Web Services
 
Data Collection and Analysis for Better Requirements: Just the Facts, Ma'am
Data Collection and Analysis for Better Requirements: Just the Facts, Ma'amData Collection and Analysis for Better Requirements: Just the Facts, Ma'am
Data Collection and Analysis for Better Requirements: Just the Facts, Ma'amTechWell
 
Modern Data Architectures for Real Time Analytics & Engagement
Modern Data Architectures for Real Time Analytics & EngagementModern Data Architectures for Real Time Analytics & Engagement
Modern Data Architectures for Real Time Analytics & EngagementAmazon Web Services
 
How to increase engagement and conversions
How to increase engagement and conversionsHow to increase engagement and conversions
How to increase engagement and conversionsJan Petr
 
Introduction to Real-time, Streaming Data and Amazon Kinesis: Streaming Data ...
Introduction to Real-time, Streaming Data and Amazon Kinesis: Streaming Data ...Introduction to Real-time, Streaming Data and Amazon Kinesis: Streaming Data ...
Introduction to Real-time, Streaming Data and Amazon Kinesis: Streaming Data ...Amazon Web Services
 
Netflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time TravelNetflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time TravelFaisal Siddiqi
 

Similar to Netflix Recommendations using Spark + Cassandra (20)

Scaling the Netflix API
Scaling the Netflix APIScaling the Netflix API
Scaling the Netflix API
 
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix APIMaintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
 
Machine Learning at Netflix Scale
Machine Learning at Netflix ScaleMachine Learning at Netflix Scale
Machine Learning at Netflix Scale
 
Scaling the Netflix API - From Atlassian Dev Den
Scaling the Netflix API - From Atlassian Dev DenScaling the Netflix API - From Atlassian Dev Den
Scaling the Netflix API - From Atlassian Dev Den
 
Scaling the Netflix API - OSCON
Scaling the Netflix API - OSCONScaling the Netflix API - OSCON
Scaling the Netflix API - OSCON
 
PhD Defense: Dynamic Generation of Personalized Hybrid Recommender Systems
PhD Defense: Dynamic Generation of Personalized Hybrid Recommender SystemsPhD Defense: Dynamic Generation of Personalized Hybrid Recommender Systems
PhD Defense: Dynamic Generation of Personalized Hybrid Recommender Systems
 
(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014
(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014
(ARC303) Panning for Gold: Analyzing Unstructured Data | AWS re:Invent 2014
 
Gam301 Real-Time Game Analytics with Amazon Redshift, Amazon Kinesis, and Ama...
Gam301 Real-Time Game Analytics with Amazon Redshift, Amazon Kinesis, and Ama...Gam301 Real-Time Game Analytics with Amazon Redshift, Amazon Kinesis, and Ama...
Gam301 Real-Time Game Analytics with Amazon Redshift, Amazon Kinesis, and Ama...
 
Maintaining the Front Door to Netflix
Maintaining the Front Door to NetflixMaintaining the Front Door to Netflix
Maintaining the Front Door to Netflix
 
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit MeetupMaintaining the Netflix Front Door - Presentation at Intuit Meetup
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
 
Microservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with KafkaMicroservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with Kafka
 
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
 
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
 
Wireframes for UX Revamp of the Product
Wireframes for UX Revamp of the ProductWireframes for UX Revamp of the Product
Wireframes for UX Revamp of the Product
 
DevOps on AWS: Advanced Techniques for Amazon EC2 Deployments on AWS
DevOps on AWS: Advanced Techniques for Amazon EC2 Deployments on AWSDevOps on AWS: Advanced Techniques for Amazon EC2 Deployments on AWS
DevOps on AWS: Advanced Techniques for Amazon EC2 Deployments on AWS
 
Data Collection and Analysis for Better Requirements: Just the Facts, Ma'am
Data Collection and Analysis for Better Requirements: Just the Facts, Ma'amData Collection and Analysis for Better Requirements: Just the Facts, Ma'am
Data Collection and Analysis for Better Requirements: Just the Facts, Ma'am
 
Modern Data Architectures for Real Time Analytics & Engagement
Modern Data Architectures for Real Time Analytics & EngagementModern Data Architectures for Real Time Analytics & Engagement
Modern Data Architectures for Real Time Analytics & Engagement
 
How to increase engagement and conversions
How to increase engagement and conversionsHow to increase engagement and conversions
How to increase engagement and conversions
 
Introduction to Real-time, Streaming Data and Amazon Kinesis: Streaming Data ...
Introduction to Real-time, Streaming Data and Amazon Kinesis: Streaming Data ...Introduction to Real-time, Streaming Data and Amazon Kinesis: Streaming Data ...
Introduction to Real-time, Streaming Data and Amazon Kinesis: Streaming Data ...
 
Netflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time TravelNetflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time Travel
 

More from DataStax

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?DataStax
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...DataStax
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsDataStax
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphDataStax
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyDataStax
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...DataStax
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache KafkaDataStax
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseDataStax
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0DataStax
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...DataStax
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesDataStax
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDataStax
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudDataStax
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceDataStax
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...DataStax
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...DataStax
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...DataStax
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)DataStax
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsDataStax
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingDataStax
 

More from DataStax (20)

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise Graph
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerce
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking Applications
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
 

Recently uploaded

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 

Recently uploaded (20)

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 

Netflix Recommendations using Spark + Cassandra

Editor's Notes

  1. Good afternoon everyone. My name is Prasanna. I lead the Data Systems for Personalization team at Netflix. Our team builds the Machine Learning infrastructure that powers Netflix recommendations and I have Roopa with me who leads the Cloud Database Engineering team at Netflix. Today, we are going to talk about a few use cases where we use Spark + Cassandra in our data pipelines and share some of the learnings from it.
  2. At Netflix, we aspire to a day when our members can turn on Netflix and the absolute best content for them has already started playing for them. While we know we are far away from realizing this dream, it sets a vision for us to improve the recommendations that span our service. So, where do we use recommendations in our service.
  3. Our journey of building the recommendations systems started with predicting the rating that our members will give for a video and based on that recommend appropriate videos
  4. That later evolved into creating meaningful grouping of videos and being able to personalize the videos within each group.
  5. Today, we have multitude of algorithms for doing recommendations. Not only are the videos within a row personalized for you, but the rows themselves are personalized for you. Our 80% of what our members watch come from the videos that are recommended to them, which are driven by machine learning algorithms. So how do we improve these algorithms to realize our grand vision
  6. Just like everything else at Netflix, we follow a data driven approach to improve our recommendations. Once we have an idea, we run an offline experiment using historical data to see if this new idea would have made better recommendations. If it did, we would deploy it to an online A/B test to see it performs well in Production too. We look at various metrics such as Viewing hours, Member Retention and member satisfaction to evaluate the success of an A/B test. If the A/B test is a success, we would rollout that feature to ALL members. And If not, go back to the whiteboard, come up with a better idea and start over the offline experiment. For the rest of my talk, I’m going to take one use case of Offline Experimentation and one use case for an Online A/B and walk through how we use Spark and Cassandra to help improve recommendations
  7. As we saw earlier, Offline Experiment is a step prior to doing an A/B test. It helps us decide if an idea is even worth doing an A/B test.
  8. Let’s take the use case of Algorithmic Page generation for Offline Experimentation. How can we personalize the ordering of rows on the homepage for each member.
  9. We initially used to have a rule based approach of page generation. For example, the rules could specify that the 1st row be Continue Watching, the 2nd row be Top Picks and so on and so forth. The drawback for this approach is that it does not take into account the diversity of the page nor the affinity of our members to specific rows. Algorithmic Page generation addresses these issues by personalize the row and the ordering of the rows on the home page based on our member’s viewing patterns, diversity of the page and many more attributes.
  10. Let’s take an example to see how we evaluate different pages algorithmically. Say this is a page that a member sees based on the current Production algorithm.
  11. Variant 1 is a new page that was generated with a new algorithm
  12. And Variant 2 is another page that was generated with another new algorithm. We first look for some of the basic things like how is the Row distribution (for ex: how many members see a Continue Watching row) and how is the TV/Movie Ratio (Does one variant over index on say TV shows)
  13. More importantly, we look at the actual videos that were played by the member and find the best page that could have made those videos easily discoverable. In this case, say the member played Hot Rod, The Short Game and Family Guy.
  14. We can see that HotRod was recommended in all the 3 versions of the page, except that Production and Variant 2 recommended that video much higher in the page.
  15. Similarly, Short Game too was recommended in all the 3 versions of the page, except that Variant 2 again recommended that video much higher in the page.
  16. We also look at negative samples. Family Guy was a video that was played, but not recommended in any version of the page (probably our members searched for it). We typically consider this as a fail in our recommendations. Given this data, we would choose Variant 2 as the winning page algorithm as it would likely surfaces the videos that would be played by our members much higher in the page.
  17. Now lets look at the offline experimentation architecture that made this possible. The most critical requirement for building an offline experiment is to provide an ability to travel back in time and be able to generate the page our members would have seen, if they used our service at a given time in the past. We built this ability to time travel by snapshotting data of our various online services and use that snapshot data to generate the experimental page. The first step in building the snapshot infrastructure is to select the set of members for whom we need to Snapshot data. Snapshotting data for all our members would be an expensive operation. Rather, we select a stratified set of our members based on member’s tenure, their viewing patterns, the devices they use etc. Once we have the set of users, the next step is to snapshot data of various online services such as Ratings, Viewing History, MyList that help improve personalization. As you folks might be aware, Netflix embraces a fine grained Service Oriented architecture for our cloud based deployment model. These snapshot data are then stored in S3 in nested parquet format for both space and time efficiency. Many of our offline experiments run inside Spark and they can directly consume the snapshot data from S3. However, for Algorithmic Page generation, we need to consume this snapshot data for one member at a time. This is because we are reusing our existing online systems, which generates the page for a live user request to also generate the experimental page given the snapshot data. S3 is not suited to do random seeks of the data stored in it. Alternatively, We know Cassandra is well suited for this use case. To that end, we used Spark to read the snapshot data from S3 and write that data into Cassandra. We used the Spark Cassandra connector, which took care of the nitty gritty details of connecting to all the cassandra nodes in the ring, maintaining the connection pool, doing retries and optimizing the reads/writes to Cassandra. Once the data is available in Cassandra, we will now be able to get the state of netflix data services for any given member and a timestamp in the past. We can then generate the experimental pages for this member based on the new algorithm and evaluate the metrics needed to see which of those page algorithms could have done better recommendations and if there is a clear winner, we would deploy it to a A/B test.
  18. Before we look into the data models that we used in Cassandra. There were 2 requirements that we needed to address when building the data model: The need for storing historical data from various data services such as Ratings, Viewing History. This is the core for building the time machine Optimize the data model for batch writes that happen from Spark and for Point reads from the online systems during Page generation
  19. Here is the data model that we used for storing our member’s MyList data for Offline Experiment. So yes, the obvious thing is to have different column families for different data services. Date and MemberId concatenated together formed the Row Key. Column name was a static string and its value being a blog of the MyList data for that member. With this data model, a query to get a member’s MyList data for a given timestamp in the past would translate to a point query read, which is very efficient in Cassandra.
  20. However, a similar data model for storing Viewing History would not work. This is because the viewing history data for a member could be very big and would become a wide row, causing heap pressure which inturn would affect latencies
  21. To avoid the issues of having a wide row, we divided the rows into a predefined set of shards. In this case, Date MemberId and Shard Index becomes the row key and the Viewing data blob was the column value.
  22. Now lets focus on the next use case of how did we use Spark (Spark Streaming to be precise) + Cassandra for an Online A/B test.
  23. The Trending Now row captures the video that are Trending, but personalized for you.
  24. Here is a screenshot of my Trending Now row when its 7 pm on a Monday, when my daughter takes control of our remote.
  25. Here is a screenshot of my Trending Now row when its 10 pm on a Saturday. Its ME time and the ACTION finally begins 
  26. Oh yeah, Pokemon’s impact was seen on Netflix too
  27. The key to building a Trending Now row is to have a Fast Feedback loop. Netflix is supported in 1000s of devices with each sending various types of data that help improve personalization such as a Play Event or the fact that a video is recommended to a member and not played and so on so forth. We built various data systems that captures these data In real time. Once we have the required data for personalization, we built several Sparl Streaming applications that can read these data in real time and compute the trends data, all in real-time. The Trends data is fed as i/p to our recommender systems which then looks at a member’s taste and personalized
  28. Lets do a little more deeper dive into the architecture. We capture all user interactions within our service. For ex: the videos that are recommended to our members and shown in their view port is captured by the Impression Service. Similarly, the videos that are played by our members are captured by the Viewing History Service. Both these data services sends those events into Kafka. Spark Streaming jobs consume these events from Kafka and compute the Trends data required for building the Trending Now row. This trends data is persisted into Cassandra. Again, we use the Spark Cassandra connector in the Spark Streaming job to batch write all the Trends data into Cassandra. The one thing that we had to configure in the connector was the connection timeout, which is different for a Streaming job. The Trends data is then combined with data from services such as Viewing History and Ratings and fed as input to the Model Training job. The output of which is a model that is consumed by online services to do the Personalized Trending Now recommendations for the next time interval.
  29. We also use Cassandra for managing the state of our spark streaming jobs. So what is a State in Spark Streaming? Think of it as a simple Key Value pair that gets updated continuously as events happen in real time. Let’s say we simply want to the count the number of times a video is played. In that case, videoId and the count forms the state. Spark provides a way to bootstrap the state when your streaming job is restarted or started for the first time.
  30. For Trending now, as we read events from Kafka, we first checks if the State is Present. If It did, we use the data from the existing state and the new event that was read from Kafka, perform computations and update the new state back into Cassandra. If not, it would load the State data from Cassandra and load into Spark as part of bootstrapping.
  31. Lets look into the data model that we used for Trending Now. The 2 main requirements to address is 1) Trends data is applicable only a specific interval of time and that 2) we should optimize for both batch writes and batch reads as model training happens inside Spark
  32. It’s kind’a obvious that we had to create separate column families for separate interval of times primarily because given the time interval, we would need only the data from that interval. VideoId along with some metadata such as Country, Timezone formed the row key. We had 2 columns that contained the Plays and Impressions data for each video.
  33. With that, I would like to now introduce Roopa who will walk us through few more use cases for using Spark + Cassandra and our learning's from it. You just saw two main data driven spark cassandra use cases. Once Prassan’s ab testing is a success , it needs to be rolled out to ALL members and there is a huge growth in dataset which needs a bigger Cassandra cluster.
  34. You want to move bulk dataset from one cluster to another. How would you go about doing it in a fast and quick way? Meet forklift- why is it called forklift? We are moving data across the clusters! Lets look into the architecture in detail now
  35. Meson is Netflix built - general purpose workflow orchestration and scheduling framework. It manages the lifecycle of several ML pipelines that execute workloads across heterogeneous systems. This same framework was used for the forklift too. Spark Cassandra Connector This library lets you expose Cassandra tables as Spark RDDs, write Spark RDDs to Cassandra tables, and execute arbitrary CQL queries in your Spark applications. Mesos provides task isolation and excellent abstraction of CPU, memory, storage, and other compute resources. Meson leverages these features to achieve scale and fault tolerance for its tasks. Spark jobs submitted from Meson share the same Mesos slaves to run the tasks.
  36. What makes this extension different? The nodes are not being doubled which cassandra can do well for you, instead they are being increased by few percentage. We don’t use vnodes, so the only option for adding capacity to the cluster is either doubling or creating a new cluster and populating the data. We are taking about clusters having hunderds of nodes and doubling always does not work. Forklift comes in very handy for this type of use case.
  37. We were very early adaptors of Cassandra and started using it in 0.5 version in production. So of course we were using thrift and all our streaming microservices which were built over the years were based on thrift’s schemaless design and used to access cassandra. With the advent of CQL there are apps which want to use the richer datamodel of CQL and migrate to cql for better performance. Forklifter plays a great job with the migration since you can map the datamodel and transform from source to destination in the relevant format.
  38. This is another use case we use forklift for, where for certain big clusters, instead of replacing nodes one at a time that would take weeks, we create a new cluster in trusty and forklift the data after the dual writes are enabled.
  39. Performance
  40. We get good support from datastax and one of the options was using DSE spark instead of running datastax spark connector talking to cassandra. But performance of our cassandra clusters was a concern since these clusters are used in the path of streaming serving all the members watch great streaming content, have very strict SLA’s. Running spark along with cassandra would constraint the limited resources we have in AWS and was a big concern.
  41. Cassandra is statefull, and if we had spark and cassandra running together, its not easy to scale up the cluster when you are running into resource constrains. With spark running seperately we can scale up and down the spark cluster with out affecting cassandra
  42. Running the spark and cassandra clusters are cost effective too, since we can use the instances from the shared pool and release them when the job is complete.
  43. Can lead to NPE - An `TTL` of 0 when written becomes a `null` in C*
When read, this `TTL` becomes a `null` 
The `null` cannot be written back to C* as `TTL` Fixed in 3.1- we used a workaround of translating the data when reading from source and writing to destination. -------- In thrift you could define the column level TTL and different columns could have different ttl’s. In cql there is a row TTL and there is no way to define column TTL in a single mutation. SO when you are copying data from thrift to CQL you would need to split the writes into multiple mutations by batching.
  44. Input.split.size_in_mb uses a internal system table in C* ( >= 2.1.5) to determine the size of the data in C*. The table is called system.size_estimates is not meant to be absolutely accurate so there will be some inaccuracy with smaller tables and split sizes. When you use spark cassandra connector cassandraTable() function to load data from Cassandra to Spark it will automatically create Spark partitions aligned to the Cassandra partition key. It will try to create an appropriate number of partitions by estimating the size of the table and dividing this by the parameter spark.cassandra.input.split.size_in_mb (64mb by default). (One instance where the default will need changing is if you have a small source table – in that case use withReadConf() to override the parameter.)
  45. conf spark.cassandra.connection.keep_alive_ms, 5000 =900000 , Period of time to keep unused connections open -- conf spark.cassandra.connection.timeout_ms default 5000, - we put 50 Maximum period of time to attempt connecting to a node -- conf spark.driver.maxResultSize=default 1G but we had it 4g Limit of total size of serialized results of all partitions for each Spark action (e.g. collect). Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size is above this limit. Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory and memory overhead of objects in JVM). Setting a proper limit can protect the driver from out-of-memory errors.
  46. This usually means that the size of the partitions you are attempting to create are larger than the executor's heap can handle. Remember that all of the executors run in the same JVM so the size of the data is multiplied by the number of executor slots. increase the heap size of the executors spark.executor.memory or shrink the size of the partitions by decreasing spark.cassandra.input.split.size_in_mb
  47. cassandra.output.batch.size.bytes Default = 1024. Maximum total size of the batch in bytes. Overridden by spark.cassandra.output.batch.size.rows cassandra.output.batch.size.rows (default: auto – batch size determined by size.byts): Number of rows per single batch. The default is 'auto' which means the connector will adjust the number of rows based on the amount of data in each row cassandra.output.concurrent.writes (default: 5) Maximum number of batches executed in parallel by a single Spark task cassandra.output.throughput_mb_per_sec (default: unlimited): Maximum write throughput allowed per single core in MB/s. Limit this on long (+8 hour) runs to 70% of your max throughput as seen on a smaller job for stability
  48. Spark is able to issue write requests much more quickly than Cassandra can handle them. This can lead to GC issues and build up of hints. - Version 1.2 higher -cassandra.output.throughput_mb_per_sec - Allows you to control the amount of data written to C* per Spark core per second. If this is the case with your application, older versions- try lowering the number of concurrent writes and the current batch size using the following options. spark.cassandra.output.batch.size.rows spark.cassandra.output.concurrent.writes
  49. I would like you to leave you all with this thought - Spark Cassandra connector library makes it very easy to create spark applications that need access to Cassandra!! We have used in and seen, you all do too. If you are excited about the ML algorithms and how we can go back to making the very first desire, of customer clicking Netflix and their favourite movie or TV show starts playing or you are super excited about the scale and challenges in providing persistence as a service, do talk to Pransanna or Me as we are always looking for great talent to join our teams!!