SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Dhaval Shah
R&D Software Engineer, Bloomberg L. P.
Recommender Systems at scale
using HBase and Hadoop
1Bloomberg
Agenda
 Introduction to Recommender Systems
 Types of Recommender Systems
 Building a Recommender System
 Summary
 (Hopefully) Lots of Q&A
2Bloomberg
 What is a Recommender System?
 Wikipedia1 – Recommender systems are a subclass of
information filtering system that seek to predict the
‘rating’ or ‘preference’ that user would give to an item or
social element they had not yet considered, using a model
built from the characteristics of an item (content-based
approaches) or the user’s social environment
(collaborative filtering approaches).
3
Introduction to Recommender Systems
Bloomberg
Introduction to Recommender Systems
 Where are Recommender Systems used?
 Everywhere! (Well almost!)
 E-Commerce
 Web Portals
 Online Radio
 Streaming Movies
 Media/News
4Bloomberg
5
6
7
Introduction to Recommender Systems
Bloomberg
9
10
 Why do you need a Recommender System?
 Too much useful information
 Bloomberg.com statistics
o 500-1000 stories, 100-200 videos published per day
o Average user consumption << Articles published
o Satisfied user = Content Quality + User preference
o Double digit increases in CTR
11
Introduction to Recommender Systems
Bloomberg
Types of Recommender Systems
 Content-Based
 Collaborative filter based
 User-based
 Item-based
 Hybrid
12Bloomberg
Building a Recommender System
 Collect/Generate metadata about stories/videos
 Identify and track users
 Track user activity
 Store user activity
 Generate user models
 Serve recommendations
13Bloomberg
 Collect metadata about stories/videos
 URLs, Headlines, etc.
 Sqoop, Custom Scripts
 Generate features for stories
 LDA from Mahout
 Custom extensions
14Bloomberg
Building a Recommender System
 Identify and track users
 Registered
 Anonymous
o Cookie based tracking
o IP based tracking
15Bloomberg
Building a Recommender System
 Types of user activity
 Explicit interactions
 Implicit interactions
16Bloomberg
Building a Recommender System
Tracking user activity
17Bloomberg
Building a Recommender System
Browser
(Javascript)
HTTP
Server
D
Flume HBase
 Tracking : Key Features
 1000s of ppm
 Asynchronous - Instantaneous responses to client
 Reliability
 Multiple HTTP Servers → Multiple Clusters
 Client to HBase in milliseconds
18Bloomberg
Building a Recommender System
 Why HBase?
 Scalable
 Fault-tolerant
 Auto-sharding
 Schema-less and sparse
 Real-time queries
 MR integration
19Bloomberg
Building a Recommender System
 Store user activity
 100s of millions of users
 Millions of stories/videos
 TBs of data
 Wide Tables – 1 row per user
 High load
 Sub-second response times
 Multiple MR jobs every few mins
20Bloomberg
Building a Recommender System
 Generate user models using ML
 100s of millions of users
 High IO/Processing power
 Train multiple times an hour
21Bloomberg
Building a Recommender System
 Content-based Recommender Models
 User model independent of other users
 Train only when user has new interaction
 Easily parallelizable
 No Reducer
 Incremental training
 Train 1000 user models a minute
22Bloomberg
Building a Recommender System
 Collaborative filter based Recommender Models
 User model dependent of other users
 Train all models frequently
 Map side self join
 No Reducer
 Batch training
 Train 10s of millions of user models on each batch
23Bloomberg
Building a Recommender System
 Serve recommendations
 Query HBase
 Evaluate articles against user models
 In-memory cache
 1000s of requests per minute
 50ms responses
24Bloomberg
Building a Recommender System
Summary
 Recommender System are important
 Content based and Collaborative filter based
 Cross domain expertise – Big Data, Machine Learning
 Hadoop/MapReduce for offline components
 HBase as a hybrid data store
25Bloomberg
Hiring
26
Email: dshah100@bloomberg.net
Bloomberg
Questions?
27Bloomberg

Weitere ähnliche Inhalte

Was ist angesagt?

Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - RecommendationCataldo Musto
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender SystemsT212
 
Sparking Science up with Research Recommendations
Sparking Science up with Research RecommendationsSparking Science up with Research Recommendations
Sparking Science up with Research RecommendationsMaya Hristakeva
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache MahoutAman Adhikari
 
Apache Mahout
Apache MahoutApache Mahout
Apache MahoutAjit Koti
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用James Chen
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyKris Jack
 
Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Cataldo Musto
 
Machine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionMachine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionVarad Meru
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation SystemsAxel de Romblay
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenderssscdotopen
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
It's Just Search: Presented by Erik Hatcher, Lucidworks
It's Just Search: Presented by Erik Hatcher, LucidworksIt's Just Search: Presented by Erik Hatcher, Lucidworks
It's Just Search: Presented by Erik Hatcher, LucidworksLucidworks
 
Modern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and PracticesModern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and PracticesWill Gardella
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutTed Dunning
 

Was ist angesagt? (20)

Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Sparking Science up with Research Recommendations
Sparking Science up with Research RecommendationsSparking Science up with Research Recommendations
Sparking Science up with Research Recommendations
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
 
Mahout
MahoutMahout
Mahout
 
Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)
 
Machine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionMachine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An Introduction
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
It's Just Search: Presented by Erik Hatcher, Lucidworks
It's Just Search: Presented by Erik Hatcher, LucidworksIt's Just Search: Presented by Erik Hatcher, Lucidworks
It's Just Search: Presented by Erik Hatcher, Lucidworks
 
Modern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and PracticesModern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and Practices
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 

Andere mochten auch

How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkCaserta
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architectureLiang Xiang
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineNYC Predictive Analytics
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemMilind Gokhale
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Xavier Amatriain
 
Content Marketing In The Era of Infobesity
Content Marketing In The Era of InfobesityContent Marketing In The Era of Infobesity
Content Marketing In The Era of InfobesityLinkedIn
 
Hadoop Final Documentation
Hadoop Final DocumentationHadoop Final Documentation
Hadoop Final DocumentationPoumita Das
 
Введение в рекомендательные системы
Введение в рекомендательные системыВведение в рекомендательные системы
Введение в рекомендательные системыAndrey Danilchenko
 
ИТМО Machine Learning 2016. Рекомендательные системы
ИТМО Machine Learning 2016. Рекомендательные системыИТМО Machine Learning 2016. Рекомендательные системы
ИТМО Machine Learning 2016. Рекомендательные системыAndrey Danilchenko
 
Semantics-aware Content-based Recommender Systems
Semantics-aware Content-based Recommender SystemsSemantics-aware Content-based Recommender Systems
Semantics-aware Content-based Recommender SystemsPasquale Lops
 
Методики оценки рекомендательных систем
Методики оценки рекомендательных системМетодики оценки рекомендательных систем
Методики оценки рекомендательных системWitology
 
Comparing State-of-the-Art Collaborative Filtering Systems
Comparing State-of-the-Art Collaborative Filtering SystemsComparing State-of-the-Art Collaborative Filtering Systems
Comparing State-of-the-Art Collaborative Filtering Systemsnextlib
 
Analysing data analytics use cases to understand big data platform
Analysing data analytics use cases  to understand big data platformAnalysing data analytics use cases  to understand big data platform
Analysing data analytics use cases to understand big data platformdataeaze systems
 
Recommender Algorithm for PRBT BiPartite Networks - IESL 18 Oct 2016_final_us...
Recommender Algorithm for PRBT BiPartite Networks - IESL 18 Oct 2016_final_us...Recommender Algorithm for PRBT BiPartite Networks - IESL 18 Oct 2016_final_us...
Recommender Algorithm for PRBT BiPartite Networks - IESL 18 Oct 2016_final_us...Asoka Korale
 
Recommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRecommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRoku
 
[SNU UX Lab] Introducing the Space Recommender System: How crowd-sourced voti...
[SNU UX Lab] Introducing the Space Recommender System: How crowd-sourced voti...[SNU UX Lab] Introducing the Space Recommender System: How crowd-sourced voti...
[SNU UX Lab] Introducing the Space Recommender System: How crowd-sourced voti...Jihyung Yoo
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherMongoDB
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsMário Almeida
 

Andere mochten auch (20)

How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on Spark
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
 
Collaborative Filtering Recommendation System
Collaborative Filtering Recommendation SystemCollaborative Filtering Recommendation System
Collaborative Filtering Recommendation System
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
 
Content Marketing In The Era of Infobesity
Content Marketing In The Era of InfobesityContent Marketing In The Era of Infobesity
Content Marketing In The Era of Infobesity
 
Hadoop Final Documentation
Hadoop Final DocumentationHadoop Final Documentation
Hadoop Final Documentation
 
Введение в рекомендательные системы
Введение в рекомендательные системыВведение в рекомендательные системы
Введение в рекомендательные системы
 
ИТМО Machine Learning 2016. Рекомендательные системы
ИТМО Machine Learning 2016. Рекомендательные системыИТМО Machine Learning 2016. Рекомендательные системы
ИТМО Machine Learning 2016. Рекомендательные системы
 
Semantics-aware Content-based Recommender Systems
Semantics-aware Content-based Recommender SystemsSemantics-aware Content-based Recommender Systems
Semantics-aware Content-based Recommender Systems
 
Методики оценки рекомендательных систем
Методики оценки рекомендательных системМетодики оценки рекомендательных систем
Методики оценки рекомендательных систем
 
Comparing State-of-the-Art Collaborative Filtering Systems
Comparing State-of-the-Art Collaborative Filtering SystemsComparing State-of-the-Art Collaborative Filtering Systems
Comparing State-of-the-Art Collaborative Filtering Systems
 
Analysing data analytics use cases to understand big data platform
Analysing data analytics use cases  to understand big data platformAnalysing data analytics use cases  to understand big data platform
Analysing data analytics use cases to understand big data platform
 
Recommender Algorithm for PRBT BiPartite Networks - IESL 18 Oct 2016_final_us...
Recommender Algorithm for PRBT BiPartite Networks - IESL 18 Oct 2016_final_us...Recommender Algorithm for PRBT BiPartite Networks - IESL 18 Oct 2016_final_us...
Recommender Algorithm for PRBT BiPartite Networks - IESL 18 Oct 2016_final_us...
 
Recommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRecommender Systems in the Linked Data era
Recommender Systems in the Linked Data era
 
курышев рекомендательные системы
курышев рекомендательные системыкурышев рекомендательные системы
курышев рекомендательные системы
 
[SNU UX Lab] Introducing the Space Recommender System: How crowd-sourced voti...
[SNU UX Lab] Introducing the Space Recommender System: How crowd-sourced voti...[SNU UX Lab] Introducing the Space Recommender System: How crowd-sourced voti...
[SNU UX Lab] Introducing the Space Recommender System: How crowd-sourced voti...
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File Systems
 

Ähnlich wie Recommender System at Scale Using HBase and Hadoop

Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to usersjobinwilson
 
Recommendation engines matching items to users
Recommendation engines matching items to usersRecommendation engines matching items to users
Recommendation engines matching items to usersFlytxt
 
Practical Tips for Ops: End User Monitoring
Practical Tips for Ops: End User MonitoringPractical Tips for Ops: End User Monitoring
Practical Tips for Ops: End User MonitoringDynatrace
 
Recommender system and big data (design a smartphone recommender system based...
Recommender system and big data (design a smartphone recommender system based...Recommender system and big data (design a smartphone recommender system based...
Recommender system and big data (design a smartphone recommender system based...Siwar Abidi
 
Productionalize content recommendation engine
Productionalize content recommendation engine Productionalize content recommendation engine
Productionalize content recommendation engine Kim Ming Teh
 
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...Yahoo Developer Network
 
E voting(online voting system)
E voting(online voting system)E voting(online voting system)
E voting(online voting system)Saurabh Kheni
 
Doing More with Less: Mash Your Way to Productivity
Doing More with Less: Mash Your Way to ProductivityDoing More with Less: Mash Your Way to Productivity
Doing More with Less: Mash Your Way to Productivityguest3c5c731bc
 
Doing More with Less: Mash Your Way to Productivity
Doing More with Less: Mash Your Way to ProductivityDoing More with Less: Mash Your Way to Productivity
Doing More with Less: Mash Your Way to Productivitykevinreiss
 
Applying a Methodical Approach to Website Performance
Applying a Methodical Approach to Website PerformanceApplying a Methodical Approach to Website Performance
Applying a Methodical Approach to Website PerformancePostSharp Technologies
 
blood bank management system project report
blood bank management system project reportblood bank management system project report
blood bank management system project reportNARMADAPETROLEUMGAS
 
SRS2014: Towards a Scalable Recommender Engine for Online Marketplaces
SRS2014: Towards a Scalable Recommender Engine for Online MarketplacesSRS2014: Towards a Scalable Recommender Engine for Online Marketplaces
SRS2014: Towards a Scalable Recommender Engine for Online MarketplacesDominik Kowald
 
Real Time Bidding on AWS - Pop-up Loft Tel Aviv
Real Time Bidding on AWS - Pop-up Loft Tel AvivReal Time Bidding on AWS - Pop-up Loft Tel Aviv
Real Time Bidding on AWS - Pop-up Loft Tel AvivAmazon Web Services
 
Big ideas in small packages - How microservices helped us to scale our vision
Big ideas in small packages  - How microservices helped us to scale our visionBig ideas in small packages  - How microservices helped us to scale our vision
Big ideas in small packages - How microservices helped us to scale our visionSebastian Schleicher
 
Web Performance Bootcamp 2014
Web Performance Bootcamp 2014Web Performance Bootcamp 2014
Web Performance Bootcamp 2014Daniel Austin
 
Paper6745 presentation tianjian
Paper6745 presentation tianjianPaper6745 presentation tianjian
Paper6745 presentation tianjianTianjian Chen
 
Sistemas de Recomendação sem Enrolação
Sistemas de Recomendação sem Enrolação Sistemas de Recomendação sem Enrolação
Sistemas de Recomendação sem Enrolação Gabriel Moreira
 

Ähnlich wie Recommender System at Scale Using HBase and Hadoop (20)

Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to users
 
Recommendation engines matching items to users
Recommendation engines matching items to usersRecommendation engines matching items to users
Recommendation engines matching items to users
 
29.4 mb
29.4 mb29.4 mb
29.4 mb
 
29.4 Mb
29.4 Mb29.4 Mb
29.4 Mb
 
Practical Tips for Ops: End User Monitoring
Practical Tips for Ops: End User MonitoringPractical Tips for Ops: End User Monitoring
Practical Tips for Ops: End User Monitoring
 
Recommender system and big data (design a smartphone recommender system based...
Recommender system and big data (design a smartphone recommender system based...Recommender system and big data (design a smartphone recommender system based...
Recommender system and big data (design a smartphone recommender system based...
 
Productionalize content recommendation engine
Productionalize content recommendation engine Productionalize content recommendation engine
Productionalize content recommendation engine
 
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
 
E voting(online voting system)
E voting(online voting system)E voting(online voting system)
E voting(online voting system)
 
Doing More with Less: Mash Your Way to Productivity
Doing More with Less: Mash Your Way to ProductivityDoing More with Less: Mash Your Way to Productivity
Doing More with Less: Mash Your Way to Productivity
 
Doing More with Less: Mash Your Way to Productivity
Doing More with Less: Mash Your Way to ProductivityDoing More with Less: Mash Your Way to Productivity
Doing More with Less: Mash Your Way to Productivity
 
Applying a Methodical Approach to Website Performance
Applying a Methodical Approach to Website PerformanceApplying a Methodical Approach to Website Performance
Applying a Methodical Approach to Website Performance
 
blood bank management system project report
blood bank management system project reportblood bank management system project report
blood bank management system project report
 
Microservices why?
Microservices   why?Microservices   why?
Microservices why?
 
SRS2014: Towards a Scalable Recommender Engine for Online Marketplaces
SRS2014: Towards a Scalable Recommender Engine for Online MarketplacesSRS2014: Towards a Scalable Recommender Engine for Online Marketplaces
SRS2014: Towards a Scalable Recommender Engine for Online Marketplaces
 
Real Time Bidding on AWS - Pop-up Loft Tel Aviv
Real Time Bidding on AWS - Pop-up Loft Tel AvivReal Time Bidding on AWS - Pop-up Loft Tel Aviv
Real Time Bidding on AWS - Pop-up Loft Tel Aviv
 
Big ideas in small packages - How microservices helped us to scale our vision
Big ideas in small packages  - How microservices helped us to scale our visionBig ideas in small packages  - How microservices helped us to scale our vision
Big ideas in small packages - How microservices helped us to scale our vision
 
Web Performance Bootcamp 2014
Web Performance Bootcamp 2014Web Performance Bootcamp 2014
Web Performance Bootcamp 2014
 
Paper6745 presentation tianjian
Paper6745 presentation tianjianPaper6745 presentation tianjian
Paper6745 presentation tianjian
 
Sistemas de Recomendação sem Enrolação
Sistemas de Recomendação sem Enrolação Sistemas de Recomendação sem Enrolação
Sistemas de Recomendação sem Enrolação
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Kürzlich hochgeladen (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Recommender System at Scale Using HBase and Hadoop

  • 1. Dhaval Shah R&D Software Engineer, Bloomberg L. P. Recommender Systems at scale using HBase and Hadoop 1Bloomberg
  • 2. Agenda  Introduction to Recommender Systems  Types of Recommender Systems  Building a Recommender System  Summary  (Hopefully) Lots of Q&A 2Bloomberg
  • 3.  What is a Recommender System?  Wikipedia1 – Recommender systems are a subclass of information filtering system that seek to predict the ‘rating’ or ‘preference’ that user would give to an item or social element they had not yet considered, using a model built from the characteristics of an item (content-based approaches) or the user’s social environment (collaborative filtering approaches). 3 Introduction to Recommender Systems Bloomberg
  • 4. Introduction to Recommender Systems  Where are Recommender Systems used?  Everywhere! (Well almost!)  E-Commerce  Web Portals  Online Radio  Streaming Movies  Media/News 4Bloomberg
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. Introduction to Recommender Systems Bloomberg
  • 9. 9
  • 10. 10
  • 11.  Why do you need a Recommender System?  Too much useful information  Bloomberg.com statistics o 500-1000 stories, 100-200 videos published per day o Average user consumption << Articles published o Satisfied user = Content Quality + User preference o Double digit increases in CTR 11 Introduction to Recommender Systems Bloomberg
  • 12. Types of Recommender Systems  Content-Based  Collaborative filter based  User-based  Item-based  Hybrid 12Bloomberg
  • 13. Building a Recommender System  Collect/Generate metadata about stories/videos  Identify and track users  Track user activity  Store user activity  Generate user models  Serve recommendations 13Bloomberg
  • 14.  Collect metadata about stories/videos  URLs, Headlines, etc.  Sqoop, Custom Scripts  Generate features for stories  LDA from Mahout  Custom extensions 14Bloomberg Building a Recommender System
  • 15.  Identify and track users  Registered  Anonymous o Cookie based tracking o IP based tracking 15Bloomberg Building a Recommender System
  • 16.  Types of user activity  Explicit interactions  Implicit interactions 16Bloomberg Building a Recommender System
  • 17. Tracking user activity 17Bloomberg Building a Recommender System Browser (Javascript) HTTP Server D Flume HBase
  • 18.  Tracking : Key Features  1000s of ppm  Asynchronous - Instantaneous responses to client  Reliability  Multiple HTTP Servers → Multiple Clusters  Client to HBase in milliseconds 18Bloomberg Building a Recommender System
  • 19.  Why HBase?  Scalable  Fault-tolerant  Auto-sharding  Schema-less and sparse  Real-time queries  MR integration 19Bloomberg Building a Recommender System
  • 20.  Store user activity  100s of millions of users  Millions of stories/videos  TBs of data  Wide Tables – 1 row per user  High load  Sub-second response times  Multiple MR jobs every few mins 20Bloomberg Building a Recommender System
  • 21.  Generate user models using ML  100s of millions of users  High IO/Processing power  Train multiple times an hour 21Bloomberg Building a Recommender System
  • 22.  Content-based Recommender Models  User model independent of other users  Train only when user has new interaction  Easily parallelizable  No Reducer  Incremental training  Train 1000 user models a minute 22Bloomberg Building a Recommender System
  • 23.  Collaborative filter based Recommender Models  User model dependent of other users  Train all models frequently  Map side self join  No Reducer  Batch training  Train 10s of millions of user models on each batch 23Bloomberg Building a Recommender System
  • 24.  Serve recommendations  Query HBase  Evaluate articles against user models  In-memory cache  1000s of requests per minute  50ms responses 24Bloomberg Building a Recommender System
  • 25. Summary  Recommender System are important  Content based and Collaborative filter based  Cross domain expertise – Big Data, Machine Learning  Hadoop/MapReduce for offline components  HBase as a hybrid data store 25Bloomberg

Hinweis der Redaktion

  1. Hi everyone. I am Dhaval Shah. I work in the R&amp;D Department at Bloomberg and spend a majority of my time building Recommender and Analytic systems for Bloomberg.com. For those are not aware, Bloomberg, as a company, provides best in class financial data, analytics and news services. The Bloomberg Terminal is a paid product for financial professionals providing them world class tools that help them stay ahead of the competition. Bloomberg.com provides some of that information free of cost for a common user on the internet. Bloomberg.com is one of the top 10 highly visited news and data websites in the world and one of the most influential. For the next 30 mins or so, I will be sharing my experience on building Recommender Systems for Bloomberg.com using the Hadoop ecosystem, majorly Hadoop, HBase and Flume.
  2. Just a peek into today’s agenda. We will start off with a short introduction on Recommender Systems, discuss the different types of Recommender Systems and then dive right into the different pieces that together make a Recommender System functional. During this technical deep dive, we will see how the different parts of the Hadoop ecosystem simplify building a Recommender System. And finally, I hope to have lots of questions from you guys.
  3. So what is a Recommender System? Anyone from the audience want to help me out on answering this one? Right. So here is a rather complex definition from Wikipedia. Here is how I like to put it. A user interacting with a web site is indirectly telling us something about his or her interests by virtue of what he reads, clicks or shares. We can use this data to understand the user’s interests and better serve the user. A Recommender System is just a fancy name for a system which does this.
  4. So where are Recommender Systems used? Can someone from the audience give me some examples? Yup. The short answer is almost everywhere from E-Commerce to Online Radio to Media
  5. For example, this is how Amazon uses it to entice users.
  6. Here is how IMDB uses Recommenders.
  7. Pandora’s business is mainly driven by intelligent use of such systems.
  8. And this is a rather incomplete ensemble of websites which effective use Recommender Systems to derive business value.
  9. On Bloomberg.com, you can see these modules called “Recommended” pretty much everywhere, on news pages, video pages, homepage, etc.
  10. On Bloomberg.com, you can see these modules called “Recommended” pretty much everywhere, on news pages, video pages, homepage, etc.
  11. Some of you might be wondering – Why do we need a Recommender System at all? From my own experience and talking to practitioners in the field, the single most important reason is “There is too much useful data out there for any human to make reasonable use of without an automated system”. Lets look at some stats to answer this question in the context of Bloomberg.comBloomberg.com publishes 500-1000 stories and 100-200 videos per day on average. The numbers vary slightly based on the news cycle.Its obvious that no user has the time to read everything published on Bloomberg.com and the millions of other news sites on the internet. In fact the average consumption is in single digits which is far lesser than the number of articles published. There are two main factors that influence whether a user would read an article or not, content quality and user preference. The editorial team at Bloomberg does an excellent job at producing high quality content. However, when you have 100s of millions of unique visitors a year on your site, no matter how good the editorial staff is, its not humanly possible to manually curate the website and still be relevant to the entire user base. Humans are lazy by nature and would prefer doing as little work as possible in searching or browsing to find relevant content. That’s where modeling the user preference and serving relevant content becomes extremely important to the business. The effect is not immediate but it helps in slowly gaining customer loyalty since the user does not need to spend his valuable time trying to search for relevant content.And most importantly, based on A/B tests, we have seen like 20-30% increases in click through rates on certain modules when we put in Recommendations
  12. Recommender systems can be broadly classified into two types – Content based and Collaborative filter based. Content based systems typically try to model a user’s preferences in terms of features or characteristics of the content, solely based on that user’s activities and independent of activities of other users. A simplistic representation of my preferences for example would be that I like technology stories. These kind of systems try to recommend articles similar to the ones the user has viewed in the past. It relies on the assumption that we can extract features for articles which faithfully represent the article. For stories, there are many NLP based algorithms which help us achieve this easily. However, for videos, this is still relatively difficult. So these kind of systems tend to work well for stories but the performance for videos is still an open question and requires substantial effort. On the other hand, Collaborative filter based approaches rely on finding users with similar interests and thus, user models are dependent on the activities of other users. These kind of systems try recommending articles which users similar to the user in question have viewed. It does not rely on features for the article itself and thus can be easily applied to non-text media types like videos and company quotes. However, since every user’s model is dependent on every other user’s activities, these kinds of algorithms are much more computationally expensive and requires massive processing power. Collaborative filters can be further classified into user based collaborative filters and item or resource based collaborative filters. For user based collaborative filters you find a set of similar users based on history and then use the activities of similar users to serve recommendations. Item based collaborative filters can also be described as “People who viewed X also viewed Y” as you can prominently see on sites like Amazon.com.And then ofcourse you can mix and match various flavors of content based and collaborative filter based algorithms and create hybrid recommendation systems. We have various flavors of all of these kinds of Recommender Systems running live in production at the moment.
  13. Now that we have discussed the what, where and why of Recommender Systems, lets get to the fun part – How to build a Recommender System?High level, you can break it up into 6 steps:Collect and/or generate metadata about stories/videos. This step is entirely independent on user interactions data.Next step is to identify and track usersOnce we can identify our users, we need to track their activity on the siteNext step is to organize and store this activityThen we use some Machine Learning to generate user preference modelsAnd finally, we use the models we just created to serve recommendationsFor the rest of the presentation, we will be digging deeper into each of these pieces.
  14. Our recommender system lives on a separate infrastructure than the main Bloomberg.com infrastructure. So the first step is collecting relevant metadata about our articles from the main Bloomberg.com system. This includes details like URLs, Headlines and so on. We use a combination of Sqoop and some custom scripts running on a cron’d basis to gather this data. For those who haven’t heard of Sqoop, Sqoop is a tool that lets you transfer data between an RDBMS on one side and Hadoop or HBase on the other.For content based recommendation models, we need to extract features from our stories. We use the LDA implementation from Mahout to help us achieve this. For those who haven’t heard of Mahout, it’s a Machine Learning library and many of its algorithms run on top of Hadoop. However, Mahout’s LDA is designed to run on the entire batch of stories and takes a really long time to complete, whereas for a news website like Bloomberg.com, we have new stories published every few mins and the lifetime of a story is short and hence, the batch LDA isn’t going to serve the purpose. So, we built extensions on top of Mahout’s LDA which allows us to evaluate new documents without going through the entire training process. For new documents this process now completes in a couple of mins instead of hours or days required for a full fledged training.
  15. On the user side, the fundamental requirement is to identify and track users. We have two types of users – Registered and Anonymous. For registered users who are logged in, this task is easy and accurate. However, these are a very small percentage of the actual audience. How small? For every registered user on Bloomberg.com, we have more than a 1000 anonymous users. This underscores the necessity to track anonymous users as well. To track anonymous users, we can use cookie-based or IP based tracking. I will not go into details at this point since this is a fairly standard problem across the industry and the trade-offs between the various solutions are well-known.
  16. Next up, we need to collect data about their actions. User interactions can be categorized into Explicit and Implicit. Explicit interactions include actions like Facebook Likes, Linked Shares, Tweets on Twitter and so on. In general this is high quality data but is difficult to collect because of that extra step required from the user. On the other hand, just viewing a story is an Implicit interaction. The quality of data can be slightly lower but is much easier to collect and you can get a lot more data using this approach. From a Machine Learning standpoint, getting as much data as necessary is crucial and thus, Implicit data plays a very important role. For Bloomberg.com, we use a combination of both Implicit and Explicit data but the Implicit data is the one giving us enough information to even make sense to build such a system.
  17. Due to the use of CDNs and caching, tracking of user activity cannot be done at the application servers directly. We use Javascript to get this data from the client browser.The browser tracking request hits an HTTP server which logs the data to a file and returns a dummy response. This ensures fast responses and does not hold up a client connection. More importantly, it allows us to handle a high amount of load and peak traffic periods gracefully, independent of the state of the backend.We use a multi-tier Flume architecture to transfer this data to its final resting place, which is HBase. There is a Flume process monitoring the file which the HTTP server writes to and as soon as new data is written, it cleans it up and transfers it to HBase. Though this process happens asynchronously with respect to the client, the data reaches HBase in a matter of milliseconds. For DR and Failover purposes we have multiple HBase clusters in multiple data centers. We use Flume to write out this user activity data to all of our clusters at the same time which, by the way, is really easy to set up with Flume. Flume provides a certain level of reliability guarantee which helps us avoid data loss. Flume also has plugins for Hadoop and HBase. However, the HBase plugin for Flume lacks certain features and does not handle failures gracefully because of which we had to write our own but that isn’t a terribly difficult thing to do. We have written some custom decorators to parse the HTTP server’s log and store it in a usable format in HBase. We have also built some bot filtering mechanisms into our decorators.
  18. Here are some key features of our tracking infrastructure. Bloomberg.com gets 10s of 1000s of page views per minute. Though we don’t track visits to all pages yet, the amount of data we track is still substantial. We already discussed that the client gets back an instantaneous response since the HTTP server logs to a file and returns a response.Flume’s reliability guarantees enables us to ensure that we don’t lose data even if the backend goes down or is unresponsive for a short period of time.We have multiple tracking servers writing to multiple HBase clusters, all spread across multiple data centers. This sounds like a complex setup but Flume capabilities and proper modularization makes this really simple to setup and maintain. And most importantly, all of this happens in a matter of milliseconds which makes the system look live just as a synchronous mechanism would have.
  19. As I mentioned earlier, we use HBase as our backend database to store all our data for this system, including user activity data, user models and article metadata. HBase provides us with the right mix of features, scalability and reliability to suit our needs for this system. Here are some important reasons because of which we decided to go with HBase:HBase is horizontally scalable which allows us to store and process terabytes of data at a reasonable cost.HBase is designed for fault tolerance and automatic recovery from failures. I think this is really important when you scale horizontally because with more machines, the probability of a server going down increases.HBase manages all the headaches of sharding data and automatically managing the shards as data keeps growing.HBase is schema-less and sparse. It doesn’t require you to define the entire schema beforehand and allows you to add columns on the fly with different rows having different columns altogether. This feature greatly simplifies schema design. For example, we can now have each user representing a row and add a column every time a user views a story or a video, which provides a nice natural grouping of data. This is particularly suitable for running MR jobs efficiently on this data.It allows you to perform real-time queries on the vast amount of data and still manage to provide millisecond responses.And it has a unique feature to allow MR jobs to run efficiently on the same data used for serving real time responses. This is probably the most important feature for a Recommender System. This takes away all the complexities of managing separate data stores for batch processing and online queries, greatly simplifying the entire app. Now by doing this, you do run into a risk that your resource intensive batch processes might affect your real-time responses which is why many people in community would recommend not to go down this route. However, if properly configured, this is does not turn out to be a problem in real life. You just need to right level of resource isolation and the right config parameters set. The fact that Bloomberg.com can serve recommendations within 50ms even when multiple MR jobs are running in the background is testimony that HBase does support these kinds of architectures.
  20. Here are some stats on our current usage of HBase:We currently have data for 100s of millions of users in our HBase tables.Each of those 100s of millions of users could have interacted with any of the million stories or videos published by Bloomberg.com. This is where the sparse nature of HBase comes in handy.This sums up to terabytes of data across multiple HBase tables.We use wide tables with the notion of 1 row per user which greatly simplifies our app. Specially with the way MR jobs scan HBase tables, the notion of 1 row per user naturally fits into the paradigm with each call to map getting all details about a single user which wouldn’t be possible if we went with a tall narrow table.Our Recommender System serves a high amount of traffic and is capable of handling a lot more.We do many many HBase queries per request and a good deal of processing per request and still manage to serve recommendations within 50ms on average.And all of this when multiple MR jobs are running on the same HBase tables in the background, reading and writing massive amounts of data to these tables. We will get into details for this in the coming slides.
  21. So, now we have all our raw user interaction data and article metadata in our HBase tables. The next step is to train user models using this as training data and store the results back into HBase. We are talking about running Machine Learning algorithms on terabytes of data for 100s of millions of users. This involves a massive amount of IO and processing power. This is where technologies like Hadoop and HBase shine. I wouldn’t even try doing this on any traditional RDBMS based system. For a news website like Bloomberg.com, timeliness is really important. A news article which is very important at this moment may not hold any importance at all a few hours later. For this reason, we have to train our models multiple times every hour which entails an even greater need for IO and processing power. However, the criticality of this requirement depends on the algorithm which we will discuss in a minute. A few slides back, we categorized recommendation algorithms into content based and collaborative filter based. Lets talk about the user model training for these separately since they have slightly different requirements.
  22. As discussed previously, content based systems typically try to model a user’s preferences in terms of features of the content. Moreover, these are generally based solely on that user’s activities and independent of activities of other users. This means that each user’s model can be trained independently of the interactions of other users and only changes when that user has a new interaction assuming the article features remain constant. This problem is very easy and natural to parallelize. We just split the users in some number of buckets and assign each bucket to a mapper. We can write the trained models back to HBase from the mapper directly, completing eliminating the need of a reducer and the sort and shuffle phase involved. Moreover, since this training happens incrementally when a user reads a new article, we can run this every few minutes since the total amount of effort will almost be the same and running it more frequently will give us fresher models. For returning users where we have a substantial history, the models remain fairly constant and might not give us much. However for relatively new users, this is a great deal since we now have A model for the user rather than having none which is a huge deal. On Bloomberg.com, we run this training every 5 mins and train about 5000 user models on every run which comes out to about 1000 user models a minute on average.
  23. In contrast to content based approaches, collaborative filter based approaches rely on finding users with similar interests and thus, a user’s model is dependent on the activities of other users. This means that every time any user views an article, all user models are potentially outdated. This necessitates that we train user models for potentially all users every few mins which as you can imagine is computationally very expensive. At a high level, this algorithms requires you to compare each user with every other user to find similar users which would probably take forever to complete. Thankfully, for a news website like Bloomberg.com, even though old history is useful to build user models, only the latest data is necessary to serve recommendations which simplifies the problem a little bit. We use a map side join mechanism to realize this self join. Again there is no reducer to save time on the shuffle and sort phase. The training has to happen in a batch for all users and we train 10s of millions of user models multiple times in an hour.
  24. At this point, all necessary data required for recommendations is available in HBase. The final piece of the puzzle is the real-time piece when a client requests for a recommendation and the response needs to be served within milliseconds. When the application server receives the request for recommendations, it runs multiple queries against HBase to get all the data it needs, runs some Machine Learning evaluation steps on the models it created in the background and serves the top ranking articles based on the evaluation. Compared to the training, this is computationally very cheap and hence can be done in real time. However, there is still some decent amount of processing that happens to complete this evaluation and rank the articles. We leverage in-memory caching on the application servers for speed. Our current production system is able to serve atleast 10s of 1000s of requests per minute without the in-memory cache layer and potentially a lot more with it. Our average response times are less than 50ms for all of our recommendation algorithms.
  25. To summarize the talk, Recommender Systems are really important in today’s world where the user base for most online businesses is huge and at the same time there is a need to serve each user personally. Recommender systems come in 2 flavors – Content based and collaborative filter based and the collaborative filter based algorithms can be classified into user based and item based. Building a recommender system requires cross-domain expertise, specially in the fields of Machine Learning and Big Data. Hadoop’sMapReduce framework provides a solid ground for parallel processing and helps simplify building the offline components of a Recommender system. And most importantly, HBase can be used as a massively scalable distributed hybrid data store. I call it hybrid because it can be used for online queries as well as a source and sink for offline MapReduce jobs.
  26. Iif you found any of this interesting and would like to get your hands dirty on some of these systems, please get in touch with me after the presentation or email me at dshah100@bloomberg.net.
  27. Questions?