Recommender System at Scale Using HBase and Hadoop

•Als PPTX, PDF herunterladen•

20 gefällt mir•8,479 views

Recommender Systems play a crucial role in a variety of businesses in today`s world. From E-Commerce web sites to News Portals, companies are leveraging data about their users to create a personalizes user experience, gain competitive advantage and eventually drive revenue. Dealing with the sheer quantity of data readily available can be a daunting task by itself. Consider applying machine learning algorithms on top of it and it makes the problem exponentially complex. Fortunately, tools like Hadoop and HBase make this task a little more manageable by taking out some of the complexities of dealing with a large amount of data. In this talk, we will share our success story of building a recommender system for Bloomberg.com leveraging the Hadoop ecosystem. We will describe the high level architecture of the system and discuss the pros and cons of our design choices. Bloomberg.com operates at a scale of 100s of millions of users. Building a recommendation engine for Bloomberg.com entails applying Machine Learning algorithms on terabytes of data and still being able to serve sub-second responses. We will discuss techniques for efficiently and reliably collecting data in near real-time, the notion of offline vs. online processing and most importantly, how HBase perfectly fits the bill by serving as a real-time database as well as input/output for running MapReduce.

Technologie

Dhaval Shah
R&D Software Engineer, Bloomberg L. P.
Recommender Systems at scale
using HBase and Hadoop
1Bloomberg

Agenda
 Introduction to Recommender Systems
 Types of Recommender Systems
 Building a Recommender System
 Summary
 (Hopefully) Lots of Q&A
2Bloomberg

 What is a Recommender System?
 Wikipedia1 – Recommender systems are a subclass of
information filtering system that seek to predict the
‘rating’ or ‘preference’ that user would give to an item or
social element they had not yet considered, using a model
built from the characteristics of an item (content-based
approaches) or the user’s social environment
(collaborative filtering approaches).
3
Introduction to Recommender Systems
Bloomberg

Introduction to Recommender Systems
 Where are Recommender Systems used?
 Everywhere! (Well almost!)
 E-Commerce
 Web Portals
 Online Radio
 Streaming Movies
 Media/News
4Bloomberg

Introduction to Recommender Systems
Bloomberg

 Why do you need a Recommender System?
 Too much useful information
 Bloomberg.com statistics
o 500-1000 stories, 100-200 videos published per day
o Average user consumption << Articles published
o Satisfied user = Content Quality + User preference
o Double digit increases in CTR
11
Introduction to Recommender Systems
Bloomberg

Types of Recommender Systems
 Content-Based
 Collaborative filter based
 User-based
 Item-based
 Hybrid
12Bloomberg

Building a Recommender System
 Collect/Generate metadata about stories/videos
 Identify and track users
 Track user activity
 Store user activity
 Generate user models
 Serve recommendations
13Bloomberg

 Collect metadata about stories/videos
 URLs, Headlines, etc.
 Sqoop, Custom Scripts
 Generate features for stories
 LDA from Mahout
 Custom extensions
14Bloomberg
Building a Recommender System

 Identify and track users
 Registered
 Anonymous
o Cookie based tracking
o IP based tracking
15Bloomberg
Building a Recommender System

 Types of user activity
 Explicit interactions
 Implicit interactions
16Bloomberg
Building a Recommender System

Tracking user activity
17Bloomberg
Building a Recommender System
Browser
(Javascript)
HTTP
Server
D
Flume HBase

 Tracking : Key Features
 1000s of ppm
 Asynchronous - Instantaneous responses to client
 Reliability
 Multiple HTTP Servers → Multiple Clusters
 Client to HBase in milliseconds
18Bloomberg
Building a Recommender System

 Why HBase?
 Scalable
 Fault-tolerant
 Auto-sharding
 Schema-less and sparse
 Real-time queries
 MR integration
19Bloomberg
Building a Recommender System

 Store user activity
 100s of millions of users
 Millions of stories/videos
 TBs of data
 Wide Tables – 1 row per user
 High load
 Sub-second response times
 Multiple MR jobs every few mins
20Bloomberg
Building a Recommender System

 Generate user models using ML
 100s of millions of users
 High IO/Processing power
 Train multiple times an hour
21Bloomberg
Building a Recommender System

 Content-based Recommender Models
 User model independent of other users
 Train only when user has new interaction
 Easily parallelizable
 No Reducer
 Incremental training
 Train 1000 user models a minute
22Bloomberg
Building a Recommender System

 Collaborative filter based Recommender Models
 User model dependent of other users
 Train all models frequently
 Map side self join
 No Reducer
 Batch training
 Train 10s of millions of user models on each batch
23Bloomberg
Building a Recommender System

 Serve recommendations
 Query HBase
 Evaluate articles against user models
 In-memory cache
 1000s of requests per minute
 50ms responses
24Bloomberg
Building a Recommender System

Summary
 Recommender System are important
 Content based and Collaborative filter based
 Cross domain expertise – Big Data, Machine Learning
 Hadoop/MapReduce for offline components
 HBase as a hybrid data store
25Bloomberg

Hiring
26
Email: dshah100@bloomberg.net
Bloomberg

Empfohlen

Building Recommendation Platforms with HadoopJayant Shekhar

Collaborative Filtering and Recommender Systems By Navisro AnalyticsNavisro Analytics

How to Build Recommender System with Content based FilteringVõ Duy Tuấn

The Universal RecommenderPat Ferrel

Movie recommendation system using Apache Mahout and Facebook APIsSmitha Mysore Lokesh

Buidling large scale recommendation engineKeeyong Han

Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto

Apache mahoutPuneet Gupta

Empfohlen

Building Recommendation Platforms with HadoopJayant Shekhar

Collaborative Filtering and Recommender Systems By Navisro AnalyticsNavisro Analytics

How to Build Recommender System with Content based FilteringVõ Duy Tuấn

The Universal RecommenderPat Ferrel

Movie recommendation system using Apache Mahout and Facebook APIsSmitha Mysore Lokesh

Buidling large scale recommendation engineKeeyong Han

Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto

Apache mahoutPuneet Gupta

Tutorial Mahout - RecommendationCataldo Musto

Intro to Mahout -- DC HadoopGrant Ingersoll

Recommender SystemsT212

Sparking Science up with Research RecommendationsMaya Hristakeva

Intro to Apache MahoutGrant Ingersoll

Introduction to Apache MahoutAman Adhikari

Apache MahoutAjit Koti

Apache Mahout 於電子商務的應用James Chen

Modern Perspectives on Recommender Systems and their Applications in MendeleyKris Jack

MahoutEdureka!

Mahout Tutorial and Hands-on (version 2015)Cataldo Musto

Machine Learning and Apache Mahout : An IntroductionVarad Meru

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.

[UPDATE] Udacity webinar on Recommendation SystemsAxel de Romblay

Next directions in Mahout's recommenderssscdotopen

mahout introductionchanggeng Zhang

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks

It's Just Search: Presented by Erik Hatcher, LucidworksLucidworks

Modern Machine Learning Infrastructure and PracticesWill Gardella

Whats Right and Wrong with Apache MahoutTed Dunning

How to Build a Recommendation Engine on SparkCaserta

Recommender system algorithm and architectureLiang Xiang

Weitere ähnliche Inhalte

Was ist angesagt?

Tutorial Mahout - RecommendationCataldo Musto

Intro to Mahout -- DC HadoopGrant Ingersoll

Recommender SystemsT212

Sparking Science up with Research RecommendationsMaya Hristakeva

Intro to Apache MahoutGrant Ingersoll

Introduction to Apache MahoutAman Adhikari

Apache MahoutAjit Koti

Apache Mahout 於電子商務的應用James Chen

Modern Perspectives on Recommender Systems and their Applications in MendeleyKris Jack

MahoutEdureka!

Mahout Tutorial and Hands-on (version 2015)Cataldo Musto

Machine Learning and Apache Mahout : An IntroductionVarad Meru

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.

[UPDATE] Udacity webinar on Recommendation SystemsAxel de Romblay

Next directions in Mahout's recommenderssscdotopen

mahout introductionchanggeng Zhang

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks

It's Just Search: Presented by Erik Hatcher, LucidworksLucidworks

Modern Machine Learning Infrastructure and PracticesWill Gardella

Whats Right and Wrong with Apache MahoutTed Dunning

Was ist angesagt? (20)

Tutorial Mahout - Recommendation

Intro to Mahout -- DC Hadoop

Recommender Systems

Sparking Science up with Research Recommendations

Intro to Apache Mahout

Introduction to Apache Mahout

Apache Mahout

Apache Mahout 於電子商務的應用

Modern Perspectives on Recommender Systems and their Applications in Mendeley

Mahout

Mahout Tutorial and Hands-on (version 2015)

Machine Learning and Apache Mahout : An Introduction

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...

[UPDATE] Udacity webinar on Recommendation Systems

Next directions in Mahout's recommenders

mahout introduction

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...

It's Just Search: Presented by Erik Hatcher, Lucidworks

Modern Machine Learning Infrastructure and Practices

Whats Right and Wrong with Apache Mahout

Andere mochten auch

How to Build a Recommendation Engine on SparkCaserta

Recommender system algorithm and architectureLiang Xiang

Recommender SystemsGirish Khanzode

Building a Recommendation Engine - An example of a product recommendation engineNYC Predictive Analytics

Collaborative Filtering Recommendation SystemMilind Gokhale

Recommender Systems (Machine Learning Summer School 2014 @ CMU)Xavier Amatriain

Content Marketing In The Era of InfobesityLinkedIn

Hadoop Final DocumentationPoumita Das

Введение в рекомендательные системыAndrey Danilchenko

ИТМО Machine Learning 2016. Рекомендательные системыAndrey Danilchenko

Semantics-aware Content-based Recommender SystemsPasquale Lops

Методики оценки рекомендательных системWitology

Comparing State-of-the-Art Collaborative Filtering Systemsnextlib

Analysing data analytics use cases to understand big data platformdataeaze systems

Recommender Algorithm for PRBT BiPartite Networks - IESL 18 Oct 2016_final_us...Asoka Korale

Recommender Systems in the Linked Data eraRoku

курышев рекомендательные системыСпецсеминар "Искусственный Интеллект" кафедры АЯ ВМК МГУ

[SNU UX Lab] Introducing the Space Recommender System: How crowd-sourced voti...Jihyung Yoo

Using MongoDB + Hadoop TogetherMongoDB

Self-Adapting, Energy-Conserving Distributed File SystemsMário Almeida

Andere mochten auch (20)

How to Build a Recommendation Engine on Spark

Recommender system algorithm and architecture

Recommender Systems

Building a Recommendation Engine - An example of a product recommendation engine

Collaborative Filtering Recommendation System

Recommender Systems (Machine Learning Summer School 2014 @ CMU)

Content Marketing In The Era of Infobesity

Hadoop Final Documentation

Введение в рекомендательные системы

ИТМО Machine Learning 2016. Рекомендательные системы

Semantics-aware Content-based Recommender Systems

Методики оценки рекомендательных систем

Comparing State-of-the-Art Collaborative Filtering Systems

Analysing data analytics use cases to understand big data platform

Recommender Algorithm for PRBT BiPartite Networks - IESL 18 Oct 2016_final_us...

Recommender Systems in the Linked Data era

курышев рекомендательные системы

[SNU UX Lab] Introducing the Space Recommender System: How crowd-sourced voti...

Using MongoDB + Hadoop Together

Self-Adapting, Energy-Conserving Distributed File Systems

Ähnlich wie Recommender System at Scale Using HBase and Hadoop

Recommendation engines : Matching items to usersjobinwilson

Recommendation engines matching items to usersFlytxt

29.4 mbPM_slideshare

29.4 Mbguru100

Practical Tips for Ops: End User MonitoringDynatrace

Recommender system and big data (design a smartphone recommender system based...Siwar Abidi

Productionalize content recommendation engine Kim Ming Teh

Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...Yahoo Developer Network

E voting(online voting system)Saurabh Kheni

Doing More with Less: Mash Your Way to Productivityguest3c5c731bc

Doing More with Less: Mash Your Way to Productivitykevinreiss

Applying a Methodical Approach to Website PerformancePostSharp Technologies

blood bank management system project reportNARMADAPETROLEUMGAS

Microservices why?Sascha Düpre

SRS2014: Towards a Scalable Recommender Engine for Online MarketplacesDominik Kowald

Real Time Bidding on AWS - Pop-up Loft Tel AvivAmazon Web Services

Big ideas in small packages - How microservices helped us to scale our visionSebastian Schleicher

Web Performance Bootcamp 2014Daniel Austin

Paper6745 presentation tianjianTianjian Chen

Sistemas de Recomendação sem Enrolação Gabriel Moreira

Ähnlich wie Recommender System at Scale Using HBase and Hadoop (20)

Recommendation engines : Matching items to users

Recommendation engines matching items to users

29.4 mb

29.4 Mb

Practical Tips for Ops: End User Monitoring

Recommender system and big data (design a smartphone recommender system based...

Productionalize content recommendation engine

Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...

E voting(online voting system)

Doing More with Less: Mash Your Way to Productivity

Applying a Methodical Approach to Website Performance

blood bank management system project report

Microservices why?

SRS2014: Towards a Scalable Recommender Engine for Online Marketplaces

Real Time Bidding on AWS - Pop-up Loft Tel Aviv

Big ideas in small packages - How microservices helped us to scale our vision

Web Performance Bootcamp 2014

Paper6745 presentation tianjian

Sistemas de Recomendação sem Enrolação

Mehr von DataWorks Summit

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Mehr von DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Key Features Of Token Development (1).pptxLBM Solutions

Understanding the Laravel MVC ArchitecturePixlogix Infotech

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

How to Remove Document Management Hurdles with X-Docs?XfilesPro

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Kürzlich hochgeladen (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Key Features Of Token Development (1).pptx

Understanding the Laravel MVC Architecture

A Domino Admins Adventures (Engage 2024)

Presentation on how to chat with PDF using ChatGPT code interpreter

Unblocking The Main Thread Solving ANRs and Frozen Frames

How to Remove Document Management Hurdles with X-Docs?

GenCyber Cyber Security Day Presentation

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Breaking the Kubernetes Kill Chain: Host Path Mount

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Scaling API-first – The story of a global engineering organization

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Handwritten Text Recognition for manuscripts and early printed texts

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Recommender System at Scale Using HBase and Hadoop

1. Dhaval Shah R&D Software Engineer, Bloomberg L. P. Recommender Systems at scale using HBase and Hadoop 1Bloomberg

2. Agenda  Introduction to Recommender Systems  Types of Recommender Systems  Building a Recommender System  Summary  (Hopefully) Lots of Q&A 2Bloomberg

3.  What is a Recommender System?  Wikipedia1 – Recommender systems are a subclass of information filtering system that seek to predict the ‘rating’ or ‘preference’ that user would give to an item or social element they had not yet considered, using a model built from the characteristics of an item (content-based approaches) or the user’s social environment (collaborative filtering approaches). 3 Introduction to Recommender Systems Bloomberg

4. Introduction to Recommender Systems  Where are Recommender Systems used?  Everywhere! (Well almost!)  E-Commerce  Web Portals  Online Radio  Streaming Movies  Media/News 4Bloomberg

5. 5

6. 6

7. 7

8. Introduction to Recommender Systems Bloomberg

9. 9

10. 10

11.  Why do you need a Recommender System?  Too much useful information  Bloomberg.com statistics o 500-1000 stories, 100-200 videos published per day o Average user consumption << Articles published o Satisfied user = Content Quality + User preference o Double digit increases in CTR 11 Introduction to Recommender Systems Bloomberg

12. Types of Recommender Systems  Content-Based  Collaborative filter based  User-based  Item-based  Hybrid 12Bloomberg

13. Building a Recommender System  Collect/Generate metadata about stories/videos  Identify and track users  Track user activity  Store user activity  Generate user models  Serve recommendations 13Bloomberg

14.  Collect metadata about stories/videos  URLs, Headlines, etc.  Sqoop, Custom Scripts  Generate features for stories  LDA from Mahout  Custom extensions 14Bloomberg Building a Recommender System

15.  Identify and track users  Registered  Anonymous o Cookie based tracking o IP based tracking 15Bloomberg Building a Recommender System

16.  Types of user activity  Explicit interactions  Implicit interactions 16Bloomberg Building a Recommender System

17. Tracking user activity 17Bloomberg Building a Recommender System Browser (Javascript) HTTP Server D Flume HBase

18.  Tracking : Key Features  1000s of ppm  Asynchronous - Instantaneous responses to client  Reliability  Multiple HTTP Servers → Multiple Clusters  Client to HBase in milliseconds 18Bloomberg Building a Recommender System

19.  Why HBase?  Scalable  Fault-tolerant  Auto-sharding  Schema-less and sparse  Real-time queries  MR integration 19Bloomberg Building a Recommender System

20.  Store user activity  100s of millions of users  Millions of stories/videos  TBs of data  Wide Tables – 1 row per user  High load  Sub-second response times  Multiple MR jobs every few mins 20Bloomberg Building a Recommender System

21.  Generate user models using ML  100s of millions of users  High IO/Processing power  Train multiple times an hour 21Bloomberg Building a Recommender System

22.  Content-based Recommender Models  User model independent of other users  Train only when user has new interaction  Easily parallelizable  No Reducer  Incremental training  Train 1000 user models a minute 22Bloomberg Building a Recommender System

23.  Collaborative filter based Recommender Models  User model dependent of other users  Train all models frequently  Map side self join  No Reducer  Batch training  Train 10s of millions of user models on each batch 23Bloomberg Building a Recommender System

24.  Serve recommendations  Query HBase  Evaluate articles against user models  In-memory cache  1000s of requests per minute  50ms responses 24Bloomberg Building a Recommender System

25. Summary  Recommender System are important  Content based and Collaborative filter based  Cross domain expertise – Big Data, Machine Learning  Hadoop/MapReduce for offline components  HBase as a hybrid data store 25Bloomberg

26. Hiring 26 Email: dshah100@bloomberg.net Bloomberg

27. Questions? 27Bloomberg

Hinweis der Redaktion

Hi everyone. I am Dhaval Shah. I work in the R&D Department at Bloomberg and spend a majority of my time building Recommender and Analytic systems for Bloomberg.com. For those are not aware, Bloomberg, as a company, provides best in class financial data, analytics and news services. The Bloomberg Terminal is a paid product for financial professionals providing them world class tools that help them stay ahead of the competition. Bloomberg.com provides some of that information free of cost for a common user on the internet. Bloomberg.com is one of the top 10 highly visited news and data websites in the world and one of the most influential. For the next 30 mins or so, I will be sharing my experience on building Recommender Systems for Bloomberg.com using the Hadoop ecosystem, majorly Hadoop, HBase and Flume.
Just a peek into today’s agenda. We will start off with a short introduction on Recommender Systems, discuss the different types of Recommender Systems and then dive right into the different pieces that together make a Recommender System functional. During this technical deep dive, we will see how the different parts of the Hadoop ecosystem simplify building a Recommender System. And finally, I hope to have lots of questions from you guys.
So what is a Recommender System? Anyone from the audience want to help me out on answering this one? Right. So here is a rather complex definition from Wikipedia. Here is how I like to put it. A user interacting with a web site is indirectly telling us something about his or her interests by virtue of what he reads, clicks or shares. We can use this data to understand the user’s interests and better serve the user. A Recommender System is just a fancy name for a system which does this.
So where are Recommender Systems used? Can someone from the audience give me some examples? Yup. The short answer is almost everywhere from E-Commerce to Online Radio to Media
For example, this is how Amazon uses it to entice users.
Here is how IMDB uses Recommenders.
Pandora’s business is mainly driven by intelligent use of such systems.
And this is a rather incomplete ensemble of websites which effective use Recommender Systems to derive business value.
On Bloomberg.com, you can see these modules called “Recommended” pretty much everywhere, on news pages, video pages, homepage, etc.
On Bloomberg.com, you can see these modules called “Recommended” pretty much everywhere, on news pages, video pages, homepage, etc.
Some of you might be wondering – Why do we need a Recommender System at all? From my own experience and talking to practitioners in the field, the single most important reason is “There is too much useful data out there for any human to make reasonable use of without an automated system”. Lets look at some stats to answer this question in the context of Bloomberg.comBloomberg.com publishes 500-1000 stories and 100-200 videos per day on average. The numbers vary slightly based on the news cycle.Its obvious that no user has the time to read everything published on Bloomberg.com and the millions of other news sites on the internet. In fact the average consumption is in single digits which is far lesser than the number of articles published. There are two main factors that influence whether a user would read an article or not, content quality and user preference. The editorial team at Bloomberg does an excellent job at producing high quality content. However, when you have 100s of millions of unique visitors a year on your site, no matter how good the editorial staff is, its not humanly possible to manually curate the website and still be relevant to the entire user base. Humans are lazy by nature and would prefer doing as little work as possible in searching or browsing to find relevant content. That’s where modeling the user preference and serving relevant content becomes extremely important to the business. The effect is not immediate but it helps in slowly gaining customer loyalty since the user does not need to spend his valuable time trying to search for relevant content.And most importantly, based on A/B tests, we have seen like 20-30% increases in click through rates on certain modules when we put in Recommendations
Recommender systems can be broadly classified into two types – Content based and Collaborative filter based. Content based systems typically try to model a user’s preferences in terms of features or characteristics of the content, solely based on that user’s activities and independent of activities of other users. A simplistic representation of my preferences for example would be that I like technology stories. These kind of systems try to recommend articles similar to the ones the user has viewed in the past. It relies on the assumption that we can extract features for articles which faithfully represent the article. For stories, there are many NLP based algorithms which help us achieve this easily. However, for videos, this is still relatively difficult. So these kind of systems tend to work well for stories but the performance for videos is still an open question and requires substantial effort. On the other hand, Collaborative filter based approaches rely on finding users with similar interests and thus, user models are dependent on the activities of other users. These kind of systems try recommending articles which users similar to the user in question have viewed. It does not rely on features for the article itself and thus can be easily applied to non-text media types like videos and company quotes. However, since every user’s model is dependent on every other user’s activities, these kinds of algorithms are much more computationally expensive and requires massive processing power. Collaborative filters can be further classified into user based collaborative filters and item or resource based collaborative filters. For user based collaborative filters you find a set of similar users based on history and then use the activities of similar users to serve recommendations. Item based collaborative filters can also be described as “People who viewed X also viewed Y” as you can prominently see on sites like Amazon.com.And then ofcourse you can mix and match various flavors of content based and collaborative filter based algorithms and create hybrid recommendation systems. We have various flavors of all of these kinds of Recommender Systems running live in production at the moment.
Now that we have discussed the what, where and why of Recommender Systems, lets get to the fun part – How to build a Recommender System?High level, you can break it up into 6 steps:Collect and/or generate metadata about stories/videos. This step is entirely independent on user interactions data.Next step is to identify and track usersOnce we can identify our users, we need to track their activity on the siteNext step is to organize and store this activityThen we use some Machine Learning to generate user preference modelsAnd finally, we use the models we just created to serve recommendationsFor the rest of the presentation, we will be digging deeper into each of these pieces.
Our recommender system lives on a separate infrastructure than the main Bloomberg.com infrastructure. So the first step is collecting relevant metadata about our articles from the main Bloomberg.com system. This includes details like URLs, Headlines and so on. We use a combination of Sqoop and some custom scripts running on a cron’d basis to gather this data. For those who haven’t heard of Sqoop, Sqoop is a tool that lets you transfer data between an RDBMS on one side and Hadoop or HBase on the other.For content based recommendation models, we need to extract features from our stories. We use the LDA implementation from Mahout to help us achieve this. For those who haven’t heard of Mahout, it’s a Machine Learning library and many of its algorithms run on top of Hadoop. However, Mahout’s LDA is designed to run on the entire batch of stories and takes a really long time to complete, whereas for a news website like Bloomberg.com, we have new stories published every few mins and the lifetime of a story is short and hence, the batch LDA isn’t going to serve the purpose. So, we built extensions on top of Mahout’s LDA which allows us to evaluate new documents without going through the entire training process. For new documents this process now completes in a couple of mins instead of hours or days required for a full fledged training.
On the user side, the fundamental requirement is to identify and track users. We have two types of users – Registered and Anonymous. For registered users who are logged in, this task is easy and accurate. However, these are a very small percentage of the actual audience. How small? For every registered user on Bloomberg.com, we have more than a 1000 anonymous users. This underscores the necessity to track anonymous users as well. To track anonymous users, we can use cookie-based or IP based tracking. I will not go into details at this point since this is a fairly standard problem across the industry and the trade-offs between the various solutions are well-known.
Next up, we need to collect data about their actions. User interactions can be categorized into Explicit and Implicit. Explicit interactions include actions like Facebook Likes, Linked Shares, Tweets on Twitter and so on. In general this is high quality data but is difficult to collect because of that extra step required from the user. On the other hand, just viewing a story is an Implicit interaction. The quality of data can be slightly lower but is much easier to collect and you can get a lot more data using this approach. From a Machine Learning standpoint, getting as much data as necessary is crucial and thus, Implicit data plays a very important role. For Bloomberg.com, we use a combination of both Implicit and Explicit data but the Implicit data is the one giving us enough information to even make sense to build such a system.
Due to the use of CDNs and caching, tracking of user activity cannot be done at the application servers directly. We use Javascript to get this data from the client browser.The browser tracking request hits an HTTP server which logs the data to a file and returns a dummy response. This ensures fast responses and does not hold up a client connection. More importantly, it allows us to handle a high amount of load and peak traffic periods gracefully, independent of the state of the backend.We use a multi-tier Flume architecture to transfer this data to its final resting place, which is HBase. There is a Flume process monitoring the file which the HTTP server writes to and as soon as new data is written, it cleans it up and transfers it to HBase. Though this process happens asynchronously with respect to the client, the data reaches HBase in a matter of milliseconds. For DR and Failover purposes we have multiple HBase clusters in multiple data centers. We use Flume to write out this user activity data to all of our clusters at the same time which, by the way, is really easy to set up with Flume. Flume provides a certain level of reliability guarantee which helps us avoid data loss. Flume also has plugins for Hadoop and HBase. However, the HBase plugin for Flume lacks certain features and does not handle failures gracefully because of which we had to write our own but that isn’t a terribly difficult thing to do. We have written some custom decorators to parse the HTTP server’s log and store it in a usable format in HBase. We have also built some bot filtering mechanisms into our decorators.
Here are some key features of our tracking infrastructure. Bloomberg.com gets 10s of 1000s of page views per minute. Though we don’t track visits to all pages yet, the amount of data we track is still substantial. We already discussed that the client gets back an instantaneous response since the HTTP server logs to a file and returns a response.Flume’s reliability guarantees enables us to ensure that we don’t lose data even if the backend goes down or is unresponsive for a short period of time.We have multiple tracking servers writing to multiple HBase clusters, all spread across multiple data centers. This sounds like a complex setup but Flume capabilities and proper modularization makes this really simple to setup and maintain. And most importantly, all of this happens in a matter of milliseconds which makes the system look live just as a synchronous mechanism would have.
As I mentioned earlier, we use HBase as our backend database to store all our data for this system, including user activity data, user models and article metadata. HBase provides us with the right mix of features, scalability and reliability to suit our needs for this system. Here are some important reasons because of which we decided to go with HBase:HBase is horizontally scalable which allows us to store and process terabytes of data at a reasonable cost.HBase is designed for fault tolerance and automatic recovery from failures. I think this is really important when you scale horizontally because with more machines, the probability of a server going down increases.HBase manages all the headaches of sharding data and automatically managing the shards as data keeps growing.HBase is schema-less and sparse. It doesn’t require you to define the entire schema beforehand and allows you to add columns on the fly with different rows having different columns altogether. This feature greatly simplifies schema design. For example, we can now have each user representing a row and add a column every time a user views a story or a video, which provides a nice natural grouping of data. This is particularly suitable for running MR jobs efficiently on this data.It allows you to perform real-time queries on the vast amount of data and still manage to provide millisecond responses.And it has a unique feature to allow MR jobs to run efficiently on the same data used for serving real time responses. This is probably the most important feature for a Recommender System. This takes away all the complexities of managing separate data stores for batch processing and online queries, greatly simplifying the entire app. Now by doing this, you do run into a risk that your resource intensive batch processes might affect your real-time responses which is why many people in community would recommend not to go down this route. However, if properly configured, this is does not turn out to be a problem in real life. You just need to right level of resource isolation and the right config parameters set. The fact that Bloomberg.com can serve recommendations within 50ms even when multiple MR jobs are running in the background is testimony that HBase does support these kinds of architectures.
Here are some stats on our current usage of HBase:We currently have data for 100s of millions of users in our HBase tables.Each of those 100s of millions of users could have interacted with any of the million stories or videos published by Bloomberg.com. This is where the sparse nature of HBase comes in handy.This sums up to terabytes of data across multiple HBase tables.We use wide tables with the notion of 1 row per user which greatly simplifies our app. Specially with the way MR jobs scan HBase tables, the notion of 1 row per user naturally fits into the paradigm with each call to map getting all details about a single user which wouldn’t be possible if we went with a tall narrow table.Our Recommender System serves a high amount of traffic and is capable of handling a lot more.We do many many HBase queries per request and a good deal of processing per request and still manage to serve recommendations within 50ms on average.And all of this when multiple MR jobs are running on the same HBase tables in the background, reading and writing massive amounts of data to these tables. We will get into details for this in the coming slides.
So, now we have all our raw user interaction data and article metadata in our HBase tables. The next step is to train user models using this as training data and store the results back into HBase. We are talking about running Machine Learning algorithms on terabytes of data for 100s of millions of users. This involves a massive amount of IO and processing power. This is where technologies like Hadoop and HBase shine. I wouldn’t even try doing this on any traditional RDBMS based system. For a news website like Bloomberg.com, timeliness is really important. A news article which is very important at this moment may not hold any importance at all a few hours later. For this reason, we have to train our models multiple times every hour which entails an even greater need for IO and processing power. However, the criticality of this requirement depends on the algorithm which we will discuss in a minute. A few slides back, we categorized recommendation algorithms into content based and collaborative filter based. Lets talk about the user model training for these separately since they have slightly different requirements.
As discussed previously, content based systems typically try to model a user’s preferences in terms of features of the content. Moreover, these are generally based solely on that user’s activities and independent of activities of other users. This means that each user’s model can be trained independently of the interactions of other users and only changes when that user has a new interaction assuming the article features remain constant. This problem is very easy and natural to parallelize. We just split the users in some number of buckets and assign each bucket to a mapper. We can write the trained models back to HBase from the mapper directly, completing eliminating the need of a reducer and the sort and shuffle phase involved. Moreover, since this training happens incrementally when a user reads a new article, we can run this every few minutes since the total amount of effort will almost be the same and running it more frequently will give us fresher models. For returning users where we have a substantial history, the models remain fairly constant and might not give us much. However for relatively new users, this is a great deal since we now have A model for the user rather than having none which is a huge deal. On Bloomberg.com, we run this training every 5 mins and train about 5000 user models on every run which comes out to about 1000 user models a minute on average.
In contrast to content based approaches, collaborative filter based approaches rely on finding users with similar interests and thus, a user’s model is dependent on the activities of other users. This means that every time any user views an article, all user models are potentially outdated. This necessitates that we train user models for potentially all users every few mins which as you can imagine is computationally very expensive. At a high level, this algorithms requires you to compare each user with every other user to find similar users which would probably take forever to complete. Thankfully, for a news website like Bloomberg.com, even though old history is useful to build user models, only the latest data is necessary to serve recommendations which simplifies the problem a little bit. We use a map side join mechanism to realize this self join. Again there is no reducer to save time on the shuffle and sort phase. The training has to happen in a batch for all users and we train 10s of millions of user models multiple times in an hour.
At this point, all necessary data required for recommendations is available in HBase. The final piece of the puzzle is the real-time piece when a client requests for a recommendation and the response needs to be served within milliseconds. When the application server receives the request for recommendations, it runs multiple queries against HBase to get all the data it needs, runs some Machine Learning evaluation steps on the models it created in the background and serves the top ranking articles based on the evaluation. Compared to the training, this is computationally very cheap and hence can be done in real time. However, there is still some decent amount of processing that happens to complete this evaluation and rank the articles. We leverage in-memory caching on the application servers for speed. Our current production system is able to serve atleast 10s of 1000s of requests per minute without the in-memory cache layer and potentially a lot more with it. Our average response times are less than 50ms for all of our recommendation algorithms.
To summarize the talk, Recommender Systems are really important in today’s world where the user base for most online businesses is huge and at the same time there is a need to serve each user personally. Recommender systems come in 2 flavors – Content based and collaborative filter based and the collaborative filter based algorithms can be classified into user based and item based. Building a recommender system requires cross-domain expertise, specially in the fields of Machine Learning and Big Data. Hadoop’sMapReduce framework provides a solid ground for parallel processing and helps simplify building the offline components of a Recommender system. And most importantly, HBase can be used as a massively scalable distributed hybrid data store. I call it hybrid because it can be used for online queries as well as a source and sink for offline MapReduce jobs.
Iif you found any of this interesting and would like to get your hands dirty on some of these systems, please get in touch with me after the presentation or email me at dshah100@bloomberg.net.
Questions?