Recommender Systems play a crucial role in a variety of businesses in today`s world. From E-Commerce web sites to News Portals, companies are leveraging data about their users to create a personalizes user experience, gain competitive advantage and eventually drive revenue. Dealing with the sheer quantity of data readily available can be a daunting task by itself. Consider applying machine learning algorithms on top of it and it makes the problem exponentially complex. Fortunately, tools like Hadoop and HBase make this task a little more manageable by taking out some of the complexities of dealing with a large amount of data. In this talk, we will share our success story of building a recommender system for Bloomberg.com leveraging the Hadoop ecosystem. We will describe the high level architecture of the system and discuss the pros and cons of our design choices. Bloomberg.com operates at a scale of 100s of millions of users. Building a recommendation engine for Bloomberg.com entails applying Machine Learning algorithms on terabytes of data and still being able to serve sub-second responses. We will discuss techniques for efficiently and reliably collecting data in near real-time, the notion of offline vs. online processing and most importantly, how HBase perfectly fits the bill by serving as a real-time database as well as input/output for running MapReduce.
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Recommender System at Scale Using HBase and Hadoop
1. Dhaval Shah
R&D Software Engineer, Bloomberg L. P.
Recommender Systems at scale
using HBase and Hadoop
1Bloomberg
2. Agenda
Introduction to Recommender Systems
Types of Recommender Systems
Building a Recommender System
Summary
(Hopefully) Lots of Q&A
2Bloomberg
3. What is a Recommender System?
Wikipedia1 – Recommender systems are a subclass of
information filtering system that seek to predict the
‘rating’ or ‘preference’ that user would give to an item or
social element they had not yet considered, using a model
built from the characteristics of an item (content-based
approaches) or the user’s social environment
(collaborative filtering approaches).
3
Introduction to Recommender Systems
Bloomberg
4. Introduction to Recommender Systems
Where are Recommender Systems used?
Everywhere! (Well almost!)
E-Commerce
Web Portals
Online Radio
Streaming Movies
Media/News
4Bloomberg
11. Why do you need a Recommender System?
Too much useful information
Bloomberg.com statistics
o 500-1000 stories, 100-200 videos published per day
o Average user consumption << Articles published
o Satisfied user = Content Quality + User preference
o Double digit increases in CTR
11
Introduction to Recommender Systems
Bloomberg
12. Types of Recommender Systems
Content-Based
Collaborative filter based
User-based
Item-based
Hybrid
12Bloomberg
13. Building a Recommender System
Collect/Generate metadata about stories/videos
Identify and track users
Track user activity
Store user activity
Generate user models
Serve recommendations
13Bloomberg
14. Collect metadata about stories/videos
URLs, Headlines, etc.
Sqoop, Custom Scripts
Generate features for stories
LDA from Mahout
Custom extensions
14Bloomberg
Building a Recommender System
15. Identify and track users
Registered
Anonymous
o Cookie based tracking
o IP based tracking
15Bloomberg
Building a Recommender System
16. Types of user activity
Explicit interactions
Implicit interactions
16Bloomberg
Building a Recommender System
18. Tracking : Key Features
1000s of ppm
Asynchronous - Instantaneous responses to client
Reliability
Multiple HTTP Servers → Multiple Clusters
Client to HBase in milliseconds
18Bloomberg
Building a Recommender System
19. Why HBase?
Scalable
Fault-tolerant
Auto-sharding
Schema-less and sparse
Real-time queries
MR integration
19Bloomberg
Building a Recommender System
20. Store user activity
100s of millions of users
Millions of stories/videos
TBs of data
Wide Tables – 1 row per user
High load
Sub-second response times
Multiple MR jobs every few mins
20Bloomberg
Building a Recommender System
21. Generate user models using ML
100s of millions of users
High IO/Processing power
Train multiple times an hour
21Bloomberg
Building a Recommender System
22. Content-based Recommender Models
User model independent of other users
Train only when user has new interaction
Easily parallelizable
No Reducer
Incremental training
Train 1000 user models a minute
22Bloomberg
Building a Recommender System
23. Collaborative filter based Recommender Models
User model dependent of other users
Train all models frequently
Map side self join
No Reducer
Batch training
Train 10s of millions of user models on each batch
23Bloomberg
Building a Recommender System
24. Serve recommendations
Query HBase
Evaluate articles against user models
In-memory cache
1000s of requests per minute
50ms responses
24Bloomberg
Building a Recommender System
25. Summary
Recommender System are important
Content based and Collaborative filter based
Cross domain expertise – Big Data, Machine Learning
Hadoop/MapReduce for offline components
HBase as a hybrid data store
25Bloomberg
Hi everyone. I am Dhaval Shah. I work in the R&D Department at Bloomberg and spend a majority of my time building Recommender and Analytic systems for Bloomberg.com. For those are not aware, Bloomberg, as a company, provides best in class financial data, analytics and news services. The Bloomberg Terminal is a paid product for financial professionals providing them world class tools that help them stay ahead of the competition. Bloomberg.com provides some of that information free of cost for a common user on the internet. Bloomberg.com is one of the top 10 highly visited news and data websites in the world and one of the most influential. For the next 30 mins or so, I will be sharing my experience on building Recommender Systems for Bloomberg.com using the Hadoop ecosystem, majorly Hadoop, HBase and Flume.
Just a peek into today’s agenda. We will start off with a short introduction on Recommender Systems, discuss the different types of Recommender Systems and then dive right into the different pieces that together make a Recommender System functional. During this technical deep dive, we will see how the different parts of the Hadoop ecosystem simplify building a Recommender System. And finally, I hope to have lots of questions from you guys.
So what is a Recommender System? Anyone from the audience want to help me out on answering this one? Right. So here is a rather complex definition from Wikipedia. Here is how I like to put it. A user interacting with a web site is indirectly telling us something about his or her interests by virtue of what he reads, clicks or shares. We can use this data to understand the user’s interests and better serve the user. A Recommender System is just a fancy name for a system which does this.
So where are Recommender Systems used? Can someone from the audience give me some examples? Yup. The short answer is almost everywhere from E-Commerce to Online Radio to Media
For example, this is how Amazon uses it to entice users.
Here is how IMDB uses Recommenders.
Pandora’s business is mainly driven by intelligent use of such systems.
And this is a rather incomplete ensemble of websites which effective use Recommender Systems to derive business value.
On Bloomberg.com, you can see these modules called “Recommended” pretty much everywhere, on news pages, video pages, homepage, etc.
On Bloomberg.com, you can see these modules called “Recommended” pretty much everywhere, on news pages, video pages, homepage, etc.
Some of you might be wondering – Why do we need a Recommender System at all? From my own experience and talking to practitioners in the field, the single most important reason is “There is too much useful data out there for any human to make reasonable use of without an automated system”. Lets look at some stats to answer this question in the context of Bloomberg.comBloomberg.com publishes 500-1000 stories and 100-200 videos per day on average. The numbers vary slightly based on the news cycle.Its obvious that no user has the time to read everything published on Bloomberg.com and the millions of other news sites on the internet. In fact the average consumption is in single digits which is far lesser than the number of articles published. There are two main factors that influence whether a user would read an article or not, content quality and user preference. The editorial team at Bloomberg does an excellent job at producing high quality content. However, when you have 100s of millions of unique visitors a year on your site, no matter how good the editorial staff is, its not humanly possible to manually curate the website and still be relevant to the entire user base. Humans are lazy by nature and would prefer doing as little work as possible in searching or browsing to find relevant content. That’s where modeling the user preference and serving relevant content becomes extremely important to the business. The effect is not immediate but it helps in slowly gaining customer loyalty since the user does not need to spend his valuable time trying to search for relevant content.And most importantly, based on A/B tests, we have seen like 20-30% increases in click through rates on certain modules when we put in Recommendations
Recommender systems can be broadly classified into two types – Content based and Collaborative filter based. Content based systems typically try to model a user’s preferences in terms of features or characteristics of the content, solely based on that user’s activities and independent of activities of other users. A simplistic representation of my preferences for example would be that I like technology stories. These kind of systems try to recommend articles similar to the ones the user has viewed in the past. It relies on the assumption that we can extract features for articles which faithfully represent the article. For stories, there are many NLP based algorithms which help us achieve this easily. However, for videos, this is still relatively difficult. So these kind of systems tend to work well for stories but the performance for videos is still an open question and requires substantial effort. On the other hand, Collaborative filter based approaches rely on finding users with similar interests and thus, user models are dependent on the activities of other users. These kind of systems try recommending articles which users similar to the user in question have viewed. It does not rely on features for the article itself and thus can be easily applied to non-text media types like videos and company quotes. However, since every user’s model is dependent on every other user’s activities, these kinds of algorithms are much more computationally expensive and requires massive processing power. Collaborative filters can be further classified into user based collaborative filters and item or resource based collaborative filters. For user based collaborative filters you find a set of similar users based on history and then use the activities of similar users to serve recommendations. Item based collaborative filters can also be described as “People who viewed X also viewed Y” as you can prominently see on sites like Amazon.com.And then ofcourse you can mix and match various flavors of content based and collaborative filter based algorithms and create hybrid recommendation systems. We have various flavors of all of these kinds of Recommender Systems running live in production at the moment.
Now that we have discussed the what, where and why of Recommender Systems, lets get to the fun part – How to build a Recommender System?High level, you can break it up into 6 steps:Collect and/or generate metadata about stories/videos. This step is entirely independent on user interactions data.Next step is to identify and track usersOnce we can identify our users, we need to track their activity on the siteNext step is to organize and store this activityThen we use some Machine Learning to generate user preference modelsAnd finally, we use the models we just created to serve recommendationsFor the rest of the presentation, we will be digging deeper into each of these pieces.
Our recommender system lives on a separate infrastructure than the main Bloomberg.com infrastructure. So the first step is collecting relevant metadata about our articles from the main Bloomberg.com system. This includes details like URLs, Headlines and so on. We use a combination of Sqoop and some custom scripts running on a cron’d basis to gather this data. For those who haven’t heard of Sqoop, Sqoop is a tool that lets you transfer data between an RDBMS on one side and Hadoop or HBase on the other.For content based recommendation models, we need to extract features from our stories. We use the LDA implementation from Mahout to help us achieve this. For those who haven’t heard of Mahout, it’s a Machine Learning library and many of its algorithms run on top of Hadoop. However, Mahout’s LDA is designed to run on the entire batch of stories and takes a really long time to complete, whereas for a news website like Bloomberg.com, we have new stories published every few mins and the lifetime of a story is short and hence, the batch LDA isn’t going to serve the purpose. So, we built extensions on top of Mahout’s LDA which allows us to evaluate new documents without going through the entire training process. For new documents this process now completes in a couple of mins instead of hours or days required for a full fledged training.
On the user side, the fundamental requirement is to identify and track users. We have two types of users – Registered and Anonymous. For registered users who are logged in, this task is easy and accurate. However, these are a very small percentage of the actual audience. How small? For every registered user on Bloomberg.com, we have more than a 1000 anonymous users. This underscores the necessity to track anonymous users as well. To track anonymous users, we can use cookie-based or IP based tracking. I will not go into details at this point since this is a fairly standard problem across the industry and the trade-offs between the various solutions are well-known.
Next up, we need to collect data about their actions. User interactions can be categorized into Explicit and Implicit. Explicit interactions include actions like Facebook Likes, Linked Shares, Tweets on Twitter and so on. In general this is high quality data but is difficult to collect because of that extra step required from the user. On the other hand, just viewing a story is an Implicit interaction. The quality of data can be slightly lower but is much easier to collect and you can get a lot more data using this approach. From a Machine Learning standpoint, getting as much data as necessary is crucial and thus, Implicit data plays a very important role. For Bloomberg.com, we use a combination of both Implicit and Explicit data but the Implicit data is the one giving us enough information to even make sense to build such a system.
Due to the use of CDNs and caching, tracking of user activity cannot be done at the application servers directly. We use Javascript to get this data from the client browser.The browser tracking request hits an HTTP server which logs the data to a file and returns a dummy response. This ensures fast responses and does not hold up a client connection. More importantly, it allows us to handle a high amount of load and peak traffic periods gracefully, independent of the state of the backend.We use a multi-tier Flume architecture to transfer this data to its final resting place, which is HBase. There is a Flume process monitoring the file which the HTTP server writes to and as soon as new data is written, it cleans it up and transfers it to HBase. Though this process happens asynchronously with respect to the client, the data reaches HBase in a matter of milliseconds. For DR and Failover purposes we have multiple HBase clusters in multiple data centers. We use Flume to write out this user activity data to all of our clusters at the same time which, by the way, is really easy to set up with Flume. Flume provides a certain level of reliability guarantee which helps us avoid data loss. Flume also has plugins for Hadoop and HBase. However, the HBase plugin for Flume lacks certain features and does not handle failures gracefully because of which we had to write our own but that isn’t a terribly difficult thing to do. We have written some custom decorators to parse the HTTP server’s log and store it in a usable format in HBase. We have also built some bot filtering mechanisms into our decorators.
Here are some key features of our tracking infrastructure. Bloomberg.com gets 10s of 1000s of page views per minute. Though we don’t track visits to all pages yet, the amount of data we track is still substantial. We already discussed that the client gets back an instantaneous response since the HTTP server logs to a file and returns a response.Flume’s reliability guarantees enables us to ensure that we don’t lose data even if the backend goes down or is unresponsive for a short period of time.We have multiple tracking servers writing to multiple HBase clusters, all spread across multiple data centers. This sounds like a complex setup but Flume capabilities and proper modularization makes this really simple to setup and maintain. And most importantly, all of this happens in a matter of milliseconds which makes the system look live just as a synchronous mechanism would have.
As I mentioned earlier, we use HBase as our backend database to store all our data for this system, including user activity data, user models and article metadata. HBase provides us with the right mix of features, scalability and reliability to suit our needs for this system. Here are some important reasons because of which we decided to go with HBase:HBase is horizontally scalable which allows us to store and process terabytes of data at a reasonable cost.HBase is designed for fault tolerance and automatic recovery from failures. I think this is really important when you scale horizontally because with more machines, the probability of a server going down increases.HBase manages all the headaches of sharding data and automatically managing the shards as data keeps growing.HBase is schema-less and sparse. It doesn’t require you to define the entire schema beforehand and allows you to add columns on the fly with different rows having different columns altogether. This feature greatly simplifies schema design. For example, we can now have each user representing a row and add a column every time a user views a story or a video, which provides a nice natural grouping of data. This is particularly suitable for running MR jobs efficiently on this data.It allows you to perform real-time queries on the vast amount of data and still manage to provide millisecond responses.And it has a unique feature to allow MR jobs to run efficiently on the same data used for serving real time responses. This is probably the most important feature for a Recommender System. This takes away all the complexities of managing separate data stores for batch processing and online queries, greatly simplifying the entire app. Now by doing this, you do run into a risk that your resource intensive batch processes might affect your real-time responses which is why many people in community would recommend not to go down this route. However, if properly configured, this is does not turn out to be a problem in real life. You just need to right level of resource isolation and the right config parameters set. The fact that Bloomberg.com can serve recommendations within 50ms even when multiple MR jobs are running in the background is testimony that HBase does support these kinds of architectures.
Here are some stats on our current usage of HBase:We currently have data for 100s of millions of users in our HBase tables.Each of those 100s of millions of users could have interacted with any of the million stories or videos published by Bloomberg.com. This is where the sparse nature of HBase comes in handy.This sums up to terabytes of data across multiple HBase tables.We use wide tables with the notion of 1 row per user which greatly simplifies our app. Specially with the way MR jobs scan HBase tables, the notion of 1 row per user naturally fits into the paradigm with each call to map getting all details about a single user which wouldn’t be possible if we went with a tall narrow table.Our Recommender System serves a high amount of traffic and is capable of handling a lot more.We do many many HBase queries per request and a good deal of processing per request and still manage to serve recommendations within 50ms on average.And all of this when multiple MR jobs are running on the same HBase tables in the background, reading and writing massive amounts of data to these tables. We will get into details for this in the coming slides.
So, now we have all our raw user interaction data and article metadata in our HBase tables. The next step is to train user models using this as training data and store the results back into HBase. We are talking about running Machine Learning algorithms on terabytes of data for 100s of millions of users. This involves a massive amount of IO and processing power. This is where technologies like Hadoop and HBase shine. I wouldn’t even try doing this on any traditional RDBMS based system. For a news website like Bloomberg.com, timeliness is really important. A news article which is very important at this moment may not hold any importance at all a few hours later. For this reason, we have to train our models multiple times every hour which entails an even greater need for IO and processing power. However, the criticality of this requirement depends on the algorithm which we will discuss in a minute. A few slides back, we categorized recommendation algorithms into content based and collaborative filter based. Lets talk about the user model training for these separately since they have slightly different requirements.
As discussed previously, content based systems typically try to model a user’s preferences in terms of features of the content. Moreover, these are generally based solely on that user’s activities and independent of activities of other users. This means that each user’s model can be trained independently of the interactions of other users and only changes when that user has a new interaction assuming the article features remain constant. This problem is very easy and natural to parallelize. We just split the users in some number of buckets and assign each bucket to a mapper. We can write the trained models back to HBase from the mapper directly, completing eliminating the need of a reducer and the sort and shuffle phase involved. Moreover, since this training happens incrementally when a user reads a new article, we can run this every few minutes since the total amount of effort will almost be the same and running it more frequently will give us fresher models. For returning users where we have a substantial history, the models remain fairly constant and might not give us much. However for relatively new users, this is a great deal since we now have A model for the user rather than having none which is a huge deal. On Bloomberg.com, we run this training every 5 mins and train about 5000 user models on every run which comes out to about 1000 user models a minute on average.
In contrast to content based approaches, collaborative filter based approaches rely on finding users with similar interests and thus, a user’s model is dependent on the activities of other users. This means that every time any user views an article, all user models are potentially outdated. This necessitates that we train user models for potentially all users every few mins which as you can imagine is computationally very expensive. At a high level, this algorithms requires you to compare each user with every other user to find similar users which would probably take forever to complete. Thankfully, for a news website like Bloomberg.com, even though old history is useful to build user models, only the latest data is necessary to serve recommendations which simplifies the problem a little bit. We use a map side join mechanism to realize this self join. Again there is no reducer to save time on the shuffle and sort phase. The training has to happen in a batch for all users and we train 10s of millions of user models multiple times in an hour.
At this point, all necessary data required for recommendations is available in HBase. The final piece of the puzzle is the real-time piece when a client requests for a recommendation and the response needs to be served within milliseconds. When the application server receives the request for recommendations, it runs multiple queries against HBase to get all the data it needs, runs some Machine Learning evaluation steps on the models it created in the background and serves the top ranking articles based on the evaluation. Compared to the training, this is computationally very cheap and hence can be done in real time. However, there is still some decent amount of processing that happens to complete this evaluation and rank the articles. We leverage in-memory caching on the application servers for speed. Our current production system is able to serve atleast 10s of 1000s of requests per minute without the in-memory cache layer and potentially a lot more with it. Our average response times are less than 50ms for all of our recommendation algorithms.
To summarize the talk, Recommender Systems are really important in today’s world where the user base for most online businesses is huge and at the same time there is a need to serve each user personally. Recommender systems come in 2 flavors – Content based and collaborative filter based and the collaborative filter based algorithms can be classified into user based and item based. Building a recommender system requires cross-domain expertise, specially in the fields of Machine Learning and Big Data. Hadoop’sMapReduce framework provides a solid ground for parallel processing and helps simplify building the offline components of a Recommender system. And most importantly, HBase can be used as a massively scalable distributed hybrid data store. I call it hybrid because it can be used for online queries as well as a source and sink for offline MapReduce jobs.
Iif you found any of this interesting and would like to get your hands dirty on some of these systems, please get in touch with me after the presentation or email me at dshah100@bloomberg.net.