SlideShare ist ein Scribd-Unternehmen logo
1 von 70
Recommender Systems: The Art and Science of Matching Items to Users Deepak Agarwal dagarwal@yahoo-inc.com LinkedIn, 7th July, 2011
 Recommender Systems Serve the “right” item to users in an automated fashion to optimize long-term business objectives
Content Optimization: Match articles to users
   Advertising: Recommend Ads on Pages Display/Graphical Ad Contextual  Advertising
Shopping: Recommend Related Items to buy
Recommend Movies
Recommend People
Problem Definition Item Inventory Articles, web page,  ads, …  Example applications  Content, Movie, Advertising, Shopping, ….. Construct an automatedalgorithm  to select item(s) to show Get feedback  (click, time-spent,rating, buy,…)  Refine parameters of the algorithm Repeat (large number of times) Optimize metric(s) of interest (Total clicks, Total revenue,…) Low Marginal cost per serve,           Efficient and intelligent systems can    provide significant improvements Context page,    previous item viewed, … USER
Data Mining -> Clever Algorithms So much data, enough to process it all and process it fast?  Ideally, we want to learn every user-item interaction Number of things to learn increases with data size  Dynamic nature exacerbates the problem We want to learn things quickly in order to react fast
Simple Approach: Segment Users/Items Estimate CTR of items in each user segment j Serve most popular item in segment Item/item segments Users i CTRij = clicksij/viewsij User segments
Example Application: Yahoo! front page  Recommend most popular article on slot F1 (out of 30-40, editorially programmed) Can collect data every 5 minutes Should be simple, just count clicks and views, right? Not quite! Today module F1 F2 F3 F4 NEWS
Simple algorithm we began with Initialize CTR of every new article to some high number This ensures a new article has a chance of being shown Show the most popular CTR article (randomly breaking ties) for each user visit in the next 5 minutes Re-compute the global article CTRs after 5 minutes Show the new most popular for next 5 minutes Keep updating article popularity over time Quite intuitive. Did not work! Performance was bad. Why?
Bias in the data: Article CTR decays over time This is what an article CTR curve looked like We were computing CTR by cumulating clicks and views.  Missing decay dynamics? Dynamic growth model using a Kalman filter.  New model tracked decay very well, performance still bad And the plot thickens, my dear Watson!
Explanation of decay: Repeat exposure User Fatigue-> CTR Decay
Clues to solve the mystery  Users seeing an article for the first time have higher CTR, those being exposed have lower but we use the same CTR estimate for all ? Other sources of bias? How to adjust for them? A simple idea to remove bias  Display articles at random to a small randomly chosen population Call this the Random bucket Randomization removes bias in data  (Charles Pierce,1877; R.A. Fisher, 1935) Some other observations Sticking with an article for complete 5 minutes was degrading performance, many bad articles got displayed too many times Reaction time to display good articles was slower
CTR of same article with/without randomization Serving bucket Random bucket Decay Time-of-Day
CTR of articles in Random bucket Track Unbiased CTR, but it is dynamic. Simply counting clicks and views still didn’t won’t work well.
New algorithm Create a small random bucket which selects one out of K existing articles at random for each user visit Learn unbiased article popularity using random bucket data by tracking (through a non-linear Kalman filter)     Serve the most popular article in the serving bucket Override rules: Don’t show an article to a user after few previous exposures, other rules (diversity, voice),….
Other advantages The random bucket ensures continuous flow of data for all articles, we quickly discard bad articles and converge to the best one This saved the day, the project was a success! Initial click-lift 40% (Agarwal et al. NIPS 08)  after 3 years it is 200+% (fully deployed on Yahoo! front page and elsewhere on Yahoo!), we are still improving the system
More Details Agarwal, Chen, Elango, Ramakrishnan, Motgi, Roy, Zachariah. Online models for Content Optimization, NIPS 2008 Agarwal, Chen, Elango. Spatio-Temporal Models for Estimating Click-through Rate, WWW 2009
Lessons learnt It is ok to start with simple models that learn a few things, but beware of the biases inherent in your data E.g. of things gone wrong Learning article popularity  Data used from 5am-8am pst, served from 10am-1pm pst Bad idea if article popular on the east, not on the west Randomization is a friend, use it when you can. Update the models fast, this may reduce the bias User visit patterns close in time are similar   What if we can’t afford complete randomization? Learn how to gamble
Why learn how to gamble? Consider a slot machine with two arms (unknown payoff probabilities) p2 p1      > The gambler has 1000 plays, what is the best way to experiment ?                        (to maximize total expected reward)  This is called the “bandit” problem, have been studied for a long time. Optimal solution: Play the arm that has maximum potential of being good
Recommender Problems: Bandits? Two Items: Item 1 CTR= 2/100 ; Item 2 CTR= 250/10000 Greedy: Show Item 2 to all; not a good idea Item 1 CTR estimate noisy; item could be potentially better Invest in Item 1 for better overall performance on average This is also referred to as Explore/exploit problem Exploit what is known to be good, explore what is potentially good Article 2 Article 1 Probability density CTR
Bayes optimal solution in next 5 mins 2 articles, 1 uncertain Optimal allocation to uncertain article  Uncertainty in CTR: pseudo #views
More Details on the Bayes Optimal Solution Agarwal, Chen, Elango. Explore-Exploit Schemes for Web Content Optimization, ICDM 2009  (Best Research Paper Award)
Recommender Problems: bandits in a casino Items are arms of bandits, ratings/CTRs are unknown payoffs Goal is to converge to the best CTR item quickly But this assumes one size fits all (no personalization) Personalization Each user is a separate bandit Hundreds of millions of bandits (huge casino) Rich literature (several tutorials on the topic) Broadly : Clever/adaptive randomization Our random bucket is a solution, often a good one in practice.
Back to the number of things to learn (curse of dimensionality) Pros of learning things at granular resolutions Better estimates of affinities at event level  (ad 77 has high CTR on publisher 88, instead of ad 77 has good CTR on sports publisher) Bias becomes less problematic The more we chop, less prone we are to aggregating dissimilar things, less biased our estimates. Challenges Too much sparsity to learn everything at granular resolutions We don’t have that much traffic E.g. many ads are not even shown on many publishers Explore/exploit helps but cannot do so much experimentation In advertising, response rates (conversion, click) are too low, further exacerbates the problem
Solution: Go granular but with back-off Too little data at granular level, need to borrow from coarse resolutions with abundant data (smoothing, shrinkage) 200/5000 400/10000 CTR(1) = w1(0/5) + w11(2/200)  +w12(40/1000) +w121(200/5000)  +w111(400/10000) 121. Adv-id=9 111. Bay Area 40/1000 2/200 12. Pub-id=88, adv-id=9 11.  Palo Alto 0/5 1.  Pub-id=88, ad-id=77, zip=Palo Alto
Sometimes too much data at granular level No need to back-off   CTR(1) = 100/50000 …… …. 12. Pub-id=88, adv-id=8 11.  Arizona 100/50000 1.  Pub-id=88, ad-id=80, zip=Arizona
How much to borrow from ancestors? Learning the weights when there is little data Depends on heterogeneity in CTRs of small cells  Ancestors with similar CTR child nodes are more credible E.g. if all zip-codes in Bay Area have similar CTRs, more weights given to  Bay Area node Pool similar cells, separate dissimilar ones Palo Alto Bay Area Mtn View Las Gatos
Crucial issue Obtain grouping structures to perform effective back-off BUT How do we detect such groupings when dealing with high dimensional data? Billions/trillions of possible attribute combinations Statistical modeling to the rescue Art and science, requires experience.  Important to understand the business, the problem, the data.
How do we estimate heterogeneity for a group? Simple example: CTR of an ad in different zip-codes (si, ti): i=1,…,K;  emCTRi = si /ti Var(emCTRi ) good measure of heterogeneity? Not quite, empirical estimates not good for small ti and(or) si Use a model Variance among true CTRs can be estimated in a better way using MLE/MOM  (Agarwal  & Chen, Latent OLAP, SIGMOD 2011)
Two Examples of learning granular MODELS withback-off
Online Advertising: Matching ads to opportunities Pick best ads Ads Advertisers Ad Network Page User Examples:Yahoo, Google, MSN,  Ad exchanges(network of “networks”) … Opportunity Publisher
How to Select “Best” ads Pick best ads Ads Ad Network Page User Publisher Response rates (click, conversion, ad-view) Bids conversion Auction Statistical model Select argmax f(bid,rate) Click Advertisers
The Ad Exchange: Unified Marketplace Bids $0.75 via Network… Bids $0.50 Bids $0.60 Ad.com AdSense Bids $0.65—WINS! Has ad impression to sell -- AUCTIONS … which becomes $0.45 bid Transparency and value
Advertising example  f(bid, rate) ---- rate is unknown, needs to be estimated Goal: maximize revenue, advertiser ROI High dimensional rate estimation Response obtained through interaction among few heavy-tailed categorical variables (pub, user, and ad) #levels : could be millions and changes over time ad ( pub, user)
Data Features available for both opportunity and ad Publisher: Publisher content type  User: demographics, geo,… Ad: Industry, text/video, text (if any) Hierarchically organized Publisher hierarchy: URL -> Domain -> Publisher type Geo hierarchy for users Ad hierarchy: Ad -> Campaign -> Advertiser Past empirical analysis (Agarwal et al, KDD 2007) Hierarchical grouping provides homogeneity in rates Here, groupings available through domain knowledge
Model Setup baseline ) λij B( Piuj= xi, xj xu, residual i j Eij= ∑uB(xi ,xu,xj)  (Expected Success) Sij~ Poisson(Eij λij) MLE ( Sij /Eij) does not work well ,
Hierarchical Smoothing of residuals Assuming two hierarchies (Publisher and advertiser) Advertiser Pub-class Conv-id campaign Pub-id cell (i,j) Ad-id (Sij, Eij, λij)
Back-off Model 7 neighbors 3 blues, 4 greens Advertiser campaign Pub-class Conv-id Pub-id Ad-id i j (Sij, Eij, λij) Back-off is through parameter sharing Blues and greens are neighbors of several reds
Ad- exchange (RightMedia) Advertisers participate in different ways CPM (pay by ad-view) CPC (pay per click) CPA (pay per conversion) To conduct an auction, normalize across pricing types Compute eCPM (expected CPM) Click-based eCPM= click-rate*CPC Conversion-based eCPM= conv-rate*CPA
Data  Two kinds of conversion rates Post-Click conv-rate = click-rate*conv/click Post-View conv-rate = conv/ad-view Three response rate models Click-rate (CLICK), conv/click (PCC),  post-view conv/view (PVC)
Datasets : Right-Media CLICK  [~90B training events, ~100M parameters] Post Click Conversion(PCC) (~.5B training events,~81M parameters) PVC – Post-View conversions (~7B events, ~6M parameters) Cookie gets augmented with pixel, trigger conversion when user visits the landing page Features Age, gender, ad-size, pub-class, user fatigue 2 hierarchies (publisher and advertiser) Two baselines Pubid x adid [FINE] (no hierarchical information) Pubid x advertiser [COARSE] (collapse cells)
Accuracy: Average test log-likelihood
More Details Agarwal, Kota, Agrawal, Khanna: Estimating Rates of Rare Events with Multiple Hierarchies through Scalable Log-linear Models, KDD 2010
Back to Yahoo! front page Recommend articles:     Image     Title, summary     Links to other pages For each user visit,  Pick 4 out of a pool of K Routes traffic to other pages 2 3 4 1
DATA article j with item featuresxj (keywords, content categories, ...) Algorithm selects         (i,j) : response yij User i with user featuresxi (demographics, browse history, search history, …) visits (rating or click/no-click)
Bipartite Graph completion problem Observed Graph no-click Articles Articles Users Predicted CTR Graph Users click
Factor Model to estimate CTR at granular levels ui vj j i Item popularity User popularity
Estimating granular latent factors via back-off If user/item have high degree, good estimates of factors available else we need back-off Back-off: We use user/item features through regressions                Age=old     Geo=Mtn-View   Int=Ski Uik = G1k 1(Agei=old) + G2k 1(Geoi=Mtn-View) + G3k 1(Inti=Ski) Weights of 8 different fallbacks using 3 parameters
Estimates with back-off For new user/article, factor estimates based on features For old user/article, factor estimates Linear combination of regression and user “ratings”
Estimating the back-off Regression function Maximize Integral cannot be computed in closed form,  approximated by Monte Carlo using  Gibbs Sampling
Data Example 2M binary observations by 30K heavy users on 4K articles Heavy user ---- at least 30 visits to the portal in last 5 months Article features Editorially labeled category information (~50 binary features) User features Demographics, browse behavior (~1K features) Training/test split by timestamp of events (75/25) Methods Factor model with regression, no online updates Factor model with regression + online updates Online model based on user-user similarity (Online-UU)  Online probabilistic latent semantic index (Online-PLSI)
ROC curve Factor model: regression + online updates Factor model: regression only
More Details Agarwal and Chen: Regression Based Latent Factor Models, KDD 2009
Computation Both models run on Hadoop, scalable to large datasets For the factor models, also working on online EM  Collaboration with Andrew Cron, Duke University
Multi-ObjectivesBeyond Clicks
 Post-click utilities Recommender EDITORIAL AD SERVER       PREMIUM DISPLAY          (GUARANTEED)       NETWORK PLUS         (Non-Guaranteed) Clicks on FP links influence downstream supply distribution content    SPORTS NEWS Downstream engagement (Time spent) OMG FINANCE
Serving Content on Front Page: Click Shaping What do we want to optimize? Usual: Maximize clicks (maximize downstream supply from FP) But consider the following Article 1: CTR=5%, utility per click = 5  Article 2: CTR=4.9%, utility per click=10 By promoting 2, we lose 1 click/100 visits, gain 5 utils If we do this for a large number of visits --- lose some clicks but obtain significant gains in utility? E.g. lose 5% relative CTR, gain 20% in utility (revenue, engagement, etc)
How are Clicks being Shaped ? AFTER BEFORE Supply distribution Changes SHAPING can happen  with respect to  multiple downstream metrics (like engagement, revenue,…)
Multi-Objective Optimization n articles  K properties  m user segments A1 S1 news xij: variables known pij, dij A2 S2 finance … … … omg An Sm ,[object Object]
 Time duration of i on j: dij62
63 Multi-Objective Program  ,[object Object]
Goal Programming ,[object Object]
More Details Agarwal, Chen, Elango, Wang: Click Shaping to Optimize Multiple Objectives, KDD 2011 (forthcoming)
Can we do it with Advertising Revenue? Yes, but need to be careful. Interventions can cause undesirable long-term impact Communication between two complex distributed systems  Display advertising at Y! also sold as long-term guaranteed contracts We intervene to change supply when contract is at risk of under-delivering Research to be shared in the future
Summary Simple models that learn a few parameters are fine to begin with  BUT beware of bias in data Small amounts of randomization + fast model updates Clever Randomization using Explore/Exploit techniques Granular models are more effective but we need good statistical algorithms to provide back-off estimates Considering multi-objective optimization is often important
A modeling strategy Feature Engineering Content: IR, clustering, taxonomy, entity,..  User profiles: clicks, views, social, community,.. Online (Fine resolution Corrections) (item, user level) (Quick updates) Initialize Offline(Logistic, GBDT,..) Coarse and slow changing components Explore/Exploit (Adaptive sampling)
Indexing for fast retrieval at runtime Retrieving the top-k when item inventory is large in few a milli-seconds could be challenging for complex models Current work (joint with Maxim Guverich) Approximate the model by an index friendly synthetic model Index friendly model retrieves the top-K very fast, a second stage evaluation on top-K retrieves the top-k ( K > k) Research to be shared in a forthcoming paper

Weitere ähnliche Inhalte

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Kürzlich hochgeladen (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Recommender Systems: The Art and Science of Matching Items to Users - A LinkedIn open data talk by Deepak Agarwal from Yahoo Research!

  • 1. Recommender Systems: The Art and Science of Matching Items to Users Deepak Agarwal dagarwal@yahoo-inc.com LinkedIn, 7th July, 2011
  • 2. Recommender Systems Serve the “right” item to users in an automated fashion to optimize long-term business objectives
  • 3. Content Optimization: Match articles to users
  • 4. Advertising: Recommend Ads on Pages Display/Graphical Ad Contextual Advertising
  • 8. Problem Definition Item Inventory Articles, web page, ads, … Example applications Content, Movie, Advertising, Shopping, ….. Construct an automatedalgorithm to select item(s) to show Get feedback (click, time-spent,rating, buy,…) Refine parameters of the algorithm Repeat (large number of times) Optimize metric(s) of interest (Total clicks, Total revenue,…) Low Marginal cost per serve, Efficient and intelligent systems can provide significant improvements Context page, previous item viewed, … USER
  • 9. Data Mining -> Clever Algorithms So much data, enough to process it all and process it fast? Ideally, we want to learn every user-item interaction Number of things to learn increases with data size Dynamic nature exacerbates the problem We want to learn things quickly in order to react fast
  • 10. Simple Approach: Segment Users/Items Estimate CTR of items in each user segment j Serve most popular item in segment Item/item segments Users i CTRij = clicksij/viewsij User segments
  • 11. Example Application: Yahoo! front page Recommend most popular article on slot F1 (out of 30-40, editorially programmed) Can collect data every 5 minutes Should be simple, just count clicks and views, right? Not quite! Today module F1 F2 F3 F4 NEWS
  • 12. Simple algorithm we began with Initialize CTR of every new article to some high number This ensures a new article has a chance of being shown Show the most popular CTR article (randomly breaking ties) for each user visit in the next 5 minutes Re-compute the global article CTRs after 5 minutes Show the new most popular for next 5 minutes Keep updating article popularity over time Quite intuitive. Did not work! Performance was bad. Why?
  • 13. Bias in the data: Article CTR decays over time This is what an article CTR curve looked like We were computing CTR by cumulating clicks and views. Missing decay dynamics? Dynamic growth model using a Kalman filter. New model tracked decay very well, performance still bad And the plot thickens, my dear Watson!
  • 14. Explanation of decay: Repeat exposure User Fatigue-> CTR Decay
  • 15. Clues to solve the mystery Users seeing an article for the first time have higher CTR, those being exposed have lower but we use the same CTR estimate for all ? Other sources of bias? How to adjust for them? A simple idea to remove bias Display articles at random to a small randomly chosen population Call this the Random bucket Randomization removes bias in data (Charles Pierce,1877; R.A. Fisher, 1935) Some other observations Sticking with an article for complete 5 minutes was degrading performance, many bad articles got displayed too many times Reaction time to display good articles was slower
  • 16. CTR of same article with/without randomization Serving bucket Random bucket Decay Time-of-Day
  • 17. CTR of articles in Random bucket Track Unbiased CTR, but it is dynamic. Simply counting clicks and views still didn’t won’t work well.
  • 18. New algorithm Create a small random bucket which selects one out of K existing articles at random for each user visit Learn unbiased article popularity using random bucket data by tracking (through a non-linear Kalman filter) Serve the most popular article in the serving bucket Override rules: Don’t show an article to a user after few previous exposures, other rules (diversity, voice),….
  • 19. Other advantages The random bucket ensures continuous flow of data for all articles, we quickly discard bad articles and converge to the best one This saved the day, the project was a success! Initial click-lift 40% (Agarwal et al. NIPS 08) after 3 years it is 200+% (fully deployed on Yahoo! front page and elsewhere on Yahoo!), we are still improving the system
  • 20. More Details Agarwal, Chen, Elango, Ramakrishnan, Motgi, Roy, Zachariah. Online models for Content Optimization, NIPS 2008 Agarwal, Chen, Elango. Spatio-Temporal Models for Estimating Click-through Rate, WWW 2009
  • 21. Lessons learnt It is ok to start with simple models that learn a few things, but beware of the biases inherent in your data E.g. of things gone wrong Learning article popularity Data used from 5am-8am pst, served from 10am-1pm pst Bad idea if article popular on the east, not on the west Randomization is a friend, use it when you can. Update the models fast, this may reduce the bias User visit patterns close in time are similar What if we can’t afford complete randomization? Learn how to gamble
  • 22. Why learn how to gamble? Consider a slot machine with two arms (unknown payoff probabilities) p2 p1 > The gambler has 1000 plays, what is the best way to experiment ? (to maximize total expected reward) This is called the “bandit” problem, have been studied for a long time. Optimal solution: Play the arm that has maximum potential of being good
  • 23. Recommender Problems: Bandits? Two Items: Item 1 CTR= 2/100 ; Item 2 CTR= 250/10000 Greedy: Show Item 2 to all; not a good idea Item 1 CTR estimate noisy; item could be potentially better Invest in Item 1 for better overall performance on average This is also referred to as Explore/exploit problem Exploit what is known to be good, explore what is potentially good Article 2 Article 1 Probability density CTR
  • 24. Bayes optimal solution in next 5 mins 2 articles, 1 uncertain Optimal allocation to uncertain article Uncertainty in CTR: pseudo #views
  • 25. More Details on the Bayes Optimal Solution Agarwal, Chen, Elango. Explore-Exploit Schemes for Web Content Optimization, ICDM 2009 (Best Research Paper Award)
  • 26. Recommender Problems: bandits in a casino Items are arms of bandits, ratings/CTRs are unknown payoffs Goal is to converge to the best CTR item quickly But this assumes one size fits all (no personalization) Personalization Each user is a separate bandit Hundreds of millions of bandits (huge casino) Rich literature (several tutorials on the topic) Broadly : Clever/adaptive randomization Our random bucket is a solution, often a good one in practice.
  • 27. Back to the number of things to learn (curse of dimensionality) Pros of learning things at granular resolutions Better estimates of affinities at event level (ad 77 has high CTR on publisher 88, instead of ad 77 has good CTR on sports publisher) Bias becomes less problematic The more we chop, less prone we are to aggregating dissimilar things, less biased our estimates. Challenges Too much sparsity to learn everything at granular resolutions We don’t have that much traffic E.g. many ads are not even shown on many publishers Explore/exploit helps but cannot do so much experimentation In advertising, response rates (conversion, click) are too low, further exacerbates the problem
  • 28. Solution: Go granular but with back-off Too little data at granular level, need to borrow from coarse resolutions with abundant data (smoothing, shrinkage) 200/5000 400/10000 CTR(1) = w1(0/5) + w11(2/200) +w12(40/1000) +w121(200/5000) +w111(400/10000) 121. Adv-id=9 111. Bay Area 40/1000 2/200 12. Pub-id=88, adv-id=9 11. Palo Alto 0/5 1. Pub-id=88, ad-id=77, zip=Palo Alto
  • 29. Sometimes too much data at granular level No need to back-off CTR(1) = 100/50000 …… …. 12. Pub-id=88, adv-id=8 11. Arizona 100/50000 1. Pub-id=88, ad-id=80, zip=Arizona
  • 30. How much to borrow from ancestors? Learning the weights when there is little data Depends on heterogeneity in CTRs of small cells Ancestors with similar CTR child nodes are more credible E.g. if all zip-codes in Bay Area have similar CTRs, more weights given to Bay Area node Pool similar cells, separate dissimilar ones Palo Alto Bay Area Mtn View Las Gatos
  • 31. Crucial issue Obtain grouping structures to perform effective back-off BUT How do we detect such groupings when dealing with high dimensional data? Billions/trillions of possible attribute combinations Statistical modeling to the rescue Art and science, requires experience. Important to understand the business, the problem, the data.
  • 32. How do we estimate heterogeneity for a group? Simple example: CTR of an ad in different zip-codes (si, ti): i=1,…,K; emCTRi = si /ti Var(emCTRi ) good measure of heterogeneity? Not quite, empirical estimates not good for small ti and(or) si Use a model Variance among true CTRs can be estimated in a better way using MLE/MOM (Agarwal & Chen, Latent OLAP, SIGMOD 2011)
  • 33. Two Examples of learning granular MODELS withback-off
  • 34. Online Advertising: Matching ads to opportunities Pick best ads Ads Advertisers Ad Network Page User Examples:Yahoo, Google, MSN, Ad exchanges(network of “networks”) … Opportunity Publisher
  • 35. How to Select “Best” ads Pick best ads Ads Ad Network Page User Publisher Response rates (click, conversion, ad-view) Bids conversion Auction Statistical model Select argmax f(bid,rate) Click Advertisers
  • 36. The Ad Exchange: Unified Marketplace Bids $0.75 via Network… Bids $0.50 Bids $0.60 Ad.com AdSense Bids $0.65—WINS! Has ad impression to sell -- AUCTIONS … which becomes $0.45 bid Transparency and value
  • 37. Advertising example f(bid, rate) ---- rate is unknown, needs to be estimated Goal: maximize revenue, advertiser ROI High dimensional rate estimation Response obtained through interaction among few heavy-tailed categorical variables (pub, user, and ad) #levels : could be millions and changes over time ad ( pub, user)
  • 38. Data Features available for both opportunity and ad Publisher: Publisher content type User: demographics, geo,… Ad: Industry, text/video, text (if any) Hierarchically organized Publisher hierarchy: URL -> Domain -> Publisher type Geo hierarchy for users Ad hierarchy: Ad -> Campaign -> Advertiser Past empirical analysis (Agarwal et al, KDD 2007) Hierarchical grouping provides homogeneity in rates Here, groupings available through domain knowledge
  • 39. Model Setup baseline ) λij B( Piuj= xi, xj xu, residual i j Eij= ∑uB(xi ,xu,xj) (Expected Success) Sij~ Poisson(Eij λij) MLE ( Sij /Eij) does not work well ,
  • 40. Hierarchical Smoothing of residuals Assuming two hierarchies (Publisher and advertiser) Advertiser Pub-class Conv-id campaign Pub-id cell (i,j) Ad-id (Sij, Eij, λij)
  • 41. Back-off Model 7 neighbors 3 blues, 4 greens Advertiser campaign Pub-class Conv-id Pub-id Ad-id i j (Sij, Eij, λij) Back-off is through parameter sharing Blues and greens are neighbors of several reds
  • 42. Ad- exchange (RightMedia) Advertisers participate in different ways CPM (pay by ad-view) CPC (pay per click) CPA (pay per conversion) To conduct an auction, normalize across pricing types Compute eCPM (expected CPM) Click-based eCPM= click-rate*CPC Conversion-based eCPM= conv-rate*CPA
  • 43. Data Two kinds of conversion rates Post-Click conv-rate = click-rate*conv/click Post-View conv-rate = conv/ad-view Three response rate models Click-rate (CLICK), conv/click (PCC), post-view conv/view (PVC)
  • 44. Datasets : Right-Media CLICK [~90B training events, ~100M parameters] Post Click Conversion(PCC) (~.5B training events,~81M parameters) PVC – Post-View conversions (~7B events, ~6M parameters) Cookie gets augmented with pixel, trigger conversion when user visits the landing page Features Age, gender, ad-size, pub-class, user fatigue 2 hierarchies (publisher and advertiser) Two baselines Pubid x adid [FINE] (no hierarchical information) Pubid x advertiser [COARSE] (collapse cells)
  • 45. Accuracy: Average test log-likelihood
  • 46. More Details Agarwal, Kota, Agrawal, Khanna: Estimating Rates of Rare Events with Multiple Hierarchies through Scalable Log-linear Models, KDD 2010
  • 47. Back to Yahoo! front page Recommend articles: Image Title, summary Links to other pages For each user visit, Pick 4 out of a pool of K Routes traffic to other pages 2 3 4 1
  • 48. DATA article j with item featuresxj (keywords, content categories, ...) Algorithm selects (i,j) : response yij User i with user featuresxi (demographics, browse history, search history, …) visits (rating or click/no-click)
  • 49. Bipartite Graph completion problem Observed Graph no-click Articles Articles Users Predicted CTR Graph Users click
  • 50. Factor Model to estimate CTR at granular levels ui vj j i Item popularity User popularity
  • 51. Estimating granular latent factors via back-off If user/item have high degree, good estimates of factors available else we need back-off Back-off: We use user/item features through regressions Age=old Geo=Mtn-View Int=Ski Uik = G1k 1(Agei=old) + G2k 1(Geoi=Mtn-View) + G3k 1(Inti=Ski) Weights of 8 different fallbacks using 3 parameters
  • 52. Estimates with back-off For new user/article, factor estimates based on features For old user/article, factor estimates Linear combination of regression and user “ratings”
  • 53. Estimating the back-off Regression function Maximize Integral cannot be computed in closed form, approximated by Monte Carlo using Gibbs Sampling
  • 54. Data Example 2M binary observations by 30K heavy users on 4K articles Heavy user ---- at least 30 visits to the portal in last 5 months Article features Editorially labeled category information (~50 binary features) User features Demographics, browse behavior (~1K features) Training/test split by timestamp of events (75/25) Methods Factor model with regression, no online updates Factor model with regression + online updates Online model based on user-user similarity (Online-UU) Online probabilistic latent semantic index (Online-PLSI)
  • 55. ROC curve Factor model: regression + online updates Factor model: regression only
  • 56. More Details Agarwal and Chen: Regression Based Latent Factor Models, KDD 2009
  • 57. Computation Both models run on Hadoop, scalable to large datasets For the factor models, also working on online EM Collaboration with Andrew Cron, Duke University
  • 59. Post-click utilities Recommender EDITORIAL AD SERVER PREMIUM DISPLAY (GUARANTEED) NETWORK PLUS (Non-Guaranteed) Clicks on FP links influence downstream supply distribution content SPORTS NEWS Downstream engagement (Time spent) OMG FINANCE
  • 60. Serving Content on Front Page: Click Shaping What do we want to optimize? Usual: Maximize clicks (maximize downstream supply from FP) But consider the following Article 1: CTR=5%, utility per click = 5 Article 2: CTR=4.9%, utility per click=10 By promoting 2, we lose 1 click/100 visits, gain 5 utils If we do this for a large number of visits --- lose some clicks but obtain significant gains in utility? E.g. lose 5% relative CTR, gain 20% in utility (revenue, engagement, etc)
  • 61. How are Clicks being Shaped ? AFTER BEFORE Supply distribution Changes SHAPING can happen with respect to multiple downstream metrics (like engagement, revenue,…)
  • 62.
  • 63. Time duration of i on j: dij62
  • 64.
  • 65.
  • 66. More Details Agarwal, Chen, Elango, Wang: Click Shaping to Optimize Multiple Objectives, KDD 2011 (forthcoming)
  • 67. Can we do it with Advertising Revenue? Yes, but need to be careful. Interventions can cause undesirable long-term impact Communication between two complex distributed systems Display advertising at Y! also sold as long-term guaranteed contracts We intervene to change supply when contract is at risk of under-delivering Research to be shared in the future
  • 68. Summary Simple models that learn a few parameters are fine to begin with BUT beware of bias in data Small amounts of randomization + fast model updates Clever Randomization using Explore/Exploit techniques Granular models are more effective but we need good statistical algorithms to provide back-off estimates Considering multi-objective optimization is often important
  • 69. A modeling strategy Feature Engineering Content: IR, clustering, taxonomy, entity,.. User profiles: clicks, views, social, community,.. Online (Fine resolution Corrections) (item, user level) (Quick updates) Initialize Offline(Logistic, GBDT,..) Coarse and slow changing components Explore/Exploit (Adaptive sampling)
  • 70. Indexing for fast retrieval at runtime Retrieving the top-k when item inventory is large in few a milli-seconds could be challenging for complex models Current work (joint with Maxim Guverich) Approximate the model by an index friendly synthetic model Index friendly model retrieves the top-K very fast, a second stage evaluation on top-K retrieves the top-k ( K > k) Research to be shared in a forthcoming paper
  • 71. Collaborators Bee-Chung Chen (Yahoo! Research, CA) Pradheep Elango (Yahoo! Labs, CA) Liang Zhang (Yahoo! Labs, CA) Nagaraj Kota (Yahoo! Labs, India) Xuanhui Wang (Yahoo! Labs, CA) Rajiv Khanna (Yahoo! Labs, India) Andrew Cron (Duke University) Engineering & Product Teams (CA, India) Special thanks to Yahoo! Labs senior leadership for the support Andrei Broder, Preston MacAfee ,Prabhakar Raghavan ,Raghu Ramakrishnan
  • 72. E-mail: dagarwal@yahoo-inc.com Thank you !

Hinweis der Redaktion

  1. Focus on Today module. Publishes trendy, eclectic articles on a broad range of topics including sports, finance, entertainment etc.For each visit, select 4 to display from an inventory of K. Hundreds of millions of visits/day, ~600M visitors per month.