Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
RESPONSE PREDICTION FOR
DISPLAY ADVERTISING
WSDM ’14

Olivier Chapelle
OUTLINE
1.
2.
3.
4.

Criteo & Display advertising
Modeling
Large scale learning
Explore / exploit

2
DISPLAY ADVERTISING

Display Ad

3
DISPLAY ADVERTISING
• Rapidly growing multi-billion dollar business (30% of internet
advertising revenue in 2013).
• Marke...
BRANDING VS PERFORMANCE
PRICING TYPE
• CPM (Cost Per Mille): advertiser pays per thousand impressions
• CPC (Cost Per Clic...
CRITEO
Bill on
CPC

Advertisers

Pay on
CPM

Publishers

Criteo’s success depends highly on how precisely we can predict
C...
RECOMMENDATION

Collaborative filtering
Propose related items to a user based on
historical interactions from other users
...
CRITEO NUMBERS

750
10 PB

EMPLOYEES

42

COUNTRIES

18
BILLION

600
$ MILLION

2013 REVENUE

BANNERS
PER DAY

RT BIDS
PER...
OUTLINE
1.
2.
3.
4.

Criteo & Display advertising
Modeling
Large scale learning
Explore / exploit

9
FEATURES
• Three sources of features: user, ad, page
• In this talk: categorical features on ad and page.
Advertiser netwo...
HASHING TRICK
• Standard representation of categorical features: “one-hot” encoding
For instance, site feature

0

0

1

c...
HASHING VS FEATURE SELECTION
• “Small” problem with 35M different values.
• Methods that require a dictionary have a large...
QUADRATIC FEATURES
• Outer product between two features.
• Example: between site and advertiser,
Feature is 1

site=financ...
ADVANTAGES OF HASHING
• Practical
– Straightforward implement; no need to maintain dictionaries

• Statistical
– Regulariz...
LEARNING
• Regularized logistic regression
– Vowpal Wabbit open source package

• Regularization with hierarchical feature...
EVALUATION
• Comparison with (Agarwal et al. ‟10)
– Probabilistic model for the same display advertising prediction proble...
BAYESIAN LOGISTIC REGRESSION
• Regularized logistic regression = MAP solution
(Gaussian prior, logistic likelihood)

• Pos...
MODEL UPDATE
• Needed because ads / campaigns keep changing.
• The posterior distribution of a previously trained model ca...
OUTLINE
1.
2.
3.
4.

Criteo & Display advertising
Modeling
Large scale learning
Explore / exploit

19
PARALLEL LEARNING
• Large training set
– 2B training samples; 16M parameters
– 400GB (compressed)

• Proposed method: less...
ALLREDUCE
• Aggregate and broadcast across nodes
9
13

37

37

15

1
7

7

37 37

8
5

3

5

3

37

37

4

4

• Very few m...
ONLINE INITIALIZATION
• Hybrid approach:
– One pass of online learning on each node
– Average the weights from each node t...
OUTLINE
1.
2.
3.
4.

Criteo & Display advertising
Modeling
Large scale learning
Explore / exploit

23
THOMPSON SAMPLING
• Heuristic to address the Explore / Exploit problem, dating back to
Thompson (1933)
• Simple to impleme...
E/E SIMULATIONS
MAB with K arms. Best arm has mean reward = 0.5, others have 0.5 − ε.

25
EVALUATION
• Semi-simulated environment: real input features, but labels generated.
• Set of eligible ads varies from 1 to...
RESULTS
• CTR regret (in percentage):
Thompson

UCB

e-greedy

Exploit-only

Random

3.72

4.14

4.98

5.00

31.95

• Regr...
OPEN QUESTIONS
• Hashing
– Theoretical performance guarantees

• Low rank matrix factorization
– Better predict on unseen ...
CONCLUSION
• Simple yet efficient techniques for click prediction

• Main difficulty in applied machine learning: avoid th...
Nächste SlideShare
Wird geladen in …5
×

Response prediction for display advertising - WSDM 2014

21.097 Aufrufe

Veröffentlicht am

Invited talk at WSDM 2014.
Abstract here: http://www.wsdm-conference.org/2014/practice-and-experience-talks/
Corresponding paper: http://olivier.chapelle.cc/pub/ngdstone.pdf

Veröffentlicht in: Technologie
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Response prediction for display advertising - WSDM 2014

  1. 1. RESPONSE PREDICTION FOR DISPLAY ADVERTISING WSDM ’14 Olivier Chapelle
  2. 2. OUTLINE 1. 2. 3. 4. Criteo & Display advertising Modeling Large scale learning Explore / exploit 2
  3. 3. DISPLAY ADVERTISING Display Ad 3
  4. 4. DISPLAY ADVERTISING • Rapidly growing multi-billion dollar business (30% of internet advertising revenue in 2013). • Marketplace between: – Publishers: sell display opportunities – Advertisers: pay for showing their ad • Real Time Bidding: – Auction amongst advertisers is held at the moment when a user generates a display opportunity by visiting a publisher‟s web page. 4
  5. 5. BRANDING VS PERFORMANCE PRICING TYPE • CPM (Cost Per Mille): advertiser pays per thousand impressions • CPC (Cost Per Click): advertiser pays only when the user clicks • CPA (Cost Per Action): advertiser pays only when the user performs a predefined action such as a purchase. CAMPAIGN TYPE • Branding CPM • Performance based advertising (retargeting) CPC, CPA CONVERSIONS eCPM = CPC * predicted clickthrough rate eCPM = CPA * predicted conversion rate 5
  6. 6. CRITEO Bill on CPC Advertisers Pay on CPM Publishers Criteo’s success depends highly on how precisely we can predict Click Through Rates (CTR) and Conversion Rates (CR) 6
  7. 7. RECOMMENDATION Collaborative filtering Propose related items to a user based on historical interactions from other users Over 50% of Criteo driven sales come from recommended products the user had never viewed on advertiser websites 7
  8. 8. CRITEO NUMBERS 750 10 PB EMPLOYEES 42 COUNTRIES 18 BILLION 600 $ MILLION 2013 REVENUE BANNERS PER DAY RT BIDS PER DAY 7500 CORES CLUSTER 2.1 BILLION 250K AD CALLS PER SECOND 8 8
  9. 9. OUTLINE 1. 2. 3. 4. Criteo & Display advertising Modeling Large scale learning Explore / exploit 9
  10. 10. FEATURES • Three sources of features: user, ad, page • In this talk: categorical features on ad and page. Advertiser network Publisher network Advertiser Publisher Campaign Site Publisher hierarchy Url Ad Advertiser hierarchy 10
  11. 11. HASHING TRICK • Standard representation of categorical features: “one-hot” encoding For instance, site feature 0 0 1 cnn.com 0 0 0 0 news.yahoo.com • Dimensionality equal to the number of different values – can be very large • Hashing to reduce dimensionality (made popular by John Langford in VW) • Dimensionality now independent of number of values 11
  12. 12. HASHING VS FEATURE SELECTION • “Small” problem with 35M different values. • Methods that require a dictionary have a larger model. 12
  13. 13. QUADRATIC FEATURES • Outer product between two features. • Example: between site and advertiser, Feature is 1 site=finance.yahoo.com & advertiser=bank of america Advertiser network Publisher network Publisher Advertiser Site Campaign Url Ad  Similar to a polynomial kernel of degree 2  Large number of values hashing trick 13
  14. 14. ADVANTAGES OF HASHING • Practical – Straightforward implement; no need to maintain dictionaries • Statistical – Regularization (infrequent values are washed away by frequent ones) • Most powerful when combined with quadratic features Quote of John Langford about hashing At first it‟s scary, then you love it 14
  15. 15. LEARNING • Regularized logistic regression – Vowpal Wabbit open source package • Regularization with hierarchical features Well estimated backoff smoothing Small if rare value • Negative data subsampled for computational reason 15
  16. 16. EVALUATION • Comparison with (Agarwal et al. ‟10) – Probabilistic model for the same display advertising prediction problem – Leverages the hierarchical structures on the ad and publisher sides – Sparse prior for smoothing • Model trained on three weeks of data, tested on the 3 following days auROC auPRC Log likelihood + 3.1% + 10.0% + 7.1% D. Agarwal et al., Estimating Rates of Rare Events with Multiple Hierarchies through Scalable Log-linear Models, KDD, 2010 16
  17. 17. BAYESIAN LOGISTIC REGRESSION • Regularized logistic regression = MAP solution (Gaussian prior, logistic likelihood) • Posterior is not Gaussian • Diagonal Laplace approximation: with: and: 17
  18. 18. MODEL UPDATE • Needed because ads / campaigns keep changing. • The posterior distribution of a previously trained model can be used as the prior for training a new model with a new batch of data Day 1 Day 2 Day 3 Day 4 Day 5 M0 M1 M2 • Influence of the update frequency (auPRC): 1 day 6 hours 2 hours +3.7% +5.1% +5.8% 18
  19. 19. OUTLINE 1. 2. 3. 4. Criteo & Display advertising Modeling Large scale learning Explore / exploit 19
  20. 20. PARALLEL LEARNING • Large training set – 2B training samples; 16M parameters – 400GB (compressed) • Proposed method: less than one hour with 500 machines • Optimize: • SGD is fast on a single machine, but difficult to parallelize. • Batch (quasi-Newton) methods are straightforward to parallelize – L-BFGS with distributed gradient computation. m 20
  21. 21. ALLREDUCE • Aggregate and broadcast across nodes 9 13 37 37 15 1 7 7 37 37 8 5 3 5 3 37 37 4 4 • Very few modification to existing code: just insert several AllReduce op. • Compatible with Hadoop / MapReduce – Build a spanning tree on the gateway – Single MapReduce job – Leverage speculative execution to alleviate the slow node issue 21
  22. 22. ONLINE INITIALIZATION • Hybrid approach: – One pass of online learning on each node – Average the weights from each node to get a warm start for batch optimization • Best of both (online / batch) worlds. Splice site prediction (Sonnenburg et al. ‘10) S. Sonnenburg and V. Franc, COFFIN: A Computational Framework for Linear SVMs, ICML 2010 Display advertising 22
  23. 23. OUTLINE 1. 2. 3. 4. Criteo & Display advertising Modeling Large scale learning Explore / exploit 23
  24. 24. THOMPSON SAMPLING • Heuristic to address the Explore / Exploit problem, dating back to Thompson (1933) • Simple to implement • Good performance in practice (Graepel et al. „10, Chapelle and Li „11) • Rarely used, maybe because of lack of theoretical guarantee. Draw model parameter according to P( D) t Select best action according to t T. Graepel et al., Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine, ICML 2010 O. Chapelle and L. Li, An Empirical Evaluation of Thompson Sampling, NIPS 2011 Observe reward and update model 24
  25. 25. E/E SIMULATIONS MAB with K arms. Best arm has mean reward = 0.5, others have 0.5 − ε. 25
  26. 26. EVALUATION • Semi-simulated environment: real input features, but labels generated. • Set of eligible ads varies from 1 to 5,910. Total ads = 66,373 • Comparison of E/E algorithms: – 4 days of data – Cold start • Algorithms: X candidates – UCB: mean + std. dev. – -greedy – Thompson sampling X selected Random w (ground truth) Generated Y Learned w model update 26
  27. 27. RESULTS • CTR regret (in percentage): Thompson UCB e-greedy Exploit-only Random 3.72 4.14 4.98 5.00 31.95 • Regret over time: 27
  28. 28. OPEN QUESTIONS • Hashing – Theoretical performance guarantees • Low rank matrix factorization – Better predict on unseen pairs (publisher, advertiser) • Sample selection bias – System is trained only on selected ads, but all ads are scored. – Possible solution: inverse propensity scoring – But we still need to bias the training data toward good ads. • Explore / exploit – Evaluation framework – Regret analysis of Thompson‟s sampling – E/E with a budget; with multiple slots; with a delayed feedback 28
  29. 29. CONCLUSION • Simple yet efficient techniques for click prediction • Main difficulty in applied machine learning: avoid the bias (because of academic papers) toward complex systems – It‟s easy to get lured into building a complex system – It‟s difficult to keep it simple See paper for more details Simple and scalable response prediction for display advertising O. Chapelle, E. Manavoglu, R. Rosales, 2014 29

×