SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Sequential Selection of Correlated Ads by
               POMDPs

           Shuai Yuan, Jun Wang

             University College London


              October 29, 2012
Motivations and contributions
Motivations,
  • help publishers gain more profit by displaying ads;
  • go further than offline, content-based matching of
       webpages and ads;
Contributions,
  • a framework of ad selection for revenue optimisation;
  • formulating the sequential selection problem by Partially
       observable Markov decision process and providing exact
       and approximate solutions;
  • a public keyword-bid-ad-webpage dataset for reproducible
       research1 .


  1
      http://www.computational-advertising.org
Related works
Contextual advertising,
   • A semantic approach to contextual advertising [Broder 2007]
   • Impedance coupling in content-targeted advertising [Ribeiro 2005]
   • Contextual advertising by combining relevance with click feedback [Chakrabarti
      2008]
Inventory management (contracts),
   • Targeted advertising on the Web with inventory management [Chickering 2003]
   • Revenue management for online advertising: Impatient advertisers
      [Fridgeirsdottir 2007]
   • Dynamic revenue management for online display advertising [Roels 2009]
Optimal pricing model,
   • Pricing of Online Advertising: Cost-Per-Click-Through Vs. Cost-Per-Action [Hu
      2010]
   • Online advertising: Pay-per-view versus pay-per-click [Mangani 2004]
   • Online advertising: Pay-per-view versus pay-per-click A comment [Fjell 2009]
   • Single period balancing of pay-per-click and pay-per-view online display
      advertisements [Kwon 2011]
Related works (cont.)
Ad scheduling,
   • Scheduling advertisements on a web page to maximize revenue [Kumar 2006]
   • Scheduling of dynamic in-game advertising [Turner 2011]
Multi-armed bandits,
   • Using confidence bounds for exploitation-exploration trade-offs [Auer 2003]
   • Multi-armed bandit problems with dependent arms [Pandey 2007]
POMDPs,
   • A survey of POMDP applications [Cassandra 1998]
   • Monte Carlo POMDPs [Thrun 2000]
   • Perseus: Randomized point-based value iteration for POMDPs [Spaan 2005]
Problem statement - setup
                                                            500



                                                            400



                                                            300



                                                            200



                                                            100




                                                                  0             200         400          600         800         1000




                         $                                            500



                                                                      400



                                                                      300



                                                                      200



                                                                      100




                                                                            0         200         400          600         800          1000



                                      500



                                      400



                                      300



                                      200



                                      100




                                            0   200   400               600           800         1000




Figure : 1 webpage, 1 ad slot, M impressions at each time step.
                                     2
Payoff of ads follows X ∼ N (µ, I · σ0 ). µ is generated by µ ∼ N (θ, Σ).
Problem statement - graphical model

        θ(1), Σ(1), T-1          θ(2), Σ(2), T-2                 θ(T), Σ(T), 0




             s(1)                     s(2)         θ, Σ              s(T)



              μ(1)                                    μ(2)                       μ(T)
                                         2
                                     σ   0




                          x(1)                            x(2)                   x(T)



Figure : The payoff model illustrated by an influence diagram
representation with generative processes of a finite horizon POMDP.
s(t) is the selection action. θ(t), Σ(t) is the belief at some stage.
Problem statement - object function
To maximise the expected cumulative payoff over time,
                                                              
                                                  T                              T
    ∗
   π = arg max E [Rπ (T )] = arg max E                 Xs(t) (t) = arg max          E Xs(t) (t)
            π                          π                                π
                                                  t=1                           t=1
                   T                                                   T
        =arg max             xs(t) (t)p(xs(t) (t)|Ψ(t))dx = arg max          θs(t) (t)              (1)
            π            x                                         π
                   t=1                                                 t=1


where,
  • s(t) is the selection decision;
  • Ψ(t) is the available information;
  • π is a selection policy and π ∗ is the optimal one;
  • “M impressions” is dropped from object function.
Belief update



                          $




                                     t=1       t=2 ...

    Figure : Updating belief on ads’ performance over time.
Belief update - the selected ad
We update the belief using Bayes’ theorem.
                    p (x1 |x1 (t), Ψ(t))

                     =       p (x1 |x1 (t), Ψ(t), µ1 ) p (µ1 |x1 (t), Ψ(t))dµ         (2)


by “completing squares”,
             p µ1 |x1 (t), Ψ(t) ∝ p(x1 (t)|µ1 , Ψ(t))p(µ1 |Ψ(t))
                                                             2                   2
                                   ∝ exp − x1 (t) − µ1           − µ1 − θ1 (t)        (3)

we obtain the new belief,
                                                      2
                          µ1 |x1 (t) ∼ N θ1 (t + 1), σ1 (t + 1)                       (4)

                              2              2
                             σ1 (t)x1 (t) + σ0 θ1 (t)                     2
                                                                        σ1 (t)σ02
                                                         2
              θ1 (t + 1) =          2            2
                                                        σ1 (t + 1) =    2 (t) + σ 2
                                                                                      (5)
                                   σ1 (t)   +   σ0                     σ1         0

we write θi (t) and σi2 (t) as the shorthand for θi |Ψ(t) and σi2 |Ψ(t).
Belief update - the correlated ad
We also update the belief of non-selected ads,

        p (x2 |x1 (t), Ψ(t)) =      p (x2 |µ2 , x1 (t), Ψ(t)) p(µ2 |x1 (t), Ψ(t))dµ2       (6)


with linear Gaussian property,
                                                       2
                                 µ1 |µ2 ∼ N (θ1 |µ2 , σ1 |µ2 )                             (7)

                                                                        2
                                                                       σ1,2
                                   σ1,2                 2        2
                 θ1 |µ2 = θ1 +      2
                                          (µ2 − θ2 )   σ1 |µ2 = σ1 −    2
                                                                                           (8)
                                   σ2                                  σ2

we obtain the new belief on a correlated ad,
                                                      2
                         µ2 |x1 (t) ∼ N (θ2 (t + 1), σ2 (t + 1))                           (9)

                                                                                2
                                                                              σ1,2
                                x1 (t) − θ1 (t)    2            2
   θ2 (t + 1) = θ2 (t) + σ1,2      2         2
                                                  σ2 (t + 1) = σ2 (t) −    2 (t) +    2
                                                                                          (10)
                                 σ1 (t) + σ0                              σ1         σ0
Belief update - expected payoff
We also obtain the expected payoff of the selected ad,
                                                 2    2
               X1 |x1 (t), Ψ(t) ∼ N θ1 (t + 1), σ0 + σ1 (t + 1)                   (11)


and the expected payoff of the correlated ad,
                                                 2    2
               X2 |x1 (t), Ψ(t) ∼ N θ2 (t + 1), σ0 + σ2 (t + 1)                   (12)


The final objective function is,
                                        T
                       π ∗ = arg max         θs(t) (t) subject to                 (13)
                                 π
                                       t=1
                                                          xs(t) (t) − θs(t) (t)
           θs(t+1) (t + 1) = θs(t+1) (t) + σs(t),s(t+1)       2           2
                                                                                  (14)
                                                             σs(t) (t) + σ0
                                               2
                                              σs(t),s(t+1)
            2                 2
           σs(t+1) (t + 1) = σs(t+1) (t) −    2           2
                                                                                  (15)
                                             σs(t) (t) + σ0
POMDP formulation and solution
                                 (belief state)
                                                                         500



                                                                         400



                                                                         300

                             (observation                                200


                              & reward)           (action)               100




                                                                               0             200         400          600         800         1000




                                 $                                                 500



                                                                                   400



                                                                                   300



                                                                                   200



                                                                                   100


                                            (hidden state)                               0         200         400          600         800          1000



                                                   500



                                                   400



                                                   300



                                                   200



                                                   100




                                                         0   200   400               600           800         1000




Figure : The POMDP model for the revenue optimisation problem.
(θ(t), Σ(t)) is belief at some stage; x(t) is observation and reward;
s(t) is action; (θ, Σ) is the hidden state. There is no state transition.
Value iteration and MAB approximation
The value function could be expressed as,
                                                                                                            
                                                                                                            
s(t)= arg max Vs(t) (Ψ(t)) = arg max 
                                                        ¯
                                                        (xi )               +          ξ(Ψ(t), i)            
                                                                                                             
       s(t)∈N                    i∈N
                                           the expected immediate reward        the expected future reward
                                                                                                      (16)

The exact solution using Value iteration2 :
        V ∗ (θ, Σ, T ) = max E Xs(t) (1) + V ∗ θ|Xs(t) (1), Σ|Xs(t) (1), T − 1                        (17)
                        s(1)∈N


The approximation based on multi-armed bandit3 :
                                                   qi − ti θi2 (t)       t −1
                      ξUCB 1- NORMAL =      16 ·                     ·                                (18)
                                                      ti − 1               ti

   2
    R. E. Bellman. (1957) “Dynamic Programming”
   3
    Auer, P. et al. (2002) “Finite-time analysis of the multi-armed bandit
problem”
Value iteration with Monte Carlo sampling4
We use sampling to reduce the computational complexity,
1: function VALUE F UNC(θ, Σ, t)
2:    array V ← 0                                               Expected reward vector.
3:    loop i ← 1 to N
4:        V [i] ← θi (t)                                    Expected immediate reward.
5:        if t < T then
6:             for all s in S AMPLE(θ, Σ) do
7:                 [θ , Σ ] ← U PDATE B ELIEF(θ, Σ, s, i)
                                           New belief after selecting i and observing s.
                                                                          Equations 13.
                                  1
8:              V [i] ← V [i] + M   VALUE F UNC(θ , Σ , t + 1)
                                   0
9:           end for
10:       end if
11:    end loop
12:    return [M AX(V ), M AX I NDEX(V )]
13: end function


   4
       Thrun, S. (2000) “Monte Carlo POMDPs”
Multi-armed bandit based approximation
(cont.)
The UCB 1- NORMAL - COR algorithm:
1: function P LAN(θ, Σ, Ψ(t))
2:    array V ← 0
3:    loop i ← 1 to N
4:        if ti < 8 log t then            ti is the number of times ad i gets selected.
5:             return i
6:        end if
7:    end loop
8:    [θ , Σ ] ← U PDATE B ELIEF(θ, Σ, Ψ(t))
                                     New belief of all ads with all available information.
                                                                            Equations 13.
9:     loop i ← 1 to N
                              q −t θ 2
10:       V [i] ← θi + 16 · i t −1i · t−1
                                   i
                                         ti
                                                                       Expected reward.
                                 i
11:    end loop
12:    return [MAX(V ), M AX I NDEX(V )]
13: end function
Experiment datasets
                                ad network/exchange

            Google AdWords              INTRANET




            Traffic Estimator
            service                      $
                          $$$                         $$


         advertisers                                       publishers


 • publishers gain 68% of advertisers’ spending (2003);
 • data was collected from 12/2011 to 05/2012;
 • 512 different keywords, 310 with non-zero mean payoff, 8
   categories;
 • 20% for training and 80% for testing;
 • we consider each keyword to be an ad.
Competing algorithms
We compare the following algorithms,
  • RANDOM policy, which selects candidates randomly
    (uniform);
  • MYOPIC policy, based on the expected immediate reward;
  • UCB 1 policy, which assumes independent between arms
    and is model-free of reward distribution;
  • UCB 1- NORMAL policy, which assumes independent
    between arms and the reward following Gaussian
    distribution;
  • VI - COR policy, which solves Value iteration using Monte
    Carlo sampling; and
  • UCB 1- NORMAL - COR policy, which consider the
    dependencies between candidates.
Results
 Datasets       MYOPIC     RANDOM      UCB 1    UCB 1- N    VI - COR   UCB 1- N - COR
 Education      21.9       23.0        30.9     30.9        41.2*      27.6
 Finance-1      38.5       27.8        40.9     26.4        44.5       27.4
 Finance-2      22.1       16.5        30.6     22.8        38.0*      22.9
 Information    14.1       12.9        27.8     15.9        29.4       15.9
 P&O            41.6       30.4        50.5     31.4        72.9*      63.3
 Shopping-1     17.4       10.6        42.3     16.1        40.2       16.4
 Shopping-2     29.9       14.5        34.3     75.3        52.9       79.2*
 Shopping-3     9.7        4.3         21.9     18.3        27.3       19.4
 P&S            24.7       26.0        47.2     57.1        67.9*      59.9
 Medical        30.5       19.6        52.7     32.2        58.0*      33.5

Table : The cumulative payoffs are averaged on 8 chunks then normalized w.r.t the
GOLDEN policy for a better representation. The one with highest cumulative payoff is
in bold and with ∗ if the difference with the second best is significant by Wilcoxon
signed-rank test. P&O is “People & organisations” and P&S is “‘Products & services”.
Results (cont.)

                VI COR

                UCB1 Normal COR
      4000
                UCB1 Normal

                UCB1

                Golden

                Myopic
      3000
                Random




      2000




      1000




                    20            40   60        80        100



Figure : Cumulative payoff on “People & organization” category, 5
candidates.
Results (cont.)
                                   1
                                          Myopic
                                  0.9     VI-Cor
                                          UCB1-Normal
                                  0.8
   Normalized cumulative payoff




                                          UCB1-Normal-Cor
                                  0.7

                                  0.6

                                  0.5

                                  0.4

                                  0.3

                                  0.2

                                  0.1

                                   0
                                        Edu   F-1    F-2    Info   P&O   S-1   S-2   S-3   P&S   Med


Figure : Comparison of accumulated payoffs on the 10 datasets.
VI-COR always performed better than MYOPIC and UCB1-NORMAL-COR
always performed better than UCB1-NORMAL across all datasets.
Results (cont.)
                           5000
                                   best phones
                           4500    term insurance

                           4000

                           3500
            Daily payoff




                           3000

                           2500

                           2000

                           1500

                           1000

                            500

                              0
                               0                50         100   150
                                                     Day


Figure : Special case: the daily payoff of two candidates with a
sudden change.
Results (cont.)
                            4
                         x 10
                    10
                                                                   Golden
                                                                   Myopic
                     9                                             VI−COR
                                                                   UCB1−Normal−COR

                     8
Cumulative payoff




                                                                                                   Figure : The
                     7
                                                                                                   impact of the noise
                                                                                                           2
                     6
                                                                                                   factor σ0 for the
                                                                                                   situation in the
                     5                                                                             previous figure.

                     4


                     3           −2          0                2                4
                                10        10             10                 10
                                            Noise factor σ2
                                                          0
                                                                                     xs(t) (t) − θs(t) (t)
                                      θs(t+1) (t + 1) = θs(t+1) (t) + σs(t),s(t+1)
                                                                                        2           2
                                                                                       σs(t) (t) + σ0
Future works
 • correlated update: if ad a1 on webpage w1 was shown to
   user u1 and we observed its performance, what’s the belief
   on performance of ad a2 on webpage w2 when showing to
   user u2 with correlations known?
 • multiple ads with diversification (another exploration and
   exploitation dilemma);
 • better solution for our continuous POMDP problem.

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlabkrishna_093
 
Gentlest Introduction to Tensorflow
Gentlest Introduction to TensorflowGentlest Introduction to Tensorflow
Gentlest Introduction to TensorflowKhor SoonHin
 
Gentlest Introduction to Tensorflow - Part 3
Gentlest Introduction to Tensorflow - Part 3Gentlest Introduction to Tensorflow - Part 3
Gentlest Introduction to Tensorflow - Part 3Khor SoonHin
 
รายงานคอม
รายงานคอมรายงานคอม
รายงานคอมAreeya Onnom
 
Numerical solution of spatiotemporal models from ecology
Numerical solution of spatiotemporal models from ecologyNumerical solution of spatiotemporal models from ecology
Numerical solution of spatiotemporal models from ecologyKyrre Wahl Kongsgård
 
TensorFlow Tutorial
TensorFlow TutorialTensorFlow Tutorial
TensorFlow TutorialNamHyuk Ahn
 
TensorFlow in Practice
TensorFlow in PracticeTensorFlow in Practice
TensorFlow in Practiceindico data
 
รายงานคอม
รายงานคอมรายงานคอม
รายงานคอมAreeya Onnom
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithmsguestfee8698
 
Explanation on Tensorflow example -Deep mnist for expert
Explanation on Tensorflow example -Deep mnist for expertExplanation on Tensorflow example -Deep mnist for expert
Explanation on Tensorflow example -Deep mnist for expert홍배 김
 
Distilling Free-Form Natural Laws from Experimental Data
Distilling Free-Form Natural Laws from Experimental DataDistilling Free-Form Natural Laws from Experimental Data
Distilling Free-Form Natural Laws from Experimental Dataswissnex San Francisco
 
Amth250 octave matlab some solutions (1)
Amth250 octave matlab some solutions (1)Amth250 octave matlab some solutions (1)
Amth250 octave matlab some solutions (1)asghar123456
 
provenance of lists - TAPP'11 Mini-tutorial
provenance of lists - TAPP'11 Mini-tutorialprovenance of lists - TAPP'11 Mini-tutorial
provenance of lists - TAPP'11 Mini-tutorialPaolo Missier
 
Stochastic Differential Equations: Application to Pension Funds under Adverse...
Stochastic Differential Equations: Application to Pension Funds under Adverse...Stochastic Differential Equations: Application to Pension Funds under Adverse...
Stochastic Differential Equations: Application to Pension Funds under Adverse...Marius García Meza
 
Machine learning of structured outputs
Machine learning of structured outputsMachine learning of structured outputs
Machine learning of structured outputszukun
 

Was ist angesagt? (19)

Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
Gentlest Introduction to Tensorflow
Gentlest Introduction to TensorflowGentlest Introduction to Tensorflow
Gentlest Introduction to Tensorflow
 
Gentlest Introduction to Tensorflow - Part 3
Gentlest Introduction to Tensorflow - Part 3Gentlest Introduction to Tensorflow - Part 3
Gentlest Introduction to Tensorflow - Part 3
 
รายงานคอม
รายงานคอมรายงานคอม
รายงานคอม
 
Numerical solution of spatiotemporal models from ecology
Numerical solution of spatiotemporal models from ecologyNumerical solution of spatiotemporal models from ecology
Numerical solution of spatiotemporal models from ecology
 
TensorFlow Tutorial
TensorFlow TutorialTensorFlow Tutorial
TensorFlow Tutorial
 
TensorFlow in Practice
TensorFlow in PracticeTensorFlow in Practice
TensorFlow in Practice
 
My sql cheat sheet
My sql cheat sheetMy sql cheat sheet
My sql cheat sheet
 
รายงานคอม
รายงานคอมรายงานคอม
รายงานคอม
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithms
 
Explanation on Tensorflow example -Deep mnist for expert
Explanation on Tensorflow example -Deep mnist for expertExplanation on Tensorflow example -Deep mnist for expert
Explanation on Tensorflow example -Deep mnist for expert
 
TensorFlow
TensorFlowTensorFlow
TensorFlow
 
Distilling Free-Form Natural Laws from Experimental Data
Distilling Free-Form Natural Laws from Experimental DataDistilling Free-Form Natural Laws from Experimental Data
Distilling Free-Form Natural Laws from Experimental Data
 
Amth250 octave matlab some solutions (1)
Amth250 octave matlab some solutions (1)Amth250 octave matlab some solutions (1)
Amth250 octave matlab some solutions (1)
 
provenance of lists - TAPP'11 Mini-tutorial
provenance of lists - TAPP'11 Mini-tutorialprovenance of lists - TAPP'11 Mini-tutorial
provenance of lists - TAPP'11 Mini-tutorial
 
Stochastic Differential Equations: Application to Pension Funds under Adverse...
Stochastic Differential Equations: Application to Pension Funds under Adverse...Stochastic Differential Equations: Application to Pension Funds under Adverse...
Stochastic Differential Equations: Application to Pension Funds under Adverse...
 
Machine learning of structured outputs
Machine learning of structured outputsMachine learning of structured outputs
Machine learning of structured outputs
 
About RNN
About RNNAbout RNN
About RNN
 
About RNN
About RNNAbout RNN
About RNN
 

Andere mochten auch

CIKM 2013 Tutorial: Real-time Bidding: A New Frontier of Computational Advert...
CIKM 2013 Tutorial: Real-time Bidding: A New Frontier of Computational Advert...CIKM 2013 Tutorial: Real-time Bidding: A New Frontier of Computational Advert...
CIKM 2013 Tutorial: Real-time Bidding: A New Frontier of Computational Advert...Shuai Yuan
 
Dsp and the prediction
Dsp and the predictionDsp and the prediction
Dsp and the predictionSoohan Ahn
 
RTBMA ECIR 2016 tutorial
RTBMA ECIR 2016 tutorialRTBMA ECIR 2016 tutorial
RTBMA ECIR 2016 tutorialShuai Yuan
 
ブラックボックスなアドテクを機械学習で推理してみた Short ver
ブラックボックスなアドテクを機械学習で推理してみた Short verブラックボックスなアドテクを機械学習で推理してみた Short ver
ブラックボックスなアドテクを機械学習で推理してみた Short ver尚行 坂井
 
あなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LT
あなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LTあなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LT
あなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LTHiroaki Kudo
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論Taiji Suzuki
 
“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと
“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと
“確率的最適化”を読む前に知っておくといいかもしれない関数解析のことHiroaki Kudo
 
機械学習でデジタル広告を変える! @デブサミ 2015autumn
機械学習でデジタル広告を変える! @デブサミ 2015autumn機械学習でデジタル広告を変える! @デブサミ 2015autumn
機械学習でデジタル広告を変える! @デブサミ 2015autumnKei Tateno
 
アドテクにおける機械学習技術 @Tokyo Data Night #tokyodn
アドテクにおける機械学習技術 @Tokyo Data Night #tokyodnアドテクにおける機械学習技術 @Tokyo Data Night #tokyodn
アドテクにおける機械学習技術 @Tokyo Data Night #tokyodnKei Tateno
 

Andere mochten auch (10)

CIKM 2013 Tutorial: Real-time Bidding: A New Frontier of Computational Advert...
CIKM 2013 Tutorial: Real-time Bidding: A New Frontier of Computational Advert...CIKM 2013 Tutorial: Real-time Bidding: A New Frontier of Computational Advert...
CIKM 2013 Tutorial: Real-time Bidding: A New Frontier of Computational Advert...
 
Dsp and the prediction
Dsp and the predictionDsp and the prediction
Dsp and the prediction
 
RTBMA ECIR 2016 tutorial
RTBMA ECIR 2016 tutorialRTBMA ECIR 2016 tutorial
RTBMA ECIR 2016 tutorial
 
ブラックボックスなアドテクを機械学習で推理してみた Short ver
ブラックボックスなアドテクを機械学習で推理してみた Short verブラックボックスなアドテクを機械学習で推理してみた Short ver
ブラックボックスなアドテクを機械学習で推理してみた Short ver
 
あなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LT
あなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LTあなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LT
あなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LT
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論
 
「YDNの広告のCTRをオンライン学習で予測してみた」#yjdsw4
「YDNの広告のCTRをオンライン学習で予測してみた」#yjdsw4「YDNの広告のCTRをオンライン学習で予測してみた」#yjdsw4
「YDNの広告のCTRをオンライン学習で予測してみた」#yjdsw4
 
“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと
“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと
“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと
 
機械学習でデジタル広告を変える! @デブサミ 2015autumn
機械学習でデジタル広告を変える! @デブサミ 2015autumn機械学習でデジタル広告を変える! @デブサミ 2015autumn
機械学習でデジタル広告を変える! @デブサミ 2015autumn
 
アドテクにおける機械学習技術 @Tokyo Data Night #tokyodn
アドテクにおける機械学習技術 @Tokyo Data Night #tokyodnアドテクにおける機械学習技術 @Tokyo Data Night #tokyodn
アドテクにおける機械学習技術 @Tokyo Data Night #tokyodn
 

Ähnlich wie Sequential Selection of Correlated Ads by POMDPs

Pricing average price advertising options when underlying spot market prices ...
Pricing average price advertising options when underlying spot market prices ...Pricing average price advertising options when underlying spot market prices ...
Pricing average price advertising options when underlying spot market prices ...Bowei Chen
 
Optimal debt maturity management
Optimal debt maturity managementOptimal debt maturity management
Optimal debt maturity managementADEMU_Project
 
A/B Testing for Game Design
A/B Testing for Game DesignA/B Testing for Game Design
A/B Testing for Game DesignTrieu Nguyen
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavVyacheslav Arbuzov
 
Chapter 1 introduction (Image Processing)
Chapter 1 introduction (Image Processing)Chapter 1 introduction (Image Processing)
Chapter 1 introduction (Image Processing)Varun Ojha
 
Asset Prices in Segmented and Integrated Markets
Asset Prices in Segmented and Integrated MarketsAsset Prices in Segmented and Integrated Markets
Asset Prices in Segmented and Integrated Marketsguasoni
 
ISI MSQE Entrance Question Paper (2010)
ISI MSQE Entrance Question Paper (2010)ISI MSQE Entrance Question Paper (2010)
ISI MSQE Entrance Question Paper (2010)CrackDSE
 
Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009Sean Meyn
 
Unit 1 Operation on signals
Unit 1  Operation on signalsUnit 1  Operation on signals
Unit 1 Operation on signalsDr.SHANTHI K.G
 
Cs8092 computer graphics and multimedia unit 2
Cs8092 computer graphics and multimedia unit 2Cs8092 computer graphics and multimedia unit 2
Cs8092 computer graphics and multimedia unit 2SIMONTHOMAS S
 
12.5. vector valued functions
12.5. vector valued functions12.5. vector valued functions
12.5. vector valued functionsmath267
 
Multi-keyword multi-click advertisement option contracts for sponsored search
Multi-keyword multi-click advertisement option contracts for sponsored searchMulti-keyword multi-click advertisement option contracts for sponsored search
Multi-keyword multi-click advertisement option contracts for sponsored searchBowei Chen
 
Case Study (All)
Case Study (All)Case Study (All)
Case Study (All)gudeyi
 
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoConvolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoSeongwon Hwang
 
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydSri Ambati
 
Fuzzy calculation
Fuzzy calculationFuzzy calculation
Fuzzy calculationAmir Rafati
 
The convenience yield implied by quadratic volatility smiles presentation [...
The convenience yield implied by quadratic volatility smiles   presentation [...The convenience yield implied by quadratic volatility smiles   presentation [...
The convenience yield implied by quadratic volatility smiles presentation [...yigalbt
 
Discussion of Matti Vihola's talk
Discussion of Matti Vihola's talkDiscussion of Matti Vihola's talk
Discussion of Matti Vihola's talkChristian Robert
 
K050 t分布f分布
K050 t分布f分布K050 t分布f分布
K050 t分布f分布t2tarumi
 

Ähnlich wie Sequential Selection of Correlated Ads by POMDPs (20)

Pricing average price advertising options when underlying spot market prices ...
Pricing average price advertising options when underlying spot market prices ...Pricing average price advertising options when underlying spot market prices ...
Pricing average price advertising options when underlying spot market prices ...
 
Optimal debt maturity management
Optimal debt maturity managementOptimal debt maturity management
Optimal debt maturity management
 
A/B Testing for Game Design
A/B Testing for Game DesignA/B Testing for Game Design
A/B Testing for Game Design
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
 
Chapter 1 introduction (Image Processing)
Chapter 1 introduction (Image Processing)Chapter 1 introduction (Image Processing)
Chapter 1 introduction (Image Processing)
 
Asset Prices in Segmented and Integrated Markets
Asset Prices in Segmented and Integrated MarketsAsset Prices in Segmented and Integrated Markets
Asset Prices in Segmented and Integrated Markets
 
ISI MSQE Entrance Question Paper (2010)
ISI MSQE Entrance Question Paper (2010)ISI MSQE Entrance Question Paper (2010)
ISI MSQE Entrance Question Paper (2010)
 
Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009
 
Unit 1 Operation on signals
Unit 1  Operation on signalsUnit 1  Operation on signals
Unit 1 Operation on signals
 
Cs8092 computer graphics and multimedia unit 2
Cs8092 computer graphics and multimedia unit 2Cs8092 computer graphics and multimedia unit 2
Cs8092 computer graphics and multimedia unit 2
 
12.5. vector valued functions
12.5. vector valued functions12.5. vector valued functions
12.5. vector valued functions
 
matlab.docx
matlab.docxmatlab.docx
matlab.docx
 
Multi-keyword multi-click advertisement option contracts for sponsored search
Multi-keyword multi-click advertisement option contracts for sponsored searchMulti-keyword multi-click advertisement option contracts for sponsored search
Multi-keyword multi-click advertisement option contracts for sponsored search
 
Case Study (All)
Case Study (All)Case Study (All)
Case Study (All)
 
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoConvolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in Theano
 
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
 
Fuzzy calculation
Fuzzy calculationFuzzy calculation
Fuzzy calculation
 
The convenience yield implied by quadratic volatility smiles presentation [...
The convenience yield implied by quadratic volatility smiles   presentation [...The convenience yield implied by quadratic volatility smiles   presentation [...
The convenience yield implied by quadratic volatility smiles presentation [...
 
Discussion of Matti Vihola's talk
Discussion of Matti Vihola's talkDiscussion of Matti Vihola's talk
Discussion of Matti Vihola's talk
 
K050 t分布f分布
K050 t分布f分布K050 t分布f分布
K050 t分布f分布
 

Kürzlich hochgeladen

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 

Kürzlich hochgeladen (20)

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 

Sequential Selection of Correlated Ads by POMDPs

  • 1. Sequential Selection of Correlated Ads by POMDPs Shuai Yuan, Jun Wang University College London October 29, 2012
  • 2. Motivations and contributions Motivations, • help publishers gain more profit by displaying ads; • go further than offline, content-based matching of webpages and ads; Contributions, • a framework of ad selection for revenue optimisation; • formulating the sequential selection problem by Partially observable Markov decision process and providing exact and approximate solutions; • a public keyword-bid-ad-webpage dataset for reproducible research1 . 1 http://www.computational-advertising.org
  • 3. Related works Contextual advertising, • A semantic approach to contextual advertising [Broder 2007] • Impedance coupling in content-targeted advertising [Ribeiro 2005] • Contextual advertising by combining relevance with click feedback [Chakrabarti 2008] Inventory management (contracts), • Targeted advertising on the Web with inventory management [Chickering 2003] • Revenue management for online advertising: Impatient advertisers [Fridgeirsdottir 2007] • Dynamic revenue management for online display advertising [Roels 2009] Optimal pricing model, • Pricing of Online Advertising: Cost-Per-Click-Through Vs. Cost-Per-Action [Hu 2010] • Online advertising: Pay-per-view versus pay-per-click [Mangani 2004] • Online advertising: Pay-per-view versus pay-per-click A comment [Fjell 2009] • Single period balancing of pay-per-click and pay-per-view online display advertisements [Kwon 2011]
  • 4. Related works (cont.) Ad scheduling, • Scheduling advertisements on a web page to maximize revenue [Kumar 2006] • Scheduling of dynamic in-game advertising [Turner 2011] Multi-armed bandits, • Using confidence bounds for exploitation-exploration trade-offs [Auer 2003] • Multi-armed bandit problems with dependent arms [Pandey 2007] POMDPs, • A survey of POMDP applications [Cassandra 1998] • Monte Carlo POMDPs [Thrun 2000] • Perseus: Randomized point-based value iteration for POMDPs [Spaan 2005]
  • 5. Problem statement - setup 500 400 300 200 100 0 200 400 600 800 1000 $ 500 400 300 200 100 0 200 400 600 800 1000 500 400 300 200 100 0 200 400 600 800 1000 Figure : 1 webpage, 1 ad slot, M impressions at each time step. 2 Payoff of ads follows X ∼ N (µ, I · σ0 ). µ is generated by µ ∼ N (θ, Σ).
  • 6. Problem statement - graphical model θ(1), Σ(1), T-1 θ(2), Σ(2), T-2 θ(T), Σ(T), 0 s(1) s(2) θ, Σ s(T) μ(1) μ(2) μ(T) 2 σ 0 x(1) x(2) x(T) Figure : The payoff model illustrated by an influence diagram representation with generative processes of a finite horizon POMDP. s(t) is the selection action. θ(t), Σ(t) is the belief at some stage.
  • 7. Problem statement - object function To maximise the expected cumulative payoff over time,   T T ∗ π = arg max E [Rπ (T )] = arg max E  Xs(t) (t) = arg max E Xs(t) (t) π π π t=1 t=1 T T =arg max xs(t) (t)p(xs(t) (t)|Ψ(t))dx = arg max θs(t) (t) (1) π x π t=1 t=1 where, • s(t) is the selection decision; • Ψ(t) is the available information; • π is a selection policy and π ∗ is the optimal one; • “M impressions” is dropped from object function.
  • 8. Belief update $ t=1 t=2 ... Figure : Updating belief on ads’ performance over time.
  • 9. Belief update - the selected ad We update the belief using Bayes’ theorem. p (x1 |x1 (t), Ψ(t)) = p (x1 |x1 (t), Ψ(t), µ1 ) p (µ1 |x1 (t), Ψ(t))dµ (2) by “completing squares”, p µ1 |x1 (t), Ψ(t) ∝ p(x1 (t)|µ1 , Ψ(t))p(µ1 |Ψ(t)) 2 2 ∝ exp − x1 (t) − µ1 − µ1 − θ1 (t) (3) we obtain the new belief, 2 µ1 |x1 (t) ∼ N θ1 (t + 1), σ1 (t + 1) (4) 2 2 σ1 (t)x1 (t) + σ0 θ1 (t) 2 σ1 (t)σ02 2 θ1 (t + 1) = 2 2 σ1 (t + 1) = 2 (t) + σ 2 (5) σ1 (t) + σ0 σ1 0 we write θi (t) and σi2 (t) as the shorthand for θi |Ψ(t) and σi2 |Ψ(t).
  • 10. Belief update - the correlated ad We also update the belief of non-selected ads, p (x2 |x1 (t), Ψ(t)) = p (x2 |µ2 , x1 (t), Ψ(t)) p(µ2 |x1 (t), Ψ(t))dµ2 (6) with linear Gaussian property, 2 µ1 |µ2 ∼ N (θ1 |µ2 , σ1 |µ2 ) (7) 2 σ1,2 σ1,2 2 2 θ1 |µ2 = θ1 + 2 (µ2 − θ2 ) σ1 |µ2 = σ1 − 2 (8) σ2 σ2 we obtain the new belief on a correlated ad, 2 µ2 |x1 (t) ∼ N (θ2 (t + 1), σ2 (t + 1)) (9) 2 σ1,2 x1 (t) − θ1 (t) 2 2 θ2 (t + 1) = θ2 (t) + σ1,2 2 2 σ2 (t + 1) = σ2 (t) − 2 (t) + 2 (10) σ1 (t) + σ0 σ1 σ0
  • 11. Belief update - expected payoff We also obtain the expected payoff of the selected ad, 2 2 X1 |x1 (t), Ψ(t) ∼ N θ1 (t + 1), σ0 + σ1 (t + 1) (11) and the expected payoff of the correlated ad, 2 2 X2 |x1 (t), Ψ(t) ∼ N θ2 (t + 1), σ0 + σ2 (t + 1) (12) The final objective function is, T π ∗ = arg max θs(t) (t) subject to (13) π t=1 xs(t) (t) − θs(t) (t) θs(t+1) (t + 1) = θs(t+1) (t) + σs(t),s(t+1) 2 2 (14) σs(t) (t) + σ0 2 σs(t),s(t+1) 2 2 σs(t+1) (t + 1) = σs(t+1) (t) − 2 2 (15) σs(t) (t) + σ0
  • 12. POMDP formulation and solution (belief state) 500 400 300 (observation 200 & reward) (action) 100 0 200 400 600 800 1000 $ 500 400 300 200 100 (hidden state) 0 200 400 600 800 1000 500 400 300 200 100 0 200 400 600 800 1000 Figure : The POMDP model for the revenue optimisation problem. (θ(t), Σ(t)) is belief at some stage; x(t) is observation and reward; s(t) is action; (θ, Σ) is the hidden state. There is no state transition.
  • 13. Value iteration and MAB approximation The value function could be expressed as,     s(t)= arg max Vs(t) (Ψ(t)) = arg max   ¯ (xi ) + ξ(Ψ(t), i)   s(t)∈N i∈N the expected immediate reward the expected future reward (16) The exact solution using Value iteration2 : V ∗ (θ, Σ, T ) = max E Xs(t) (1) + V ∗ θ|Xs(t) (1), Σ|Xs(t) (1), T − 1 (17) s(1)∈N The approximation based on multi-armed bandit3 : qi − ti θi2 (t) t −1 ξUCB 1- NORMAL = 16 · · (18) ti − 1 ti 2 R. E. Bellman. (1957) “Dynamic Programming” 3 Auer, P. et al. (2002) “Finite-time analysis of the multi-armed bandit problem”
  • 14. Value iteration with Monte Carlo sampling4 We use sampling to reduce the computational complexity, 1: function VALUE F UNC(θ, Σ, t) 2: array V ← 0 Expected reward vector. 3: loop i ← 1 to N 4: V [i] ← θi (t) Expected immediate reward. 5: if t < T then 6: for all s in S AMPLE(θ, Σ) do 7: [θ , Σ ] ← U PDATE B ELIEF(θ, Σ, s, i) New belief after selecting i and observing s. Equations 13. 1 8: V [i] ← V [i] + M VALUE F UNC(θ , Σ , t + 1) 0 9: end for 10: end if 11: end loop 12: return [M AX(V ), M AX I NDEX(V )] 13: end function 4 Thrun, S. (2000) “Monte Carlo POMDPs”
  • 15. Multi-armed bandit based approximation (cont.) The UCB 1- NORMAL - COR algorithm: 1: function P LAN(θ, Σ, Ψ(t)) 2: array V ← 0 3: loop i ← 1 to N 4: if ti < 8 log t then ti is the number of times ad i gets selected. 5: return i 6: end if 7: end loop 8: [θ , Σ ] ← U PDATE B ELIEF(θ, Σ, Ψ(t)) New belief of all ads with all available information. Equations 13. 9: loop i ← 1 to N q −t θ 2 10: V [i] ← θi + 16 · i t −1i · t−1 i ti Expected reward. i 11: end loop 12: return [MAX(V ), M AX I NDEX(V )] 13: end function
  • 16. Experiment datasets ad network/exchange Google AdWords INTRANET Traffic Estimator service $ $$$ $$ advertisers publishers • publishers gain 68% of advertisers’ spending (2003); • data was collected from 12/2011 to 05/2012; • 512 different keywords, 310 with non-zero mean payoff, 8 categories; • 20% for training and 80% for testing; • we consider each keyword to be an ad.
  • 17. Competing algorithms We compare the following algorithms, • RANDOM policy, which selects candidates randomly (uniform); • MYOPIC policy, based on the expected immediate reward; • UCB 1 policy, which assumes independent between arms and is model-free of reward distribution; • UCB 1- NORMAL policy, which assumes independent between arms and the reward following Gaussian distribution; • VI - COR policy, which solves Value iteration using Monte Carlo sampling; and • UCB 1- NORMAL - COR policy, which consider the dependencies between candidates.
  • 18. Results Datasets MYOPIC RANDOM UCB 1 UCB 1- N VI - COR UCB 1- N - COR Education 21.9 23.0 30.9 30.9 41.2* 27.6 Finance-1 38.5 27.8 40.9 26.4 44.5 27.4 Finance-2 22.1 16.5 30.6 22.8 38.0* 22.9 Information 14.1 12.9 27.8 15.9 29.4 15.9 P&O 41.6 30.4 50.5 31.4 72.9* 63.3 Shopping-1 17.4 10.6 42.3 16.1 40.2 16.4 Shopping-2 29.9 14.5 34.3 75.3 52.9 79.2* Shopping-3 9.7 4.3 21.9 18.3 27.3 19.4 P&S 24.7 26.0 47.2 57.1 67.9* 59.9 Medical 30.5 19.6 52.7 32.2 58.0* 33.5 Table : The cumulative payoffs are averaged on 8 chunks then normalized w.r.t the GOLDEN policy for a better representation. The one with highest cumulative payoff is in bold and with ∗ if the difference with the second best is significant by Wilcoxon signed-rank test. P&O is “People & organisations” and P&S is “‘Products & services”.
  • 19. Results (cont.) VI COR UCB1 Normal COR 4000 UCB1 Normal UCB1 Golden Myopic 3000 Random 2000 1000 20 40 60 80 100 Figure : Cumulative payoff on “People & organization” category, 5 candidates.
  • 20. Results (cont.) 1 Myopic 0.9 VI-Cor UCB1-Normal 0.8 Normalized cumulative payoff UCB1-Normal-Cor 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Edu F-1 F-2 Info P&O S-1 S-2 S-3 P&S Med Figure : Comparison of accumulated payoffs on the 10 datasets. VI-COR always performed better than MYOPIC and UCB1-NORMAL-COR always performed better than UCB1-NORMAL across all datasets.
  • 21. Results (cont.) 5000 best phones 4500 term insurance 4000 3500 Daily payoff 3000 2500 2000 1500 1000 500 0 0 50 100 150 Day Figure : Special case: the daily payoff of two candidates with a sudden change.
  • 22. Results (cont.) 4 x 10 10 Golden Myopic 9 VI−COR UCB1−Normal−COR 8 Cumulative payoff Figure : The 7 impact of the noise 2 6 factor σ0 for the situation in the 5 previous figure. 4 3 −2 0 2 4 10 10 10 10 Noise factor σ2 0 xs(t) (t) − θs(t) (t) θs(t+1) (t + 1) = θs(t+1) (t) + σs(t),s(t+1) 2 2 σs(t) (t) + σ0
  • 23. Future works • correlated update: if ad a1 on webpage w1 was shown to user u1 and we observed its performance, what’s the belief on performance of ad a2 on webpage w2 when showing to user u2 with correlations known? • multiple ads with diversification (another exploration and exploitation dilemma); • better solution for our continuous POMDP problem.