Melden

Teilen

Folgen

•0 gefällt mir•37 views

•0 gefällt mir•37 views

Melden

Teilen

Downloaden Sie, um offline zu lesen

Contextual multi armed bandit model is a popular framework for describing sequential decision making under uncertainty. We introduce a novel variant of the problem that aims at describing disruption of the system by the entrance of an external sponsor. Sponsored content is ubiquitous in the modern world, it is present most profoundly in the recommender systems, but more broadly in any scheme involving lobbying. The consequences of introduction of the sponsor are few fold, however in the paper we focus on the differences in assignment mechanisms between standard learner and sponsor. We might not have access to the description of the process that governs the willingness to sponsor. This might be due to the fact that model specifications can be confidential between companies, the decision might be human determined or arise as a consequence of a complicated system not fully modeled, such as auctions between advertisers. In particular the sponsoring mechanism can be confounded. We use a tool from causal inference topic of combining randomized controlled trials with observational studies and adjust the Inverse Gap Weighting algorithm to account for confounded sponsor targeting mechanism. We show in a simulation that this adjustment improves learning. This research proposes a novel tool to analyze interaction of a recommender system with sponsor. Moreover, it shows that ignoring the sponsoring act might lead to worse outcomes, interpreted for example as less user engagement. Most importantly the research shows an adjustment to tackle the problem combining causal inference with sequential decision making.

Folgen

- 1. Authoritarian Sponsor Deconfounding Experiment Conclusions References Sponsored content in contextual bandits. Deconfounding Targeting Not At Random MIUE 2023 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology September 22, 2023 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 1 / 26
- 2. Authoritarian Sponsor Deconfounding Experiment Conclusions References Motivational examples Recommender systems • Suggest best ads/movies a ∈ {a1, a2, ...aK } • Users X1, X2, ...., XT • Design of the study {na1 , na2 , ..., naK }, P i nai = T • Measured satisfaction {Rt(a1), ...Rt(aK )}T t=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 2 / 26
- 3. Authoritarian Sponsor Deconfounding Experiment Conclusions References The framework Exploration vs exploitation • Allocating limited resources under uncertainty • Sequential manner • Partial feedback (bandit feedback) • Adaptive (non-iid) data • Maximizing cumulative gain • Current actions do not change future environment Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
- 4. Authoritarian Sponsor Deconfounding Experiment Conclusions References The framework Exploration vs exploitation • Allocating limited resources under uncertainty • Sequential manner • Partial feedback (bandit feedback) • Adaptive (non-iid) data • Maximizing cumulative gain • Current actions do not change future environment Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
- 5. Authoritarian Sponsor Deconfounding Experiment Conclusions References The framework Exploration vs exploitation • Allocating limited resources under uncertainty • Sequential manner • Partial feedback (bandit feedback) • Adaptive (non-iid) data • Maximizing cumulative gain • Current actions do not change future environment Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
- 6. Authoritarian Sponsor Deconfounding Experiment Conclusions References The framework Exploration vs exploitation • Allocating limited resources under uncertainty • Sequential manner • Partial feedback (bandit feedback) • Adaptive (non-iid) data • Maximizing cumulative gain • Current actions do not change future environment Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
- 7. Authoritarian Sponsor Deconfounding Experiment Conclusions References The framework Exploration vs exploitation • Allocating limited resources under uncertainty • Sequential manner • Partial feedback (bandit feedback) • Adaptive (non-iid) data • Maximizing cumulative gain • Current actions do not change future environment Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
- 8. Authoritarian Sponsor Deconfounding Experiment Conclusions References The framework Exploration vs exploitation • Allocating limited resources under uncertainty • Sequential manner • Partial feedback (bandit feedback) • Adaptive (non-iid) data • Maximizing cumulative gain • Current actions do not change future environment Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
- 9. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
- 10. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
- 11. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
- 12. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
- 13. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
- 14. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
- 15. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
- 16. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
- 17. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
- 18. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
- 19. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
- 20. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
- 21. Authoritarian Sponsor Deconfounding Experiment Conclusions References Details • In short (Xt, At, Rt) ∼ D(πt) • We know πt(a|x) (propensity score) • We don’t know DX,⃗ R • We have 1(At = a) ⊥ ⊥ R(a)|Xt • We want to maximize with π ED(π) T X t=1 Rt(At) # Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 5 / 26
- 22. Authoritarian Sponsor Deconfounding Experiment Conclusions References Details • In short (Xt, At, Rt) ∼ D(πt) • We know πt(a|x) (propensity score) • We don’t know DX,⃗ R • We have 1(At = a) ⊥ ⊥ R(a)|Xt • We want to maximize with π ED(π) T X t=1 Rt(At) # Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 5 / 26
- 23. Authoritarian Sponsor Deconfounding Experiment Conclusions References Details • In short (Xt, At, Rt) ∼ D(πt) • We know πt(a|x) (propensity score) • We don’t know DX,⃗ R • We have 1(At = a) ⊥ ⊥ R(a)|Xt • We want to maximize with π ED(π) T X t=1 Rt(At) # Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 5 / 26
- 24. Authoritarian Sponsor Deconfounding Experiment Conclusions References Details • In short (Xt, At, Rt) ∼ D(πt) • We know πt(a|x) (propensity score) • We don’t know DX,⃗ R • We have 1(At = a) ⊥ ⊥ R(a)|Xt • We want to maximize with π ED(π) T X t=1 Rt(At) # Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 5 / 26
- 25. Authoritarian Sponsor Deconfounding Experiment Conclusions References Details • In short (Xt, At, Rt) ∼ D(πt) • We know πt(a|x) (propensity score) • We don’t know DX,⃗ R • We have 1(At = a) ⊥ ⊥ R(a)|Xt • We want to maximize with π ED(π) T X t=1 Rt(At) # Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 5 / 26
- 26. Authoritarian Sponsor Deconfounding Experiment Conclusions References The flow of information Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 6 / 26
- 27. Authoritarian Sponsor Deconfounding Experiment Conclusions References Inverse Gap Weighting Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 7 / 26
- 28. Authoritarian Sponsor Deconfounding Experiment Conclusions References 1 Authoritarian Sponsor 2 Deconfounding 3 Experiment 4 Conclusions Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 8 / 26
- 29. Authoritarian Sponsor Deconfounding Experiment Conclusions References Authoritarian Sponsor model • The act of sponsoring • Recommender system - marketing campaigns, testing products • Healthcare - funding experiments, lobbying doctors • The sponsor (€, H) intervenes in an authoritarian manner At = StÃt + (1 − St)Āt, St ∈ {0, 1}, St ∼ €(·|X) Āt ∼ πt(·|X), Ãt ∼ H t (·|X) H t (a|x) = € t (1|x) H t (a|x) + € t (0|x)πt(a|x). • The lack of knowledge about sponsor’s policy (€, H) • Not sharing technology or strategy • Lost in human to algorithm translation • Hard to model process like auctions Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 9 / 26
- 30. Authoritarian Sponsor Deconfounding Experiment Conclusions References Authoritarian Sponsor model • The act of sponsoring • Recommender system - marketing campaigns, testing products • Healthcare - funding experiments, lobbying doctors • The sponsor (€, H) intervenes in an authoritarian manner At = StÃt + (1 − St)Āt, St ∈ {0, 1}, St ∼ €(·|X) Āt ∼ πt(·|X), Ãt ∼ H t (·|X) H t (a|x) = € t (1|x) H t (a|x) + € t (0|x)πt(a|x). • The lack of knowledge about sponsor’s policy (€, H) • Not sharing technology or strategy • Lost in human to algorithm translation • Hard to model process like auctions Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 9 / 26
- 31. Authoritarian Sponsor Deconfounding Experiment Conclusions References Authoritarian Sponsor model • The act of sponsoring • Recommender system - marketing campaigns, testing products • Healthcare - funding experiments, lobbying doctors • The sponsor (€, H) intervenes in an authoritarian manner At = StÃt + (1 − St)Āt, St ∈ {0, 1}, St ∼ €(·|X) Āt ∼ πt(·|X), Ãt ∼ H t (·|X) H t (a|x) = € t (1|x) H t (a|x) + € t (0|x)πt(a|x). • The lack of knowledge about sponsor’s policy (€, H) • Not sharing technology or strategy • Lost in human to algorithm translation • Hard to model process like auctions Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 9 / 26
- 32. Authoritarian Sponsor Deconfounding Experiment Conclusions References Targeting mechanisms Introducing an unobserved confounder Z 1 Targeting Completely At Random (TCAR): • S(X) = S, H(a|X, R, Z) = H(a) • kind of like MCAR 2 Targeting At Random (TAR) • S(X) = S(X), H(a|X, R, Z) = H(a|X) • kind of like MAR 3 Targeting Not At Random (TNAR) • H(a|X, R, Z) ⇒ R(a) ̸⊥ ⊥ A|X, S = 1. • kind of like MNAR Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 10 / 26
- 33. Authoritarian Sponsor Deconfounding Experiment Conclusions References Targeting mechanisms Introducing an unobserved confounder Z 1 Targeting Completely At Random (TCAR): • S(X) = S, H(a|X, R, Z) = H(a) • kind of like MCAR 2 Targeting At Random (TAR) • S(X) = S(X), H(a|X, R, Z) = H(a|X) • kind of like MAR 3 Targeting Not At Random (TNAR) • H(a|X, R, Z) ⇒ R(a) ̸⊥ ⊥ A|X, S = 1. • kind of like MNAR Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 10 / 26
- 34. Authoritarian Sponsor Deconfounding Experiment Conclusions References Targeting mechanisms Introducing an unobserved confounder Z 1 Targeting Completely At Random (TCAR): • S(X) = S, H(a|X, R, Z) = H(a) • kind of like MCAR 2 Targeting At Random (TAR) • S(X) = S(X), H(a|X, R, Z) = H(a|X) • kind of like MAR 3 Targeting Not At Random (TNAR) • H(a|X, R, Z) ⇒ R(a) ̸⊥ ⊥ A|X, S = 1. • kind of like MNAR Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 10 / 26
- 35. Authoritarian Sponsor Deconfounding Experiment Conclusions References Causal interpretation Figure 1: TCAR Figure 2: TAR Figure 3: TNAR Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 11 / 26
- 36. Authoritarian Sponsor Deconfounding Experiment Conclusions References 1 Authoritarian Sponsor 2 Deconfounding 3 Experiment 4 Conclusions Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 12 / 26
- 37. Authoritarian Sponsor Deconfounding Experiment Conclusions References Data fusion (see Colnet et al. (2020)) RCT OS Internal validity External validity Propensity score ? Table 1: Differences and similarities between data sources Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 13 / 26
- 38. Authoritarian Sponsor Deconfounding Experiment Conclusions References Data fusion (see Colnet et al. (2020)) RCT OS Learner Sponsor Internal validity External validity ∼ ∼ Propensity score ? ? Table 2: Differences and similarities between data sources • Unsolved challenge: sampling in interaction! Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 14 / 26
- 39. Authoritarian Sponsor Deconfounding Experiment Conclusions References Data fusion (see Colnet et al. (2020)) RCT OS Learner Sponsor Internal validity External validity ∼ ∼ Propensity score ? ? Table 2: Differences and similarities between data sources • Unsolved challenge: sampling in interaction! Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 14 / 26
- 40. Authoritarian Sponsor Deconfounding Experiment Conclusions References CATE • CATE τa1,a2 (x) = EDR|A,X=x [R(a1) − R(a2)] and b τa1,a2 (x) = b µa1 (x) − b µa2 (x) • Assumptions • SUTVA: Rt = P a∈A 1(At = a)Rt (a), • Ignorability: 1(At = a) ⊥ ⊥ R(a)|Xt , St = 0 • Ignorability of the study participation: Rt (a) ⊥ ⊥ St |Xt • TNAR: R(a) ̸⊥ ⊥ A|X, S = 1. • Biased CATE on sponsor sample ρa1,a2 (x) = E[R|A = a1, X = x, S = 1] − E[R|A = a2, X = x, S = 1]. • Bias measurement ηa1,a2 (x) = τa1,a2 (x) − ρa1,a2 (x) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 15 / 26
- 41. Authoritarian Sponsor Deconfounding Experiment Conclusions References CATE • CATE τa1,a2 (x) = EDR|A,X=x [R(a1) − R(a2)] and b τa1,a2 (x) = b µa1 (x) − b µa2 (x) • Assumptions • SUTVA: Rt = P a∈A 1(At = a)Rt (a), • Ignorability: 1(At = a) ⊥ ⊥ R(a)|Xt , St = 0 • Ignorability of the study participation: Rt (a) ⊥ ⊥ St |Xt • TNAR: R(a) ̸⊥ ⊥ A|X, S = 1. • Biased CATE on sponsor sample ρa1,a2 (x) = E[R|A = a1, X = x, S = 1] − E[R|A = a2, X = x, S = 1]. • Bias measurement ηa1,a2 (x) = τa1,a2 (x) − ρa1,a2 (x) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 15 / 26
- 42. Authoritarian Sponsor Deconfounding Experiment Conclusions References CATE • CATE τa1,a2 (x) = EDR|A,X=x [R(a1) − R(a2)] and b τa1,a2 (x) = b µa1 (x) − b µa2 (x) • Assumptions • SUTVA: Rt = P a∈A 1(At = a)Rt (a), • Ignorability: 1(At = a) ⊥ ⊥ R(a)|Xt , St = 0 • Ignorability of the study participation: Rt (a) ⊥ ⊥ St |Xt • TNAR: R(a) ̸⊥ ⊥ A|X, S = 1. • Biased CATE on sponsor sample ρa1,a2 (x) = E[R|A = a1, X = x, S = 1] − E[R|A = a2, X = x, S = 1]. • Bias measurement ηa1,a2 (x) = τa1,a2 (x) − ρa1,a2 (x) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 15 / 26
- 43. Authoritarian Sponsor Deconfounding Experiment Conclusions References Two step deconfounding (see Kallus et al. (2018)), A = {a0, a1} 1 On the observational sample data use a metalearner to obtain b ρa1,a0 (X). 2 Postulate a function q(X, a0) such that E[qt(X, a0)R|X = x, S = 0] = τa1,a0 (x). Where qt(X, a0) = 1(A = a1) πt(a1|X) − 1(A = a0) πt(a0|X) . 3 Using qt(X, a0) apply the definition of ηa1,a0 (x) to adjust the b ρ term by solving an optimization problem on the unconfounded sample: b ηa1,a0 (X) = arg min η X t:St =0 (qt(xt, a0)rt − b ρa1,a0 (xt) − η(xt)) 2 . 4 Finally b τa1,a0 (x) = b ρa1,a0 (x) + b ηa1,a0 (x). Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 16 / 26
- 44. Authoritarian Sponsor Deconfounding Experiment Conclusions References Two step deconfounding (see Kallus et al. (2018)), A = {a0, a1} 1 On the observational sample data use a metalearner to obtain b ρa1,a0 (X). 2 Postulate a function q(X, a0) such that E[qt(X, a0)R|X = x, S = 0] = τa1,a0 (x). Where qt(X, a0) = 1(A = a1) πt(a1|X) − 1(A = a0) πt(a0|X) . 3 Using qt(X, a0) apply the definition of ηa1,a0 (x) to adjust the b ρ term by solving an optimization problem on the unconfounded sample: b ηa1,a0 (X) = arg min η X t:St =0 (qt(xt, a0)rt − b ρa1,a0 (xt) − η(xt)) 2 . 4 Finally b τa1,a0 (x) = b ρa1,a0 (x) + b ηa1,a0 (x). Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 16 / 26
- 45. Authoritarian Sponsor Deconfounding Experiment Conclusions References Two step deconfounding (see Kallus et al. (2018)), A = {a0, a1} 1 On the observational sample data use a metalearner to obtain b ρa1,a0 (X). 2 Postulate a function q(X, a0) such that E[qt(X, a0)R|X = x, S = 0] = τa1,a0 (x). Where qt(X, a0) = 1(A = a1) πt(a1|X) − 1(A = a0) πt(a0|X) . 3 Using qt(X, a0) apply the definition of ηa1,a0 (x) to adjust the b ρ term by solving an optimization problem on the unconfounded sample: b ηa1,a0 (X) = arg min η X t:St =0 (qt(xt, a0)rt − b ρa1,a0 (xt) − η(xt)) 2 . 4 Finally b τa1,a0 (x) = b ρa1,a0 (x) + b ηa1,a0 (x). Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 16 / 26
- 46. Authoritarian Sponsor Deconfounding Experiment Conclusions References Two step deconfounding (see Kallus et al. (2018)), A = {a0, a1} 1 On the observational sample data use a metalearner to obtain b ρa1,a0 (X). 2 Postulate a function q(X, a0) such that E[qt(X, a0)R|X = x, S = 0] = τa1,a0 (x). Where qt(X, a0) = 1(A = a1) πt(a1|X) − 1(A = a0) πt(a0|X) . 3 Using qt(X, a0) apply the definition of ηa1,a0 (x) to adjust the b ρ term by solving an optimization problem on the unconfounded sample: b ηa1,a0 (X) = arg min η X t:St =0 (qt(xt, a0)rt − b ρa1,a0 (xt) − η(xt)) 2 . 4 Finally b τa1,a0 (x) = b ρa1,a0 (x) + b ηa1,a0 (x). Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 16 / 26
- 47. Authoritarian Sponsor Deconfounding Experiment Conclusions References Deconfounded CATE IGW (D-CATE-IGW) • Let b = arg maxa b µa(xt). π(a|x) = ( 1 K+γm(b µm b (x)−b µm a (x)) for a ̸= b 1 − P c̸=b π(c|x) for a = b = ( 1 K+γm b τb,a(x) for a ̸= b 1 − P c̸=b π(c|x) for a = b , • Each round/epoch deconfound the CATE Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 17 / 26
- 48. Authoritarian Sponsor Deconfounding Experiment Conclusions References 1 Authoritarian Sponsor 2 Deconfounding 3 Experiment 4 Conclusions Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 18 / 26
- 49. Authoritarian Sponsor Deconfounding Experiment Conclusions References Setup I • St ∼ Bern(ρ) • No overlap scenario Xt|St = 0 ∼Unif([−1, 1]), Ut|St = 0 ∼N(0, 1). • Full overlap scenario Xt|St = 0 ∼ N(0, 1), Ut|St = 0 ∼ N(0, 1). Xt Ut | {At, St = 1} ∼ N 0 0 , 1 (2At − 1)σA (2At − 1)σA 1 , • σA ∈ {0.6, 0.9} • ρ ∈ {0.3, 0.6} Rt(At) = 1 + At + Xt + 2AtXt + 1/2X2 t + 3/4AtX2 t + 2Ut + 1/2ϵt, where ϵ ∼ N(0, 1), τ(Xt) = 3/4X2 t + 2Xt + 1. Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 19 / 26
- 50. Authoritarian Sponsor Deconfounding Experiment Conclusions References Setup I • St ∼ Bern(ρ) • No overlap scenario Xt|St = 0 ∼Unif([−1, 1]), Ut|St = 0 ∼N(0, 1). • Full overlap scenario Xt|St = 0 ∼ N(0, 1), Ut|St = 0 ∼ N(0, 1). Xt Ut | {At, St = 1} ∼ N 0 0 , 1 (2At − 1)σA (2At − 1)σA 1 , • σA ∈ {0.6, 0.9} • ρ ∈ {0.3, 0.6} Rt(At) = 1 + At + Xt + 2AtXt + 1/2X2 t + 3/4AtX2 t + 2Ut + 1/2ϵt, where ϵ ∼ N(0, 1), τ(Xt) = 3/4X2 t + 2Xt + 1. Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 19 / 26
- 51. Authoritarian Sponsor Deconfounding Experiment Conclusions References Setup I • St ∼ Bern(ρ) • No overlap scenario Xt|St = 0 ∼Unif([−1, 1]), Ut|St = 0 ∼N(0, 1). • Full overlap scenario Xt|St = 0 ∼ N(0, 1), Ut|St = 0 ∼ N(0, 1). Xt Ut | {At, St = 1} ∼ N 0 0 , 1 (2At − 1)σA (2At − 1)σA 1 , • σA ∈ {0.6, 0.9} • ρ ∈ {0.3, 0.6} Rt(At) = 1 + At + Xt + 2AtXt + 1/2X2 t + 3/4AtX2 t + 2Ut + 1/2ϵt, where ϵ ∼ N(0, 1), τ(Xt) = 3/4X2 t + 2Xt + 1. Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 19 / 26
- 52. Authoritarian Sponsor Deconfounding Experiment Conclusions References Setup I • St ∼ Bern(ρ) • No overlap scenario Xt|St = 0 ∼Unif([−1, 1]), Ut|St = 0 ∼N(0, 1). • Full overlap scenario Xt|St = 0 ∼ N(0, 1), Ut|St = 0 ∼ N(0, 1). Xt Ut | {At, St = 1} ∼ N 0 0 , 1 (2At − 1)σA (2At − 1)σA 1 , • σA ∈ {0.6, 0.9} • ρ ∈ {0.3, 0.6} Rt(At) = 1 + At + Xt + 2AtXt + 1/2X2 t + 3/4AtX2 t + 2Ut + 1/2ϵt, where ϵ ∼ N(0, 1), τ(Xt) = 3/4X2 t + 2Xt + 1. Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 19 / 26
- 53. Authoritarian Sponsor Deconfounding Experiment Conclusions References Result I Figure 4: Normed cumulative regret for different scenarios Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 20 / 26
- 54. Authoritarian Sponsor Deconfounding Experiment Conclusions References Result II Figure 5: True and estimated CATE values for different scenarios Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 21 / 26
- 55. Authoritarian Sponsor Deconfounding Experiment Conclusions References 1 Authoritarian Sponsor 2 Deconfounding 3 Experiment 4 Conclusions Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 22 / 26
- 56. Authoritarian Sponsor Deconfounding Experiment Conclusions References Contribution 1 Pioneering model for sponsored content in contextual bandits framework 2 Bandits not as experimental studies, but as observational studies 3 Confounding scenario and deconfounding application 4 D-CATE-IGW works Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 23 / 26
- 57. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
- 58. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
- 59. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
- 60. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
- 61. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
- 62. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
- 63. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
- 64. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
- 65. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
- 66. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
- 67. Authoritarian Sponsor Deconfounding Experiment Conclusions References The beginning ... Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 25 / 26
- 68. Authoritarian Sponsor Deconfounding Experiment Conclusions References Colnet, B., I. Mayer, G. Chen, A. Dieng, R. Li, G. Varoquaux, J.-P. Vert, J. Josse, and S. Yang (2020). Causal inference methods for combining randomized trials and observational studies: a review. arXiv preprint arXiv:2011.08047. Kallus, N., A. M. Puli, and U. Shalit (2018). Removing hidden confounding by experimental grounding. Advances in neural information processing systems 31. Künzel, S. R., J. S. Sekhon, P. J. Bickel, and B. Yu (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences 116(10), 4156–4165. Lattimore, T. and C. Szepesvári (2020). Bandit algorithms. Cambridge University Press. Wu, L. and S. Yang (2022). Integrative r-learner of heterogeneous treatment effects combining experimental and observational studies. In Conference on Causal Learning and Reasoning, pp. 904–926. PMLR. Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 26 / 26