1. The document discusses fairness constraints in contextual bandit problems and classic bandit problems.
2. It shows that for classic bandits, Θ(k^3) rounds are necessary and sufficient to achieve a non-trivial regret under fairness constraints.
3. For contextual bandits, it establishes a tight relationship between achieving fairness and Knows What it Knows (KWIK) learning, where KWIK learnability implies the existence of fair learning algorithms.
Ensuring Technical Readiness For Copilot in Microsoft 365
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
1. Introduction of
“Fairness in Learning: Classic and
Contextual Bandits”
authorized by Matthew Joseph, Michael Kearns, Jamie
Morgenstern, and Aaron Roth
NIPS2016-Yomi
January 19, 2017
Presenter: Kazuto Fukuchi
2. Fairness in Machine Learning
Consequential decisions using machine learning may lead
unfair treatment
E.g., Google’s ad suggestion system [Sweeney 13]
Fairness in contextual bandit problem
African descent names European descent names
Arrested? Located
Negative ad. Neutral ad.
3. Individual fairness
𝐾 persons
• Choose one person for conducting an action
• E.g., lend loan, hire, admission, etc.
When we can preferentially choose one person?
Only if the person has the largest ability
There is no other reason for preferential choice
Payback 90% Payback 60%
>
4. Contextual Bandit Problem
Each round 𝑡
1. Obtain a context 𝑥𝑗
𝑡
for each arm 𝑗
2. Choose one arm 𝑖 𝑡 ∈ [𝐾]
3. Observe reward 𝑟𝑖 𝑡
𝑡
s.t. 𝔼 𝑟𝑗
𝑡
= 𝑓𝑗 𝑥𝑗
𝑡
and 𝑟𝑗
𝑡
∈ [0,1] a.s.
𝐾-arms
𝑓1 𝑓2 𝑓3 𝑓4 𝑓5
Unknown to
the learner
Goal: Maximize the expected cumulative reward
𝔼
𝑡
𝑟𝑖 𝑡
𝑡
= 𝔼
𝑡
𝑓𝑖 𝑡
𝑥𝑖 𝑡
𝑡
5. Example: Linear Contextual Bandit
Define
𝐶 = 𝑓𝜃 ∶ 𝑓𝜃 𝑥 = 𝜃, 𝑥 , 𝜃 ∈ ℝ 𝑑
, 𝜃 ≤ 1
𝒳 = 𝑥 ∈ ℝ 𝑑
∶ 𝑥 ≤ 1
• Suppose 𝑓𝑗 = 𝑓𝜃 𝑗
∈ 𝐶, 𝑥𝑗
𝑡
∈ 𝒳
E.g., Online recommendation
• 𝜃𝑗: Feature of a product 𝑗
• 𝑥𝑗
𝑡
: Feature of a user 𝑡 regarding the product 𝑗
• Score of a user 𝑡 for a product 𝑗 is an inner product
𝑥𝑗
𝑡
, 𝜃𝑗
6. Example: Classic Bandit
• Expected reward is 𝔼 𝑟𝑗
𝑡
= 𝜇 𝑗
• Set 𝑓𝑗 𝑥𝑗
𝑡
= 𝜇 𝑗 for any 𝑥𝑗
𝑡
• Then, the contextual bandit becomes to the classic bandit
𝜇1 𝜇2 𝜇3 𝜇4 𝜇5
7. Regret
• History ℎ 𝑡: a record of 𝑡 − 1 experiences
• contexts, arm chosen, and reward observed
• A policy 𝜋: mapping from 𝑥 𝑡
and ℎ 𝑡 to a distribution on arms [𝐾]
• Probability of choosing arm 𝑗 with ℎ 𝑡 at round 𝑡
𝜋𝑗|ℎ 𝑡
𝑡
Regret: Dropped reward compared to the optimal policy
Regret 𝑥1
, … , 𝑥 𝑇
=
𝑡
max
𝑗
𝑓𝑗 𝑥𝑗
𝑡
− 𝔼𝑖 𝑡∼𝜋 𝑡
𝑡
𝑓𝑖 𝑡 𝑥𝑖 𝑡
𝑡
Regret bound 𝑅(𝑇) if max
𝑥1,…,𝑥 𝑇
Regret 𝑥1
, … , 𝑥 𝑇
≤ 𝑅(𝑇)
8. Fairness Constraint
It is unfair to preferentially choose one individual without an
acceptable reason
A policy 𝜋 is 𝜹-fair if with probability 1 − 𝛿
𝜋𝑗|ℎ
𝑡
> 𝜋𝑗′|ℎ
𝑡
only if 𝑓𝑗 𝑥𝑗
𝑡
> 𝑓𝑗′ 𝑥𝑗′
𝑡
.
Quality of the chosen individual is larger than others.
Probability of choosing arm
𝑗 at round 𝑡
𝑓𝑗(𝑥𝑗
𝑡
)
>
𝑓𝑗′(𝑥𝑗′
𝑡
)
9. Institution of Fairness Constraint
• Optimal policy is fair
• But we can’t get the optimal policy due to unknown 𝑓1, … , 𝑓𝐾
>
Can’t distinguish which arm has high
expected reward
Expected reward is lower than the left
group with h.p.
Fairness constraint enforces to choose a arm from the left
group with uniform distribution
10. Fairness in Classic Bandit
• Consider confidence bounds of the expected rewards
• Choose uniformly from the chained group
expected rewards
Arm 1
Arm 2
Arm 3
Arm 4
Arm 5
Chained
Expected reward is lower than that of arms
in the chained group
12. Regret Upper Bound
If 𝛿 <
1
𝑇
, then FairBandits has regret
𝑅 𝑇 = 𝑂 𝑘3 𝑇 ln
𝑇𝑘
𝛿
• 𝑇 = Ω 𝑘3 rounds require to obtain non-trivial regret, i.e.,
𝑅 𝑇
𝑇
≪ 1
• Non-fair case: 𝑂 𝑘𝑇
• 𝑘 becomes 𝑘3
by fairness constraint
• Dependence on 𝑇 is optimal
13. Regret Lower Bound
Any fair algorithm experiences constant per-round regret for at
least
𝑇 = Ω 𝑘3
ln
1
𝛿
• constant per-round regret = non-trivial regret
• To achieve non-trivial regret, we need at least 𝑘3
rounds
• Thus, Ω 𝑘3
is necessary and sufficient
14. Fairness in Contextual Bandit
KWIK learnable = Fair bandit learnable
KWIK (Know What It Know) learning
• Online regression
• Learner outputs either prediction 𝑦 𝑡 ∈ [0,1] or 𝑦 𝑡 =⊥
• ⊥ denotes “I Don’t Know”
• Only when 𝑦 𝑡 =⊥, the learner observes feedback 𝑦 𝑡 s.t.
𝔼 𝑦 𝑡 = 𝑓 𝑥 𝑡
𝑥 𝑡
Feature
Learner
𝑦 𝑡
∈ [0,1]
“I Don’t Know”
Accurately
predictable
15. KWIK learnable
(𝜖, 𝛿)-KWIK learnable on a class 𝑓 ∈ 𝐶 with 𝑚 𝜖, 𝛿 if
1. 𝑦 𝑡 ∈ ⊥ ∪ [𝑓 𝑥 𝑡 − 𝜖, 𝑓 𝑥 𝑡 + 𝜖] for all 𝑡 w.p. 1 − 𝛿
2. 𝑡=1
∞
𝕀 𝑦 𝑡
=⊥ ≤ 𝑚 𝜖, 𝛿
Institutions
• Prediction is accurate if 𝑦 𝑡 ≠⊥
• With small number of answering ⊥
• number of answering ⊥ = 𝑚 𝜖, 𝛿
16. KWIK Learnability Implies Fair Bandit
Learnability
Suppose 𝐶 is (𝜖, 𝛿)-KWIK learnable with 𝑚 𝜖, 𝛿
Then, there is 𝛿-fair algorithm for 𝑓𝑗 ∈ 𝐶 s.t.
𝑅 𝑇 = 𝑂 max 𝑘2 𝑚 𝜖∗,
min 𝛿,
1
𝑇
𝑇2 𝑘
, 𝑘3 ln
𝑘
𝛿
For 𝛿 ≤
1
𝑇
where
𝜖∗ = arg min
𝜖
max 𝜖𝑇, 𝑘𝑚 𝜖,
min 𝛿,
1
𝑇
𝑇2 𝑘
19. Institution of KWIKToFair
• Predict the expected rewards using KWIK algorithm for each
arm
• If the outputs of KWIK algorithm is not ⊥
• Same strategy of classic bandit is applicable
expected rewards 𝑓𝑗 𝑥𝑗
𝑡
Arm 1
Arm 2
Arm 3
Arm 4
Arm 5
2𝜖∗
20. Fair Bandit Learnability Implies KWIK
Learnability
Suppose
• There is 𝛿-fair algorithm for 𝑓𝑗 ∈ 𝐶 with regret 𝑅 𝑇, 𝛿
• There exists 𝑓 ∈ 𝐶, 𝑥 ℓ ∈ 𝒳 s.t. 𝑓 𝑥 ℓ = ℓ𝜖 for ℓ =
1, … ,
1
𝜖
Then, there is (𝜖, 𝛿)-KWIK learnable algorithm for 𝐶 with
𝑚 𝜖, 𝛿 is the solution of
𝑚 𝜖, 𝛿 𝜖
4
= 𝑅 𝑚 𝜖, 𝛿 ,
𝜖𝛿
2𝑇
21. An Exponential Separation Between Fair
and Unfair Learning
• Boolean conjunctions: Let 𝑥 ∈ 0,1 𝑑
𝐶 = 𝑓|𝑓 𝑥 = 𝑥𝑖1
∧ ⋯ ∧ 𝑥𝑖 𝑘
, 0 ≤ 𝑘 ≤ 𝑑, 𝑖1, … , 𝑖 𝑘 ∈ [𝑑]
• Boolean conjunctions without fairness constraint
𝑅 𝑇 = 𝑂(𝑘2
𝑑)
• For such 𝐶, KWIK bound is at least 𝑚 𝜖, 𝛿 = Ω 2 𝑑
• For 𝛿 <
1
2𝑇
, worst case regret bound is
𝑅 𝑇 = Ω 2 𝑑
23. Institution of FairToKWIK
• Divide domain of 𝑓(𝑥 𝑡) s.t. each width becomes 𝜖∗
• Using fair algorithm,
𝑓 𝑥 𝑡0 𝜖∗ 2𝜖∗
𝑥(0) 𝑥(1) 𝑥(2)
𝑥 𝑡
𝑥(ℓ) 𝑥 𝑡
>
<
?
𝑥(3) 𝑥(4)
𝑝ℓ,1 𝑝ℓ,2
Prob. of choosing left
arm
Prob. of choosing
right arm
If 𝑝ℓ,1 ≠ 𝑝ℓ,2 for all ℓ ≠ 3,
𝑥 𝑡
is in the red area
Output 3𝜖∗
Otherwise,
Output ⊥
24. Conclusions
• Fairness in contextual bandit problem and classic bandit
problem
• 𝛿-fair: with probability 1 − 𝛿
𝜋𝑗|ℎ
𝑡
> 𝜋𝑗′|ℎ
𝑡
only if 𝑓𝑗 𝑥𝑗
𝑡
> 𝑓𝑗′ 𝑥𝑗′
𝑡
Results
• Classical Bandits: Necessary and sufficient rounds to achieve
non-trivial regret is Θ 𝑘3
• Contextual Bandits: Tightly relationship with Knows What it
Knows (KWIK) learning