Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Maxmin qlearning controlling the estimation bias of qlearning

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Mdp
Mdp
Wird geladen in …3
×

Hier ansehen

1 von 9 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Maxmin qlearning controlling the estimation bias of qlearning (20)

Anzeige

Weitere von HyunKyu Jeon (20)

Aktuellste (20)

Anzeige

Maxmin qlearning controlling the estimation bias of qlearning

  1. 1. Maxmin Q-learning: Controlling the Estimation Bias of Q-learning Published as a conference paper at ICLR 2020, Qingfeng Lan et al.
  2. 2. Maxmin Q-Learning 유한한 MDP 상황에서 에이전트에게 특정 상태에서 어떤 액션(action)이 좋은지 알려주는 최선의 정책(Q-policy)을 학습하는 모델 프리(model-free) 강화 학습 알고리즘이다. 1. Q-Learning의 정의 Q-러닝에서의 액션은 특정 상태에서의 가능한 액션들에 대한 Q 값 중 최댓값을 취하며, 다음 상태에 대한 가능한 액션들의 (Q 값의) 최댓값을 바탕으로 그 값이 업데이트된다. 2. Q-Learning의 절차 Optimal Policy Q E=*(s,a) R R S, , A 는 각각 t시점의 보상, 상태, 액션max *Q (S a s,) |,t+1 S =t a ,A =t t t ta A∈+t+1[ ] 다음 상태와 특정 액션에 대한 Q의 최댓값을 바탕 으로 현재 상태와 액션에 대한 Q값을 업데이트 함. epsilon-greedy 방법으로 액션을 산출 임의의 값으로 Q-Table을 초기화
  3. 3. Maxmin Q-Learning Q-러닝의 문제점 중의 하나는 액션을 추정할 때 과추정(overestimation)이 발생되기 쉽다는 것이며, 이것은 Q-러닝의 업데이트 과정에서의 최댓값 추정에 기인한다. 3. Q-Learning의 문제점 Q (S a ) ← +α,t t Q Y -( ( for = def S a ) ),t t Q Yt r γ+t+1 Q t Q (S a ),t t max Q(s a ),t+1a A∈ QEven unbiased estimates for ,∀ ∵(s a a) a ),t+1 Q (s a ),t+1 Q*= +(s e ( stochasticity)a,t+1 Q∴ is the optimal estimation for Q,E ([ ]s a ),t+1 QQ* = E ([ ]s a ),t+1 Qmax (s aa A∈ ),t+1 Q≥ (s a ),t+1Q (s a ),t+1 Q= (s a ) ,,t+1 QmaxE ([ ]s aa A∈ ),t+1 Qmax =≥ E ([ ]s aa A∈ ),t+1 Q*max (s aa A∈ ),t+1
  4. 4. Maxmin Q-Learning 기존의 Q 러닝에서 발생하였던 과추정(overestimation)의 문제를 해결하기 위해, 서로 다른 두 개의 Q함수를 바탕으로 액션을 선택 및 업데이트 하는 알고리즘을 말한다. 1. Double Q-Learning의 정의 Q러닝에서의 액션(action)은 특정 상태에서의 가능한 액션들에 대한 Q값 중 최댓값을 취하며, 다음 상태에 대한 가능한 액션들의 (Q값의) 최댓값을 바탕으로 그 값이 업데이트 된다. 2. Double Q-Learning의 구조 Algorithm 1 Double Q-learning 1: Initialize QA ,QB ,s 2: repeat 3: Choose a, based on QA (s, ·) and QB (s, ·), observe r, s 4: Choose (e.g. random) either UPDATE(A) or UPDATE(B) 5: if UPDATE(A) then 6: Define a∗ = arg maxa QA (s , a) 7: QA (s, a) ← QA (s, a) + α(s, a) r + γQB (s , a∗ ) − QA (s, a) 8: else if UPDATE(B) then 9: Define b∗ = arg maxa QB (s , a) 10: QB (s, a) ← QB (s, a) + α(s, a)(r + γQA (s , b∗ ) − QB (s, a)) 11: end if 12: s ← s 13: until end Lemma 2. Consider a stochastic process (ζt, ∆t, Ft), t ≥ 0, where ζt, ∆t, Ft : X → R satisfy the equations: ∆t+1(xt) = (1 − ζt(xt))∆t(xt) + ζt(xt)Ft(xt) , (8) where xt ∈ X and t = 0, 1, 2, . . .. Let Pt be a sequence of increasing σ-fields such that ζ0 and ∆0 are P0-measurable and ζt, ∆t and Ft−1 are Pt-measurable, t = 1, 2, . . . . Assume that the following hold: 1) The set X is finite. 2) ζt(xt) ∈ [0, 1] , t ζt(xt) = ∞ , t(ζt(xt))2 < ∞ w.p.1 and ∀x = xt : ζt(x) = 0. 3) ||E{Ft|Pt}|| ≤ κ||∆t|| + ct, where κ ∈ [0, 1) and ct converges to 2 만약 Q(A)를 업데이트 시킨다면, 다음 상태에 대한 Q(B)값과의 차를 서로 다른 두 개의 Q 함수(또는 네트워크)를 생성 및 초기화. 두 개의 Q를 바탕으로 액션을 선택, 보상과 다음 상태를 획득(operations for the estimation: sum, average, expectation etc) 만약 Q(B)를 업데이트 시킨다면, 다음 상태에 대한 Q(A)값과의 차를 바탕으로 업데이트를 진행함(대상은 임의 선택 등의 방법으로 선택). Q-Learning Double Q-Learning과추정(Overestimation) 문제 과소추정(Underestimation)으로 과추정 문제를 해결하고자 함.
  5. 5. Maxmin Q-Learning Q-함수를 바탕으로 한 추정(estimation)에 있어서 과추정 편향(overestimation bias)은 항상 문제가 되거나 이익이 되는 것이 아닌, 그 과제 환경의 특성에 따라 때로는 이익이 되기도 때로는 손해가 되기도 한다. 3. Overestimation Bias Helps and Hurts B Left 8 actions Right Simple Episodic MDP (Sutton&Barto, 2018) 두 개의 종료(Terminal) 상태가 있음. B에서의 8개의 액션에 따른 보상(reward)은 μ+U(-1,1) 분포를 따름. A에서 시작하며 μ>0 일 때 Left, μ<0 일 때 Right 가 각각 최적의 액션(optimal action). r=0 r ~ μ+U(-1, 1) r=0 A 실험 환경 Published as a conference paper at ICLR 2020 (a) µ = +0.1 (overestimation helps) (b) µ = −0.1 (underestimation helps) Figure 2: Comparison of three algorithms using the simple MDP in Figure 1 with different values of µ, and thus different expected rewards. For µ = +0.1, shown in (a), the optimal -greedy policy is to take the Left action with 95% probability. For µ = −0.1, shown in in (b), the optimal policy is to take the Left action with 5% probability. The reported distance is the absolute difference between the probability of taking the Left action under the learned policy compared to the optimal -greedy policy. All results were averaged over 5, 000 runs. to maintain N estimates of the action values, Qi , and use the minimum of these estimates in the Q-learning target: maxa mini∈{1,...,N} Qi (s , a ). For N = 1, the update is simply Q-learning, and so likely has overestimation bias. As N increase, the overestimation decreases; for some N > 1, this maxmin estimator switches from an overestimate, in expectation, to an underestimate. We 실험 결과 (a) μ>0 일 때, 최적의 정책(optimal policy)은 ‘Left’를 95%의 확률로 취하는 것(∵ε-greedy). (b) μ<0 일 때, 최적의 정책은 ‘Left’를 5%의 확률로 취하는 것(∵ε-greedy). 그래프 y 축에서의 거리(distance)는 최적의 정책과 각 모델의 Left 확률 값의 절대 차를 의미. (a) 일 때에 모델의 정책이 과추정할수록 최적의 정책과 가까워짐(overestimation helps). (b) 일 때에 모델의 정책이 과소추정할수록 최적의 정책과 가까워짐(underestimation helps). Q-Learning은 과추정, Double Q-Learning은 과소추정하는 경향이 있다 > 둘 다 치우쳐(biased) 있다. Published as a conference paper at ICLR 2020
  6. 6. Maxmin Q-Learning We use this lemma to prove convergence of Double Q-learning under similar conditions as Q- learning. Our theorem is as follows: Theorem 1. Assume the conditions below are fulfilled. Then, in a given ergodic MDP, both QA and QB as updated by Double Q-learning as described in Algorithm 1 will converge to the optimal value function Q∗ as given in the Bellman optimality equation (2) with probability one if an infinite number of experiences in the form of rewards and state transitions for each state action pair are given by a proper learning policy. The additional conditions are: 1) The MDP is finite, i.e. |S × A| < ∞. 2) γ ∈ [0, 1). 3) The Q values are stored in a lookup table. 4) Both QA and QB receive an infinite number of updates. 5) αt(s, a) ∈ [0, 1], t αt(s, a) = ∞, t(αt(s, a))2 < ∞ w.p.1, and ∀(s, a) = (st, at) : αt(s, a) = 0. 6) ∀s, a, s : Var{Rs sa} < ∞. A ‘proper’ learning policy ensures that each state action pair is visited an infinite number of times. For instance, in a communicating MDP proper policies include a random policy. Sketch of the proof. We sketch how to apply Lemma 2 to prove Theorem 1 without going into full technical detail. Because of the symmetry in the updates on the functions QA and QB it suffices to show convergence for either of these. We will apply Lemma 2 with Pt = {QA 0 , QB 0 , s0, a0, α0, r1, s1, . . ., st, at}, X = S × A, ∆t = QA t − Q∗ , ζ = α and Ft(st, at) = rt + γQB t (st+1, a∗ ) − Q∗ t (st, at), where a∗ = arg maxa QA (st+1, a). It is straightforward to show the first two conditions of the lemma hold. The fourth condition of the lemma holds as a consequence of the boundedness condition on the variance of the rewards in the theorem. This leaves to show that the third condition on the expected contraction of Ft holds. We can write Ft(st, at) = FQ t (st, at) + γ QB t (st+1, a∗ ) − QA t (st+1, a∗ ) , where FQ t = rt + γQA t (st+1, a∗ ) − Q∗ t (st, at) is the value of Ft if normal Q-learning would be under consideration. It is well-known that E{FQ t |Pt} ≤ γ||∆t||, so to apply the lemma we identify ct = γQB t (st+1, a∗ ) − γQA t (st+1, a∗ ) and it suffices to show that ∆BA t = QB t − QA t converges to zero. Depending on whether QB or QA is updated, the update of ∆BA t at time t is either ∆BA t+1(st, at) = ∆BA t (st, at) + αt(st, at)FB t (st, at) , or ∆BA t+1(st, at) = ∆BA t (st, at) − αt(st, at)FA t (st, at) , 5 MaxMin Q-Learning은 기존의 Q-Learning의 과추정 편향과 Double Q-Learning의 과소추정 편향의 성질을 개선하기 위해, 여러 Q 함수를 구성하고 이 중 최소의 값을 가진 Q 함수를 바탕으로 임의 개의 Q를 업데이트하는 Q-러닝 모델이다. 1. Maxmin Q-Learning의 정의 Maxmin Q-Learning 에서 액션 추정(estimation)은 각 액션에 대한 (Q값의) 최솟값을 바탕으로 구성된 Q(min Q)의 최댓값을 바탕으로 결정되며, min Q를 업데이트 시에 타깃을 구성하는데에 사용한다. 2. Maxmin Q-Learning의 구조 N개의 Q-value 함수를 만들고 초기화함. 상태 s와 a에 대한 Q값들의 최솟값을 바탕으로 Q-table을 구성하고, 이라 함(reduce_min).이를 Qmin Q min Q i 집합 {1,...,N}에서 임의의 값(들)을 추출하여 부분 집합 S를 구성. 타깃 Y를 을 사용하여 구하고(max), 의 Q-value 값을 타깃값을 이용하여 업데이트. Published as a conference paper at ICLR 2020 Algorithm 1: Maxmin Q-learning Input: step-size α, exploration parameter > 0, number of action-value functions N Initialize N action-value functions {Q1 , . . . , QN } randomly Initialize empty replay buffer D Observe initial state s while Agent is interacting with the Environment do Qmin (s, a) ← mink∈{1,...,N} Qk (s, a), ∀a ∈ A Choose action a by -greedy based on Qmin Take action a, observe r, s Store transition (s, a, r, s ) in D Select a subset S from {1, . . . , N} (e.g., randomly select one i to update) for i ∈ S do Sample random mini-batch of transitions (sD, aD, rD, sD) from D Get update target: Y MQ ← rD + γ maxa ∈A Qmin (sD, a ) Update action-value Qi : Qi (sD, aD) ← Qi (sD, aD) + α[Y MQ − Qi (sD, aD)] end s ← s end
  7. 7. Maxmin Q-Learning learning; Maxmin Q-learning with N = 2 is not Double Q-learning. The full algorithm is summarized in Algorithm 1, and is a simple modification of Q-learning with experience replay. We use random subsamples of the observed data for each of the N estimators, to make them nearly independent. To do this training online, we keep a replay buffer. On each step, a random estimator i is chosen and updated using a mini-batch from the buffer. Multiple such updates can be performed on each step, just like in experience replay, meaning multiple estimators can be updated per step using different random mini-batches. In our experiments, to better match DQN, we simply do one update per step. Finally, it is also straightforward to incorporate target networks to get Maxmin DQN, by maintaining a target network for each estimator. We now characterize the relation between the number of action-value functions used in Maxmin Q-learning and the estimation bias of action values. For compactness, we write Qi sa instead of Qi (s, a). Each Qi sa has random approximation error ei sa Qi sa = Q∗ sa + ei sa. We assume that ei sa is a uniform random variable U(−τ, τ) for some τ > 0. The uniform random assumption was used by Thrun & Schwartz (1993) to demonstrate bias in Q-learning, and reflects that non-negligible positive and negative ei sa are possible. Notice that for N estimators with nsa samples, the τ will be proportional to some function of nsa/N, because the data will be shared amongst the N estimators. For the general theorem, we use a generic τ, and in the following corollary provide a specific form for τ in terms of N and nsa. Recall that M is the number of actions applicable at state s . Define the estimation bias ZMN for transition s, a, r, s to be ZMN def = (r + γ max a Qmin s a ) − (r + γ max a Q∗ s a ) = γ(max a Qmin s a − max a Q∗ s a ) 4 Maxmin Q-Learning은 다른 Q-러닝 모델에 비해 덜 편향적이며, 부분적으로 편향(bias)을 제어(control) 할 수 있다. 또한, 보상의 분산에 강건하다(robust)한 모습을 보인다. 3. Maxmin Q-Learning 의 특징 1) MaxMin Q-Learning은 Q-Learing 모델과 Double Q-Learning 모델에 비해 덜 편향적이다. Published as a conference paper at ICLR 2020 (a) µ = +0.1 (overestimation helps) (b) µ = −0.1 (underestimation helps) Figure 2: Comparison of three algorithms using the simple MDP in Figure 1 with different values of µ, and thus different expected rewards. For µ = +0.1, shown in (a), the optimal -greedy policy is to take the Left action with 95% probability. For µ = −0.1, shown in in (b), the optimal policy is to take the Left action with 5% probability. The reported distance is the absolute difference between the probability of taking the Left action under the learned policy compared to the optimal -greedy policy. All results were averaged over 5, 000 runs. to maintain N estimates of the action values, Qi , and use the minimum of these estimates in the Q-learning target: maxa mini∈{1,...,N} Qi (s , a ). For N = 1, the update is simply Q-learning, and so likely has overestimation bias. As N increase, the overestimation decreases; for some N > 1, this maxmin estimator switches from an overestimate, in expectation, to an underestimate. We characterize the relationship between N and the expected estimation bias below in Theorem 1. Note that Maxmin Q-learning uses a different mechanism to reduce overestimation bias than Double Q- learning; Maxmin Q-learning with N = 2 is not Double Q-learning. The full algorithm is summarized in Algorithm 1, and is a simple modification of Q-learning with experience replay. We use random subsamples of the observed data for each of the N estimators, to make them nearly independent. To do this training online, we keep a replay buffer. On each step, a random estimator i is chosen and updated using a mini-batch from the buffer. Multiple such updates can be performed on each step, just like in experience replay, meaning multiple estimators can be updated per step using different random mini-batches. In our experiments, to better match DQN, we simply do one update per step. Finally, it is also straightforward to incorporate target networks to 과추정이 도움이 되는 상황(overestimation helps)에서는 Q-Value 함수의 개수(N)에 따라 그 정도(편향의 정도)를 제어할 수 있는 것으로 보임. 과소추정이 도움이 되는 상황(underestimation helps)에서는 N에 따른 차이는 보이지 않지만, Q-Learning, Double Q-learning에 비해 덜 편향적인 모습을 보임. 2) 보상의 분산(variance)이 증가하는 것에 강건(Robust)하다. Published as a conference paper at ICLR 2020 (a) Robustness under increasing reward variance (b) Reward ∼ N(−1, 10) Figure 3: Comparison of four algorithms on Mountain Car under different reward variances. The lines in (a) show the average number of steps taken in the last episode with one standard error. The lines in (b) show the number of steps to reach the goal position during training when the reward variance σ2 = 10. All results were averaged across 100 runs, with standard errors. Additional 환경: Mountain Car (Gym) (a)는 보상의 분산이 점차 증가하였을 때 마지막 에피소드일 때의 평균 스텝 수를 나타낸 것이다. (b)는 보상의 평균이 -1/분산이 10일 때, 에피소드 증가에 따른 목표까지의 스텝 수를 나타낸 것이다. (a) Catcher (b) Lunar (d) Breakout (e) Pixelc
  8. 8. Maxmin Q-Learning 4. 다양한 환경에서의 실험 결과 (a) Catcher (b) Lunarlander (c) Space Invaders (d) Breakout (e) Pixelcopter (f) Asterix (a) Catcher (b) Lunarlander (c) S (d) Breakout (e) Pixelcopter (g) Seaquest (h) Pixelcopter with varying N (i) Aster Figure 4: Learning curves on the seven benchmark environments. The depict over the last 100 episodes, and the curves are smoothed using an exponenti previous reported results (Young & Tian, 2019). The results were averaged o shaded area representing one standard error. Plots (h) and (i) show the perf DQN on Pixelcopter and Asterix, with different N, highlighting that larger slower early learning but better final performance in both environments. (a) Catcher (b) Lunarlander (c) Space Invaders (d) Breakout (e) Pixelcopter (f) Asterix (g) Seaquest (h) Pixelcopter with varying N (i) Asterix with varying N Figure 4: Learning curves on the seven benchmark environments. The depicted return is averaged over the last 100 episodes, and the curves are smoothed using an exponential average, to match previous reported results (Young & Tian, 2019). The results were averaged over 20 runs, with the (c) Space Invaders (f) Asterix
  9. 9. Maxmin Q-Learning 5. Proofs for Maxmin Q-Learning We now characterize the relation between the number of action-value functions used in Maxmin Q-learning and the estimation bias of action values. For compactness, we write Qi sa instead of Qi (s, a). Each Qi sa has random approximation error ei sa Qi sa = Q∗ sa + ei sa. We assume that ei sa is a uniform random variable U(−τ, τ) for some τ > 0. The uniform random assumption was used by Thrun & Schwartz (1993) to demonstrate bias in Q-learning, and reflects that non-negligible positive and negative ei sa are possible. Notice that for N estimators with nsa samples, the τ will be proportional to some function of nsa/N, because the data will be shared amongst the N estimators. For the general theorem, we use a generic τ, and in the following corollary provide a specific form for τ in terms of N and nsa. Recall that M is the number of actions applicable at state s . Define the estimation bias ZMN for transition s, a, r, s to be ZMN def = (r + γ max a Qmin s a ) − (r + γ max a Q∗ s a ) = γ(max a Qmin s a − max a Q∗ s a ) 4 nsa samples for a single estimate, V ar[Qmin sa ] = 12N2 (N + 1)2(N + 2) V ar[Qsa]. Under this uniform random noise assumption, for N ≥ 8, V ar[Qmin sa ] < V ar[Qsa]. 5 EXPERIMENTS In this section, we first investigate robustness to reward variance, in a simple environment (Mountain Car) in which we can perform more exhaustive experiments. Then, we investigate performance in seven benchmark environments. Robustness under increasing reward variance in Mountain Car Mountain Car (Sutton & Barto, 2018) is a classic testbed in Reinforcement Learning, where the agent receives a reward of −1 per step with γ = 1, until the car reaches the goal position and the episode ends. In our experiment, we modify the rewards to be stochastic with the same mean value: the reward signal is sampled from a Gaussian distribution N(−1, σ2 ) on each time step. An agent should learn to reach the goal position in as few steps as possible. The experimental setup is as follows. We trained each algorithm with 1, 000 episodes. The number of steps to reach the goal position in the last training episode was used as the performance measure. The fewer steps, the better performance. All experimental results were averaged over 100 runs. The key algorithm settings included the function approximator, step-sizes, exploration parameter and replay buffer size. All algorithm used -greedy with = 0.1 and a buffer size of 100. For each algorithm, the best step-size was chosen from {0.005, 0.01, 0.02, 0.04, 0.08}, separately for each reward setting. Tile-coding was used to approximate the action-value function, where we used 8 tilings with each tile covering 1/8th of the bounded distance in each dimension. For Maxmin Q-learning, we randomly chose one action-value function to update at each step. As shown in Figure 3, when the reward variance is small, the performance of Q-learning, Double Q- learning, Averaged Q-learning, and Maxmin Q-learning are comparable. However, as the variance increases, Q-learning, Double Q-learning, and Averaged Q-learning became much less stable than Maxmin Q-learning. In fact, when the variance was very high (σ = 50, see Appendix C.2), Q- learning and Averaged Q-learning failed to reach the goal position in 5, 000 steps, and Double Q- learning produced runs > 400 steps, even after many episodes. 6 (g) Seaquest (h) Pixelcopter w Figure 4: Learning curves on the seven benchmark over the last 100 episodes, and the curves are sm previous reported results (Young & Tian, 2019). T shaded area representing one standard error. Plots DQN on Pixelcopter and Asterix, with different N slower early learning but better final performance in 8 Theorem1. Corollary 1. samples for (s,a) Qand, for the estimator that uses all samples for a single estimate, Var =nsasa Assuming the samples are evenly allocated the N estimators, then τ σ= where is the variance of²3 N nsa/²σnsa ZIf ,MN maxr +γ( ) -Q min a s a maxr +γ( ) =Q* a s a maxγ( -Q min a s a max )Q* a s a def Z 1 where = ( ) … 1MM 1- M 1+ M 1- ( )M + N 1 ( )1… + N 1 ( )M-1 + N 1 ²N12 ( )N 1+ ( N )² +2 2t-E MN MN tMNτγ[ [ ]] Qsa min [ ] Var .Qsa[ ] Z N 1 ∝, ,E MN[ ] Z =γτ γτE M, N=1[ ] Zand E M, N→∞[ ] = - def MaxMin Q-Learning은 Q-Learing에 비해 과추정의 경향이 적다. Lemma 1. ( ) and[ ]- E≤ ≤μ μi f(x) F(x) Denote ,Xi[ ]E def μ .X Set+∞i X min1:N X andi{ }i 1, … , N∈ 2 [ ]Var < def def X Denote the PDF and CDF ofmax (N-1)σ N:N X as f (x)1:N 1:N Fand , respectively. We then have,(x)1:NX .i X1:N [ ]E ≤X1:N+1 [ ] .E X1:N ( )ii F (x) 1 - 1 N -( ) .=1:N F(x) f f(x) (x) 1 N-1 -( ) .= N1:N F(x) ( )iii F (x) = N )( .N:N F(x) f f(x) (x) N-1 ( ) .= NN:N F(x) { }i 1, … , N∈ def σ.and cumulative distribution function Let X ,1 X …,2 , X be N i.i.d. R.V. from absolutely continuous distribution with probablity density functionN ( )iv If X …,1 X ~ U( ττ- , ) we have and., N X( )Var =1:N X( )Var <1:N+1 X( )Var ≤1:N X( )Var = , N 1, 2, …{ }.∈∀²σ1:1 4Nτ² (N+1)²(N+2) 2n-1 (N-1)σ

×