Dqn break

We present the first deep learning model to
successfully learn control policies directly from
high-dimensional sensory input using
reinforcement learning. The model is a
convolutional neural network, trained with a
variant of Q-learning, whose input is raw pixels
and whose output is a value function estimating
future rewards. We apply our method to seven
Atari 2600 games from the Arcade Learning
Environment, with no adjustment of the
architecture or learning algorithm. We find that it
outperforms all previous approaches on six of the
games and surpasses a human expert on three of
them.

팀원소개
주찬웅
(네이버)
박민철
(국민대학교)
이건희
(서울과기대)
김상근
(패스트캠퍼스)
김준태
(대전대학교)

입력
DQN
출력
알고리즘
DDQN
(김준태)
Multi-
Step RL
(이건희)
구조
Dueling
Network
(김상근)
NoiseNet
(주찬웅)
separate processes
메모리
PER
(박민철)

Context
-DQN
-Double DQN
-PER
-Dueling
-Multi step
-Noisy Network

DQN
https://arxiv.org/pdf/1312.5602.pdf

강화학습이 진행되는 과정

Agent의 Action을 통해 Episode를 모은다.

Agent의 Action을 통해 Episode를 모은다.
모여진 Episode를 바탕으로 Q값과 정책 값을 Update.

차원의 저주(Curse of Dimensionality) 문제 발생

CNN
특징추출
Q함수 학습

일반화(Generalization)가 잘되어 있다면

처음 보는 상황에서도

처음 보는 상황에서도
Q값을 잘 맞출 수 있다.

과하게 좋은 Q값을 주게 된다.

Double DQN
https://arxiv.org/abs/1509.06461

이름에서도 알 수 있듯이

두 네트워크의 파라미터

하나의 네트워크는 빠르게 업데이트를 하면서 Episode를 생성

하나의 네트워크는 빠르게 업데이트를 하면서 Episode를 생성
다른 하나는 느리게 학습하면서
어느 한 행동의 Q값이 과도하게 커지지 않도록 막아줌

를 구할 때는 일정 반복 횟수마다 업데이트를 해준다

우리는 항상 살아 갈 때 우선순위를 따진다.

내가 먼저야 공부가 먼저야?

DQN은 Replay Memory로 부터
경험을 가져와 학습한다.

Random하게 가져오기 때문에

학습할 메모리에 우선순위를 주면…?

Prioritized Experience Replay
https://arxiv.org/abs/1511.05952

DQN
Experience를 Replay memory에 넣어놓고 학습한다.

DQN
DQN + PER

DQN
DQN + PER
학습해야 하는 Experience에 우선순위를 부여

우선순위는 어떻게 측정?

NextState의 Maximum한 Q값 현재 예측된 Q값

NextState의 Maximum한 Q값 현재 예측된 Q값
TD-Error

TD-Error만 가지고 학습을 하면

몇 가지 문제점이 발생한다

첫째
선택되지 않은 경험은 학습을 안 한다.

첫째
선택되지 않은 경험은 학습을 안 한다.
TD-Error가 낮아

=3
=2
=1
TD-Error가 큰 순서대로 학습

=5
=5
=3
=2
=1
Experience 다시 채우기

둘째
선택된 경험만 학습하여 오버피팅(Overfitting) 발생

Probability of sampling transition

Priority of transition

how much prioritization is used

= 0 Random기반 = DQN방법

= 1 TD-Error 기반

= 1 TD-Error 기반
= 0.6 DQN + TD-Error

1. Proportional prioritization

0보다 약~~간 큰

TD-Error가 0이 되는걸 막는다.

TD-Error가 0이 되는걸 막는다.
2. Rank – based prioritization
r ( ) 는 T − 에 의해 replay
memory가 정렬되었을 때의 순서

PER는 확률기반으로 update를 한다.

확률 기반으로 update를 하려면 모든 값이 골고루 분포 되어 있어야 한다

Priority replay는 편차가 발생한다

Important sampling weights?
(TD-error)이 아닌 (weighted TD-error) 에 의해 update.

Important sampling weights?
(TD-error)이 아닌 (weighted TD-error) 에 의해 update.
w/max (w) 로 정규화를 함으로써 bias크기를 낮춘다.

Dueling DQN

Q(s, a)
특정 상태(s)에서

Q(s, a)
특정 행동(a)을 했을 때

Q(s, a)
특정 행동(a)을 했을 때
이것이 얼마나 좋은지

가치의 2가지 근본적인 개념으로 생각 가능

주어진 상태에 대해 얼마나 좋은지 말해주는
가치함수 V(s)

다른 행동과 비교하여 특정 행동이 얼마나
좋은지 알려주는
이득함수 A(s, a)
A(s, a) = Q(s, a) – V(s)

어떻게 DeepMind는 V와 A로 나눴을까?

DeepMind의 직관적이랍니다.

한 스텝만 진행한 후 받은 Reward로 학습하는 TD

한 스텝만 진행한 후 받은 Reward로 학습하는 TD
Episode가 끝나야 업데이트를 하는 Monte Carlo

어떻게 Update를 해야 더 좋아질까?

작심삼일을 계속 반복하면 10000시간의 법칙이 되는 것 처럼

중간 스텝마다 Reward를 받으면서 학습해 볼까?

Multi-Step Reinforcement Learning:
A Unifying Algorithm

빠르게 설명 하겠습니다

Importance sampling이 필요 없는 off-policy

DQN을 학습하다 보면 학습을 멈출 때가 있다.

NOISY NETWORKS FOR
EXPLORATION

Action space noise param space noise

우리 모두가 아는 그 식

w b
Random variables
Learnable

Noise variables
Factorised Gaussian noise를 사용!!

Factorised Gaussian noise
Gaussian variables

Factorised Gaussian noise
Gaussian variables
Gaussian variables for noise of the outputs

Open AI 에서도 나온 논문

Parameter Space Noise for Exploration

모델 파라미터
가우시안 노이즈

thresholddistance
어떤 distance?

논문에는 이렇게 나와 있습니다.

perturbed and non-perturbed policy

perturbed and non-perturbed policy
노이즈가 포함된 policy 노이즈가 포함되지 않은 policy

위 논문들을 다 합치면

DQN
Double DQN
Prioritised Experience Replay
Dueling Network Architecture
Multi-step Returns
Distributional RL
Noisy Network
ㅎㅎㅎ….

DQN
Double DQN
Prioritised Experience Replay
Dueling Network Architecture
Multi-step Returns
Noisy Network

https://github.com/reinforcement-learning-kr/break_dqn

Dqn break

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dqn break

Similar to Dqn break (17)

Dqn break