Reinforcement Learning - DQN

TWO TYPES OF ENV
DETERMINISTIC
STOCHASTIC
Example: N-puzzle , tic-tac-toe , chess
any action that is taken uniquely determines its outcome
any games that involve dice are good examples and
it uses probabilities to maximize the performance for a task.

HOW TO SOLVE THE PROBLEM
environment can be modeled in as a graph where each state is a node
and edges represent transition actions from one state to another and
edge weights are received rewards. Then, the agent can use a graph
search algorithm such as A* to find the path with maximum total reward
form the initial state.
A* : f = g + h

HOW TO SOLVE THE PROBLEM
environment can be modeled in as a graph where each state is a node
and edges represent transition actions from one state to another and
edge weights are received rewards. Then, the agent can use a graph
search algorithm such as A* to find the path with maximum total reward
form the initial state.
A* : f = g + h
will remain as the number of nodes traversed from start
node to get to the current node.
as the number of misplaced tiles by comparing the current state and the goal
state or summation of the Manhattan distance between misplaced nodes.h
g

IS THERE ANY SOLUTION?
Peter Hart, Nils Nilsson and Bertram Raphael of Stanford Research
Institute first published the algorithm in 1968. It can be seen as an
extension of Edsger Dijkstra's 1959 algorithm. A* achieves better
performance by using heuristics to guide its search and its
performance depends on estimation function totally.
LET'S JUST DIG DEEPER INTO THE AI ...

MARKOV DECISION PROCESS FRAMEWORK

A Markov decision process (MDP) is a discrete time stochastic control process. It
provides a mathematical framework for modeling decision making in situations where
outcomes are partly random and partly under the control of a decision maker. MDPs
are useful for studying optimization problems solved via dynamic
programming and reinforcement learning.

MDP consists of a tuple of 5 elements:
S : Set of states. At each time step the state of the environment is an element s ∈ S.
A : Set of actions. At each time step the agent choses an action a ∈ A to perform.
p(s_{t+1} | s_t, a_t) : State transition model that describes how the environment state changes when the user performs an
action a depending on the action a and the current state s.
p(r_{t+1} | s_t, a_t) : Reward model that describes the real-valued reward value that the agent receives from the
environment after performing an action. In MDP the the reward value depends on the current state and the action performed.
𝛾 : discount factor that controls the importance of future rewards.
A Markov decision process (MDP) is a discrete time stochastic control process. It
provides a mathematical framework for modeling decision making in situations where
outcomes are partly random and partly under the control of a decision maker. MDPs
are useful for studying optimization problems solved via dynamic
programming and reinforcement learning.

The way by which the agent chooses which action to perform is named the
agent policy which is a function that takes the current environment state to return an
action. The policy is often denoted by the symbol 𝛑.

INTRODUCTION TO MACHINE LEARNING
Linear Regression
Logistic Regression
Decision Tree
SVM
Naive Bayes
kNN
K-Means
Random Forest
Dimensionality Reduction Algorithms
Gradient Boosting algorithms
GBM
XGBoost
LightGBM
CatBoost
Machine learning is the study of algorithms and mathematical
models that computer systems use to progressively improve their
performance on a specific task like health care and robots

3 TYPES OF MACHINE LEARNING ALGORITHMS
we generate a function that map
inputs to desired outputs.
This machine learns from past experience and
tries to capture the best possible knowledge to
make accurate business decisions.
we do not have any target or
outcome variable to predict /
estimate. Like socialnets

INTRODUCTION TO REINFORCEMENT LEARNING AND ITS ALGORITHMS
Q-Learning
SARSA
DQN
DDPG
OPENAI PPO
MDP IS AN EXAMPLE OF HOW RL WORKS
Typically, a RL setup is composed of
two components, an agent and an
environment.

INTRODUCTION TO NEURAL NETWORK
nowadays we solve anything with this structure like
PDEs , wave equations , games(Agents) and etc.
a computer system modelled on the human brain and nervous system.

VALUE-ITERATION VS POLICY-ITERATION
These are two fundamental methods for solving MDPs. Both value-iteration and
policy-iteration assume that the agent knows the MDP model of the world (i.e. the
agent knows the state-transition and reward probability functions). Therefore, they
can be used by the agent to (offline) plan its actions given knowledge about the
environment before interacting with it.
Both value-iteration and policy-iteration algorithms can be used for offline
planning where the agent is assumed to have prior knowledge about the effects of its
actions on the environment (they assume the MDP model is known).

Q-LEARNING ALGORITHM ON MDP
USING BELLMAN EQUATION
It does not assume that agent knows anything about the state-
transition and reward models. However, the agent will discover
what are the good and bad actions by trial and error.
In Q-learning the agent improves its behavior (online) through
learning from the history of interactions with the environment(MDP)

SOLVE THE PROBLEM USING DQN(DRL)

SOLVE THE PROBLEM USING DQN(DRL)
Although Q-learning is a very powerful algorithm, its main weakness is lack of generality. If you
view Q-learning as updating numbers in a two-dimensional array (Action Space * State Space),
it, in fact, resembles dynamic programming. This indicates that for states that the Q-learning
agent has not seen before, it has no clue which action to take. In other words, Q-learning agent
does not have the ability to estimate value for unseen states. To deal with this problem, DQN
get rid of the two-dimensional array by introducing Neural Network.
DQN leverages a Neural Network to estimate the Q-value function. The input for the network is
the current, while the output is the corresponding Q-value for each of the action.
LET'S DIG DEEPER INTO THE CODE...

REAL WORLD EXAMPLE
In 2013, DeepMind applied DQN to Atari game, as illustrated in the above figure. The input is the raw
image of the current game situation. It went through several layers including convolutional layer as well
as fully connected layer. The output is the Q-value for each of the actions that the agent can take.

AlphaGo, that combines an advanced tree search with deep neural networks. These
neural networks take a description of the Go board as an input and process it
through 12 different network layers containing millions of neuron-like connections.
One neural network, the “policy network,” selects the next move to play. The other
neural network, the “value network,” predicts the winner of the game.
We trained the neural networks on 30 million moves from games played by human
experts, until it could predict the human move 57 percent of the time (the previous
record before AlphaGo was 44 percent).
REAL WORLD EXAMPLE
Go is a game of profound complexity. There are
1,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,0
00,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
possible positions—that’s more than the number of atoms in the universe, and more than a googol times larger than chess.

IMPROVEMENTS AND ALTERNATIVES
DQN IMPROVEMENTS
fixed Q-targets
double DQNs
dueling DQN (aka DDQN)
Prioritized Experience Replay (aka PER)
RL ALTERNATIVE
Evolution Strategies / Deep Neuroevolution as a Scalable
Alternative to Reinforcement Learning and DQN

Reinforcement Learning - DQN

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Reinforcement Learning - DQN

Ähnlich wie Reinforcement Learning - DQN (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Reinforcement Learning - DQN