Diese Präsentation wurde erfolgreich gemeldet.

# Reinforcement Learning - DQN

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Deep Reinforcement Learning
×

1 von 23 Anzeige

# Reinforcement Learning - DQN

Reinforcement Learning Deep-Q-Network presentation in an AI class at University of Mazandaran for solving the 8 puzzle game.

Reinforcement Learning Deep-Q-Network presentation in an AI class at University of Mazandaran for solving the 8 puzzle game.

Anzeige
Anzeige

## Weitere Verwandte Inhalte

Anzeige

### Reinforcement Learning - DQN

1. 1. WHY WE CREATE AI?
2. 2. TWO TYPES OF ENV DETERMINISTIC STOCHASTIC Example: N-puzzle , tic-tac-toe , chess any action that is taken uniquely determines its outcome any games that involve dice are good examples and it uses probabilities to maximize the performance for a task.
3. 3. DEFINE THE PROBLEM
4. 4. HOW TO SOLVE THE PROBLEM environment can be modeled in as a graph where each state is a node and edges represent transition actions from one state to another and edge weights are received rewards. Then, the agent can use a graph search algorithm such as A* to find the path with maximum total reward form the initial state. A* : f = g + h
5. 5. HOW TO SOLVE THE PROBLEM environment can be modeled in as a graph where each state is a node and edges represent transition actions from one state to another and edge weights are received rewards. Then, the agent can use a graph search algorithm such as A* to find the path with maximum total reward form the initial state. A* : f = g + h will remain as the number of nodes traversed from start node to get to the current node. as the number of misplaced tiles by comparing the current state and the goal state or summation of the Manhattan distance between misplaced nodes.h g
6. 6. IS THERE ANY SOLUTION? Peter Hart, Nils Nilsson and Bertram Raphael of Stanford Research Institute first published the algorithm in 1968. It can be seen as an extension of Edsger Dijkstra's 1959 algorithm. A* achieves better performance by using heuristics to guide its search and its performance depends on estimation function totally. LET'S JUST DIG DEEPER INTO THE AI ...
7. 7. MARKOV DECISION PROCESS FRAMEWORK
8. 8. MARKOV DECISION PROCESS FRAMEWORK A Markov decision process (MDP) is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming and reinforcement learning.
9. 9. MARKOV DECISION PROCESS FRAMEWORK MDP consists of a tuple of 5 elements: S : Set of states. At each time step the state of the environment is an element s ∈ S. A : Set of actions. At each time step the agent choses an action a ∈ A to perform. p(s_{t+1} | s_t, a_t) : State transition model that describes how the environment state changes when the user performs an action a depending on the action a and the current state s. p(r_{t+1} | s_t, a_t) : Reward model that describes the real-valued reward value that the agent receives from the environment after performing an action. In MDP the the reward value depends on the current state and the action performed. 𝛾 : discount factor that controls the importance of future rewards. A Markov decision process (MDP) is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming and reinforcement learning.
10. 10. MARKOV DECISION PROCESS FRAMEWORK The way by which the agent chooses which action to perform is named the agent policy which is a function that takes the current environment state to return an action. The policy is often denoted by the symbol 𝛑.
11. 11. INTRODUCTION TO MACHINE LEARNING Linear Regression Logistic Regression Decision Tree SVM Naive Bayes kNN K-Means Random Forest Dimensionality Reduction Algorithms Gradient Boosting algorithms GBM XGBoost LightGBM CatBoost Machine learning is the study of algorithms and mathematical models that computer systems use to progressively improve their performance on a specific task like health care and robots
12. 12. 3 TYPES OF MACHINE LEARNING ALGORITHMS we generate a function that map inputs to desired outputs. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. we do not have any target or outcome variable to predict / estimate. Like socialnets
13. 13. INTRODUCTION TO REINFORCEMENT LEARNING AND ITS ALGORITHMS Q-Learning SARSA DQN DDPG OPENAI PPO MDP IS AN EXAMPLE OF HOW RL WORKS Typically, a RL setup is composed of two components, an agent and an environment.
14. 14. INTRODUCTION TO NEURAL NETWORK nowadays we solve anything with this structure like PDEs , wave equations , games(Agents) and etc. a computer system modelled on the human brain and nervous system.
15. 15. VALUE-ITERATION VS POLICY-ITERATION These are two fundamental methods for solving MDPs. Both value-iteration and policy-iteration assume that the agent knows the MDP model of the world (i.e. the agent knows the state-transition and reward probability functions). Therefore, they can be used by the agent to (offline) plan its actions given knowledge about the environment before interacting with it. Both value-iteration and policy-iteration algorithms can be used for offline planning where the agent is assumed to have prior knowledge about the effects of its actions on the environment (they assume the MDP model is known).
16. 16. Q-LEARNING ALGORITHM ON MDP USING BELLMAN EQUATION It does not assume that agent knows anything about the state- transition and reward models. However, the agent will discover what are the good and bad actions by trial and error. In Q-learning the agent improves its behavior (online) through learning from the history of interactions with the environment(MDP)
17. 17. SOLVE THE PROBLEM USING DQN(DRL)
18. 18. SOLVE THE PROBLEM USING DQN(DRL) Although Q-learning is a very powerful algorithm, its main weakness is lack of generality. If you view Q-learning as updating numbers in a two-dimensional array (Action Space * State Space), it, in fact, resembles dynamic programming. This indicates that for states that the Q-learning agent has not seen before, it has no clue which action to take. In other words, Q-learning agent does not have the ability to estimate value for unseen states. To deal with this problem, DQN get rid of the two-dimensional array by introducing Neural Network. DQN leverages a Neural Network to estimate the Q-value function. The input for the network is the current, while the output is the corresponding Q-value for each of the action. LET'S DIG DEEPER INTO THE CODE...
19. 19. REAL WORLD EXAMPLE In 2013, DeepMind applied DQN to Atari game, as illustrated in the above figure. The input is the raw image of the current game situation. It went through several layers including convolutional layer as well as fully connected layer. The output is the Q-value for each of the actions that the agent can take.
20. 20. AlphaGo, that combines an advanced tree search with deep neural networks. These neural networks take a description of the Go board as an input and process it through 12 different network layers containing millions of neuron-like connections. One neural network, the “policy network,” selects the next move to play. The other neural network, the “value network,” predicts the winner of the game. We trained the neural networks on 30 million moves from games played by human experts, until it could predict the human move 57 percent of the time (the previous record before AlphaGo was 44 percent). REAL WORLD EXAMPLE Go is a game of profound complexity. There are 1,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,0 00,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 possible positions—that’s more than the number of atoms in the universe, and more than a googol times larger than chess.
21. 21. ML IMPLEMENTATION FRAMEWORKS
22. 22. IMPROVEMENTS AND ALTERNATIVES DQN IMPROVEMENTS fixed Q-targets double DQNs dueling DQN (aka DDQN) Prioritized Experience Replay (aka PER) RL ALTERNATIVE Evolution Strategies / Deep Neuroevolution as a Scalable Alternative to Reinforcement Learning and DQN
23. 23. THANKS!