Reinforcement Learning 3. Finite Markov Decision Processes

•

0 gefällt mir•855 views

A summary of Chapter 3: Finite Markov Decision Processes of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html

Technologie

Chapter 3: Finite Markov Decision Processes
Seungjae Ryan Lee

● Simplified, flexible reinforcement learning problem
● Consists of States , Actions , Rewards
Markov Decision Process (MDP)
States
Info available to agent
Actions
Choice made by agent
Rewards
Basis for evaluating choices

Agent
The learner
Takes action
Everything outside the agent
Returns state and reward
Environment

● Anything the agent cannot arbitrarily change is part of the environment
○ Agent might still know everything about the environment
● Different boundaries for different purposes
Agent-Environment Boundary
Machinery Sensors Battery“Brain”

1. Agent observes a state
2. Agent takes action
3. Agent receives reward and new state
4. Agent takes another action
5. Repeat
Agent-Environment Interactions

Transition Probability
● Probability of reaching state and reward by taking action on state
● Fully describes the dynamics of a finite MDP
● Can deduce other properties of the environment

Expected Rewards
● Expected reward of taking action on state
● Expected reward of arriving in state by taking action on state

Recycling Robot Example
● States: Battery status (high or low)
● Actions
○ Search: High reward. Battery status can be lowered or depleted.
○ Wait: Low reward. Battery status does not change.
○ Recharge: No reward. Battery status changed to high.
● If battery is depleted, -3 reward and battery status changed to high.

Transition Graph
● Graphical summary of MDP dynamics

Designing Rewards
● Reward hypothesis
○ Goals and purposes can be represented by maximization of cumulative reward
● Tell what you want to achieve, not how
+1 for each boxProportional to
forward action
Always -1

Episodic Tasks
● Interactions can be broken into episodes
● Episodes end in a special terminal state
● Each episode is independent
Finished when the game ends Finished when the agent is out of the maze

Return for Episodic Tasks
● Sum of rewards from time step
● Time of termination:

Continuing Tasks
● Cannot be naturally broken into episodes
● Goes on without limit
Stock Trading

Return for Continuing Tasks
● Sum of rewards is almost always infinite
● Need to discount future rewards by factor
○ If , the return only considers immediate reward (myopic)

Unified Notation for Return
● Cumulative reward
● can be a finite number or infinity
● Future rewards can be discounted with factor
○ If , then must be less than 1.

Policy
● Mapping from states to probabilities of selecting each possible action
● : Probability of selecting action in state

State-value function
● Expected return from state and following policy

Action-value function
● Expected return from taking action in state and following policy

Bellman Equation
● Recursive relationship between and

Optimal Policies and Value Functions
● For any policy , for all states
● There can be multiple optimal policies
● All optimal policies share same optimal value functions:

Bellman Optimality Equation
● Bellman Equation for optimal policies

Solving Bellman Optimality Equation
● Linear system: equations, unknowns
● Possible to find the exact optimal policy
● Impractical in most environments
○ Need to know the dynamics of the environment
○ Need extreme computational power
○ Need Markov property
→ In most cases, approximation is the best possible solution.

Approximation
● Does not require complete knowledge of environment
● Less memory and computational power needed
● Can focus learning on frequently encountered states

Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

Weitere ähnliche Inhalte

Was ist angesagt?

Rl chapter 1 introductionConnorShorten2

Markov decision processHamed Abdi

Intelligent Agents Amar Jukuntla

Planning and Learning with Tabular MethodsDongmin Lee

Reinforcement LearningSalem-Kabbani

DQN (Deep Q-Network)Dong Guo

An introduction to deep reinforcement learningBig Data Colombia

Reinforcement Learning 10. On-policy Control with ApproximationSeung Jae Lee

Reinforcement learning Chandra Meena

Temporal difference learningJie-Han Chen

An introduction to reinforcement learning (rl)pauldix

Reinforcement Learning - DQNMohammaderfan Arefimoghaddam

Deep Reinforcement LearningMeetupDataScienceRoma

Reinforcement learningDing Li

Lecture 9 Markov decision processVARUN KUMAR

Deep reinforcement learning from scratchJie-Han Chen

Agents_AI.pptsandeep54552

Reinforcement Learning for Self Driving CarsSneha Ravikumar

Lecture 2 agent and environmentVajira Thambawita

2-Agents- Artificial IntelligenceMhd Sb

Was ist angesagt? (20)

Rl chapter 1 introduction

Markov decision process

Intelligent Agents

Planning and Learning with Tabular Methods

Reinforcement Learning

DQN (Deep Q-Network)

An introduction to deep reinforcement learning

Reinforcement Learning 10. On-policy Control with Approximation

Reinforcement learning

Temporal difference learning

An introduction to reinforcement learning (rl)

Reinforcement Learning - DQN

Deep Reinforcement Learning

Reinforcement learning

Lecture 9 Markov decision process

Deep reinforcement learning from scratch

Agents_AI.ppt

Reinforcement Learning for Self Driving Cars

Lecture 2 agent and environment

2-Agents- Artificial Intelligence

Ähnlich wie Reinforcement Learning 3. Finite Markov Decision Processes

Structured prediction with reinforcement learningguruprasad110

An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar

reinforcement-learning-141009013546-conversion-gate02.pptxMohibKhan79

Aaa ped-24- Reinforcement LearningAminaRepo

Reinforcement learning, Q-LearningKuppusamy P

24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751

Reinforcement Learning Guide For Beginnersgokulprasath06

reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1

CH2_AI_Lecture1.pptAhmedNURHUSIEN

Head First Reinforcement Learningazzeddine chenine

Intro to Reinforcement learning - part IIMikko Mäkipää

Reinforcement learning：policy gradient (part 1)Bean Yen

Intelligent AGent class.pptxAdJamesJohn

RL.pptAzharJamil15

Reinforcement LearningYigit UNALLAR

Intelligent AgentsPankaj Debbarma

Dexterous In-hand Manipulation by OpenAIAnand Joshi

Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...Lviv Startup Club

Introduction to Deep Reinforcement LearningIDEAS - Int'l Data Engineering and Science Association

Artificial Intelligence and Machine Learning.pptxMANIPRADEEPS1

Ähnlich wie Reinforcement Learning 3. Finite Markov Decision Processes (20)

Structured prediction with reinforcement learning

An efficient use of temporal difference technique in Computer Game Learning

reinforcement-learning-141009013546-conversion-gate02.pptx

Aaa ped-24- Reinforcement Learning

Reinforcement learning, Q-Learning

24.09.2021 Reinforcement Learning Algorithms.pptx

Reinforcement Learning Guide For Beginners

reinforcement-learning-141009013546-conversion-gate02.pdf

CH2_AI_Lecture1.ppt

Head First Reinforcement Learning

Intro to Reinforcement learning - part II

Reinforcement learning：policy gradient (part 1)

Intelligent AGent class.pptx

RL.ppt

Reinforcement Learning

Intelligent Agents

Dexterous In-hand Manipulation by OpenAI

Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...

Introduction to Deep Reinforcement Learning

Artificial Intelligence and Machine Learning.pptx

Kürzlich hochgeladen

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

"ML in Production",Oleksandr BaganFwdays

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

From Family Reminiscence to Scholarly Archive .Alan Dix

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Advanced Computer Architecture – An IntroductionDilum Bandara

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Kürzlich hochgeladen (20)

How AI, OpenAI, and ChatGPT impact business and software.

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

"ML in Production",Oleksandr Bagan

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

DevEX - reference for building teams, processes, and platforms

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

Developer Data Modeling Mistakes: From Postgres to NoSQL

From Family Reminiscence to Scholarly Archive .

TeamStation AI System Report LATAM IT Salaries 2024

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Advanced Computer Architecture – An Introduction

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Connect Wave/ connectwave Pitch Deck Presentation

Nell’iperspazio con Rocket: il Framework Web di Rust!

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

What is DBT - The Ultimate Data Build Tool.pdf

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

Gen AI in Business - Global Trends Report 2024.pdf

Reinforcement Learning 3. Finite Markov Decision Processes

1. Chapter 3: Finite Markov Decision Processes Seungjae Ryan Lee

2. ● Simplified, flexible reinforcement learning problem ● Consists of States , Actions , Rewards Markov Decision Process (MDP) States Info available to agent Actions Choice made by agent Rewards Basis for evaluating choices

3. Agent The learner Takes action Everything outside the agent Returns state and reward Environment

4. ● Anything the agent cannot arbitrarily change is part of the environment ○ Agent might still know everything about the environment ● Different boundaries for different purposes Agent-Environment Boundary Machinery Sensors Battery“Brain”

5. 1. Agent observes a state 2. Agent takes action 3. Agent receives reward and new state 4. Agent takes another action 5. Repeat Agent-Environment Interactions

6. Transition Probability ● Probability of reaching state and reward by taking action on state ● Fully describes the dynamics of a finite MDP ● Can deduce other properties of the environment

7. Expected Rewards ● Expected reward of taking action on state ● Expected reward of arriving in state by taking action on state

8. Recycling Robot Example ● States: Battery status (high or low) ● Actions ○ Search: High reward. Battery status can be lowered or depleted. ○ Wait: Low reward. Battery status does not change. ○ Recharge: No reward. Battery status changed to high. ● If battery is depleted, -3 reward and battery status changed to high.

9. Transition Graph ● Graphical summary of MDP dynamics

10. Designing Rewards ● Reward hypothesis ○ Goals and purposes can be represented by maximization of cumulative reward ● Tell what you want to achieve, not how +1 for each boxProportional to forward action Always -1

11. Episodic Tasks ● Interactions can be broken into episodes ● Episodes end in a special terminal state ● Each episode is independent Finished when the game ends Finished when the agent is out of the maze

12. Return for Episodic Tasks ● Sum of rewards from time step ● Time of termination:

13. Continuing Tasks ● Cannot be naturally broken into episodes ● Goes on without limit Stock Trading

14. Return for Continuing Tasks ● Sum of rewards is almost always infinite ● Need to discount future rewards by factor ○ If , the return only considers immediate reward (myopic)

15. Unified Notation for Return ● Cumulative reward ● can be a finite number or infinity ● Future rewards can be discounted with factor ○ If , then must be less than 1.

16. Policy ● Mapping from states to probabilities of selecting each possible action ● : Probability of selecting action in state

17. State-value function ● Expected return from state and following policy

18. Action-value function ● Expected return from taking action in state and following policy

19. Bellman Equation ● Recursive relationship between and

20. Optimal Policies and Value Functions ● For any policy , for all states ● There can be multiple optimal policies ● All optimal policies share same optimal value functions:

21. Bellman Optimality Equation ● Bellman Equation for optimal policies

22. Solving Bellman Optimality Equation ● Linear system: equations, unknowns ● Possible to find the exact optimal policy ● Impractical in most environments ○ Need to know the dynamics of the environment ○ Need extreme computational power ○ Need Markov property → In most cases, approximation is the best possible solution.

23. Approximation ● Does not require complete knowledge of environment ● Less memory and computational power needed ● Can focus learning on frequently encountered states

24. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai

Reinforcement Learning 3. Finite Markov Decision Processes

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Reinforcement Learning 3. Finite Markov Decision Processes

Ähnlich wie Reinforcement Learning 3. Finite Markov Decision Processes (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Reinforcement Learning 3. Finite Markov Decision Processes