SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Q-Learning Algorithm
What, How and Why
by
SHAKEEB AHMAD
MUSTAFA AL-HAMMADI
SNEHA UKEY
SHAIKH ABUZAR
Background
• Q-learning falls under the umbrella of “Reinforcement Learning”
• Differences:
• Supervised Learning: Immediate feedback (labels provided for every input.)
• Unsupervised Learning: No feedback (no labels provided).
• Reinforcement Learning: Delayed scalar feedback (a number called reward).
• RL deals with agents that must sense and act upon their environment.
• This combines classical AI and machine learning techniques.
• Examples:
• A robot cleaning my room and recharging its battery
• Robo-soccer
• How to invest in shares
• Learning how to fly a helicopter
• Scheduling planes to their destinations etc.
What is Q-learning?
• A carrot and stick approach to learning
• If by chance an AI does something which we want to
encourage say it gets a coin in Mario, we give it a
“carrot”
• If it does something which we don’t want, say: the car
drives into the wall in racing game, we “punish” it
In technical terms
• A model-free algorithm to learn a policy telling an agent what action
to take under what circumstances
• Seeks to find the best action to take given the current state. It’s
considered off-policy because the q-learning function learns from
actions that are outside the current policy, like taking random actions,
and therefore a policy isn’t needed. More specifically, q-learning
seeks to learn a policy that maximizes the total reward.
• What is Q in Q-Learning?
• It stands for quality. Quality in this case represents how useful a given action
is in gaining some future reward.
Summing up
1. Reinforcement Learning is the process of learning by interacting with
an environment through positive feedback
2. Q-Learning is a type of RL that minimizes behavior of a system
through trial and error
3. Q-learning updates its policy (state-action mapping) based on a
reward
Example
• Controlling A Walking Robot
• Agent: The program controlling a walking robot.
• Environment: The real world.
• Action: One out of four moves (1) forward; (2) backward; (3) left; and
(4) right.
• Reward: Positive when it approaches the target destination; negative
when it wastes time, goes in the wrong direction or falls down.
• In this final example, a robot can teach itself to move more effectively
by adapting its policy based on the rewards it receives.
6 important parameters
• We need an algorithm to learn(1) a policy (2) that will tell us how to
interact(3) with an environment(4) under different circumstances(5) in
such a way to maximize rewards(6)
1. Learn — This implies we are not supposed to hand-code any particular
strategy but the algorithm should learn by itself.
2. Policy — This is the result of the learning. Given a State of
the Environment, the Policy will tell us how best to Interact with it so as
to maximize the Rewards.
3. Interact — This is nothing but the “Actions” the algorithm should
recommend we take under different circumstances.
Parameters (…continued)
4. Environment — This is the black box the algorithm interacts with. It is
the game it is supposed to win. It’s the world we live in. It’s the universe
and all the suns and the stars and everything else that can influence the
environment and it’s reaction to the action taken.
5. Circumstances — These are the different “States” the environment can
be in.
6. Rewards — This is the goal. The purpose of interacting with the
environment. The purpose playing the game.
Implementation of Algorithm
• Q-learning at its simplest stores data in tables. This approach falters
with increasing numbers of states/actions since the likelihood of the
agent visiting a particular state and performing a particular action is
increasingly small.
• The algorithm, therefore, has a function that calculates the quality of
a state-action combination:
𝑄: 𝑆 × 𝐴 → 𝑅
Q-Table or Q-Matrix
• Q-Table is just a fancy name for a simple lookup table where we
calculate the maximum expected future rewards for action at each
state.
• Basically, this table will guide us to the best action at each state.
• There will be four numbers of actions at each non-edge tile. When a
robot is at a state it can either move up or down or right or left.
• So, let’s model this environment in our Q-Table.
• In the Q-Table, the columns are the actions and the rows are the
states.
Q-Table (Continued)
• Each Q-table score will be the maximum
expected future reward that the robot will
get if it takes that action at that state.
• This is an iterative process, as we need to
improve the Q-Table at each iteration.
Process
• But the questions are:
• How do we calculate the values of the Q-table? [A: Q-functions]
• Are the values available or predefined? [A: Can be both]
• Q-function
• The Q-function uses the Bellman equation and takes two inputs:
state(s) and action(a).
Process (…continued)
• In the case of the robot game, to
reiterate the scoring/reward
structure is:
• power = +1
• mine = -100
• end = +100
• In the beginning, the epsilon
rates will be higher. The robot will
explore the environment and
randomly choose actions. The
logic behind this is that the robot
does not know anything about
the environment.
• As the robot explores the
environment, the epsilon rate
decreases and the robot starts to
exploit the environment.
Complications
• The outcome of your actions may be uncertain
• You may not be able to perfectly sense the state of the world
• The reward may be stochastic
• Reward is delayed (e.g. finding food in a maze)
• You may have no clue (model) about how the world responds to your
actions.
• You may have no clue (model) of how rewards are being paid off.
• The world may change while you try to learn it
• How much time do you need to explore uncharted territory before you
exploit what you have learned?
Conclusion
•Reinforcement learning addresses a very broad and relevant question:
How can we learn to survive in our environment?
•We have looked at Q-learning, which simply learns from experience.
No model of the world is needed.
•There have been many successful real world applications built with
less time and more efficiency.

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Q-learning
Q-learningQ-learning
Q-learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Ch2 properties of the task environment
Ch2 properties of the task environmentCh2 properties of the task environment
Ch2 properties of the task environment
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Goal stack planning.ppt
Goal stack planning.pptGoal stack planning.ppt
Goal stack planning.ppt
 
Minmax Algorithm In Artificial Intelligence slides
Minmax Algorithm In Artificial Intelligence slidesMinmax Algorithm In Artificial Intelligence slides
Minmax Algorithm In Artificial Intelligence slides
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
DIGITAL IMAGE PROCESSING
DIGITAL IMAGE PROCESSINGDIGITAL IMAGE PROCESSING
DIGITAL IMAGE PROCESSING
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Morphological Image Processing
Morphological Image ProcessingMorphological Image Processing
Morphological Image Processing
 
Planning
PlanningPlanning
Planning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 

Ähnlich wie Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]

24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
ManiMaran230751
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1
 

Ähnlich wie Q-Learning Algorithm: A Concise Introduction [Shakeeb A.] (20)

24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptx
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
Reinforcement learning.pptx
Reinforcement learning.pptxReinforcement learning.pptx
Reinforcement learning.pptx
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introduction
 
Reinforcement learning in Machine learning
 Reinforcement learning in Machine learning Reinforcement learning in Machine learning
Reinforcement learning in Machine learning
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
AI_Planning.pdf
AI_Planning.pdfAI_Planning.pdf
AI_Planning.pdf
 
Finalver
FinalverFinalver
Finalver
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
chapterThree.pptx
chapterThree.pptxchapterThree.pptx
chapterThree.pptx
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
CH2_AI_Lecture1.ppt
CH2_AI_Lecture1.pptCH2_AI_Lecture1.ppt
CH2_AI_Lecture1.ppt
 
3 probsolver edited.ppt
3 probsolver edited.ppt3 probsolver edited.ppt
3 probsolver edited.ppt
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]

  • 1. Q-Learning Algorithm What, How and Why by SHAKEEB AHMAD MUSTAFA AL-HAMMADI SNEHA UKEY SHAIKH ABUZAR
  • 2. Background • Q-learning falls under the umbrella of “Reinforcement Learning” • Differences: • Supervised Learning: Immediate feedback (labels provided for every input.) • Unsupervised Learning: No feedback (no labels provided). • Reinforcement Learning: Delayed scalar feedback (a number called reward). • RL deals with agents that must sense and act upon their environment. • This combines classical AI and machine learning techniques. • Examples: • A robot cleaning my room and recharging its battery • Robo-soccer • How to invest in shares • Learning how to fly a helicopter • Scheduling planes to their destinations etc.
  • 3. What is Q-learning? • A carrot and stick approach to learning • If by chance an AI does something which we want to encourage say it gets a coin in Mario, we give it a “carrot” • If it does something which we don’t want, say: the car drives into the wall in racing game, we “punish” it
  • 4. In technical terms • A model-free algorithm to learn a policy telling an agent what action to take under what circumstances • Seeks to find the best action to take given the current state. It’s considered off-policy because the q-learning function learns from actions that are outside the current policy, like taking random actions, and therefore a policy isn’t needed. More specifically, q-learning seeks to learn a policy that maximizes the total reward. • What is Q in Q-Learning? • It stands for quality. Quality in this case represents how useful a given action is in gaining some future reward.
  • 5.
  • 6. Summing up 1. Reinforcement Learning is the process of learning by interacting with an environment through positive feedback 2. Q-Learning is a type of RL that minimizes behavior of a system through trial and error 3. Q-learning updates its policy (state-action mapping) based on a reward
  • 7. Example • Controlling A Walking Robot • Agent: The program controlling a walking robot. • Environment: The real world. • Action: One out of four moves (1) forward; (2) backward; (3) left; and (4) right. • Reward: Positive when it approaches the target destination; negative when it wastes time, goes in the wrong direction or falls down. • In this final example, a robot can teach itself to move more effectively by adapting its policy based on the rewards it receives.
  • 8. 6 important parameters • We need an algorithm to learn(1) a policy (2) that will tell us how to interact(3) with an environment(4) under different circumstances(5) in such a way to maximize rewards(6) 1. Learn — This implies we are not supposed to hand-code any particular strategy but the algorithm should learn by itself. 2. Policy — This is the result of the learning. Given a State of the Environment, the Policy will tell us how best to Interact with it so as to maximize the Rewards. 3. Interact — This is nothing but the “Actions” the algorithm should recommend we take under different circumstances.
  • 9. Parameters (…continued) 4. Environment — This is the black box the algorithm interacts with. It is the game it is supposed to win. It’s the world we live in. It’s the universe and all the suns and the stars and everything else that can influence the environment and it’s reaction to the action taken. 5. Circumstances — These are the different “States” the environment can be in. 6. Rewards — This is the goal. The purpose of interacting with the environment. The purpose playing the game.
  • 10.
  • 11. Implementation of Algorithm • Q-learning at its simplest stores data in tables. This approach falters with increasing numbers of states/actions since the likelihood of the agent visiting a particular state and performing a particular action is increasingly small. • The algorithm, therefore, has a function that calculates the quality of a state-action combination: 𝑄: 𝑆 × 𝐴 → 𝑅
  • 12. Q-Table or Q-Matrix • Q-Table is just a fancy name for a simple lookup table where we calculate the maximum expected future rewards for action at each state. • Basically, this table will guide us to the best action at each state. • There will be four numbers of actions at each non-edge tile. When a robot is at a state it can either move up or down or right or left. • So, let’s model this environment in our Q-Table. • In the Q-Table, the columns are the actions and the rows are the states.
  • 13. Q-Table (Continued) • Each Q-table score will be the maximum expected future reward that the robot will get if it takes that action at that state. • This is an iterative process, as we need to improve the Q-Table at each iteration.
  • 14. Process • But the questions are: • How do we calculate the values of the Q-table? [A: Q-functions] • Are the values available or predefined? [A: Can be both] • Q-function • The Q-function uses the Bellman equation and takes two inputs: state(s) and action(a).
  • 15. Process (…continued) • In the case of the robot game, to reiterate the scoring/reward structure is: • power = +1 • mine = -100 • end = +100 • In the beginning, the epsilon rates will be higher. The robot will explore the environment and randomly choose actions. The logic behind this is that the robot does not know anything about the environment. • As the robot explores the environment, the epsilon rate decreases and the robot starts to exploit the environment.
  • 16. Complications • The outcome of your actions may be uncertain • You may not be able to perfectly sense the state of the world • The reward may be stochastic • Reward is delayed (e.g. finding food in a maze) • You may have no clue (model) about how the world responds to your actions. • You may have no clue (model) of how rewards are being paid off. • The world may change while you try to learn it • How much time do you need to explore uncharted territory before you exploit what you have learned?
  • 17. Conclusion •Reinforcement learning addresses a very broad and relevant question: How can we learn to survive in our environment? •We have looked at Q-learning, which simply learns from experience. No model of the world is needed. •There have been many successful real world applications built with less time and more efficiency.