1. Q-learning is a type of reinforcement learning algorithm that seeks to learn the optimal policy for an agent to take actions in an environment to maximize rewards.
2. The algorithm works by maintaining a Q-table that contains Q-values representing the expected rewards for state-action pairs, which are updated using the Bellman equation as the agent interacts with the environment.
3. Over time, the Q-values converge and the agent learns the optimal policy to take the best actions under different states to maximize long-term rewards without requiring a model of the environment.
2. Background
• Q-learning falls under the umbrella of “Reinforcement Learning”
• Differences:
• Supervised Learning: Immediate feedback (labels provided for every input.)
• Unsupervised Learning: No feedback (no labels provided).
• Reinforcement Learning: Delayed scalar feedback (a number called reward).
• RL deals with agents that must sense and act upon their environment.
• This combines classical AI and machine learning techniques.
• Examples:
• A robot cleaning my room and recharging its battery
• Robo-soccer
• How to invest in shares
• Learning how to fly a helicopter
• Scheduling planes to their destinations etc.
3. What is Q-learning?
• A carrot and stick approach to learning
• If by chance an AI does something which we want to
encourage say it gets a coin in Mario, we give it a
“carrot”
• If it does something which we don’t want, say: the car
drives into the wall in racing game, we “punish” it
4. In technical terms
• A model-free algorithm to learn a policy telling an agent what action
to take under what circumstances
• Seeks to find the best action to take given the current state. It’s
considered off-policy because the q-learning function learns from
actions that are outside the current policy, like taking random actions,
and therefore a policy isn’t needed. More specifically, q-learning
seeks to learn a policy that maximizes the total reward.
• What is Q in Q-Learning?
• It stands for quality. Quality in this case represents how useful a given action
is in gaining some future reward.
5.
6. Summing up
1. Reinforcement Learning is the process of learning by interacting with
an environment through positive feedback
2. Q-Learning is a type of RL that minimizes behavior of a system
through trial and error
3. Q-learning updates its policy (state-action mapping) based on a
reward
7. Example
• Controlling A Walking Robot
• Agent: The program controlling a walking robot.
• Environment: The real world.
• Action: One out of four moves (1) forward; (2) backward; (3) left; and
(4) right.
• Reward: Positive when it approaches the target destination; negative
when it wastes time, goes in the wrong direction or falls down.
• In this final example, a robot can teach itself to move more effectively
by adapting its policy based on the rewards it receives.
8. 6 important parameters
• We need an algorithm to learn(1) a policy (2) that will tell us how to
interact(3) with an environment(4) under different circumstances(5) in
such a way to maximize rewards(6)
1. Learn — This implies we are not supposed to hand-code any particular
strategy but the algorithm should learn by itself.
2. Policy — This is the result of the learning. Given a State of
the Environment, the Policy will tell us how best to Interact with it so as
to maximize the Rewards.
3. Interact — This is nothing but the “Actions” the algorithm should
recommend we take under different circumstances.
9. Parameters (…continued)
4. Environment — This is the black box the algorithm interacts with. It is
the game it is supposed to win. It’s the world we live in. It’s the universe
and all the suns and the stars and everything else that can influence the
environment and it’s reaction to the action taken.
5. Circumstances — These are the different “States” the environment can
be in.
6. Rewards — This is the goal. The purpose of interacting with the
environment. The purpose playing the game.
10.
11. Implementation of Algorithm
• Q-learning at its simplest stores data in tables. This approach falters
with increasing numbers of states/actions since the likelihood of the
agent visiting a particular state and performing a particular action is
increasingly small.
• The algorithm, therefore, has a function that calculates the quality of
a state-action combination:
𝑄: 𝑆 × 𝐴 → 𝑅
12. Q-Table or Q-Matrix
• Q-Table is just a fancy name for a simple lookup table where we
calculate the maximum expected future rewards for action at each
state.
• Basically, this table will guide us to the best action at each state.
• There will be four numbers of actions at each non-edge tile. When a
robot is at a state it can either move up or down or right or left.
• So, let’s model this environment in our Q-Table.
• In the Q-Table, the columns are the actions and the rows are the
states.
13. Q-Table (Continued)
• Each Q-table score will be the maximum
expected future reward that the robot will
get if it takes that action at that state.
• This is an iterative process, as we need to
improve the Q-Table at each iteration.
14. Process
• But the questions are:
• How do we calculate the values of the Q-table? [A: Q-functions]
• Are the values available or predefined? [A: Can be both]
• Q-function
• The Q-function uses the Bellman equation and takes two inputs:
state(s) and action(a).
15. Process (…continued)
• In the case of the robot game, to
reiterate the scoring/reward
structure is:
• power = +1
• mine = -100
• end = +100
• In the beginning, the epsilon
rates will be higher. The robot will
explore the environment and
randomly choose actions. The
logic behind this is that the robot
does not know anything about
the environment.
• As the robot explores the
environment, the epsilon rate
decreases and the robot starts to
exploit the environment.
16. Complications
• The outcome of your actions may be uncertain
• You may not be able to perfectly sense the state of the world
• The reward may be stochastic
• Reward is delayed (e.g. finding food in a maze)
• You may have no clue (model) about how the world responds to your
actions.
• You may have no clue (model) of how rewards are being paid off.
• The world may change while you try to learn it
• How much time do you need to explore uncharted territory before you
exploit what you have learned?
17. Conclusion
•Reinforcement learning addresses a very broad and relevant question:
How can we learn to survive in our environment?
•We have looked at Q-learning, which simply learns from experience.
No model of the world is needed.
•There have been many successful real world applications built with
less time and more efficiency.