Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic pricing with RL
AI & BigData Online Day 2021
Website - http://aiconf.com.ua
Youtube - https://www.youtube.com/startuplviv
FB - https://www.facebook.com/aiconf
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic pricing with RL
1. Why the amazon sellers are buying the RTX 3080:
Dynamic pricing with RL
2. What is dynamic pricing?
Dynamic Pricing is a process of automated price adjustment for products or services in real-
time to maximise income and other economic performance indicators.
4. Dynamic Pricing for e-commerce: benefits
● Stay ahead of the competitors. Automation of competitors’ prices monitoring, allows
to quickly adapt to the dynamic environment.
● Increase in profits. By analysing the market, you can adjust the price of a product
to generate more revenues. If the demand for a product is low, you can boost it by lowering
the price, and if it is a peak season for a product, you can increase the price
without altering sales volume.
● Gain market insights. The continuous market examination allows retailers to stay aware of
prevailing market trends and get insights into consumers behaviour, which lead to better
decision-making.
5. Sales forecasting
Sales data is time-series data that
contains prices along with respective
sales and other features that could be useful in
driving sales (like inventory, adversarial spent,
competitors' prices, discounts, etc).
6. Greedy strategy for Dynamic Pricing
A typical solution is to discretise a continuous interval into a countable number of possible
prices and choose the price that leads to the highest income or another objective.
7. Reinforcement Learning basics: MDP
MDP - is a discrete-time stochastic control process. It consists of a state space S, action space
A, reward space R (subset of real numbers), and dynamics function p.
8. Reinforcement Learning basics: Agent-Environment interface
Agent and environment (MDP) interacts at each time steps t = 0, 1, 2, … . The agent performs an
action a, based on current state s. MDP receive a and respond with reward r and next state s’.
Together they produce the following sequence
The overall process can be
visualised with this picture:
9. Reinforcement Learning basics: Goals and Rewards
RL’s objective is to maximise the cumulative reward, which the agent receives in the long run
(reward hypothesis).
To formalise this idea, the notion of return should be defined. The return is some function of a
sequence of rewards. In the case of episodic tasks (those that terminate), it can be defined as:
In case of continuing tasks, we should also consider convergence properties of the infinite
sums. That’s way the following return is used:
10. Reinforcement Learning basics: Policy and Value functions
Policy - is a mapping from states to probabilities of selecting each possible action. Informally,
policy defines the rules of moving on MDP.
We can reformulate the reinforcement learning task as finding “the best” policy. The question is
what it means “the best”?
To answer this question the notions of state-value and action-value functions are defined:
11. Reinforcement Learning basics: Bellman optimality equation
We said that policy pi’ is better than policy pi if
It can be shown, that there exists a policy that is better than any other policy. Its state-values
and action-values should satisfy the following equations – Bellman optimality equations for v
and q
12. Reinforcement Learning for dynamic pricing
In terms of RL concepts, actions are all possible prices, and states are market conditions,
except for the current price of the product or service.
Usually, it is incredibly problematic to train an agent from an interaction with a real-world
market. There are two main reasons – time and exploration.
An alternative approach is to use a simulator of the environment.
13. Simulated sales data
In example we use simulated sales rather than real ones:
This allows us to analytically find a greddy policy and compare it with the RL agent’s
performance.
14. Experiments
The graph below shows the performance of the random and greedy agents.
Episodic task of 52 steps yields the following results:
15. Tabular Q-Learning
Q-learning is an off-policy temporal-difference control algorithm. Its main purpose is to
iteratively find the action values of an optimal policy (optimal action values).
The following updated formula is used:
This updated formula can also be treated as an iterative way of solving Bellman optimality
equations.
Also, before running this algorithm, we should discretise continuous variables.
16. Tabular Q-Learning: Results
As we can see, this approach outperforms a random agent, but cannot outperform a greedy
agent.
17. Deep Q-Network
Deep Q-Network (DQN) algorithm is based on the same idea as Tabular Q-learning. The main
difference is that DQN uses a parametrised function to approximate optimal action values.
The optimisation objective at iteration i looks as follows:
The gradient of the objective is as follows:
18. Deep Q-Network: Results
As we can see, DQN outperforms all other agents. Also, it was trained on a smaller number of
episodes.
19. Policy Gradients
Instead of learning optimal action values and moving greedy with respect to them, policy
gradients directly parametrise and optimise a policy.
The difficulty here is that an optimisation objective, depends on a dynamics function p, which is
unknown.
That is why policy gradients use the fact that the gradient of the objective is an expected
value of a random variable, which is approximated while acting in the environment.
20. Policy Gradients: Results
Policy gradients outperform a greedy agent but do not perform as well as DQN. Likewise, policy
gradients require far more episodes then DQN.
21. Conclusions
● Dynamic pricing can help retail players to stay
ahead of the market.
● RL-based dynamic pricing solves a problem
of greedy decision-making.
● Training agents on real market is time
consuming and can result in a loss of funds.
That's way a simulator of the environment is
used.