A new Evolutionary Reinforcement Scheme for Stochastic Learning Automata

12th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 23-25, 2008
A new Evolutionary Reinforcement Scheme for Stochastic Learning
Automata
FLORIN STOICA, EMIL M. POPA
Computer Science Department
“Lucian Blaga” University Sibiu
Str. Dr. Ion Ratiu 5-7, 550012, Sibiu
ROMANIA
Abstract: - A stochastic automaton can perform a finite number of actions in a random environment. When a specific
action is performed, the environment responds by producing an environment output that is stochastically related to the
action. The aim is to design an automaton, using an evolutionary reinforcement scheme (the basis of the learning
process), that can determine the best action guided by past actions and responses. Using Stochastic Learning Automata
techniques, we introduce a decision/control method for intelligent vehicles receiving data from on-board sensors or
from the localization system of highway infrastructure.
Key-Words: - Stochastic Learning Automata, Reinforcement Learning, Intelligent Vehicle Control
1 Introduction
An automaton is a machine or control mechanism
designed to automatically follow a predetermined
sequence of operations or respond to encoded
instructions. The term stochastic emphasizes the
adaptive nature of the automaton we describe here. The
automaton described here does not follow predetermined
rules, but adapts to changes in its environment. This
adaptation is the result of the learning process. Learning
is defined as any permanent change in behavior as a
result of past experience, and a learning system should
therefore have the ability to improve its behavior with
time, toward a final goal.
The stochastic automaton attempts a solution of the
problem without any information on the optimal action
(initially, equal probabilities are attached to all the
actions). One action is selected at random, the response
from the environment is observed, action probabilities
are updated based on that response, and the procedure is
repeated. A stochastic automaton acting as described to
improve its performance is called a learning automaton.
The response values can be represented in three
different models. In the P-model, the response values are
either 0 or 1, in the S-model the response values are
continuous in the range (0, 1) and in the Q-model the
values belong to a finite set of discrete values in the
range (0, 1).
The environment can further be split up in two types,
stationary and nonstationary. In a stationary environment
the penalty probabilities will never change. In a
nonstationary environment the penalties will change
over time.
The algorithm that guarantees the desired learning
process is called a reinforcement scheme [5]. The
reinforcement scheme is the basis of the learning process
for learning automata. The general solution for
absolutely expedient schemes was found by
Lakshmivarahan and Thathachar [8]. However, these
reinforcement schemes are theoretically valid only in
stationary environments.
For a nonstationary automata environment resulting
from a changing physical environment, we propose a
new evolutionary reinforcement scheme, based on an
original co-mutation operator, called LR-Mijn.
2 Stochastic learning automata
Mathematically, the environment of a stochastic
automaton is defined by a triple {α , c,β } where
α ={α1 ,α 2 ,...,α r } represents a finite set of actions
being the input to the environment, { 1 , 2} β = β β
represents a binary response set, and c ={c1 , c2 ,..., cr } is
a set of penalty probabilities, where ci is the probability
that action i α
will result in an unfavourable response.
Given that β (n) = 0 is a favourable outcome and
β (n) =1 is an unfavourable outcome at time instant
n (n = 0,1, 2, ...) , the element ci of c is defined
mathematically by:
ci = P(β (n) =1|α (n) =α i ) i =1, 2, ..., r
A learning automaton generates a sequence of
actions on the basis of its interaction with the
environment. If the automaton is “learning” in the
process, its performance must be superior to “intuitive”
methods [6]. Consider a stationary random environment
ISBN: 978-960-6766-85-5 268 ISSN: 1790-5109

with penalty probabilities {c1,c2 ,...,cr } defined above.
In order to describe the reinforcement schemes, is
defined p(n) , a vector of action probabilities:
pi (n) = P(α (n) =α i ), i =1, r
We define a quantity M(n) as the average penalty
for a given action probability vector:
M ( n ) = P ( β
( n ) = 1| p ( n
))
=
Σ Σ
= =
( β ( ) = 1| α ( ) = α ) ∗ ( α ( ) = α
) =
( )
r
i
i i
r
i
P n n i P n i c p n
1 1
An automaton is absolutely expedient if the expected
value of the average penalty at one iteration step is less
than it was at the previous step for all steps:
M(n +1) < M(n) for all n [7].
Updating action probabilities can be represented as
follows:
p(n +1) = T[ p(n),α (n),β (n)] ,
where T is a mapping. This formula says the next action
probability p(n +1) is updated based on the current
probability p(n) , the input from the environment and the
resulting action. If p(n +1) is a linear function of p(n) ,
the reinforcement scheme is said to be linear; otherwise
it is termed nonlinear.
Absolutely expedient learning schemes are presently
the only class of schemes for which necessary and
sufficient conditions of design are available.
However, the need for learning and adaptation in
systems is mainly due to the fact that the environment
changes with time. The performance of a learning
automaton should be judged in such a context. If a
learning automaton with a fixed strategy is used in a
nonstationary environment, it may become less
expedient, and even non-expedient. The learning scheme
must have sufficient flexibility to track the better actions.
If the action probabilities of the learning automata are
functions of the status of the physical environment (such
in case of an Intelligent Vehicle Control System), the
realization of a deterministic mathematical model of this
physical environment may be impossible, if not
extremely difficult.
For a nonstationary automata environment resulting
from a changing physical environment, we propose a
new evolutionary reinforcement scheme, based on an
original co-mutation operator, called LR-Mijn.
3 The LR-Mijn operator
In [10] was presented a co-mutation operator called Mijn,
capable of operating on a set of adjacent bits in one
single step. Its properties were examined and compared
against those of the bit-flip mutation. A simple
Evolutionary Algorithm, called MEA, based only on
selection and Mijn, was experimentally used on some
classical test functions widely known from literature, in
order to compare its performance against that offered by
a classical Genetic Algorithm (SGA). The obtained
results proved the superiority of MEA algorithm when
compared to the SGA [12].
The aim of this section is to introduce a new co-mutation
operator called LR-Mijn, derived from Mijn. The
LR-MEA algorithm, based on the LR-Mijn operator, was
used for the optimization of some well-known
benchmark functions and its performance was compared
against that offered by algorithms using the Mijn
operator. The obtained results proved the effectiveness
of our LR-Mijn operator [14].
Let us consider a generic alphabet A = {a1, a2, … , as}
composed by s ≥2 different symbols. The set of all
sequences of length l over the alphabet A will be denoted
with Σ = Al. In the following we shall denote with σ a
generic string σ = σl-1…σ0 ∈ Σ, where σq ∈ A ∀ q ∈ {0,
… , l-1}.
The Mijn operator is capable of operating on a set of
adjacent bits in one single step, starting from a randomly
chosen position p and going left within the chromosome.
[10].
In this section we introduce a new co-mutation
operator called LR-Mijn. Our LR-Mijn operator finds the
longest sequence of σp elements, situated in the left or in
the right of the position p. If the longest sequence is in
the left of p, the LR-Mijn behaves as Mijn, otherwise the
LR-Mijn will operate on the set of bits starting from p and
going to the right.
Through σ(q,i) we denote that on position q within
the sequence σ there is the symbol ai of the alphabet A,
right ,
n
and
p i
σ ( , ) specify the presence of symbol ai on
left m
,
position p within the sequence σ, between right symbols
an on the right and left symbols am on the left. We
right ,
i
suppose that σ = σ(l-1)... σ(p+left+1,m)
p i
σ ( , ) σ(p-right-
left i
,
1,n)... σ(0).
Definition 3.1 Formally, the LR-Mijn operator is defined
as follows:
(i) If p ≠ right and p ≠ l – left – 1, LR-Mijn(σ)=
⎧
σ σ σ σ
⎪ ⎪ ⎪ ⎪ ⎪
σ σ
σ σ σ σ
⎨
⎪ ⎪ ⎪ ⎪ ⎪
⎩
right ,
i
l p left i p m
( 1)... ( 1, ) ( , )
= − + +
left m
p right n for left right
( 1, )... (0)
,
− − >
,
l p left m p n
( 1)... ( 1, ) ( , )
= − + +
p right i for left right
( 1, )... (0)
,
− − <
σ σ
or for left =
right with probability
, 0.5
left
rightt
right rleft
right n
left i
σ σ
ISBN: 978-960-6766-85-5 269 ISSN: 1790-5109

(ii) If p = right and p ≠ l – left – 1,
right ,
i
σ = σ(l-1)... σ(p+left+1,m)
p i
σ ( , ) and
left ,
i
LR-Mijn(σ) = σ(l-1)... σ(p+left+1,m)
right ,
k
p k
σ ( , ) ,
left ,
i
where k ≠ i (randomly chosen).
(iii) If p ≠ right and p = l – left – 1,
right ,
i
σ =
p i
σ ( , ) σ(p-right-1,n)... σ(0) and
left ,
i
LR-Mijn(σ) =
right ,
i
p k
σ ( , ) σ(p-right-1,n)... σ(0), where k ≠ i
left ,
k
(randomly chosen).
(iv) If p = right and p = l – left –1, σ =
right ,
i
p i
σ ( , )
left ,
i
LR-Mijn(σ)=
⎧
σ σ
⎪ ⎪ ⎪
σ σ
⎨
⎪ ⎪ ⎪
σ σ
⎩
right i
p k for left right where k i
( , ) , ,
= > ≠
left k
right k
p k for left right where k i
( , ) , ,
= < ≠
=
, 0.5
,
left ,
i
,
,
or for left right with probability
left
rightt
right rleft
As an example, let us consider the binary case, the
string σ = 11110000 and the randomly chosen
application point p = 2. In this case, σ2 = 0, so we have to
find the longest sequence of 0 within string σ, starting
from position p. This sequence goes to the right, and
because we have reached the end of the string, and no
occurrence of 1 has been met, the new string obtained
after the application of LR-Mijn is 11110111.
If we suppose that the randomly chosen point p = 6,
we have in this case σ6 = 1. Again, the longest sequence
of elements 1 within σ goes to the right, so we have to
go right bound and look for the first occurrence of a 0 in
σ. For p = 3 we have σ3 = 0 so we stop. Then we replace
the string components from 6 to 3, and the new string
obtained after the application of LR-Mijn is 10001000.
In the following, we will consider that A is the binary
alphabet, A = {0, 1}. We suppose that σ represent just
one and only one integer or real variable. In the
following we shall call value of string σ the integer
number between 0 and 2l-1 corresponding to the string
encode and jump length the difference between the
values represented by LR-Mijn(σ)/Mijn(σ) and σ,
respectively. In cases (ii), (iii) and (iv), the operator LR-Mijn
will produce so-called long jumps.
In the following we will present some properties of
the LR-Mijn operator, in order to evaluate the number of
long jumps performed by our new co-mutation operator.
Property 3.1 [14] LR-Mijn allows performing a number
of long jumps equal to:
4 + − l
γ’ = (2 3)
3
l c ,
where cl = 2 for l even and cl =1 for l odd.
The proof of this property can be found in [14].
Property 3.2 [14] The relative number of long jumps
caused by the LR-Mijn operator is equal to:
ρ’ =
number of long jumps =
total number of possible jumps
γ'
l 2l
4
⋅ + −
= l
l
l
l
c
2
(2 3)
3
⋅
=
4 −
= l l c
( 3)
4
3 2
3
⋅ ⋅
+
⋅
l l
, where cl is
defined above.
Property 3.3 [14] The probability of performing long
jumps by the LR-Mijn operator tends to
1
3
4 ⋅ .
l
The most interesting point in the co-mutation
operator presented above is the fact that it allows long
jumps, thus the search can reach very far points from
where the search currently is. The LR-Mijn operator
allows performing longer jumps which are not possible
in the classical Bit-Flip Mutation (BFM) used in GAs.
This means that our co-mutation operator has about the
same capabilities of local search as ordinary mutation
has, and it has also the possibility of performing jumps
to reach far regions in the search space which cannot be
reached by BFM. Properties 3.1-3.3 lay out that the LR-Mijn
operator performs more long jumps than Mijn
[12,14], which leads to a better convergence of an
Evolutionary Algorithm based on the LR-Mijn in
comparison with an algorithm based on the Mijn operator.
This statement was be verified through experimental
results in [14].
In the following section is presented an Evolutionary
Reinforcement Scheme which is based only on selection
and LR-Mijn
4 A new evolutionary reinforcement
scheme
We denote with K the dimension of the population.
The proposed learning algorithm, which ensures the
update of the probability vector of the stochastic
automaton, is described in the following:
ISBN: 978-960-6766-85-5 270 ISSN: 1790-5109

p[] function Learning-LR-Mijn(action)
begin
t = 0;
- initialize randomly population P(t) with K elements;
- evaluate P (t) by using fitness function;
while not Terminated
for j = 1 to K-1 do
- select randomly one element among the best T%
from P(t);
- mutate it using LR-Mijn, performing a mutation
within the gene which contains the probability of
the action selected by the automaton;
- insert the new chromosome into the population
P’(t);
end for
- choose the best element from P(t) and insert it into
P’(t);
- evaluate P’(t) using the fitness function;
P(t+1) = P’(t);
t = t + 1;
end while
- select the best chromosome from population P(t);
- decode from that chromosome the real values
representing the probabilities associated with the
actions of the automaton;
- return the obtained probabilities vector p[];
end
In the following, we denote with:
• N - the number of actions of the
corresponding automaton;
• k
pn - the probability of action n ,
stored within the chromosome k ;
• β n - the environment response for
action n .
We define the fitness function as follows:
N
k
n p k f β − =Σ=
f :{1,..,K}→ℜ, ( ) ( 1
) n
n
1
N
∈ ∀ = Σ=
The relations p k { K}
n
k
n 1, 1,..,
1
must be verified
after application of the LR-Mijn on any chromosome k
from population. In order to accomplish this, we proceed
as follows: for the variable of index i (preferably
unaltered by the LR-Mijn), is calculated the
sum Σ
≠ =
=
N
i n n
k
n
S k p
1
. The probability vector is then modified
as follows:
If S k >1 then
k
n * , 1,.., := 1 ∈
k = 0
pi
p n { N} {i}
p k
S
k n
else
k k
pi =1− S .
5 Using stochastic learning automata for
Intelligent Vehicle Control
In this section, we present a method for intelligent
vehicle control, having as theoretical background
Stochastic Learning Automata. The aim here is to design
an automata system that can learn the best possible
action based on the data received from on-board sensors,
of from roadside-to-vehicle communications. For our
model, we assume that an intelligent vehicle is capable
of two sets of lateral and longitudinal actions. Lateral
actions are LEFT (shift to left lane), RIGHT (shift to
right lane) and LINE_OK (stay in current lane).
Longitudinal actions are ACC (accelerate), DEC
(decelerate) and SPEED_OK (keep current speed). An
autonomous vehicle must be able to “sense” the
environment around itself. Therefore, we assume that
there are four different sensors modules on board the
vehicle (the headway module, two side modules and a
speed module), in order to detect the presence of a
vehicle traveling in front of the vehicle or in the
immediately adjacent lane and to know the current speed
of the vehicle. These sensor modules evaluate the
information received from the on-board sensors or from
the highway infrastructure in the light of the current
automata actions, and send a response to the automata.
The response from physical environment is a
combination of outputs from the sensor modules.
Because an input parameter for the decision blocks is the
action chosen by the stochastic automaton, is necessary
to use two distinct functions for mapping the outputs of
decision blocks in inputs for the two learning automata,
namely the longitudinal automaton and respectively the
lateral automaton.
After updating the action probability vectors in both
learning automata, using the evolutionary reinforcement
scheme presented in section 4, the outputs from
stochastic automata are transmitted to the regulation
layer. The regulation layer handles the actions received
from the two automata in a distinct manner, using for
each of them a regulation buffer. If an action received
was rewarded, it will be introduced in the regulation
buffer of the corresponding automaton, else in buffer
will be introduced a certain value which denotes a
penalized action by the physical environment. The
regulation layer does not carry out the action chosen
immediately; instead, it carries out an action only if it is
recommended k times consecutively by the automaton,
where k is the length of the regulation buffer. After an
action is executed, the action probability vector is
ISBN: 978-960-6766-85-5 271 ISSN: 1790-5109

initialized to
1
, where r is the number of actions. When
r
an action is executed, regulation buffer is initialized also.
6 Sensor modules
The four teacher modules mentioned above are decision
blocks that calculate the response (reward/penalty),
based on the last chosen action of automaton. Table 1
describes the output of decision blocks for side sensors.
As seen in table 1, a penalty response is received
from the left sensor module when the action is LEFT and
there is a vehicle in the left or the vehicle is already
traveling on the leftmost lane. There is a similar situation
for the right sensor module.
Left/Right Sensor Module
Actions
Vehicle in
sensor range or
no adjacent lane
No vehicle in
sensor range and
adjacent lane
exists
LINE_OK 0/0 0/0
LEFT 1/0 0/0
RIGHT 0/1 0/0
Table 1 Outputs from the Left/Right Sensor Module
The Headway (Frontal) Module is defined as shown
in table 2. If there is a vehicle at a close distance (<
admissible distance), a penalty response is sent to the
automaton for actions LINE_OK, SPEED_OK and
ACC. All other actions (LEFT, RIGHT, DEC) are
encouraged, because they may serve to avoid a collision.
Headway Sensor Module
Actions
Vehicle in range
(at a close
frontal distance)
No vehicle in
range
LINE_OK 1 0
LEFT 0 0
RIGHT 0 0
SPEED_OK 1 0
ACC 1 0
DEC 0* 0
Table 2 Outputs from the Headway Module
The Speed Module compares the actual speed with
the desired speed, and based on the action chosen send a
feedback to the longitudinal automaton.
The reward response indicated by 0* (from the
Headway Sensor Module) is different than the normal
reward response, indicated by 0: this reward response
has a higher priority and must override a possible
penalty from other modules.
Speed Sensor Module
Actions Speed:
too slow
Acceptable
speed
Speed:
too fast
SPEED_OK 1 0 1
ACC 0 0 1
DEC 1 0 0
Table 3 Outputs from the Speed Module
7 Implementation of a simulator
In this section is described an implementation of a
simulator for the Intelligent Vehicle Control System.
The entire system was implemented in Java, and is based
on JADE platform.
JADE is a middleware that facilitates the
development of multi-agent systems and applications
conforming to FIPA standards for intelligent agents. In
figure 1 is showed the class diagram of the simulator.
Each vehicle has associated an agent, responsible for the
intelligent control.
The response of the physical environment is a
combination of the outputs of all four sensor modules.
The implementation of this combination for each
automaton (longitudinal respectively lateral) is showed
in figure 2 (the value 0* was substituted by 2).
A snapshot of the running simulator is shown in figure 3.
Fig. 1 The class diagram of the simulator
// environment response for Longitudinal Automaton
public double reward(int action){
int combine;
combine = Math.max(speedModule(action),
frontModule(action));
ISBN: 978-960-6766-85-5 272 ISSN: 1790-5109

if (combine = = 2) combine = 0;
return combine;
}
// environment response for Lateral Automaton
public double reward(int action){
int combine;
combine = Math.max(leftRightModule(action),
frontModule(action));
return combine;
}
Fig. 2 The physical environment response
Fig. 3 A scenario executed in the simulator
8 Conclusion
Reinforcement learning has attracted rapidly increasing
interest in the machine learning and artificial intelligence
communities. Its promise is beguiling - a way of
programming agents by reward and punishment without
needing to specify how the task (i.e., behavior) is to be
achieved. Reinforcement learning allows, at least in
principle, to bypass the problems of building an explicit
model of the behavior to be synthesized and its
counterpart, a meaningful learning base (supervised
learning).
The evolutionary reinforcement scheme presented in this
paper is suitable for nonstationary stochastic automata
environment, resulting from a changing physical
environment. Used within a simulator of an Intelligent
Vehicle Control System, this new reinforcement scheme
has proved its efficiency.
References:
[1] A. Barto, S. Mahadevan, Recent advances in
hierarchical reinforcement learning, Discrete-Event
Systems journal, Special issue on Reinforcement
Learning, 2003.
[2] R. Sutton, A. Barto, Reinforcement learning: An
introduction, MIT-press, Cambridge, MA, 1998.
[3] O. Buffet, A. Dutech, F. Charpillet, Incremental
reinforcement learning for designing multi-agent
systems, In J. P. Müller, E. Andre, S. Sen, and C.
Frasson, editors, Proceedings of the Fifth
International Conference onAutonomous Agents, pp.
31–32,Montreal, Canada, 2001. ACM Press.
[4] J. Moody, Y. Liu, M. Saffell, K. Youn, Stochastic
direct reinforcement: Application to simple games
with recurrence, In Proceedings of Artificial
Multiagent Learning. Papers from the 2004 AAAI
Fall Symposium,Technical Report FS-04-02, 2004.
[5] C. Ünsal, P. Kachroo, J. S. Bay, Simulation Study of
Multiple Intelligent Vehicle Control using Stochastic
Learning Automata, TRANSACTIONS, the Quarterly
Archival Journal of the Society for Computer
Simulation International, volume 14, number 4,
December 1997.
[6] C. Ünsal, P. Kachroo, J. S. Bay, Multiple Stochastic
Learning Automata for Vehicle Path Control in an
Automated Highway System, IEEE Transactions on
Systems, Man, and Cybernetics -part A: systems and
humans, vol. 29, no. 1, january 1999
[7] K. S. Narendra, M. A. L. Thathachar, Learning
Automata: an introduction, Prentice-Hall, 1989.
[8] S. Lakshmivarahan, M.A.L. Thathachar, Absolutely
Expedient Learning Algorithms for Stochastic
Automata, IEEE Transactions on Systems, Man and
Cybernetics, vol. SMC-6, pp. 281-286, 1973
[9] F. Stoica, E. M. Popa, An Absolutely Expedient
Learning Algorithm for Stochastic Automata,
WSEAS Transactions on Computers, Issue 2, Volume
6, February 2007, ISSN 1109-2750, pp. 229-235
[10] I. De Falco, A. Iazzetta, A. Della Cioppa, E.
Tarantino, Mijn Mutation Operator for Aerofoil
Design Optimisation, Research Institute on Parallel
Information Systems, National Research Council of
Italy, 2001
[11] I. De Falco, R. Del Balio, A. Della Cioppa, E.
Tarantino, A Comparative Analysis of Evolutionary
Algorithms for Function Optimisation, Research
Institute on Parallel Information Systems, National
Research Council of Italy, 1998
[12] I. De Falco, A. Iazzetta, A. Della Cioppa, E.
Tarantino, The Effectiveness of Co-mutation in
Evolutionary Algorithms: the Mijn operator, Research
Institute on Parallel Information Systems, National
Research Council of Italy, 2000
[13] H. Kita, A comparison study of self-adaptation in
evolution strategies and real-code genetic algorithms,
Evolutionary Computation, 9(2):223-241, 2001
[14] F. Stoica, D. Simian, C. Simian, A new co-mutation
genetic operator, In Proceedings of Applications of
Evolutionary Computing in Modeling and
Development of Intelligent Systems, within the 9th
WSEAS International Conference on
EVOLUTIONARY COMPUTING (EC '08), May
2008, Sofia, Bulgaria.
ISBN: 978-960-6766-85-5 273 ISSN: 1790-5109

A new Evolutionary Reinforcement Scheme for Stochastic Learning Automata

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (16)

Ähnlich wie A new Evolutionary Reinforcement Scheme for Stochastic Learning Automata

Ähnlich wie A new Evolutionary Reinforcement Scheme for Stochastic Learning Automata (20)

Mehr von infopapers

Mehr von infopapers (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

A new Evolutionary Reinforcement Scheme for Stochastic Learning Automata