F. Stoica, D. Simian, Automatic control based on Wasp Behavioral Model and Stochastic Learning Automata, Proceedings of the 10th International Conference on Mathematical Methods, Computational Techniques & Intelligent Systems, Corfu Island, Greece, ISBN 978-960-474-012-3, pp. 289-294, October 2008
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Automatic control based on Wasp Behavioral Model and Stochastic Learning Automata
1. MATHEMATICAL METHODS, COMPUTATIONAL TECHNIQUES, NON-LINEAR SYSTEMS, INTELLIGENT SYSTEMS
Automatic control based on Wasp Behavioral Model and Stochastic
Learning Automata
FLORIN STOICA, DANA SIMIAN
Computer Science Department
“Lucian Blaga” University Sibiu
Str. Dr. Ion Ratiu 5-7, 550012, Sibiu
ROMANIA
Abstract: - A stochastic automaton can perform a finite number of actions in a random environment. When a specific
action is performed, the environment responds by producing an environment output that is stochastically related to the
action. The aim is to design an automaton, using a reinforcement scheme based on the computational model of wasp
behaviour that can determine the best action guided by past actions and environment responses. Using Stochastic
Learning Automata techniques, we introduce a decision/control method for intelligent vehicles receiving data from on-board
sensors or from the localization system of highway infrastructure.
Key-Words: - Stochastic Learning Automata, Wasp Behavioral Model, Intelligent Vehicle Control
1 Introduction
An automaton is a machine or control mechanism
designed to automatically follow a predetermined
sequence of operations or respond to encoded
instructions. The term stochastic emphasizes the
adaptive nature of the automaton we describe here. The
automaton described here does not follow predetermined
rules, but adapts to changes in its environment. This
adaptation is the result of the learning process. Learning
is defined as any permanent change in behavior as a
result of past experience, and a learning system should
therefore have the ability to improve its behavior with
time, toward a final goal.
The stochastic automaton attempts a solution of the
problem without any information on the optimal action
(initially, equal probabilities are attached to all the
actions). One action is selected at random, the response
from the environment is observed, action probabilities
are updated based on that response, and the procedure is
repeated. A stochastic automaton acting as described to
improve its performance is called a learning automaton.
The response values can be represented in three
different models. In the P-model, the response values are
either 0 or 1, in the S-model the response values are
continuous in the range (0, 1) and in the Q-model the
values belong to a finite set of discrete values in the
range (0, 1).
The environment can further be split up in two types,
stationary and nonstationary. In a stationary environment
the penalty probabilities will never change. In a
nonstationary environment the penalties will change
over time.
The algorithm that guarantees the desired learning
process is called a reinforcement scheme [5]. The
reinforcement scheme is the basis of the learning process
for learning automata. The general solution for
absolutely expedient schemes was found by
Lakshmivarahan and Thathachar [8]. However, these
reinforcement schemes are theoretically valid only in
stationary environments.
For a nonstationary automata environment resulting
from a changing physical environment, we propose a
new reinforcement scheme, based on the computational
model of wasp behaviour.
The aim of this paper is to define an agent-based
model for an Intelligent Vehicle Control System using
Stochastic Learning Automata with a learning process
driven by a reinforcement scheme with wasp-like
behaviour.
The remainder of this paper is organized as follows.
In section 2 we present the Stochastic learning automata
concept. Sections 3 and 4 contain aspects related to the
Intelligent Vehicle Control system. In the section 5 we
introduce the Wasp Behavioral Model. Our new
reinforcement scheme is presented in section 6. In
section 7 is presented an implementation of a simulator
for the Intelligent Vehicle Control System and
conclusions are presented in section 8.
2 Stochastic learning automata
Mathematically, the environment of a stochastic
automaton is defined by a triple {α , c,β } where
{ , ,..., } α = α1 α 2 α r represents a finite set of actions
being the input to the environment, { , } 1 2 β = β β
ISSN: 1790-2769 289 ISBN: 978-960-474-012-3
2. MATHEMATICAL METHODS, COMPUTATIONAL TECHNIQUES, NON-LINEAR SYSTEMS, INTELLIGENT SYSTEMS
represents a binary response set, and { , ,..., } c = c1 c2 cr is
a set of penalty probabilities, where ci is the probability
that action i α
will result in an unfavourable response.
Given that β (n) = 0 is a favourable outcome and
β (n) =1 is an unfavourable outcome at time instant
n (n = 0,1, 2, ...) , the element ci of c is defined
mathematically by:
ci = P(β (n) =1|α (n) =α i ) i =1, 2, ..., r
A learning automaton generates a sequence of
actions on the basis of its interaction with the
environment. If the automaton is “learning” in the
process, its performance must be superior to “intuitive”
methods [6]. Consider a stationary random environment
with penalty probabilities { , ,..., } c1 c2 cr defined above.
In order to describe the reinforcement schemes, is
defined p(n) , a vector of action probabilities:
pi (n) = P(α (n) =α i ), i =1, r
We define a quantity M(n) as the average penalty
for a given action probability vector:
M ( n ) = P ( β
( n ) = 1| p ( n
))
=
Σ r
β = α = α ∗ α = α
=
Σ
= =
( ( ) 1| ( ) ) ( ( ) ) ( )
i
i i
r
i
P n n i P n i c p n
1 1
An automaton is absolutely expedient if the expected
value of the average penalty at one iteration step is less
than it was at the previous step for all steps:
M(n +1) < M(n) for all n [7].
Updating action probabilities can be represented as
follows:
p(n +1) = T[ p(n),α (n),β (n)] ,
where T is a mapping. This formula says the next action
probability p(n +1) is updated based on the current
probability p(n) , the input from the environment and the
resulting action. If p(n +1) is a linear function of p(n) ,
the reinforcement scheme is said to be linear; otherwise
it is termed nonlinear.
Absolutely expedient learning schemes are presently
the only class of schemes for which necessary and
sufficient conditions of design are available.
However, the need for learning and adaptation in
systems is mainly due to the fact that the environment
changes with time. The performance of a learning
automaton should be judged in such a context. If a
learning automaton with a fixed strategy is used in a
nonstationary environment, it may become less
expedient, and even non-expedient. The learning scheme
must have sufficient flexibility to track the better actions.
If the action probabilities of the learning automata are
functions of the status of the physical environment (such
in case of an Intelligent Vehicle Control System), the
realization of a deterministic mathematical model of this
physical environment may be impossible, if not
extremely difficult.
For a nonstationary automata environment resulting
from a changing physical environment, we propose a
new reinforcement scheme, based on the computational
model of wasp behaviour.
3 Using stochastic learning automata for
Intelligent Vehicle Control
In this section, we present a method for intelligent
vehicle control, having as theoretical background
Stochastic Learning Automata. The aim here is to design
an automata system that can learn the best possible
action based on the data received from on-board sensors,
of from roadside-to-vehicle communications. For our
model, we assume that an intelligent vehicle is capable
of two sets of lateral and longitudinal actions. Lateral
actions are LEFT (shift to left lane), RIGHT (shift to
right lane) and LINE_OK (stay in current lane).
Longitudinal actions are ACC (accelerate), DEC
(decelerate) and SPEED_OK (keep current speed). An
autonomous vehicle must be able to “sense” the
environment around itself. Therefore, we assume that
there are four different sensors modules on board the
vehicle (the headway module, two side modules and a
speed module), in order to detect the presence of a
vehicle traveling in front of the vehicle or in the
immediately adjacent lane and to know the current speed
of the vehicle. These sensor modules evaluate the
information received from the on-board sensors or from
the highway infrastructure in the light of the current
automata actions, and send a response to the automata.
The response from physical environment is a
combination of outputs from the sensor modules.
Because an input parameter for the decision blocks is the
action chosen by the stochastic automaton, is necessary
to use two distinct functions for mapping the outputs of
decision blocks in inputs for the two learning automata,
namely the longitudinal automaton and respectively the
lateral automaton.
After updating the action probability vectors in both
learning automata, using the new reinforcement
scheme presented in section 6, the outputs from
stochastic automata are transmitted to the regulation
layer. The regulation layer handles the actions received
from the two automata in a distinct manner, using for
each of them a regulation buffer. If an action received
was rewarded, it will be introduced in the regulation
buffer of the corresponding automaton, else in buffer
will be introduced a certain value which denotes a
penalized action by the physical environment. The
regulation layer does not carry out the action chosen
immediately; instead, it carries out an action only if it is
ISSN: 1790-2769 290 ISBN: 978-960-474-012-3
3. MATHEMATICAL METHODS, COMPUTATIONAL TECHNIQUES, NON-LINEAR SYSTEMS, INTELLIGENT SYSTEMS
recommended k times consecutively by the automaton,
where k is the length of the regulation buffer. After an
action is executed, the action probability vector is
initialized to
1
, where r is the number of actions. When
r
an action is executed, regulation buffer is initialized also.
Our basic model for planning and coordination of lane
changing and speed control is shown in figure 1.
Highway
yes
─
Fig. 1 The model of the Intelligent Vehicle Control
System
4 Sensor modules
The four teacher modules mentioned above are decision
blocks that calculate the response (reward/penalty),
based on the last chosen action of automaton. Table 1
describes the output of decision blocks for side sensors.
As seen in table 1, a penalty response is received
from the left sensor module when the action is LEFT and
there is a vehicle in the left or the vehicle is already
traveling on the leftmost lane. There is a similar situation
for the right sensor module.
Left/Right Sensor Module
Actions
Vehicle in
sensor range or
no adjacent lane
No vehicle in
sensor range and
adjacent lane
exists
LINE_OK 0/0 0/0
LEFT 1/0 0/0
RIGHT 0/1 0/0
Table 1 Outputs from the Left/Right Sensor Module
The Headway (Frontal) Module is defined as shown
in table 2. If there is a vehicle at a close distance (<
admissible distance), a penalty response is sent to the
automaton for actions LINE_OK, SPEED_OK and
ACC. All other actions (LEFT, RIGHT, DEC) are
encouraged, because they may serve to avoid a collision.
Headway Sensor Module
Actions
Vehicle in range
(at a close
frontal distance)
No vehicle in
range
LINE_OK 1 0
LEFT 0 0
RIGHT 0 0
SPEED_OK 1 0
ACC 1 0
DEC 0* 0
Table 2 Outputs from the Headway Module
The Speed Module compares the actual speed with
the desired speed, and based on the action chosen send a
feedback to the longitudinal automaton.
The reward response indicated by 0* (from the
Headway Sensor Module) is different than the normal
reward response, indicated by 0: this reward response
has a higher priority and must override a possible
penalty from other modules.
Speed Sensor Module
Actions
Speed:
too slow
Acceptable
speed
Speed:
too fast
SPEED_OK 1 0 1
ACC 0 0 1
DEC 1 0 0
Table 3 Outputs from the Speed Module
Physical Environment
Auto vehicle
Regulation Layer
SLA
Environment
Planning Layer
β2 = 0
Frontal
detection
Left
Detection
Right
Detection
Speed
Detection
F1
F2
Longitudinal
Automaton
Lateral
Automaton
Regulation
Buffer
Regulation
Buffer
β1 = 0
no
yes
no
β1
β 2
Localization
System
α1
α 2
α1
α 2
α1
α 2
─
action
ISSN: 1790-2769 291 ISBN: 978-960-474-012-3
4. MATHEMATICAL METHODS, COMPUTATIONAL TECHNIQUES, NON-LINEAR SYSTEMS, INTELLIGENT SYSTEMS
5 Wasp Behavioral Model
Theraulaz et al. present a model for self-organization
within a colony of wasps ([15]). In a colony of wasps,
individual wasp interacts with its local environment in
the form of a stimulus-response mechanism, which
governs distributed task allocation. An individual wasp
has a response threshold for each zone of the nest. Based
on a wasp’s threshold for a given zone and the amount of
stimulus from brood located in this zone, a wasp may or
may not become engaged in the task of foraging for this
zone. A lowest response threshold for a given zone
amounts to a higher likelihood of engaging in activity
given a stimulus. In [16] is discussed a model in which
these thresholds remain fixed over time. Later, in [17] is
considered that a threshold for a given task decreases
during time periods when that task is performed and
increases otherwise. In [18], Cicirello and Smith, present
a system which incorporates aspects of the wasp model
which have been ignored by others authors. They
consider three ways in which the response thresholds are
updated. The first two ways are analogous to that of real
wasp model. The third is included to encourage a wasp
associated with an idle machine to take whatever jobs
rather than remaining idle. The model of wasp behaviour
also describes the nature of wasp-to-wasp interaction
that takes place within the nest. When two individuals of
the colony encounter each other, they may with some
probability interact in a dominance contest. The wasp
with the higher social rank will have a higher probability
of dominating in the interaction. Wasps within the
colony self-organize themselves into a dominance
hierarchy. In [18] is incorporated this aspect of the
behaviour model, that is when two or more of the wasp-like
agents bid for a given job, the winner is chosen
through a tournament of dominance contests.
In the following section is presented a new
reinforcement scheme for stochastic learning automata.
6 A new reinforcement scheme
In this section we present our approach in
implementing a new reinforcement scheme based on a
computational model of wasp behavior.
Each stochastic automaton (namely the longitudinal
automaton and respectively the lateral automaton) has an
associated wasp. Each wasp has a set of response
thresholds: w { w,0 w,r } Θ = θ ,...,θ
where r denotes the number of actions of the automaton
(as defined in section 2) and w,i θ is the response
threshold of wasp w for action i . An action i broadcast
to the automaton a stimulus i S which is equal to the
number of occurences of action i in the regulation buffer
of the automaton. The stochastic automaton will pick up
the action i emitting a stimulus i S with probability:
2
,
P S S
2
2
, ( , )
i
i w i
w i i S
θ
θ
+
=
The threshold values w,i θ may vary in the range
[ , ]. min max θ θ
If automaton will execute the action i , the threshold
w,i θ is updated as follows:
( 0) , , 1 1 θ =θ −δ δ > w i w i
For each action j other than i , the threshold w, j θ is
updated according to:
( 0) , , 2 2 θ =θ +δ δ > w j w j
7 Implementation of a simulator
In this section is described an implementation of a
simulator for the Intelligent Vehicle Control System.
The entire system was implemented in Java, and is based
on JADE platform.
JADE is a middleware that facilitates the
development of multi-agent systems and applications
conforming to FIPA standards for intelligent agents. In
figure 3 is showed the class diagram of the simulator.
Each vehicle has associated an agent, responsible for the
intelligent control.
The response of the physical environment is a
combination of the outputs of all four sensor modules.
The implementation of this combination for each
automaton (longitudinal respectively lateral) is showed
in figure 2 (the value 0* was substituted by 2).
A snapshot of the running simulator is shown in figure 5.
// environment response for Longitudinal Automaton
public double reward(int action){
int combine;
combine = Math.max(speedModule(action),
frontModule(action));
if (combine = = 2) combine = 0;
return combine;
}
// environment response for Lateral Automaton
public double reward(int action){
int combine;
combine = Math.max(leftRightModule(action),
frontModule(action));
return combine;
}
Fig. 2 The physical environment response
ISSN: 1790-2769 292 ISBN: 978-960-474-012-3
5. MATHEMATICAL METHODS, COMPUTATIONAL TECHNIQUES, NON-LINEAR SYSTEMS, INTELLIGENT SYSTEMS
Fig. 3 The class diagram of the simulator
Implementation of the learning process for the lateral
automaton is described in the following:
// The learning process
public void learning(){
int i, j;
double f;
boolean doIt;
// choose an action
i = getAction();
// compute environment response
f = reward(i);
for (int k = 1; k < HISTORY; k++)
regulation_layer[k-1]=
regulation_layer[k];
if (f==0)
regulation_layer[HISTORY-1] = i;
else
// ignore the action!
regulation_layer[HISTORY-1] = -1;
doIt=true;
for (int k = 0; k < HISTORY; k++)
if (regulation_layer[k]!=i) {
doIt=false;
break;
}
if (doIt) {
// all action probabilities
// become equals
init();
switch(i) {
case LEFT:
auto.setLane(auto.getLane()-1);
break;
case RIGHT:
auto.setLane(auto.getLane()+1);
break;
}
return;
}
// update the action probabilities using a
// reinforcement scheme based on the
// wasp-like computational model
if (f==0){
if (t[i]-d1 >= t_min) t[i]=t[i]-d1;
S[i] = getStimulus(i);
p[i]=S[i]*S[i]/( S[i]*S[i] +t[i]*t[i]);
for (j=0; j < ACTIONS; j++)
if (j != i) {
if (t[j]+d2 <= t_max) t[j]=t[j]+d2;
S[j] = getStimulus(j);
p[j]=S[j]*S[j]/( S[j]*S[j] +t[j]*t[j]);
}
} else{
if (t[i]+d2 <= t_max) t[i]=t[i]+d2;
S[i] = getStimulus(i);
p[i]=S[i]*S[i]/( S[i]*S[i] +t[i]*t[i]);
for (j=0; j < ACTIONS; j++)
if (j != i) {
if (t[j]-d1 >= t_min) t[j]=t[j]-d1;
S[j] = getStimulus(j);
p[j]=S[j]*S[j]/( S[j]*S[j] +t[j]*t[j]);
}
}
}
Fig. 4 The learning process for the lateral automaton
Fig. 5 A scenario executed in the simulator
8 Conclusion
Reinforcement learning has attracted rapidly increasing
interest in the machine learning and artificial intelligence
communities. Its promise is beguiling - a way of
programming agents by reward and punishment without
needing to specify how the task (i.e., behavior) is to be
achieved. Reinforcement learning allows, at least in
ISSN: 1790-2769 293 ISBN: 978-960-474-012-3
6. MATHEMATICAL METHODS, COMPUTATIONAL TECHNIQUES, NON-LINEAR SYSTEMS, INTELLIGENT SYSTEMS
principle, to bypass the problems of building an explicit
model of the behavior to be synthesized and its
counterpart, a meaningful learning base (supervised
learning).
The new reinforcement scheme presented in this
paper is suitable for nonstationary stochastic automata
environment, resulting from a changing physical
environment. Used within a simulator of an Intelligent
Vehicle Control System, this new reinforcement scheme,
based on a computational model of wasp behavior, has
proved its efficiency.
Acknowledgment: This work was completed with the
support of the research grant of the Romanian Ministry
of Education and Research, CNCSIS 33/2007.
References:
[1] A. Barto, S. Mahadevan, Recent advances in
hierarchical reinforcement learning, Discrete-Event
Systems journal, Special issue on Reinforcement
Learning, 2003.
[2] R. Sutton, A. Barto, Reinforcement learning: An
introduction, MIT-press, Cambridge, MA, 1998.
[3] O. Buffet, A. Dutech, F. Charpillet, Incremental
reinforcement learning for designing multi-agent
systems, In J. P. Müller, E. Andre, S. Sen, and C.
Frasson, editors, Proceedings of the Fifth
International Conference onAutonomous Agents, pp.
31–32,Montreal, Canada, 2001. ACM Press.
[4] J. Moody, Y. Liu, M. Saffell, K. Youn, Stochastic
direct reinforcement: Application to simple games
with recurrence, In Proceedings of Artificial
Multiagent Learning. Papers from the 2004 AAAI
Fall Symposium,Technical Report FS-04-02, 2004.
[5] C. Ünsal, P. Kachroo, J. S. Bay, Simulation Study of
Multiple Intelligent Vehicle Control using Stochastic
Learning Automata, TRANSACTIONS, the Quarterly
Archival Journal of the Society for Computer
Simulation International, volume 14, number 4,
December 1997.
[6] C. Ünsal, P. Kachroo, J. S. Bay, Multiple Stochastic
Learning Automata for Vehicle Path Control in an
Automated Highway System, IEEE Transactions on
Systems, Man, and Cybernetics -part A: systems and
humans, vol. 29, no. 1, january 1999
[7] K. S. Narendra, M. A. L. Thathachar, Learning
Automata: an introduction, Prentice-Hall, 1989.
[8] S. Lakshmivarahan, M.A.L. Thathachar, Absolutely
Expedient Learning Algorithms for Stochastic
Automata, IEEE Transactions on Systems, Man and
Cybernetics, vol. SMC-6, pp. 281-286, 1973
[9] F. Stoica, E. M. Popa, An Absolutely Expedient
Learning Algorithm for Stochastic Automata,
WSEAS Transactions on Computers, Issue 2, Volume
6, February 2007, ISSN 1109-2750, pp. 229-235
[10] I. De Falco, A. Iazzetta, A. Della Cioppa, E.
Tarantino, Mijn Mutation Operator for Aerofoil
Design Optimisation, Research Institute on Parallel
Information Systems, National Research Council of
Italy, 2001
[11] I. De Falco, R. Del Balio, A. Della Cioppa, E.
Tarantino, A Comparative Analysis of Evolutionary
Algorithms for Function Optimisation, Research
Institute on Parallel Information Systems, National
Research Council of Italy, 1998
[12] I. De Falco, A. Iazzetta, A. Della Cioppa, E.
Tarantino, The Effectiveness of Co-mutation in
Evolutionary Algorithms: the Mijn operator, Research
Institute on Parallel Information Systems, National
Research Council of Italy, 2000
[13] H. Kita, A comparison study of self-adaptation in
evolution strategies and real-code genetic algorithms,
Evolutionary Computation, 9(2):223-241, 2001
[14] F. Stoica, D. Simian, C. Simian, A new co-mutation
genetic operator, In Proceedings of Applications of
Evolutionary Computing in Modeling and
Development of Intelligent Systems, within the 9th
WSEAS International Conference on
EVOLUTIONARY COMPUTING (EC '08), May
2008, Sofia, Bulgaria.
[15] G. Theraulaz , E. Bonabeau, J. Gervet , J. I.
Demeubourg, Task differention in policies
waspcolonies. A model for self-organizing groups of
robots, From animals to Animats: Proceedings of the
First International Conference on Simulation of
Adaptiv behaviour, 1991, pp. 346–355.
[16] E. Bonabeau, G. Theraulaz, J.I. Demeubourg, Fixed
response thresholds and the regulation of division of
labor in insect societies, Bull. Math. Biol., vol. 60,
1998, pp. 753–807.
[17] G. Theraulaz , E. Bonabeau, J. I. Demeubourg,
Response threshold reinforcement and division of
labour in insects societies, Proc. R Spob London
B.,vol. 265, no. 1393,1998, pp. 327-‘335.
[18] V. A. Cicirelo, S.F. Smith,Wasp-like Agents for
Distributed Factory coordination, Autonomous
Agents and Multi-Agent Systems, Vol. 8, No. 3,2004,
pp. 237–267.
[19] D. Simian, F. Stoica, C. Simian, Models for a
Multi-Agent System Based on Wasp-Like Behaviour
for Distributed Patients Repartition, 9th WSEAS
International Conference on EVOLUTIONARY
COMPUTING (EC’08), Sofia, Bulgaria, May 2-4,
2008.
ISSN: 1790-2769 294 ISBN: 978-960-474-012-3