Black box optimization of restart strategies for the MetaMax algorithm
1. Czech Technical University in Prague
Faculty of Electrical Engineering
DIPLOMA THESIS
Bc. Viktor Kajml
Black box optimization: Restarting versus MetaMax
algorithm
Department of Cybernetics
Project supervisor: Ing. Petr Posik, Ph.D.
Prague, 2014
2.
3.
4.
5.
6.
7.
8.
9. Abstrakt
Tato diplomová práce se zabývá vyhodnocením nového perspektivního optimaliza£ního algoritmu, nazvaného MetaMax. Hlavním cílem je zhodnotit vhodnost jeho
pouºití pro °e²ení problém· optimalizace £erné sk°í¬ky se spojitými parametry, obzvlá²t¥
v porovnání s ostatními metodami b¥ºn¥ pouºívanými v této oblasti. Za tímto ú£elem
je MetaMax a vybrané tradi£ní restartovací strategie, podrobn¥ otestován na rozsáhlé
sad¥ srovnávacích funkcí, za pouºití r·zných algoritm· lokálního prohledávání. Takto
nam¥°ené výsledky jsou poté porovnány a vyhodnoceny. Druhotným cílem je navrhnout
a implementovat modikace algoritmu MetaMax v jistých oblastech, kde je prostor
pro zlep²ení jeho výkon·.
Abstract
This diploma thesis is focused on evaluating a new promising multi-start optimization algorithm called MetaMax. The main goal is to assess its utility
it in the area of black-box continuous parameter optimization, especially in
comparison with other strategies commonly used in this area. To achieve this,
MetaMax and a selection of traditional restart strategies are thoroughly tested
on a large set of benchmark problems and using multiple dierent local search
algorithms. Their results are then compared and evaluated. An additional
goal is to suggest and implement modications of the MetaMax algorithm,
in certain areas where it seems that there could be a potential room for improvement.
10.
11. I would like to thank:
Mr. Petr Po²ík for his help on this thesis
The Centre of Machine perception at the Czech Technical University in Prague
for providing me with access to their computer grid
My friends and family for their support
17. 1
Introduction
The goal of this thesis is to implement and evaluate the performance of the
MetaMax optimization algorithm, particularly in comparison with other commonly
used optimization strategies.
MetaMax was proposed by György and Kocsis in [GK11] and the results they
present seem very interesting and suggest that MetaMax might be a very competitive
algorithm. Our goal is to more closely evaluate its performance on problems from
the area of black-box continuous optimization, by performing a series of exhaustive
measurements and comparing the results with those of several commonly used restart
strategies.
This text is organized as follows: Firstly there is a short overview of the subjects
of mathematical, continuous and black-box optimization, local search algorithms and
multi-start strategies. This is meant as an introduction for readers who might not
be familiar with these topics. Readers who already have knowledge of these elds
might wish to skip forward to the following sections, which describe the MetaMax
algorithm, the experimental setup, used optimization strategies and the software
implementation. In the last two sections, the measured results are summed up and
evaluated.
The mathematical optimization problem is dened as selecting the best element,
according to some criteria, from a set of feasible elements. Most common form of
x1,opt , . . . , xd,opt , where d is the problem
dimension, for which a value of a given objective function f (x1 , . . . , xd ) is minimal,
that is f (x1,opt , . . . , xd,opt ) ≤ f (x1 , . . . , xd ) for all possible values of x1 , . . . , xd .
the problem is nding a set of parameters
Within this eld of mathematical optimization, it is possible to dene several
subelds based on the properties of the parameters
information available about the objective function
x1 , . . . , x d
and the amount of
f.
The set of all possible solutions (possible combid
nations of the parameter values) is nite. Usually some subset of N .
Combinatorial optimization:
Integer programming:
All of the parameters are restricted to be integers:
x1 , . . . , xd ∈ N. Can be considered to be a subset of combinatorial optimization.
Mixed integer programming:
Some parameters are real-valued and some are
integers.
Continuous optimization:
The set of all possible solutions is innite. Usually
x1 , . . . , x d ∈ R .
black-box optimization:
about
f
Assumes that only a bare minimum of information
is given. It can evaluated at an arbitrary point
tion value
f (x),
but besides that, no other properties of
x, returning the funcf are known. In order
to solve this kind of problems, we have to resolve to searching (The exact techniques are be described in more detail later in this text). Furthermore we are
almost never guaranteed to nd the exact solution, just one that is suciently
close to it, and there is almost always a non-zero probability that even an
approximate solution might not be found at all.
1
18. White box optimization
knowledge about
f,
deals with problems where we have some additional
for example its gradient, which can obviously be very
useful when looking for its minimum.
In this text we will almost exclusively with black-box continuous optimization
problems.
For a practical example of a black-box optimization problem, imagine the process
of trying to design an airfoil, which should have certain desired properties. It is possible to describe the airfoil by a vector of variables representing its various parameters
- length, thickness, shape, etc. This will be the parameter vector
run aerodynamic simulation with the airfoil described by
x,
x.
Then, we can
evaluate how closely it
matches the desired properties, and based on that, assign a function value
f (x)
the parameter vector. In this way, the simulator becomes the black-box function
and the problem is transformed into task of minimizing the objective function
We can then use black-box optimization methods to nd the parameter vector
to
f
f.
xopt
which will give us an airfoil with the desired properties.
This example hopefully suciently illustrates the fact that, black-box optimization can be a very powerful tool, as it allows us to nd reasonably good solutions
even for problems which we might not be able to, or would not know how to, solve
otherwise.
As already mentioned, the usual method for nding optima (the best possible
set of parameters
xopt )
in continuous mathematical optimization is searching. The
structure of a typical local search algorithm is as follows:
Algorithm 1: Typical structure of a local search algorithm
1
Select a starting solution
x0
somehow (most commonly randomly) from the
set of feasible solutions.
2
3
4
5
6
7
8
9
10
11
12
xc ← x0
f (xc ).
Set current solution:
Get function value
while
Stop condition not met
do
Generate a set of neighbour solutions
Evaluate
f
at each
Xn
similar to
xc
xn ∈ X n
Find the best neighbour solution
∗
if f (x ) f (xc ) then
Update the current solution
x∗ = argmaxxn ∈Xn f (x)
xc ← x∗
else
Modify the way of generating neighbour solutions
return
xc
In the case of continuous optimization, a solution is represented simply by a
d
point in R . There are various ways of generating neighbour solutions. In general,
two neighbouring solutions should be dierent from each other, but in some sense
also similar. In continuous optimization, this usually means that the solutions are
close in terms of Euclidean distance, but not identical.
2
19. The algorithm described above has the property that it always creates neighbour
solutions close to the current solution and moves the current solution in the direction
of decreasing
f (x). This makes it a greedy algorithm, which works well in cases where
the objective function is unimodal (has only one optimum), but for multimodal
functions (functions with multiple local optima), the resulting behaviour will not be
ideal. The algorithm will move in the direction of the nearest optimum (the optimum
x0 ),
with basin of attraction containng
but when it gets there it will not move any
further, as at this point, all the neighbour solutions will be worse than the current
solution. Such algorithm can therefore be relied on to nd the nearest local optimum,
but there is no guarantee that it will also be the global one. The global optimum
will be found only when
x0
happens to land in its basin of attraction.
The method which is most commonly used to overcome this problem, is to run
multiple instances of the local search algorithm from dierent starting positions
x0 .
Then it is probable that at least one of them will start in the basin of attraction of
the global optimum and will be able nd it.
There are various dierent multi-start strategies which implement this basic idea,
with MetaMax, the main subject of this thesis, being one of them.
More thorough description of the local search algorithms problem of getting stuck
in a local optimum is described in section 2. Detailed description of the MetaMax
algorithm and its variations is given in section 3. Structure of the performed experiments is described in section 4. Finally, the measured results are presented and
evaluated in section 5.
2
Problem description and related work
As mentioned in the previous section, local search algorithms have problems
nding the global optimum of functions with multiple optima (also called multimodal
functions). In this section we focus on this problem more thoroughly. We describe
several common types of local search algorithms in more detail and discuss their
susceptibility to getting stuck in a local optimum. Next, we will describe several
methods to overcome this problem.
2.1 Local search algorithms
Following are the descriptions of four commonly used kinds of local search algorithms, which we hope will give the reader a more concrete idea about the functioning
of local search algorithms, than the very basic example described in algorithm 1.
Line search algorithms try to solve the problem of minimizing a d-dimensional
function
f
by using a series of one-dimensional minimization tasks, called line
searches. During each step of the algorithm, an imaginary line is created starting at the current solution
xc
and going in a suitably chosen direction
σ . Then,
x, with the minimal value of f (x), and the
current solution is updated: xc ← x. In this way, the algorithm will eventually
converge on a nearest local optimum of f .
the line is searched for a point
3
20. The question remains - How to chose the search direction
σ?
The most simple
algorithms just use a preselected set of directions (usually vectors in an orthonormal positive d-dimensional base) and loop through them on successive
iterations. This method is quite simple to implement, but it has trouble coping
with ill-conditioned functions.
An obvious idea might be to use information about the functions gradient for
determining the search direction. However, this turns out not to be much more
eective than simple alternating algorithms. The best results are achieved when
information about both the functions gradient and its Hessian is used. Then,
it is possible to get quite robust and well performing algorithms. Note, that
for black-box optimization problems, it is necessary to obtain the gradient by
estimation, as it is not explicitly available.
Examples of this kind of algorithm are: Symmetric rank one method, gradient
descent algorithm and Broyden-Fletcher-GoldfarbShanno algorithm
Pattern search algorithms closely t the description given in algorithm 1. They
generate the neighbour solutions
relative to the current solution
xn ∈ X n
in dened positions (a pattern)
xc . If any of the neighbour solutions is found to
be better than the current one, it then becomes the new current solution, the
next set of neighbour solutions is generated around it and so on.
If none of the neighbour solutions is found to be better (an unsuccessful iteration), then the pattern is contracted so that in the next step the neighbour
solutions are generated closer to
xc .
In this way the algorithm will converge to
the nearest local optimum (for proof, please see [KLT03]). Advanced pattern
search algorithms use patterns, which change size shape according to various
rules, both on successful and unsuccessful iterations.
Typical algorithms of this type are: Compass search (or coordinate search),
Nelder-Mead simplex algorithm and Luus-Jakola algorithm.
Population based algorithms keep track of a number of solutions, also called
individuals, at one time, which together constitute a population. A new generation of solutions is generated each step, based on the properties of a set of
selected (usually the best) individuals from the previous generation. Dierent
algorithms vary in the exact implementation of this process.
For example, in the family of genetic algorithms, this process is designed to
emulate natural evolution: Properties of each individual (in case of continuous
optimization, this means its position) are encoded into a genome and new
individuals are created by combining parts of genomes of successful individuals
from the previous generation, or by random mutation. Unsuccessful individuals
are discarded, in an analogy with the natural principle of survival of the ttest.
Other population based algorithms, such as CMA-ES take a somewhat more
mathematical approach: New generations are populated by sampling a multivariate normal distribution, which is in turn updated every step, based on the
properties of a number of best individuals from the previous generation.
4
21. Swarm intelligence algorithms are based on the observation, that it is possible to
get quite well performing optimization algorithms by trying to emulate natural
behaviours, such as ocking of birds or sh schools. Each solution represents one
member of a swarm and moves around the search space according to a simple
set of rules. For example, it might try to keep certain minimal distance form
other ock members, while also heading in the direction with the best values
of
f (x).
The specic rules vary a great deal between dierent algorithms, but
in general even a simple individual behaviour is often enough to result in quite
complex collective emergent behaviour. Because swarm intelligence algorithms
keep track of multiple individuals/solutions during each step, they can also be
considered to be a subset of population based algorithms.
Some examples of this class of algorithms are the Particle swarm optimization
algorithm and the Fish school search algorithm.
Pattern search and line search algorithms, have the property that they always
choose neighbour solutions close to the current solution and they move in direction
of decreasing
f (x).
Thus, as was already described in the previous section, they are
able to nd only the local optimum, which is nearest to their starting position
x0 .
Population based and swarm intelligence algorithms might be somewhat less susceptible to this behaviour, in the case where the initial population is spread over a
large area of the search space. Then there is a chance that some individuals might
land near to the global optimum, and eventually pull the others towards it.
There are several modications of local search algorithms specically designed to
overcome the problem of getting stuck in a local optimum. We shall now describe
two basic ones - Simulated annealing and Tabu search. The main idea behind them,
is to limit the local search algorithms greedy behaviour by sometimes taking steps
other than those, which lead to the greatest decrease of
f (x).
Simulated annealing implements the above mentioned idea in a very straightfor-
ward way: During each step, the local search algorithm may select any of the
generated neighbour solutions with a non-zero probability, thus possibly not
selecting the best one.
The probability
of
f (xc ), f (xn ),
P
of choosing a particular neighbour solution
and
s,
where
s
xn
is a function
is the number of steps already taken by the
algorithm. Usually, it increases with the value of
∆f = f (xc ) − f (xn ),
so that
the best neighbour solutions are still likely to be picked the most often. The
probability of choosing a neighbour solution other than the best one also usually
decreases as
s
increases, so that the algorithm behaves more randomly in the
beginning and then, as the time goes on, settles down to a more predictable
behaviour and converges to the nearest optimum. This is somewhat similar
to the metallurgical process of annealing, from where the algorithm takes its
name.
It is possible to apply this method to almost any of the previously mentioned local search algorithms, simply by adding the possibility of choosing neighbour solutions, which are not the best. In practice, the exact form of
5
P (f (xc ), f (xn ), s)
22. has to be ne-tuned for a given problem in order to get good results. Therefore,
this algorithm is of limited usefulness in the area of black-box optimization.
Tabu search works by keeping list of previously visited solutions, which is called
the tabu list. It selects potential moves only from the set of neighbour solutions,
which are not on the this list, even if it means choosing a solution, which is
worse than the current one. The selected solution is then added to the tabu
list and the oldest entry in the Tabu list is deleted. The list therefore works in
a way similar to a cyclic buer.
This method has been originally designed for solving combinatorial optimization problems and it requires certain modications in order to be useful in the
area of continuous parameter optimization. At the very least, it is necessary
to modify the method to not only discard neighbour solutions which are on
the tabu list, but also solutions which are close to them. Without this, the
algorithm would not work very well, as the probability of generating the exact
d
same solution twice in R is quite small.
There is a multitude of advanced variations of this basic method, for example
it is possible to add aspiration rules, which override the tabu status of solutions
that would lead to a large decrease in
f (x).
For a detailed description of tabu
search adapted for continuous optimization, please see [CS00].
2.2 Multi-start strategies
Multi-start strategies allow eectively using local search algorithms on functions
with multiple local optima without making any modication to the way they work.
The basic idea is, that if we run a search algorithm multiple times, each time from
a dierent starting position
x0 ,
then it is probable that at least one of the starting
positions will be in the basin of attraction of the global optimum and thus the
corresponding local search algorithm will be able to nd it. Of course, the probability
of this depends on the number of algorithm instances that are run, relative to the
number and properties of the functions optima. It is possible to think about multistart strategies as meta-heuristics, running above, and controlling multiple instances
of local search algorithm sub-heuristics.
Restart strategies are a subset of multi-start strategies, where multiple instances
are run one at a time in succession. The most basic implementation of a restart
strategy is to take the total amount of allowed resource budget (usually a set number
of objective function evaluations), evenly divide it into multiple slots, and use each
of them to run one instance of a local search algorithm. A very important choice
is deciding the length of a single slot. The optimal length largely depends on the
specic problem and type of used algorithm. If the length is set too low, then the
algorithm might not have enough time to converge to its nearest optimum. If it is too
long, then there is a possibility that resources will be wasted on running instances
which are stuck in local optima and can no longer improve.
Of course, all of the time slots do not have to be of the same length. A good
strategy for black-box optimization is to start with low length and keep increasing
6
23. it for each subsequent slot. In this way, a reasonable performance can be achieved
even if we are unable to choose the most suitable slot length for a given problem in
advance.
A dierent restart strategy is to keep each instance going as long as it needs
to until it converges to an optimum. The most universal way to detect convergence
is to look for stagnation of values of the objective function over a number of past
function evaluations (or past local search algorithm iterations). If the best objective
function value found so far does not improve by at least the limit
tf
over the last
hf
function evaluations, then the current algorithm instance is terminated and new one
is started. For convenience, in the subsequent text we will call
hf
the function value
history length and tf the function value history tolerance. An example of this restart
condition is given in gure 1: The best solution found after v function evaluations
∗
∗
is marked as xv and its corresponding function value as f (xv ). In the gure, we see
that the restart condition is triggered because at the last function evaluation m, the
∗
∗
following is true: f (xm−h ) ≤ f (xm ) + tf
f
m−hf
3000
f(xv )
2500
2000
1500
1000
5000
5
10
15
20
v
25
30
35
∗
f(xm ) + tf
∗
f(xm )
Figure 1: Restart condition based on function value stagnation
Displays the objective function value f (xv ) (dashed black line) of evaluation v , and the best objective function value reached after v function
evaluations f (x∗ ) (solid black line), over the interval 0, m function
s
evaluations. The values f (x∗ ), f( x∗ ) + tf and m − hf are highlighted.
m
m
It is, of course, necessary to choose specic values of
hf
and
tf
but usually it is
not overly dicult to nd a combination which works well for a large set of problems.
Various dierent ways of detecting convergence and corresponding restart conditions can be used. For example, reaching zero-gradient for line-search algorithms,
7
24. reaching minimal size of pattern for pattern search algorithms, etc.
There are also various ways of choosing the starting position
search algorithm instances. The simplest one is to choose
x0
x0
for new local
by sampling random
uniform distribution over the set of all feasible solutions. This is very easy to implement and often gives good results. However, it is also possible to use information
gained by the previous instances, when choosing
x0
for a new one.
A simple algorithm, which utilizes this idea is the Iterated search: The rst instance
i1
is started from an arbitrary position and it is run until it converges (or
until it exhausts certain amount of resources) and returns the best solution it has
∗
found xi1 . Then, the starting position for the next instance is selected from the neigh∗
bourhood N of xi . Note, that N is a qualitatively dierent neighbourhood, than
1
what the instance
i1
might be using to generate neighbourhood solutions each step.
It is usually much larger, with the goal being to generate the new starting point for
instance
i2
by perturbing the best solution of
i1
enough, to move it to a dierent
x∗2 better than x∗1 , then the
i
i
∗
∗
∗
next instance is started from the neighbourhood N (xi ). If f (xi ) ≥ f (xi ) and a
basin of attraction. If the new instance nds a solution
2
2
1
better solution is not found, then the next instance is started from the neighbour∗
hood N (xi ) again. This is repeated until a stop condition is triggered. An obvious
1
assumption, that this method makes, is that the minima of the objective function
are grouped close together. If this is not the case, then it might be better to use
uniform random sampling.
The big question is, how to choose the size of the neighbourhood
N?
Too small,
and the new instance might fall into the same basin of attraction as the previous one.
Too big, and the results will be similar to choosing the starting position uniformly
randomly. Another method, called the Variable neighbourhood search, which can, in
a way, be considered to be an improved version of the iterated search, tackles this
N1 , . . . , Nk of varying sizes,
N1 is the smallest and the following neighbourhoods are successively larger,
with Nk being the largest. The restarting procedure is the same as with iterated
search, with the following modication: If a local search algorithm instance ik , started
∗
from the neighbourhood N1 (xi
) does not improve the current best solution, then
k−1
∗
the algorithm tries starting the next instance from N2 (xi
), then N3 (x∗k−1 ), and so
i
k−1
problem by using multiple neighbourhood structures
where
on. The structure of a basic variable neighbourhood search, as given in [HM03], page
10, is described in algorithm 2. This algorithm can also be used as a description of
iterated search, if the set of neighbourhood structures contains only one element.
Yet another group of methods, which aim to prevent local search algorithms
from getting stuck in local optima is based on the idea that it is not necessary to
run multiple local search algorithm instances one after another, but they can be run
at the same time. Then, it is possible to evaluate the expected performance of each
instance based on the results it obtained so far and allocate the resources to the best
(or most promising) ones. This is somewhat similar to the well known multi-armed
bandit problem.
The basic implementation of this idea is called the explore and exploit strategy.
It involves initially running all of its
k
algorithm instances until a certain fraction
of its resource budget is expended. This is the exploration phase. Then, the best
8
25. Algorithm 2: Variable neighbourhood search
: initial position
input
x0 ,
set of neighbourhood structures
N1 , . . . , Nk
of
increasing size
1
2
3
4
5
6
7
8
9
10
11
x∗ ← local_search(x0 )
k←1
while
Stop condition not met
Generate random point
y ∗ ← local_search(x )
∗
∗
if f (y ) f (x ) then
∗
∗
x
do
from
Nk (x∗ )
x ←y
k←1
else
k ←k+1
return
x∗
instance is selected and run until the rest of the resource budget is used up - The
exploitation phase.
There is, again, an obvious trade o between the amount of resources allocated
to each phase. The exploration phase should be long enough, so that, when it ends,
it is possible to reliably identify the best instance. On the other hand, it is necessary
to have enough resources left in the exploitation phase in order for the selected best
instance to converge to the optimum. In practice, it is actually not that dicult to
nd balance between these two phases, that gives good results for a wide range of
problems.
Methods like this, which run multiple local search algorithm instances at the
same time, belong into the group of portfolio algorithms. We should, however, note
that portfolio algorithms are usually used in a somewhat dierent way than described
here. Most commonly, they run multiple instances of dierent local search algorithms,
each of which is well suited for a dierent kind of problem. This allows the portfolio
algorithm to select instances of that algorithm, which is able to solve the given
problem the most eciently, even without knowing its properties a priori.
The MetaMax algorithm, which is the main subject of this thesis, is also an portfolio algorithm. However, we use it only running one kind of local search algorithm at
a time, to allow for a more fair and direct comparison with restart strategies, which
typically use only one kind of local search algorithm.
3
MetaMax algorithm and its variants
The MetaMax algoithm is a multi-start portfolio strategy presented by György
and Koscis in [GK11]. There are, actually, three versions of the algorithm, which
dier in certain details. They are called MetaMax(k), MetaMax(∞) and MetaMax
and they will be described in detail in this section.
Please note, that while in this text we usually presume all optimization prob-
9
26. lems to be minimization problems, the text in [GK11] assumes a maximization task.
Therefore, while describing the workings of MetaMax algorithm in this section, we
will keep to the convention in [GK11], but in the rest of the text we will refer to
minimization tasks as usual. Our implementation of MetaMax was modied to work
with minimization tasks.
György and Kocsis demonstrate ([GK11], page 413, equation 2) that convergence
of an instance of a local search algorithm, after
s
steps, can be optimistically esti-
mated with large probability as:
lim f (x∗ ) ≤ f (x∗ ) + gσ (s)
s
s
(1)
s→∞
Where
f ( x∗ )
s
is the best function value obtained by the local search algorithm in-
stance up until the step
s
and
gσ (s)
is a non increasing, non negative function with
lims→∞ g(s) = 0. Note, that the notation used here is a little dierent than in [GK11],
but the meaning is the same.
In practice, the exact form of
g(s)
is not known, so the right side of equation 1
has to be approximated as:
f (x∗ ) + ch(s)
s
Where
(2)
c is an unknown constant and h(s) is a positive, monotone, decreasing function
with the following properties:
h(0) = 1,
lims→∞ h(s) = 0
(3)
h(s) = e−s .
In the subsequent text, we
One possible simple form of this function is
shall call this function the estimate function. György and Kocsis do not use this
name in their work. In fact, they do not use any name for this function at all and
refer to it simply as
h
function. However, we think that this is not very convenient,
hence we picked a suitable name.
Based on equations 1 and 2, it is possible to create a strategy that allocates
resources only to those instances, which are estimated to converge the most quickly
and maximize the value of expression 2 for a certain range of the constant
c.
The
problem of nding these instances can be solved eectively by transforming it into
a problem of nding the upper right convex hull of a set of points in the following
way:
Ai
si , the position xi,si of the best solution
∗
it has found so far and its corresponding function value f (xi,s ). If we represent the
i
set of the local search algorithm instances Ai , i = 1, . . . , k by a set of points:
We assume, that there is
k
number of instances in total and that each instance
keeps track of the number of steps it has taken
P : {(h(si ), f (x∗ i )), i = 1, . . . , k}
i,s
(4)
Then the instances which minimize the value of expression 2 for a certain range of
c
correspond to those points, which lie on the upper right convex hull of the set
P.
Because the term upper right convex hull is not quite standard, we should clarify
that we understand it to mean an intersection of the upper convex hull and the right
convex hull.
10
27. Note, that presumably for simplicity, the authors of [GK11] assumed only local
search algorithms which use the same number of function evaluations every step. For
algorithms where this is not true, it makes more sense to set
of function evaluations used by the instance
i
si
equal to the number
so far instead. We believe that this is
a better way to measure the use of resources by individual instances, which is also
conrmed in [PG13].
György and Kocsis suggest using a form of estimate function, which changes based
on the amount of resources used by all the local search algorithm instances, in order
to encourage more exploratory behaviour as the MetaMax algorithm progresses.
Therefore, in our implementation, we use the following estimate function, which is
recommended in [GK11]:
h(vi , vt ) = e−vi /vt
Where
vi
(5)
is the number of function evaluations used by instance
i and vt
is the total
number of function evaluations used by all of the instances combined.
The simplest of the three MetaMax variants is MetaMax(k). It uses
k local search
algorithm instances and is described in algorithm 3. For convenience and improved
readability, we will use simplied notation, when describing MetaMax variants:
vi
for number of function evaluations used by local search algorithm instance
i
so
far
xi
for position of the best solution found by instance
fi
for function value of
i
so far
xi
In the descriptions, we also assume that the estimate function
h is a function of only
one variable.
Algorithm 3: MetaMax(k)
: function to be optimized
input
f,
number of algorithm instances
monotone non-decreasing function
h
k
and a
with properties as given in
equation 3
1
Step each of the
variables
2
3
while
v i , xi
k
local search algorithm instances
and
Ai
and update their
fi
stop conditions not met
do
i = 1, . . . , k , select algorithm Ai if there exists c 0 so that:
fi + ch(vi ) fj + ch(vj ) for all j = 1, . . . , k so that (vi , fi ) = (vj , fj ). If
there are multiple algorithms with identical v and f , then select only one
For
of them at random.
4
5
6
7
Step each selected
Ai
and update its variables
vi , xi
and
fi .
b = argmin1,...,k (fi ).
∗
solution: x ← xb .
Find the best instance:
Update the best
return
x∗
As with a priori scheduled restart strategies, there is the question of choosing
the right number of instances (parameter
k)
11
to use. The other two versions of the
28. algorithm - MetaMax and MetaMax(∞) get around this problem by gradually increasing the number of instances, starting with a single one and adding a new one
every round. Thus, the number of instances tends to innity as the algorithm keeps
running. This allows to prove that the algorithm is consistent. That is, it will almost
surely nd the global optimum if kept running for an innite amount of time.
Please note, that in some literature, such as [Neu11], the term asymptotically
complete is used, instead of consistent, but both of them mean the same thing. Also
note, that we use the word round to refer to a step of the MetaMax algorithm, in order
to avoid confusion with steps of local search algorithms. MetaMax and MetaMax(∞)
are described in algorithms 5 and 4 respectively, also using the simplied notation.
Algorithm 4: MetaMax(
input
∞)
: function to be optimized
f,
monotone non-decreasing function
h
with
properties as given in equation 3
1
2
3
r←1
while
stop conditions not met
do
Add a new local search algorithm instance
Ar ,
step it once and initialize
vr , xr and fr
For i = 1, . . . , r , select algorithm Ai if there exists c 0 so that:
fi + ch(vi ) fj + ch(vj ) for all j = 1, . . . , r so that (vi , fi ) = (vj , fj ). If
there are multiple algorithms with identical v and f , then select only one
its variables
4
of them at random.
5
6
7
8
9
Step each selected
Ai
and update its variables
Find the best instance:
vi , xi
and
fi .
b = argmin1,...,r (fi ).
x∗ ← x b .
Update the best solution:
r ←r+1
return
x∗
MetaMax and MetaMax(∞) dier only in one point (lines 6 and 7 in algorithm
5): If, after stepping all selected instances, the best instance is a dierent one than
in the previous round, MetaMax will step it until it overtakes the old best instance
in terms of used resources.
In [GK11] it is shown that MetaMax asymptotically approaches the performance
of its best local search algorithm instance as the number of rounds increases. Theoretical analysis suggests that the number of instances increases at a rate of
where
√
Ω( vt ),
vt
is the total number of used function evaluations. However, practical results
vt
Ω( logvt ). Based on this, it can also be estimated ([GK11],
page 439) that to nd the global optimum xopt , MetaMax needs only a logarithmic
give a rate of growth only of
factor more function evaluations than a local search algorithm instance, which would
start in the basin of attraction of
xopt .
Note a small dierence in the way MetaMax and MetaMax(∞) are described
in algorithms 5 and 4 from their descriptions in [GK11]. There, a new algorithm
instance
Ar
is added with
fr = 0
and
sr = 0
and takes at most one step during the
round that it is added. This is possible because in [GK11] a non-negative objective
function
f
and a maximization task are assumed. Therefore, an algorithm instance
12
29. Algorithm 5: MetaMax
input
: function to be optimized
f,
monotone non-decreasing function
h
with
properties as given in equation 3
1
2
3
r←1
while
stop conditions not met
do
Ar , step it once and initialize
vr , xr and fr
For i = 1, . . . , r , select algorithm Ai if there exists c 0 so that:
fi + ch(vi ) fj + ch(vj ) for all j = 1, . . . , r so that (vi , fi ) = (vj , fj ). If
there are multiple algorithms with identical v and f , then select only one
Add a new local search algorithm instance
its variables
4
of them at random.
5
6
7
8
9
10
Step each selected
Ai
and update its variables
vi , xi
and
fi .
br = argmin1,...,r (fi ).
If br = br−1 step instance Abr until vbr ≥ vbr−1
∗
Update the best solution: x ← xb .
r ←r+1
Find the best instance:
return
x∗
can be added without taking any steps rst, and assigned a function value
fr = 0,
which is guaranteed to not be better than any of the function values of the other
instances.
We are, however, dealing with a minimization problem with a known target value
(see [Han+13b]) but no upper bound on
of
f.
f
and, consequently, no worst possible value
Therefore, we made a little change and step the new instance
Ar
immediately
after it is added. It can then also be stepped second time, during step 4 in algorithms
5 and 4. We believe, that this has no signicant impact on performance.
3.1 Suggested modications
MetaMax and MetaMax(∞) will add a new instance each round as long as
they are running, with no limit on the maximum number of instances. The authors of [GK11] state that the worst-case computational overhead of MetaMax and
2
MetaMax(∞) is O(r ), where r is the number of rounds. For the purpose of optimizing functions, where each function evaluation uses up a large amount of computational time (for which MetaMax was primarily designed), the overhead will be
negligible compared to the time spent calculating function values and will not present
a signicant problem. However, in comparison with restart strategies which have typically almost no overhead this is still a disadvantage for MetaMax. Therefore, it would
be desirable to come up with some mechanism that would improve its computational
complexity.
An obvious solution would be to limit the total number of instances which can
be added or slow down the rate at which they are added so that there will never be
too many of them. However, this would make MetaMax and MetaMax(∞) behave
basically in the same way as MetaMax(k) and lose their main property, which is the
13
30. consistency based on always generating new instances.
A better solution would be to add a mechanism which would discard one of
already existing instances every time a new one is added and therefore keep the total
number of instances at any given time constant. The important question is: Which
one of the existing instances should be discarded?
We propose the following approach: Discard the instance which has not been
selected for the longest time. If there are multiple instances which qualify, discard the
one with the worst function value. The rationale behind this discarding mechanism
is that MetaMax most often selects (allocates the most resources to) those instances,
which have the best optimistic estimate of convergence. Therefore, the instances
which are selected the least often will likely not give very good results in the future,
and so make good candidates for deletion. An alternative, method may also be to
discard the absolute worst instance (in terms of the best objective function value
found so far). Which is even simpler, but we feel that it does not follow so naturally
from the principles behind MetaMax. Therefore, for most of our experiments we will
use the discarding of least selected instances.
Another area where we think it might be benecial to modify the workings of
MetaMax, is the mechanism of selecting instances to be stepped in each round.
The original mechanism has two possible disadvantages: Firstly, it is not invariant
to monotone transformation of the objective function values. By this we mean a
f (x) → f (x), which itself is only a function of the value of f (x) and not
the parameter vector x. The monotone property meaning, that if f (x1 ) f (x2 ) then
f (x1 ) f (x2 ) for all possible x1 and x2 . Such a monotone transformation will not
change the location of the optima of f (x). I will also not change the direction of
gradient of f (x) for any x, but not necessarily its magnitude. An example of such
mapping
transformation is given in gure 2.
Logically, it would not make much sense to require an optimization algorithm to
be invariant to an objective function value transformation, which is not monotone,
as it could change the position of the functions optima.
The second possible disadvantage of the convex hull based instance selection
mechanism is that it also behaves dierently based on the choice of the estimate
function
h.
given, while
This is not such a great disadvantage as the rst one, because
h
f (x)
is
can be chosen freely. However, it would still be benecial if we could
entirely remove the need to choose
h.
To overcome these problems, we propose a new instance selection mechanism. It
uses the same representation of local search algorithm instances as a set of points
P,
given in equation 4 but it select those instances, which correspond to non-dominated
in the sense of maximizing fi and maximizing h(vi ) (or analogically
fi and minimizing vi ). This method is clearly invariant to both monotone
transformation of objective function values f → f and dierent choices of h, as
determining non-dominated points depends only on their ordering along the axes fi
and h(xi ), which will always be preserved due to the fact that both f → f and h
are monotone. Moreover, the points which lie on the right upper convex hull of P ,
and thus maximize the optimistic estimate fi + ch(vi ), are always non-dominated,
points of
P
maximizing
and thus will always be selected.
14
31. 1
3 2 1
10
0 1 2
2
3 3
2
15
10
5
30
1
3 2 1
10
0 1 2
2
3 3
2
2
1
1
0
0
1
1
2
4000
3000
2000
1000
0
23
2
3
3
2
1
0
1
3
2
3
2
1
1
0
Figure 2: Example of monotone transformation of
2
f (x)
Displays a 3D mesh plot of a Rastrigin like function f (x) in the top left, a
transformed function f (x)3 in the top right and their respective contour
plots on the bottom. It is clear, that the shape of the contours is the
same, but their heights are not.
A possible disadvantage of the proposed algorithm is, that at each round it selects many more points than the original convex hull mechanism. This might result
in selecting instances with low convergence estimate too often, and not dedicating
enough resources to the more promising ones. A visual comparison of the two selection mechanisms and demonstration of the inuence of choice of estimate function
upon selection are presented in gure 3.
15
32. fi
fi3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0
1
2
3
4
5
1e8
1e25
0.1 0.2 0.3 0.4
h(vi )
0.1 0.2 0.3 0.4
h(vi )
Figure 3: MetaMax selection mechanisms
Compares the original selection mechanism based on nding upper convex hull (left sub-gures), with the new proposed mechanism based on
selecting non-dominated points (right sub-gures). Also demonstrates the
eects of monotone transformation of the objective function values on the
selection, with f (x) for the upper sub-gures and f (x)3 for those on the
bottom. Selected points are marked as red diamonds, connected by a red
line. Unselected points are marked as lled black circles.
4
Experimental setup
All of the experiments were conducted using the COCO (Comparing continuous
optimizers) framework [Han13a], which is an open-source set of tools for systematic
evaluation and comparison of real-parameter optimization strategies. It provides a set
of 24 benchmark functions of dierent types, chosen to thoroughly test the limits and
capabilities of optimization algorithms. Also included are tools for running experiments on these functions and logging, processing and visualising the measured data.
The library for running experiments is provided in versions for C, Java, R, Matlab
and Python. The post processing part of the framework is available for Python only.
The benchmark functions are divided into 6 groups, according to their properties.
They are briey described in table 1. For detailed description, please see [Han+13a].
There are also multiple instances dened for each function, which are created by
applying various transformations to the base formula.
We shall now briey explain some of the functions properties mentioned in table
1. As already mentioned, the terms unimodal and multimodal refer to functions with
16
33. Name
separ
lcond
hcond
multi
mult2
Functions
Description
1-5
Separable functions
6-9
Functions with low or moderate conditionality
10-14
Unimodal functions with high conditionality
15-19
Multimodal structured functions
20-24
Multimodal functions with weak global structure
Table 1: Benchmark function groups
single optimum and multiple local optima respectively.
Conditionality describes how much the functions gradient changes depending on
direction. Simply put, functions with high conditionality (also called ill-conditioned
functions), at certain points, grow rapidly in some directions but slowly in others.
This often means that the gradient points away from the local optimum, which
presents a dicult problem for some local search algorithms. To give a more visual
description, one can imagine that 3D graphs of two-dimensional ill conditioned functions usually form sharp ridges, while those of well conditioned functions form gentle
round hills.
f (x1 , x2 , . . . , xd ) = f (x1 ) + f (x2 ) +
...+f (xd ), which means that they can be minimized by minimizing d one-dimensional
functions, where d is the number of dimensions of the separable function.
Separable functions have the following form:
In order to exhaustively evaluate performance of the selected strategies, we decided to make the following series of measurements for each strategy:
1. Using four dierent local search algorithms - Compass search, Nelder-Mead
method, BFGS and CMA-ES. In order to evaluate the eect of algorithm
choice.
2. Using all of the 24 noiseless benchmark functions available in the COCO framework, to measure performance on a wide variety of dierent problems.
3. Using the following dimensionalities : d = 2, 3, 5, 10, 20. To see how much is
the performance aected by the number of dimensions.
4. Using the rst fteen instances of each function. According to [Han+13b], this
number is sucient to provide statistically sound data.
Resource budget for minimizing a single function instance (a single trial) was set to
105 d, meaning 100000 times the number of dimensions of the instance.
The reasons for choosing the four local search algorithms are: Compass search
algorithm was chosen for its simplicity, in order to allow us to evaluate whether
MetaMax can improve performance of such a basic algorithm. Nelder-Mead method
was chosen as a more sophisticated representative of the group of pattern search
algorithms, than compass search. BFGS was selected as a typical line search method.
Finally, CMA-ES is there to represent population based algorithms. It is also the most
advanced of the four algorithms and thus we expect that it will perform the best
of the four selected algorithm. For a more detailed description of these algorithms,
please see section A.
17
34. 4.1 Used multi-start strategies
In this section, we describe the selected MetaMax and restart strategies, which
were evaluated using the methods described above. For convenience, we assigned a
shorthand name to each used strategy, so that we can write, for example csa-h-10d,
instead of objective function stagnation based restart strategy with history length
10d using compass search algorithm, which is impractically verbose. The shorthand
names have the following form: abbreviation of the used local search algorithm, dash,
used multi-start strategy, dash, strategy parameters. A list of all used strategies and
their shorthand names is given in table 3.
We chose two commonly used restart strategies to compare with MetaMax: a xed
restart strategy with a set number of resources allocated to each local search algorithm run, and a dynamic restart strategy with restart condition based on objective
function value stagnation.
Performance of these two strategies largely depends on the combination of the
problem being solved and the strategy parameters. Therefore, we decided to use six
xed restart strategies and six dynamic restart function stagnation strategies with
dierent parameters:
•
Fixed restart strategies
Run lengths:
nf
= 100d, 200d, 500d, 1000d, 2000d, 5000d evaluations.
Shorthand names:
•
algorithm-f-nf
Function value stagnation restart strategies
hf =
tf = 10−10
algorithm-h-hf
Function value history lengths:
2d, 5d, 10d, 20d, 50d, 100d evaluations
Function value tolerance:
Shorthand names:
Note, that the parameters depend on the number of dimensions of the measured
function d. This is consistent with the fact that the total resource budget of the
strategy also depends on d and that we can expect that for higher dimensionalities,
the used local search algorithms will need longer runs to converge.
The rationale behind choosing the used parameter values is the following: Using
5
the function evaluation budget of 10 d, run lengths longer than 5000d would give us
less than 20 restarts per trial. This would result in a very low chance of nding the
global optimum on most of the benchmark functions, some of which can have up to
10d optima. Also, it is probable that most local search algorithms will converge a
long time before using up all 50000d function evaluations and then the rest of the
allocated resources would be essentially wasted on running an instance which cannot
improve any more. Conversely, run lengths smaller than 100d are probably not long
enough to allow most local search algorithm instances to converge and so there would
be little sense in using them.
The choice of the upper bound of the function value history length
hf
as 100d
is based on a similar idea: For values greater than 100d the restart condition would
trigger too long after the local search algorithm has already converged, and so we
would be needlessly wasting resources on it. The choice of the lower bound of
depends on the used algorithm. For a restart strategy to function properly,
18
hf
hf
has to
35. be greater, or at least as much, as the number of function evaluations that the used
local search algorithm uses during one step. The above stated value of
hf
= 2d is
the minimal value for which the Nelder-Mead and BFGS algorithms work properly.
For the other two algorithms, the minimal value is
hf
= 5d. We decided to base the
function value history length on number of used function evaluations, rather than on
number of taken steps, because it allows for a more direct comparison of performance
of the same strategy using two dierent algorithms.
Choosing the value of the function stagnation tolerance
tf
involved a little bit
more guesswork. There is target function value dened for all of the benchmark
functions, which is equal to the function value at their global optimum f (xopt ) plus
−8
a tolerance value ftol = 10 . That is, the function instance is considered to be
solved if we nd some point
x
f (x) ≤ f (xopt ) + ftol . We
tf = 10−10 on ftol .
with
the function stagnation tolerance parameter
tf
one hundred times lower than
ftol
based our choice of
Setting the value of
should make it large enough to reliably detect
convergence, while not being too large to trigger the reset condition prematurely,
when the local search algorithm is still converging.
The goal of using multiple strategies with dierent parameter values is to have
at least one xed restart and one objective function value stagnation based strategy,
that performs well on the set of all functions, for each measured dimensionality.
For easier comparison of results of the xed restart strategies, we represent them
all together, by choosing only the results of the best performing strategies for each
dimensionality and collecting them into a best of collection of results, which we
will refer to by the shorthand name
algorithm-f-comb.
This represents the results
of running a xed restart strategy, which is able to choose the optimal run length
(from the set of six used run lengths), based on dimensionality of the function being
solved. The results of objective function value stagnation strategies are represented
in an analogous way, under the name
algorithm-h-comb.
Besides the already mentioned restart strategies, we decided to add four more,
each based on a restart condition specic to one of the used local search algorithms.
Shorthand names for these strategies are
algorithm-special.
They are described
in table 2.
In order to save computing time, and as per recommendation in [Han+13b], we
used an additional termination criterion that halts the execution of a restart strategy
after 100 restarts, even if the resource budget has not yet been exhausted and the
solution has not been found. This does not impact the accuracy of the measurements,
as 100 restarts is enough to provide statistically signicant amount of data and the
metrics which we use (see subsection 4.2) are not biased against results of runs
which did not use up the entire resource budget. In fact, the xed restart strategies
f-100d, f-200d
and
f-500d
always reach 100 restarts before they can fully exhaust
their resource budgets.
The idea of using the original pure versions of MetaMax and MetaMax(∞)
algorithms, which keep adding local search algorithm instances without limit, proved
to be impractical due to its excessive computational resource requirements (for the
length of experiments that were planned). Therefore, we performed measurements
using only the modied versions of MetaMax and MetaMax(∞) with the added
19
36. Algorithm
Compass search
Description
Restart when the variable
a,
which aects how far from the cur-
rent solution the algorithm generates neighbour solutions, decreases
−10
below 10
. It naturally decreases as the algorithm converges, so
checking its value makes for a good restart condition.
Nelder-Mead
We chose a similar condition to the one mentioned above. Restart
is triggered when distance between the two points of the simplex
−10
which are the farthest apart from each other decreases below 10
.
The rationale is similar as above: the simplex keeps growing smaller
as the algorithm converges. It might be more mathematically proper
to check the area (or volume, or hyper-volume, depending on the
dimensionality) of the simplex, but we discarded this idea out of
concern that it might be too computationally intensive.
BFGS
Restart condition is triggered if the norm of the gradient is smaller
−10
than 10
. Since the algorithm already uses information about the
gradient, it makes sense also to use it for detecting convergence.
CMA-ES
The recommended settings for CMA-ES given in [Han11] suggest
using 9 dierent restart conditions. Here we use these recommended
settings. Note that when using CMA-ES with the other restart
strategies, we use only a single restart condition and the additional
ones are disabled. In a sense, we are not using the algorithm to its
full potential, but this allows for a more direct comparison with
other local search algorithms.
Table 2: Algorithm specic restart strategies
mechanism (described in subsection 3.1) for limiting maximum number of instances.
For all MetaMax strategies, we used the recommended form of estimate function:
h = e−vi /vt . Measurements were performed using the following MetaMax strategies:
1. MetaMax(k), with k=20, k=50 and k=100. This gives the same total number
of local search algorithm instances as when using xed restart strategies with
run lengths equal to 5000d, 2000d and 1000d respectively. This makes it possible to evaluate the degree to which the MetaMax mechanism of selecting the
most promising instances improves the performance over these corresponding
restart strategies. The expectation is, that the success rate for MetaMax(k) will
not increase, because the number of instances and thus the ability to explore
the search space stays the same. However, MetaMax(k) should converge faster
than the xed restart strategies, because it should be able to identify the best
instances and allocate resources to them appropriately.
2. MetaMax and MetaMax(∞) with the maximum number of instances set to
100. This should allow us to asses the benets of the mechanism of adding new
instances (and deleting old ones), by comparing the results with MetaMax(k),
which uses the same number of instances each round, but does not add or
delete any. Here, we would expect an increase in success rate on multimodal
20
37. functions, as the additional instances ,generated each round, should allow the
algorithms to explore the search space more thoroughly. However, the limit of
100 instances will possibly still not be enough to get a good success rate for
multimodal problems with high dimensionality.
3. MetaMax and MetaMax(∞) with maximum number of instances set to 50d.
This should allow the algorithms to scale better with the number of dimensions and, hopefully, further improve their performance. The number 50d was
chosen as a reasonable compromise between computation time and expected
performance. We expect to get the best results here.
algorithm-k-X for MetaMax(k), algorithm-m-X for MetaMax and algorithm-i-X for MetaMax(∞), where
X is the maximum allowed number of instances (or, equivalently, the value of k for
Shorthand names for MetaMax variants were chosen as
MetaMax(k)).
Fixed restart strategies
f-100d
f-200d
f-500d
f-1000d
f-2000d
f-5000d
f-comb
Run length = 100d evaluations
Run length = 200d evaluations
Run length = 500d evaluations
Run length = 1000d evaluations
Run length = 2000d evaluations
Run length = 5000d evaluations
Combined xed restart strategy
Function value stagnation restart strategies
h-2d
h-5d
h-10d
h-20d
h-50d
h-100d
h-comb
History length = 2d evaluations
History length = 5d evaluations
History length = 10d evaluations
History length = 20d evaluations
History length = 50d evaluations
History length = 100d evaluations
Combined function value stagnation restart strategy
Other restart strategies
special
Special restart strategy specic to each algorithm, see table 2
MetaMax variants
k-20
k-50
k-100
k-50d
m-100
m-50d
i-100
i-50d
MetaMax(k) with k=20
MetaMax(k) with k=50
MetaMax(k) with k=100
MetaMax(k) with k=50d
MetaMax with maximum number of instances = 100
MetaMax with maximum number of instances = 50d
MetaMax(∞) with maximum number of instances = 100
MetaMax(∞) with maximum number of instances = 50d
Table 3: Tested multi-start strategies
21
38. There is a number of additional interesting aspects of the MetaMax variants,
which would be worth testing and evaluating. For example:
•
Comparison of MetaMax and MetaMax(∞) with the limit on maximum number of instances and without it.
•
Performance of dierent methods of discarding old instances.
•
Inuence of dierent choices of estimate function on performance.
•
Performance of our proposed alternative method for selecting instances.
However, it was not practically possible (mainly time-wise) to perform full sized
5
(10 d function evaluatoin budget) experiments which would test all of these features.
Therefore, we decided to make a series of smaller measurements, with the maximum
number of function evaluations per trial set to 5000d, using only dimensionalities
d=5, d=10 and d=20 and using only the BFGS algorithm. This should allow us to
test these features at least in a limited way and see if any of them warrant further
attention. More specically, we made the following series of measurements:
1. MetaMax and MetaMax(∞) without limit on maximum number of instances
2. MetaMax and MetaMax(∞) with maximum instance limit 5d, 10d and 20d,
discarding the most inactive instances
3. MetaMax and MetaMax(∞) with maximum instance limit 5d, 10d and 20d,
discarding the worst instances
4. MetaMax(k) with k=5d, k=10d and k=20d
These measurements were repeated three times, rst time using the recommended
−v /v
form of the estimate function h1 (vi , vt ) = e i t , second time with a simplied
vi
function h(vi ) = e and the third time using the proposed alternate instance selection
method, based on selecting non-dominated points.
4.2 Used metrics
In this section, we describe the various metrics that were used to compare results
of dierent strategies. The simplest one is the success rate. For a set of trials
U
(usually of one strategy running on one or more benchmark functions) and a chosen
target value
t,
it can be dened as:
SR(U, t) =
(6)
|u : u ∈ U, fbest (u) ≤ t| is the number of trials
which have found a solution at least as good as t. In the rest of this text we use a
mean success rate, averaged over a set of target values T
Where
|U |
|{u ∈ U : fbest (u) ≤ t}|
|U |
is the number of trials and
SRm (U, T ) =
1
|T |
22
SR(U, t)
t∈T
(7)
39. The main metric used in the COCO framework is the expected running time,
or ERT. It estimates the expected number of function evaluations that a selected
strategy will take to reach a target function value
trials
U.
for the rst time, over a set of
It is dened as:
ERT (U, t) =
Where
t
evals(u, t)
1
evals(u, t)
|{u ∈ U : fbest (u) ≤ t}| u∈U
(8)
u to reach
u if it never reached t. Expression
successful trials for target t. If there were
is the number of function evaluations used by trial
target t, or the total number of evaluations used by
|{u ∈ U : fbest (U ) ≤ t}| is the number of
no such trials, then ERT (U, t) = ∞. In the rest o this text we will use ERT averaged
over a set of target values T , in a similar way to what is described in equation 7. We
will also usually compute it using a set trials obtained by running the same strategy
on multiple dierent functions, usually all functions in one of the function groups
described in table 1.
For comparing two or more strategies in terms of success rates and expected
running times, we use graphs of the empirical cumulative distributive function of
run lengths, or ECDF. Such a graph displays on the y-axis the percentage of trials
for which ERT (averaged over a set of target values
evaluations
x,
that for each
where
x
x
T)
is lower than the number of
is the corresponding value on the x-axis. It can also be said,
it shows the expected average success rate, if a function evaluation
budget equal to
x
was used. For easier comparison of ECDF graphs across dierent
dimensionalities, the values on the x-axis are divided by the number of dimensions.
The function displayed in the graph can then be dened as:
y(x) =
1
|{t ∈ T : ERT (t, u) ≤ x}|
d|T ||U | u∈U
(9)
An example ECDF graph, like ones that are used throughout the rest of the text,
is given in gure 4. It shows ERTs of two sets of trials measured by running two
dierent strategies on the set of all benchmark functions, for d=10 and averaged
over a set of 50 target values. The target values are logarithmically distributed in
−8
2
the interval 10 ; 10 . We use this same set of target values in all our ECDF
graphs.
The marker
× denotes the median number of function evaluations of unsuccessful
trials, divided by the number of dimensions. Values to the right of this marker are
(mostly) estimated using bootstrapping (for details of the bootstrapping method,
please refer to [Han+13b]). The fact that we use 15 trials for each strategy-function
pair means, that the estimate is reliable only up to about fteen times the number of
evaluations marked by
×. This is a fact that should be kept in mind when evaluating
the results. The thick orange line in the plot is represents the best results obtained
during the 2009 BBOB workshop for the same set of problems and is provided for
reference.
Since we are dealing with a very large amount of measured results, it would be
desirable to have a method of comparing them, that is even more concise than ECDF
graphs. To this end, we use metric called aggregate performance index (API), dened
23
40. Proportion of trials
1.0 f1-24,10-D
best 2009
0.8
0.6
bfgs-k-100
0.4
0.2
nm-k-100
0.00 1 2 3 4 5 6 7 8
log10evaluations/D
Figure 4: Example ECDF graph
Comparison of the results of MetaMax(k), with k=100, using BFGS and
Nelder-Mead local search algorithms, on the set of all benchmark functions. The strategy using BFGS clearly outperforms the other one, both
in terms of success rate and speed of convergence.
by Mr. Po²ík in a yet unpublished (at the time of writing this text) article [Po²13]. It
is based on the idea that the ECDF graph of the results of an ideal strategy, which
solves the given problem instantly, would be a straight horizontal line across the top
of the plot. Conversely, for the worst possible strategy imaginable, the graph would
be a straight line along the bottom. It is apparent, that the area above (or bellow)
the graph makes for quite a natural measure of eectiveness of dierent strategies.
Given a set of ERTs
A,
their aggregate performance index can be computed as:
AP I(A) = exp
1
log a
|A| a∈A 10
(10)
For the purposes of computing API, the ERTs of unsuccessful trials which are by
denition
∞
have to be replaced with a value that is higher than ERT of any suc-
cessful trial. The choice of this value determines how much the unsuccessful trials
are penalized and thus aects the nal ERT score. For our purposes, we chose the
8
value 10 d.
Since we are computing API from the area above the graph this means that the
lower its value the better the corresponding strategy performs. Using API essentially
allows us to represent results of a set of trials by a single number and to easily
compare performances of dierent optimization strategies.
4.3 Implementation details
The software side of this project was implemented mostly in Python, with parts
in C. The original plan was to write the project purely in Python, which was cho-
24
41. sen because of its ease of use and availability of many open-source scientic and
mathematical libraries. However, during the project it was found out that a pure
Python code performs too slowly and would not allow us to make all the necessary
measurements. Therefore, parts of the program had to be changed over to C, which
has improved performance to a reasonable level.
The used implementations of BFGS and Nelder-Mead algorithms, are based on
the code from open-source Scipy library. They were modied to allow running the
algorithms in single steps, which is necessary in order for them to work with MetaMax. An open-source implementation of CMA-ES was used, available at [Han13b].
Implementation of MetaMax was written based on its description in [GK11]. It was
however, necessary to make several little changes to it mainly because it is designed
with a maximization task in mind but we needed to use it for minimization problems.
For nding upper convex hulls we used Andrew's algorithm with some additional pre
and post processing, to get the exact behaviour described in [GK11].
For description of the source code, please see the le
source/readme.txt on the
attached CD.
5
Results
In this section we will evaluate results of the selected multi-start strategies. We
decided to split the results into four subsections based on used local search algorithm. We present and compare the results mainly using tables, which list APIs and
success rates for dierent groups of functions and dierent dimensionalities. For convenience, the best results are highlighted with bold text. We also show ECDF graphs
to illustrate particularly interesting results. Results of the smaller experiments, described at the end of section 4.1 and results of timing measurements are summarized
in subsection 5.5.
The values of success rates and APIs, which are shown in this section, are com5
puted only using data bootstrapped up to the value of 10 d function evaluations.
In our opinion, these values represent the real performance of the selected strategies
better than if we were to use fully bootstrapped data, which are estimated to a large
degree and therefore not so statistically reliable. In ECDF graphs, bootstrapped re7
sults are shown up to 10 d evaluations. All of the APIs and success rates are averaged
over a set of multiple targets, as described in subsection 4.3.
The measured data are provided in their entirety on the attached CD (see section
B) in the form of tarballed Python pickle les, which can be processed using the
BBOB post processing framework. It was not possible to provide the data in their
original form, as text les, because their total size would be in the order of gigabytes,
which would clearly not t on the attached medium.
5.1 Compass search
Table 4 summarizes which of the used xed restart and function value stagnation
restart strategies were best for each dimensionality and chosen for the best-of result
collections
cs-f-comb
and
cs-h-comb.
Table 5 then compares these two sets of re-
25
42. sults together with results obtained by the compass search specic restart strategy
cs-special.
It is apparent, that for the best strategies the values of run length and function
value history length increase with the number of dimensions. This is not unexpected
as compass search uses 2d or 2d-1 function evaluations at each step.
Dimensionality
d=3
d=5
d=10
d=20
Stagnation based
cs-f-100d
cs-f-100d
cs-f-200d
cs-f-500d
cs-f-500d
d=2
Fixed
cs-h-5d
cs-h-5d
cs-h-10d
cs-h-10d
cs-h-20d
Table 4: Compass search - best restart strategies for each dimensionality
The comparison of the best restart strategies suggests, that all of them have quite
cs-h-comb being a little better than the others in
cs-f-comb in terms of API. In the subsequent tables, we
cs-f-comb for reference, as an exaple of a well tuned restart
similar overall performance, with
terms of success rate and
will provide results of
strategy.
None of the strategies performs very well on multimodal and highly conditioned
functions. This is to be expected, as the compass search algorithm is known to have
trouble with ill conditioned problems and multimodal problems are dicult to solve
mult2
multi
hcond
lcond
separ
all
for any algorithm.
cs-f-comb
cs-h-comb
cs-special
cs-f-comb
cs-h-comb
cs-special
cs-f-comb
cs-h-comb
cs-special
cs-f-comb
cs-h-comb
cs-special
cs-f-comb
cs-h-comb
cs-special
cs-f-comb
cs-h-comb
cs-special
2D
3D
3.75
4.46
3.82
4.16
2.58
2.69
2.63
3.84
3.49
4.69
4.92
3.03
3.11
3.09
4.18
4.22
log10 API
5D
5.52
5.62
5.80
4.33
4.03
4.32
5.34
10D
6.31
20D
6.90
Success rate [%]
2D 3D 5D 10D 20D
85
74
53
41
34
6.28
6.86
6.91
5.50
85
74
100
100
48
69
4.86
5.43
100
100
84
5.43
100
100
100
84
98
6.39
4.93
4.95
6.35
7.44
5.15
5.97
6.99
56
44
43
66
39
37
62
66
63
66
63
84
72
51
69
63
50
4.18
5.38
100
100
72
5.38
5.99
6.63
7.03
82
52
43
35
33
5.17
6.59
7.34
7.80
80
26
10
44
74
18
17
66
14
76
70
52
4.10
4.71
3.86
3.36
3.89
5.21
6.08
4.48
4.74
4.89
6.50
6.45
6.95
6.99
5.31
5.38
5.76
6.91
6.89
7.53
7.61
6.30
6.08
6.20
7.31
7.30
7.98
7.96
6.85
6.63
6.65
47
37
79
72
80
100
85
35
32
63
66
72
29
31
55
Table 5: Compass search - results of restart strategies
26
64
26
3.63
6.07
6.20
7.26
84
68
4.19
5.40
5.83
4.29
6.27
78
28
28
11
10
42
51
40
26
26
7
8
36
50
50
43. Comparison of the results of three MetaMax(k) strategies with corresponding
xed restart strategies which use the same total number of local search algorithm
instances is given in table 6. They conrm our expectations, and show that, overall,
MetaMax(k) converges faster than a comparable xed restart strategy. The only
exception being the group
separ.
This can be explained by the fact that functions
from this group are very simple and can be generally solved by a single, or only very
few, runs of the local search algorithm. In this case, the MetaMax mechanism of
selecting multiple instances each round is more of a hindrance than a benet.
In terms of success rate, MetaMax(k) is always as good or even better than the
comparable xed restart strategy, with the improvement being especially obvious on
the groups
lcond
and
mult2.
Of the three tested variants of MetaMax(k),
cs-m-100
ever, it is not better than a well tuned restart strategy like
is the best overall. How-
cs-f-comb.
Figure 5 shows a behaviour which was observed across all function groups and dimensionalities when comparing MetaMax(k) with corresponding xed restart strategies: At rst, MetaMax(k) converges much more slowly than the restart strategy, as
it is still in the phase of initialising all of its instances. However, as soon as this is
nished, it starts converging quickly and overtakes the restart strategy for a certain
interval. After that, its rate of convergence slows down again and it ends up with
5
success rate (for 10 d function evaluations) similar to that of the restart strategy.
Proportion of trials
This eect seems to get less pronounced with increasing number of dimensions.
1.0 f1-24,5-D
best 2009
0.8
0.6
cs-f-2000d
0.4
0.2
cs-k-50
0.00 1 2 3 4 5 6 7 8
log10evaluations/D
Figure 5: Compass search - ECDF comparing MetaMax(k) with an equivalent xed
restart strategy
Results of comparing
cs-k-100, cs-m-100
and
cs-i-100
are shown in table 7.
It is apparent, that using the same number of instances at a time MetaMax and
MetaMax(∞) clearly outperform MetaMax(k) on all function groups, both in terms
of speed of convergence and success rates.
In general, they also provide results at least as good, or better than the best
27
45. restart strategies. There is almost no dierence between the performance of
and
cs-i-100,
cs-m-100
which corresponds with results presented in [GK11]. Dierences in
performance seem to diminish with increasing dimensionality and, for d=10 and
d=20. all of the MetaMax strategies which use 100 instances perform almost the
same.
ECDF graph in gure 6 shows an interesting behaviour, where
cs-i-100
start converging right away and overtake
cs-k-100
cs-m-100
and
while it is still in the
process of initializing all of its instances. After that, MetaMax(k) catches up and for
a certain interval all of the strategies perform the same. Then, MetaMax(k) stops
converging, the other two strategies overtake it again and ultimately achieve better
success rates. The sudden stop in MetaMax(k) convergence presumably happens
when all of its best instances have already found their local optima, after which
there is no possibility of nding better solutions without adding new instances, which
Proportion of trials
MetaMax(k) cannot do.
1.0 f1-24,5-D
best 2009
0.8
cs-m-100
0.6
0.4
cs-i-100
0.2
cs-k-100
0.00 1 2 3 4 5 6 7 8
log10evaluations/D
Figure 6: Compass search - ECDF of MetaMax variants using 100 instances
In the next set of measurements using
cs-m-50d, cs-i-50d
and
cs-k-50d,
it
became apparent that the increased limit on maximum number of instances does not
cause any noticeable increase in performance for MetaMax and MetaMax(∞). The
performance of MetaMax(k) was somewhat improved, but overall it is still worse than
the other two MetaMax variants and slightly worse than the best restart strategies.
These results are also presented in table 7.
cs-k-50d and cs-m50d compared with
the collection of best xed restart strategy results cs-f-comb. We have omitted
cs-i-50d as its performance is very similar to that of cs-m-50d.
ECDF graph in gure 7 shows results of
In conclusion, we can say that using the compass search algorithm MetaMax
and MetaMax(∞) perform better than even well tuned restart strategies, and that
increasing the maximum number of allowed instances does not have any signicant
eect on their performance.
29
47. Proportion of trials
1.0 f1-24,10-D
best 2009
0.8
cs-m-50d
0.6
0.4
cs-k-50d
0.2
cs-f-comb
0.00 1 2 3 4 5 6 7 8
log10evaluations/D
Figure 7: Compass search - ECDF of MetaMax variants using 50d instances
5.2 Nelder-Mead method
The best restart strategies for each dimensionality are listed in table 8 and their
results are compared in table 9.
For the xed restart strategies we see the expected behaviour, where run lengths
of the best strategies increase with the number of dimensions. However, there seem
to be only two best objective function stagnation based strategies:
nm-h-10d
and
nm-h-100d. Interestingly enough, the switch between them occurs between d=5 and
d=10, which is also the point where the overall performance of the Nelder-Mead
algorithm decreases dramatically.
Dimensionality
d=2
d=3
d=5
d=10
d=20
Fixed
nm-f-100d
nm-f-100d
nm-f-500d
nm-f-1000d
nm-f-5000d
Stagnation based
nm-h-10d
nm-h-10d
nm-h-10d
nm-h-100d
nm-h-100d
Table 8: Nelder-Mead - best restart strategies for each dimensionality
The algorithm performs very well for low number of dimensions - d=2, d=3 and
to some extent also d=5. With results for these dimensionalities approaching those
of the best algorithms from the 2009 BBOB conference. On the other hand, the
performance for higher dimensionalities is very poor, especially on the group
hcond.
The three best-of restart strategies, compared in table 9, are all quite evenly matched
with
nm-special being the best overall by a small margin and nm-f-comb being the
worst.
The comparison of MetaMax(k) with corresponding xed restart strategies, given
in table 10, shows that MetaMax(k) performs better on multimodal functions and
31
49. functions is worse than that of the corresponding restart strategies. This is the opposite of what was observed when using the compass search algorithm and can be
explained by the fact that the Nelder-Mead algorithm, unlike compass search, can
handle ill-conditioned problems very quickly and with high success rate (at least for
low dimensionalities). Therefore, there is no need for the MetaMax mechanism of
selecting multiple instances each round, as almost any instance is capable of nding
the global optimum. Selecting more than one at the same time only serves to decrease
the rate of convergence.
Overall, the three tested MetaMax(k) strategies perform only slightly better than
the corresponding xed restart strategies and are clearly worse than the best restart
strategies, such as
nm-special.
Table 11 shows the results of other tested MetaMax strategies. Unfortunately,
measurements for all dimensionalities were not nished in time, before the deadline
of this thesis, therefore table 11 contains only partial results for some strategies.
For the dimensionalities where the results of all the strategies are available, it
is apparent that
nm-m-100
and
nm-i-100
outperform
nm-k-100,
both in terms of
success rates and API. There are no signicant dierences in performance between
MetaMax variants using 100 and 50d local search algorithm instances, as well as no
observable dierences between performance of MetaMax and MetaMax(∞).
nm-special, MetaMax and MetaMax(∞)
have better success rates on function groups separ, multi and mult2 and as a result,
In comparison with the restart strategy
a better overall success rate.
MetaMax and MetaMax(∞) also converge faster on
slower on
lcond
and
hcond.
multi
and
mult2,
but are
The overall result being that they are better than the
best restart strategy ,in terms of API, for d=2 and d=3, but are worse for d=5.
Unfortunately, we cannot make comparisons for higher dimensionalities, where the
results for MetaMax and MetaMax(∞) are not available. However, based on the fact
that the advantage in performance of MetaMax over the restart strategy is lower in
d=3 than in d=2, and that the restart strategy is better in d=5, we can extrapolate
that MetaMax would likely also perform worse for higher dimensionalities. Even if
there was an improvement in performance, the fact remains, that the Nelder-Mead
method has such a bad performance in higher dimensionalities, that it is unlikely
that MetaMax could improve it to a practical level.
33
52. 5.3 BFGS
Results of restart strategies
bfgs-f-comb, bfgs-h-comb
and
bfgs-special
are
shown in table 13. Best xed and objective function stagnation based restart strategies for each dimension, which were used to make
bfgs-hfcomb
and
bfgs-h-comb,
are listed in table 12.
Dimensionality
d=2
d=3
d=5
d=10
d=20
Fixed
Stagnation based
bfgs-f-100d
bfgs-f-100d
bfgs-f-200d
bfgs-f-1000d
bfgs-f-1000d
bfgs-h-2d
bfgs-h-2d
bfgs-h-2d
bfgs-h-2d
bfgs-h-2d
Table 12: BFGS - best restart strategies for each dimensionality
For the selected best xed restart strategies, we see an ordinary behaviour where
run lengths increase with dimensionality. However, for the stagnation based restart
strategies,
bfgs-h-2d
is apparently the best for all dimensionalities. This is quite
unusual but, in hindsight, not entirely unexpected. It has to do with the way our
implementation of BFGS works: At the begging of each step, the algorithm estimates
gradient of the objective function by using the nite dierence method. This involves
evaluating the objective function at a set of neighbouring solutions, which are very
close to the current solution. The number of these neighbour solutions is always 2d one for each vector in a positive orthonormal basis of the search space. A very quick
way to detect convergence is to check, if objective function values at these points are
worse than function value at the current solution. As it turns out, this is precisely
what
bfgs-h-2d
does and also the reason why it works so well.
In contrast with the surprisingly good results of
bfgs-h-2d,
the special restart
strategy, which is based on monitoring value of the norm of the estimated gradient,
performs very poorly and is clearly the worst of all the tested restart strategies. The
other two strategies
with
bfgs-h-comb
bfgs-h-comb and bfgs-f-comb have a very similar performance,
being slightly better.
Overall, BFGS has excellent results on ill-conditioned problems, even exceeding
the performance of the best algorithms from the BBOB 2009 conference for certain
dimensionalities on the group
hcond,
which is illustrated in gure 9. However, it
performs quite poorly on multimodal functions (multi and
mult2).
Table 14 sums up the results comparing MetaMax(k) with corresponding xed
restart strategies. In terms of success rate, both types of strategies perform the
same on all function groups. In terms of rate of convergence, expressed by values of
API, the results are similar to those observed when using the Nelder-Mead method:
MetaMax(k) strategies perform better on
hcond
and
separ.
multi
and
mult2,
but worse on
lcond,
The overall performance of MetaMax(k) across all function groups is worse than
that of the corresponding xed restart strategies and consequently also worse than
performance of the best restart strategies.
36