Black box optimization of restart strategies for the MetaMax algorithm

Czech Technical University in Prague
Faculty of Electrical Engineering

DIPLOMA THESIS

Bc. Viktor Kajml

Black box optimization: Restarting versus MetaMax
algorithm

Department of Cybernetics

Project supervisor: Ing. Petr Posik, Ph.D.

Prague, 2014

Abstrakt
Tato diplomová práce se zabývá vyhodnocením nového perspektivního optimaliza£ního algoritmu, nazvaného MetaMax. Hlavním cílem je zhodnotit vhodnost jeho
pouºití pro °e²ení problém· optimalizace £erné sk°í¬ky se spojitými parametry, obzvlá²t¥
v porovnání s ostatními metodami b¥ºn¥ pouºívanými v této oblasti. Za tímto ú£elem
je MetaMax a vybrané tradi£ní restartovací strategie, podrobn¥ otestován na rozsáhlé
sad¥ srovnávacích funkcí, za pouºití r·zných algoritm· lokálního prohledávání. Takto
nam¥°ené výsledky jsou poté porovnány a vyhodnoceny. Druhotným cílem je navrhnout
a implementovat modikace algoritmu MetaMax v jistých oblastech, kde je prostor
pro zlep²ení jeho výkon·.

Abstract
This diploma thesis is focused on evaluating a new promising multi-start optimization algorithm called MetaMax. The main goal is to assess its utility
it in the area of black-box continuous parameter optimization, especially in
comparison with other strategies commonly used in this area. To achieve this,
MetaMax and a selection of traditional restart strategies are thoroughly tested
on a large set of benchmark problems and using multiple dierent local search
algorithms. Their results are then compared and evaluated. An additional
goal is to suggest and implement modications of the MetaMax algorithm,
in certain areas where it seems that there could be a potential room for improvement.

I would like to thank:
Mr. Petr Po²ík for his help on this thesis
The Centre of Machine perception at the Czech Technical University in Prague
for providing me with access to their computer grid
My friends and family for their support

Contents
1

Introduction

1

2

Problem description and related work

3

2.1

Local search algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2.2

Multi-start strategies . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

3

MetaMax algorithm and its variants

3.1
4

Suggested modications

. . . . . . . . . . . . . . . . . . . . . . . . .

Experimental setup

9

13
16

4.1

18

Used metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

4.3
5

Used multi-start strategies . . . . . . . . . . . . . . . . . . . . . . . .

4.2

Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . .

24

Results

25

5.1

25

Nelder-Mead method . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

5.3

BFGS

36

5.4

CMA-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

5.5
6

Compass search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2

Additional results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusion

48

A Used local search algorithms

51

A.1

Compass search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

A.2

Nelder-Mead algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .

51

A.3

BFGS

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

A.4

CMA-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

B

CD contents

56

C

Acknowledgements

56

i

List of Tables
1

Benchmark function groups

. . . . . . . . . . . . . . . . . . . . . . .

17

2

Algorithm specic restart strategies . . . . . . . . . . . . . . . . . . .

20

3

Tested multi-start strategies . . . . . . . . . . . . . . . . . . . . . . .

21

4

Compass search - best restart strategies for each dimensionality

. . .

26

5

Compass search - results of restart strategies . . . . . . . . . . . . . .

26

6

Compass search - results of MetaMax(k) and corresponding xed restart
strategies

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

7

Compass search - results of MetaMax strategies

. . . . . . . . . . . .

30

8

Nelder-Mead - best restart strategies for each dimensionality . . . . .

31

9

Nelder-Mead - results of restart strategies . . . . . . . . . . . . . . . .

32

10

Nelder-Mead - results of MetaMax(k) and corresponding xed restart
strategies

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

11

Nelder-Mead - results of MetaMax strategies . . . . . . . . . . . . . .

35

12

BFGS - best restart strategies for each dimensionality . . . . . . . . .

36

13

BFGS - results of restart strategies

37

14

BFGS - results of MetaMax(k) and corresponding xed restart strategies 38

15

BFGS - results of MetaMax strategies . . . . . . . . . . . . . . . . . .

40

16

CMA-ES - best restart strategies for each dimensionality

. . . . . . .

41

17

CMA-ES - results of restart strategies . . . . . . . . . . . . . . . . . .

42

18

CMA-ES - results of MetaMax(k) and corresponding xed restart
strategies

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

19

CMA-ES - results of MetaMax strategies . . . . . . . . . . . . . . . .

45

20

CD contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

ii

List of Figures
1

Restart condition based on function value stagnation

2

Example of monotone transformation of

3

MetaMax selection mechanisms

4

Example ECDF graph

5

Compass search - ECDF comparing MetaMax(k) with an equivalent
. . . . . . . . . . . . . . . . . . . . . . . . . . .

27

6

Compass search - ECDF of MetaMax variants using 100 instances . .

29

7

Compass search - ECDF of MetaMax variants using 50d instances . .

31

8

Nelder-Mead - ECDF comparing MetaMax(k) strategies

. . . . . . .

32

9

BFGS - ECDF of the best restart strategies

. . . . . . . . . . . . . .

37

10

BFGS - ECDF of MetaMax variants using 50d instances

. . . . . . .

39

11

CMA-ES - ECDF of function value stagnation based restart strategies

41

12

CMA-ES - ECDF comparison of MetaMax variants using 50d instances 44

13

MetaMax timing measurements

14

ECDF comparing MetaMax strategies using dierent instance selec-

xed restart strategy

tion methods
15

. . . . . . . . .

7

. . . . . . . . . . . . . .

15

. . . . . . . . . . . . . . . . . . . . .

16

. . . . . . . . . . . . . . . . . . . . . . . . . .

24

f (x)

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Nelder-Mead algorithm in 2D

47
48

. . . . . . . . . . . . . . . . . . . . . .

52

1

Typical structure of a local search algorithm . . . . . . . . . . . . . . .

2

2

Variable neighbourhood search

. . . . . . . . . . . . . . . . . . . . . .

9

3

MetaMax(k)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

4

MetaMax(∞) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

5

MetaMax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

6

Compass search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

7

Nelder-Mead method . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

8

BFGS algorithm

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

9

CMA-ES algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

List of Algorithms

iii

1

Introduction
The goal of this thesis is to implement and evaluate the performance of the

MetaMax optimization algorithm, particularly in comparison with other commonly
used optimization strategies.
MetaMax was proposed by György and Kocsis in [GK11] and the results they
present seem very interesting and suggest that MetaMax might be a very competitive
algorithm. Our goal is to more closely evaluate its performance on problems from
the area of black-box continuous optimization, by performing a series of exhaustive
measurements and comparing the results with those of several commonly used restart
strategies.
This text is organized as follows: Firstly there is a short overview of the subjects
of mathematical, continuous and black-box optimization, local search algorithms and
multi-start strategies. This is meant as an introduction for readers who might not
be familiar with these topics. Readers who already have knowledge of these elds
might wish to skip forward to the following sections, which describe the MetaMax
algorithm, the experimental setup, used optimization strategies and the software
implementation. In the last two sections, the measured results are summed up and
evaluated.
The mathematical optimization problem is dened as selecting the best element,
according to some criteria, from a set of feasible elements. Most common form of

x1,opt , . . . , xd,opt , where d is the problem
dimension, for which a value of a given objective function f (x1 , . . . , xd ) is minimal,
that is f (x1,opt , . . . , xd,opt ) ≤ f (x1 , . . . , xd ) for all possible values of x1 , . . . , xd .
the problem is nding a set of parameters

Within this eld of mathematical optimization, it is possible to dene several
subelds based on the properties of the parameters
information available about the objective function

x1 , . . . , x d

and the amount of

f.

The set of all possible solutions (possible combid
nations of the parameter values) is nite. Usually some subset of N .

Combinatorial optimization:

Integer programming:

All of the parameters are restricted to be integers:

x1 , . . . , xd ∈ N. Can be considered to be a subset of combinatorial optimization.
Mixed integer programming:

Some parameters are real-valued and some are

integers.
Continuous optimization:

The set of all possible solutions is innite. Usually

x1 , . . . , x d ∈ R .
black-box optimization:

about

f

Assumes that only a bare minimum of information

is given. It can evaluated at an arbitrary point

tion value

f (x),

but besides that, no other properties of

x, returning the funcf are known. In order

to solve this kind of problems, we have to resolve to searching (The exact techniques are be described in more detail later in this text). Furthermore we are
almost never guaranteed to nd the exact solution, just one that is suciently
close to it, and there is almost always a non-zero probability that even an
approximate solution might not be found at all.

1

White box optimization

knowledge about

f,

deals with problems where we have some additional

for example its gradient, which can obviously be very

useful when looking for its minimum.
In this text we will almost exclusively with black-box continuous optimization
problems.
For a practical example of a black-box optimization problem, imagine the process
of trying to design an airfoil, which should have certain desired properties. It is possible to describe the airfoil by a vector of variables representing its various parameters
- length, thickness, shape, etc. This will be the parameter vector
run aerodynamic simulation with the airfoil described by

x,

x.

Then, we can

evaluate how closely it

matches the desired properties, and based on that, assign a function value

f (x)

the parameter vector. In this way, the simulator becomes the black-box function
and the problem is transformed into task of minimizing the objective function
We can then use black-box optimization methods to nd the parameter vector

to

f
f.

xopt

which will give us an airfoil with the desired properties.
This example hopefully suciently illustrates the fact that, black-box optimization can be a very powerful tool, as it allows us to nd reasonably good solutions
even for problems which we might not be able to, or would not know how to, solve
otherwise.
As already mentioned, the usual method for nding optima (the best possible
set of parameters

xopt )

in continuous mathematical optimization is searching. The

structure of a typical local search algorithm is as follows:

Algorithm 1: Typical structure of a local search algorithm

1

Select a starting solution

x0

somehow (most commonly randomly) from the

set of feasible solutions.

2
3
4
5
6
7
8
9
10
11
12

xc ← x0
f (xc ).

Set current solution:
Get function value
while

Stop condition not met

do

Generate a set of neighbour solutions
Evaluate

f

at each

Xn

similar to

xc

xn ∈ X n

Find the best neighbour solution
∗
if f (x ) f (xc ) then
Update the current solution

x∗ = argmaxxn ∈Xn f (x)

xc ← x∗

else

Modify the way of generating neighbour solutions
return

xc

In the case of continuous optimization, a solution is represented simply by a
d
point in R . There are various ways of generating neighbour solutions. In general,
two neighbouring solutions should be dierent from each other, but in some sense
also similar. In continuous optimization, this usually means that the solutions are
close in terms of Euclidean distance, but not identical.

2

The algorithm described above has the property that it always creates neighbour
solutions close to the current solution and moves the current solution in the direction
of decreasing

f (x). This makes it a greedy algorithm, which works well in cases where

the objective function is unimodal (has only one optimum), but for multimodal
functions (functions with multiple local optima), the resulting behaviour will not be
ideal. The algorithm will move in the direction of the nearest optimum (the optimum

x0 ),

with basin of attraction containng

but when it gets there it will not move any

further, as at this point, all the neighbour solutions will be worse than the current
solution. Such algorithm can therefore be relied on to nd the nearest local optimum,
but there is no guarantee that it will also be the global one. The global optimum
will be found only when

x0

happens to land in its basin of attraction.

The method which is most commonly used to overcome this problem, is to run
multiple instances of the local search algorithm from dierent starting positions

x0 .

Then it is probable that at least one of them will start in the basin of attraction of
the global optimum and will be able nd it.
There are various dierent multi-start strategies which implement this basic idea,
with MetaMax, the main subject of this thesis, being one of them.
More thorough description of the local search algorithms problem of getting stuck
in a local optimum is described in section 2. Detailed description of the MetaMax
algorithm and its variations is given in section 3. Structure of the performed experiments is described in section 4. Finally, the measured results are presented and
evaluated in section 5.

2

Problem description and related work
As mentioned in the previous section, local search algorithms have problems

nding the global optimum of functions with multiple optima (also called multimodal
functions). In this section we focus on this problem more thoroughly. We describe
several common types of local search algorithms in more detail and discuss their
susceptibility to getting stuck in a local optimum. Next, we will describe several
methods to overcome this problem.

2.1 Local search algorithms
Following are the descriptions of four commonly used kinds of local search algorithms, which we hope will give the reader a more concrete idea about the functioning
of local search algorithms, than the very basic example described in algorithm 1.
Line search algorithms try to solve the problem of minimizing a d-dimensional

function

f

by using a series of one-dimensional minimization tasks, called line

searches. During each step of the algorithm, an imaginary line is created starting at the current solution

xc

and going in a suitably chosen direction

σ . Then,

x, with the minimal value of f (x), and the
current solution is updated: xc ← x. In this way, the algorithm will eventually
converge on a nearest local optimum of f .
the line is searched for a point

3

The question remains - How to chose the search direction

σ?

The most simple

algorithms just use a preselected set of directions (usually vectors in an orthonormal positive d-dimensional base) and loop through them on successive
iterations. This method is quite simple to implement, but it has trouble coping
with ill-conditioned functions.
An obvious idea might be to use information about the functions gradient for
determining the search direction. However, this turns out not to be much more
eective than simple alternating algorithms. The best results are achieved when
information about both the functions gradient and its Hessian is used. Then,
it is possible to get quite robust and well performing algorithms. Note, that
for black-box optimization problems, it is necessary to obtain the gradient by
estimation, as it is not explicitly available.
Examples of this kind of algorithm are: Symmetric rank one method, gradient
descent algorithm and Broyden-Fletcher-GoldfarbShanno algorithm
Pattern search algorithms closely t the description given in algorithm 1. They

generate the neighbour solutions
relative to the current solution

xn ∈ X n

in dened positions (a pattern)

xc . If any of the neighbour solutions is found to

be better than the current one, it then becomes the new current solution, the
next set of neighbour solutions is generated around it and so on.
If none of the neighbour solutions is found to be better (an unsuccessful iteration), then the pattern is contracted so that in the next step the neighbour
solutions are generated closer to

xc .

In this way the algorithm will converge to

the nearest local optimum (for proof, please see [KLT03]). Advanced pattern
search algorithms use patterns, which change size shape according to various
rules, both on successful and unsuccessful iterations.
Typical algorithms of this type are: Compass search (or coordinate search),
Nelder-Mead simplex algorithm and Luus-Jakola algorithm.
Population based algorithms keep track of a number of solutions, also called

individuals, at one time, which together constitute a population. A new generation of solutions is generated each step, based on the properties of a set of
selected (usually the best) individuals from the previous generation. Dierent
algorithms vary in the exact implementation of this process.
For example, in the family of genetic algorithms, this process is designed to
emulate natural evolution: Properties of each individual (in case of continuous
optimization, this means its position) are encoded into a genome and new
individuals are created by combining parts of genomes of successful individuals
from the previous generation, or by random mutation. Unsuccessful individuals
are discarded, in an analogy with the natural principle of survival of the ttest.
Other population based algorithms, such as CMA-ES take a somewhat more
mathematical approach: New generations are populated by sampling a multivariate normal distribution, which is in turn updated every step, based on the
properties of a number of best individuals from the previous generation.

4

Swarm intelligence algorithms are based on the observation, that it is possible to

get quite well performing optimization algorithms by trying to emulate natural
behaviours, such as ocking of birds or sh schools. Each solution represents one
member of a swarm and moves around the search space according to a simple
set of rules. For example, it might try to keep certain minimal distance form
other ock members, while also heading in the direction with the best values
of

f (x).

The specic rules vary a great deal between dierent algorithms, but

in general even a simple individual behaviour is often enough to result in quite
complex collective emergent behaviour. Because swarm intelligence algorithms
keep track of multiple individuals/solutions during each step, they can also be
considered to be a subset of population based algorithms.
Some examples of this class of algorithms are the Particle swarm optimization
algorithm and the Fish school search algorithm.
Pattern search and line search algorithms, have the property that they always
choose neighbour solutions close to the current solution and they move in direction
of decreasing

f (x).

Thus, as was already described in the previous section, they are

able to nd only the local optimum, which is nearest to their starting position

x0 .

Population based and swarm intelligence algorithms might be somewhat less susceptible to this behaviour, in the case where the initial population is spread over a
large area of the search space. Then there is a chance that some individuals might
land near to the global optimum, and eventually pull the others towards it.
There are several modications of local search algorithms specically designed to
overcome the problem of getting stuck in a local optimum. We shall now describe
two basic ones - Simulated annealing and Tabu search. The main idea behind them,
is to limit the local search algorithms greedy behaviour by sometimes taking steps
other than those, which lead to the greatest decrease of

f (x).

Simulated annealing implements the above mentioned idea in a very straightfor-

ward way: During each step, the local search algorithm may select any of the
generated neighbour solutions with a non-zero probability, thus possibly not
selecting the best one.
The probability
of

f (xc ), f (xn ),

P

of choosing a particular neighbour solution

and

s,

where

s

xn

is a function

is the number of steps already taken by the

algorithm. Usually, it increases with the value of

∆f = f (xc ) − f (xn ),

so that

the best neighbour solutions are still likely to be picked the most often. The
probability of choosing a neighbour solution other than the best one also usually
decreases as

s

increases, so that the algorithm behaves more randomly in the

beginning and then, as the time goes on, settles down to a more predictable
behaviour and converges to the nearest optimum. This is somewhat similar
to the metallurgical process of annealing, from where the algorithm takes its
name.
It is possible to apply this method to almost any of the previously mentioned local search algorithms, simply by adding the possibility of choosing neighbour solutions, which are not the best. In practice, the exact form of

5

P (f (xc ), f (xn ), s)

has to be ne-tuned for a given problem in order to get good results. Therefore,
this algorithm is of limited usefulness in the area of black-box optimization.
Tabu search works by keeping list of previously visited solutions, which is called

the tabu list. It selects potential moves only from the set of neighbour solutions,
which are not on the this list, even if it means choosing a solution, which is
worse than the current one. The selected solution is then added to the tabu
list and the oldest entry in the Tabu list is deleted. The list therefore works in
a way similar to a cyclic buer.
This method has been originally designed for solving combinatorial optimization problems and it requires certain modications in order to be useful in the
area of continuous parameter optimization. At the very least, it is necessary
to modify the method to not only discard neighbour solutions which are on
the tabu list, but also solutions which are close to them. Without this, the
algorithm would not work very well, as the probability of generating the exact
d
same solution twice in R is quite small.
There is a multitude of advanced variations of this basic method, for example
it is possible to add aspiration rules, which override the tabu status of solutions
that would lead to a large decrease in

f (x).

For a detailed description of tabu

search adapted for continuous optimization, please see [CS00].

2.2 Multi-start strategies
Multi-start strategies allow eectively using local search algorithms on functions
with multiple local optima without making any modication to the way they work.
The basic idea is, that if we run a search algorithm multiple times, each time from
a dierent starting position

x0 ,

then it is probable that at least one of the starting

positions will be in the basin of attraction of the global optimum and thus the
corresponding local search algorithm will be able to nd it. Of course, the probability
of this depends on the number of algorithm instances that are run, relative to the
number and properties of the functions optima. It is possible to think about multistart strategies as meta-heuristics, running above, and controlling multiple instances
of local search algorithm sub-heuristics.
Restart strategies are a subset of multi-start strategies, where multiple instances
are run one at a time in succession. The most basic implementation of a restart
strategy is to take the total amount of allowed resource budget (usually a set number
of objective function evaluations), evenly divide it into multiple slots, and use each
of them to run one instance of a local search algorithm. A very important choice
is deciding the length of a single slot. The optimal length largely depends on the
specic problem and type of used algorithm. If the length is set too low, then the
algorithm might not have enough time to converge to its nearest optimum. If it is too
long, then there is a possibility that resources will be wasted on running instances
which are stuck in local optima and can no longer improve.
Of course, all of the time slots do not have to be of the same length. A good
strategy for black-box optimization is to start with low length and keep increasing

6

it for each subsequent slot. In this way, a reasonable performance can be achieved
even if we are unable to choose the most suitable slot length for a given problem in
advance.
A dierent restart strategy is to keep each instance going as long as it needs
to until it converges to an optimum. The most universal way to detect convergence
is to look for stagnation of values of the objective function over a number of past
function evaluations (or past local search algorithm iterations). If the best objective
function value found so far does not improve by at least the limit

tf

over the last

hf

function evaluations, then the current algorithm instance is terminated and new one
is started. For convenience, in the subsequent text we will call

hf

the function value

history length and tf the function value history tolerance. An example of this restart
condition is given in gure 1: The best solution found after v function evaluations
∗
∗
is marked as xv and its corresponding function value as f (xv ). In the gure, we see
that the restart condition is triggered because at the last function evaluation m, the
∗
∗
following is true: f (xm−h ) ≤ f (xm ) + tf
f

m−hf
3000

f(xv )

2500
2000
1500
1000
5000

5

10

15

20

v

25

30

35

∗
f(xm ) + tf
∗
f(xm )

Figure 1: Restart condition based on function value stagnation

Displays the objective function value f (xv ) (dashed black line) of evaluation v , and the best objective function value reached after v function
evaluations f (x∗ ) (solid black line), over the interval 0, m function
s
evaluations. The values f (x∗ ), f( x∗ ) + tf and m − hf are highlighted.
m
m
It is, of course, necessary to choose specic values of

hf

and

tf

but usually it is

not overly dicult to nd a combination which works well for a large set of problems.
Various dierent ways of detecting convergence and corresponding restart conditions can be used. For example, reaching zero-gradient for line-search algorithms,

7

reaching minimal size of pattern for pattern search algorithms, etc.
There are also various ways of choosing the starting position
search algorithm instances. The simplest one is to choose

x0

x0

for new local

by sampling random

uniform distribution over the set of all feasible solutions. This is very easy to implement and often gives good results. However, it is also possible to use information
gained by the previous instances, when choosing

x0

for a new one.

A simple algorithm, which utilizes this idea is the Iterated search: The rst instance

i1

is started from an arbitrary position and it is run until it converges (or

until it exhausts certain amount of resources) and returns the best solution it has
∗
found xi1 . Then, the starting position for the next instance is selected from the neigh∗
bourhood N of xi . Note, that N is a qualitatively dierent neighbourhood, than
1

what the instance

i1

might be using to generate neighbourhood solutions each step.

It is usually much larger, with the goal being to generate the new starting point for
instance

i2

by perturbing the best solution of

i1

enough, to move it to a dierent
x∗2 better than x∗1 , then the
i
i
∗
∗
∗
next instance is started from the neighbourhood N (xi ). If f (xi ) ≥ f (xi ) and a
basin of attraction. If the new instance nds a solution

2

2

1

better solution is not found, then the next instance is started from the neighbour∗
hood N (xi ) again. This is repeated until a stop condition is triggered. An obvious
1
assumption, that this method makes, is that the minima of the objective function
are grouped close together. If this is not the case, then it might be better to use
uniform random sampling.
The big question is, how to choose the size of the neighbourhood

N?

Too small,

and the new instance might fall into the same basin of attraction as the previous one.
Too big, and the results will be similar to choosing the starting position uniformly
randomly. Another method, called the Variable neighbourhood search, which can, in
a way, be considered to be an improved version of the iterated search, tackles this

N1 , . . . , Nk of varying sizes,
N1 is the smallest and the following neighbourhoods are successively larger,
with Nk being the largest. The restarting procedure is the same as with iterated
search, with the following modication: If a local search algorithm instance ik , started
∗
from the neighbourhood N1 (xi
) does not improve the current best solution, then
k−1
∗
the algorithm tries starting the next instance from N2 (xi
), then N3 (x∗k−1 ), and so
i
k−1

problem by using multiple neighbourhood structures
where

on. The structure of a basic variable neighbourhood search, as given in [HM03], page
10, is described in algorithm 2. This algorithm can also be used as a description of
iterated search, if the set of neighbourhood structures contains only one element.
Yet another group of methods, which aim to prevent local search algorithms
from getting stuck in local optima is based on the idea that it is not necessary to
run multiple local search algorithm instances one after another, but they can be run
at the same time. Then, it is possible to evaluate the expected performance of each
instance based on the results it obtained so far and allocate the resources to the best
(or most promising) ones. This is somewhat similar to the well known multi-armed
bandit problem.
The basic implementation of this idea is called the explore and exploit strategy.
It involves initially running all of its

k

algorithm instances until a certain fraction

of its resource budget is expended. This is the exploration phase. Then, the best

8

Algorithm 2: Variable neighbourhood search

: initial position

input

x0 ,

set of neighbourhood structures

N1 , . . . , Nk

of

increasing size

1
2
3
4
5
6
7
8
9
10
11

x∗ ← local_search(x0 )
k←1
while

Stop condition not met

Generate random point
y ∗ ← local_search(x )
∗
∗
if f (y ) f (x ) then
∗
∗

x

do

from

Nk (x∗ )

x ←y
k←1

else

k ←k+1
return

x∗

instance is selected and run until the rest of the resource budget is used up - The
exploitation phase.
There is, again, an obvious trade o between the amount of resources allocated
to each phase. The exploration phase should be long enough, so that, when it ends,
it is possible to reliably identify the best instance. On the other hand, it is necessary
to have enough resources left in the exploitation phase in order for the selected best
instance to converge to the optimum. In practice, it is actually not that dicult to
nd balance between these two phases, that gives good results for a wide range of
problems.
Methods like this, which run multiple local search algorithm instances at the
same time, belong into the group of portfolio algorithms. We should, however, note
that portfolio algorithms are usually used in a somewhat dierent way than described
here. Most commonly, they run multiple instances of dierent local search algorithms,
each of which is well suited for a dierent kind of problem. This allows the portfolio
algorithm to select instances of that algorithm, which is able to solve the given
problem the most eciently, even without knowing its properties a priori.
The MetaMax algorithm, which is the main subject of this thesis, is also an portfolio algorithm. However, we use it only running one kind of local search algorithm at
a time, to allow for a more fair and direct comparison with restart strategies, which
typically use only one kind of local search algorithm.

3

MetaMax algorithm and its variants
The MetaMax algoithm is a multi-start portfolio strategy presented by György

and Koscis in [GK11]. There are, actually, three versions of the algorithm, which
dier in certain details. They are called MetaMax(k), MetaMax(∞) and MetaMax
and they will be described in detail in this section.
Please note, that while in this text we usually presume all optimization prob-

9

lems to be minimization problems, the text in [GK11] assumes a maximization task.
Therefore, while describing the workings of MetaMax algorithm in this section, we
will keep to the convention in [GK11], but in the rest of the text we will refer to
minimization tasks as usual. Our implementation of MetaMax was modied to work
with minimization tasks.
György and Kocsis demonstrate ([GK11], page 413, equation 2) that convergence
of an instance of a local search algorithm, after

s

steps, can be optimistically esti-

mated with large probability as:

lim f (x∗ ) ≤ f (x∗ ) + gσ (s)
s
s

(1)

s→∞
Where

f ( x∗ )
s

is the best function value obtained by the local search algorithm in-

stance up until the step

s

and

gσ (s)

is a non increasing, non negative function with

lims→∞ g(s) = 0. Note, that the notation used here is a little dierent than in [GK11],
but the meaning is the same.
In practice, the exact form of

g(s)

is not known, so the right side of equation 1

has to be approximated as:

f (x∗ ) + ch(s)
s
Where

(2)

c is an unknown constant and h(s) is a positive, monotone, decreasing function

with the following properties:

h(0) = 1,

lims→∞ h(s) = 0

(3)

h(s) = e−s .

In the subsequent text, we

One possible simple form of this function is

shall call this function the estimate function. György and Kocsis do not use this
name in their work. In fact, they do not use any name for this function at all and
refer to it simply as

h

function. However, we think that this is not very convenient,

hence we picked a suitable name.
Based on equations 1 and 2, it is possible to create a strategy that allocates
resources only to those instances, which are estimated to converge the most quickly
and maximize the value of expression 2 for a certain range of the constant

c.

The

problem of nding these instances can be solved eectively by transforming it into
a problem of nding the upper right convex hull of a set of points in the following
way:

Ai
si , the position xi,si of the best solution
∗
it has found so far and its corresponding function value f (xi,s ). If we represent the
i
set of the local search algorithm instances Ai , i = 1, . . . , k by a set of points:
We assume, that there is

k

number of instances in total and that each instance

keeps track of the number of steps it has taken

P : {(h(si ), f (x∗ i )), i = 1, . . . , k}
i,s

(4)

Then the instances which minimize the value of expression 2 for a certain range of

c

correspond to those points, which lie on the upper right convex hull of the set

P.

Because the term upper right convex hull is not quite standard, we should clarify
that we understand it to mean an intersection of the upper convex hull and the right
convex hull.

10

Note, that presumably for simplicity, the authors of [GK11] assumed only local
search algorithms which use the same number of function evaluations every step. For
algorithms where this is not true, it makes more sense to set
of function evaluations used by the instance

i

si

equal to the number

so far instead. We believe that this is

a better way to measure the use of resources by individual instances, which is also
conrmed in [PG13].
György and Kocsis suggest using a form of estimate function, which changes based
on the amount of resources used by all the local search algorithm instances, in order
to encourage more exploratory behaviour as the MetaMax algorithm progresses.
Therefore, in our implementation, we use the following estimate function, which is
recommended in [GK11]:

h(vi , vt ) = e−vi /vt
Where

vi

(5)

is the number of function evaluations used by instance

i and vt

is the total

number of function evaluations used by all of the instances combined.
The simplest of the three MetaMax variants is MetaMax(k). It uses

k local search

algorithm instances and is described in algorithm 3. For convenience and improved
readability, we will use simplied notation, when describing MetaMax variants:

vi

for number of function evaluations used by local search algorithm instance

i

so

far

xi

for position of the best solution found by instance

fi

for function value of

i

so far

xi

In the descriptions, we also assume that the estimate function

h is a function of only

one variable.
Algorithm 3: MetaMax(k)

: function to be optimized

input

f,

number of algorithm instances

monotone non-decreasing function

h

k

and a

with properties as given in

equation 3

1

Step each of the
variables

2
3

while

v i , xi

k

local search algorithm instances

and

Ai

and update their

fi

stop conditions not met

do

i = 1, . . . , k , select algorithm Ai if there exists c 0 so that:
fi + ch(vi ) fj + ch(vj ) for all j = 1, . . . , k so that (vi , fi ) = (vj , fj ). If
there are multiple algorithms with identical v and f , then select only one
For

of them at random.

4
5
6
7

Step each selected

Ai

and update its variables

vi , xi

and

fi .

b = argmin1,...,k (fi ).
∗
solution: x ← xb .

Find the best instance:
Update the best
return

x∗

As with a priori scheduled restart strategies, there is the question of choosing
the right number of instances (parameter

k)

11

to use. The other two versions of the

algorithm - MetaMax and MetaMax(∞) get around this problem by gradually increasing the number of instances, starting with a single one and adding a new one
every round. Thus, the number of instances tends to innity as the algorithm keeps
running. This allows to prove that the algorithm is consistent. That is, it will almost
surely nd the global optimum if kept running for an innite amount of time.
Please note, that in some literature, such as [Neu11], the term asymptotically
complete is used, instead of consistent, but both of them mean the same thing. Also
note, that we use the word round to refer to a step of the MetaMax algorithm, in order
to avoid confusion with steps of local search algorithms. MetaMax and MetaMax(∞)
are described in algorithms 5 and 4 respectively, also using the simplied notation.

Algorithm 4: MetaMax(
input

∞)


f,


h

with

properties as given in equation 3

1
2
3

r←1
while


do

Add a new local search algorithm instance

Ar ,

step it once and initialize

vr , xr and fr
For i = 1, . . . , r , select algorithm Ai if there exists c 0 so that:
fi + ch(vi ) fj + ch(vj ) for all j = 1, . . . , r so that (vi , fi ) = (vj , fj ). If
its variables

4

of them at random.

5
6
7
8
9

Step each selected

Ai



vi , xi

and

fi .

b = argmin1,...,r (fi ).
x∗ ← x b .

Update the best solution:

r ←r+1
return

x∗

MetaMax and MetaMax(∞) dier only in one point (lines 6 and 7 in algorithm
5): If, after stepping all selected instances, the best instance is a dierent one than
in the previous round, MetaMax will step it until it overtakes the old best instance
in terms of used resources.
In [GK11] it is shown that MetaMax asymptotically approaches the performance
of its best local search algorithm instance as the number of rounds increases. Theoretical analysis suggests that the number of instances increases at a rate of
where

√
Ω( vt ),

vt

is the total number of used function evaluations. However, practical results
vt
Ω( logvt ). Based on this, it can also be estimated ([GK11],
page 439) that to nd the global optimum xopt , MetaMax needs only a logarithmic
give a rate of growth only of

factor more function evaluations than a local search algorithm instance, which would
start in the basin of attraction of

xopt .

Note a small dierence in the way MetaMax and MetaMax(∞) are described
in algorithms 5 and 4 from their descriptions in [GK11]. There, a new algorithm
instance

Ar

is added with

fr = 0

and

sr = 0

and takes at most one step during the

round that it is added. This is possible because in [GK11] a non-negative objective
function

f

and a maximization task are assumed. Therefore, an algorithm instance

12

Algorithm 5: MetaMax
input


f,


h

with

properties as given in equation 3

1
2
3

r←1
while


do

Ar , step it once and initialize
vr , xr and fr
For i = 1, . . . , r , select algorithm Ai if there exists c 0 so that:
fi + ch(vi ) fj + ch(vj ) for all j = 1, . . . , r so that (vi , fi ) = (vj , fj ). If
Add a new local search algorithm instance
its variables

4

of them at random.

5
6
7
8
9
10

Step each selected

Ai


vi , xi

and

fi .

br = argmin1,...,r (fi ).
If br = br−1 step instance Abr until vbr ≥ vbr−1
∗
Update the best solution: x ← xb .
r ←r+1


return

x∗

can be added without taking any steps rst, and assigned a function value

fr = 0,

which is guaranteed to not be better than any of the function values of the other
instances.
We are, however, dealing with a minimization problem with a known target value
(see [Han+13b]) but no upper bound on
of

f.

f

and, consequently, no worst possible value

Therefore, we made a little change and step the new instance

Ar

immediately

after it is added. It can then also be stepped second time, during step 4 in algorithms
5 and 4. We believe, that this has no signicant impact on performance.

3.1 Suggested modications
MetaMax and MetaMax(∞) will add a new instance each round as long as
they are running, with no limit on the maximum number of instances. The authors of [GK11] state that the worst-case computational overhead of MetaMax and
2
MetaMax(∞) is O(r ), where r is the number of rounds. For the purpose of optimizing functions, where each function evaluation uses up a large amount of computational time (for which MetaMax was primarily designed), the overhead will be
negligible compared to the time spent calculating function values and will not present
a signicant problem. However, in comparison with restart strategies which have typically almost no overhead this is still a disadvantage for MetaMax. Therefore, it would
be desirable to come up with some mechanism that would improve its computational
complexity.
An obvious solution would be to limit the total number of instances which can
be added or slow down the rate at which they are added so that there will never be
too many of them. However, this would make MetaMax and MetaMax(∞) behave
basically in the same way as MetaMax(k) and lose their main property, which is the

13

consistency based on always generating new instances.
A better solution would be to add a mechanism which would discard one of
already existing instances every time a new one is added and therefore keep the total
number of instances at any given time constant. The important question is: Which
one of the existing instances should be discarded?
We propose the following approach: Discard the instance which has not been
selected for the longest time. If there are multiple instances which qualify, discard the
one with the worst function value. The rationale behind this discarding mechanism
is that MetaMax most often selects (allocates the most resources to) those instances,
which have the best optimistic estimate of convergence. Therefore, the instances
which are selected the least often will likely not give very good results in the future,
and so make good candidates for deletion. An alternative, method may also be to
discard the absolute worst instance (in terms of the best objective function value
found so far). Which is even simpler, but we feel that it does not follow so naturally
from the principles behind MetaMax. Therefore, for most of our experiments we will
use the discarding of least selected instances.
Another area where we think it might be benecial to modify the workings of
MetaMax, is the mechanism of selecting instances to be stepped in each round.
The original mechanism has two possible disadvantages: Firstly, it is not invariant
to monotone transformation of the objective function values. By this we mean a

f (x) → f (x), which itself is only a function of the value of f (x) and not
the parameter vector x. The monotone property meaning, that if f (x1 ) f (x2 ) then
f (x1 ) f (x2 ) for all possible x1 and x2 . Such a monotone transformation will not
change the location of the optima of f (x). I will also not change the direction of
gradient of f (x) for any x, but not necessarily its magnitude. An example of such
mapping

transformation is given in gure 2.
Logically, it would not make much sense to require an optimization algorithm to
be invariant to an objective function value transformation, which is not monotone,
as it could change the position of the functions optima.
The second possible disadvantage of the convex hull based instance selection
mechanism is that it also behaves dierently based on the choice of the estimate
function

h.

given, while

This is not such a great disadvantage as the rst one, because

h

f (x)

is

can be chosen freely. However, it would still be benecial if we could

entirely remove the need to choose

h.

To overcome these problems, we propose a new instance selection mechanism. It
uses the same representation of local search algorithm instances as a set of points

P,

given in equation 4 but it select those instances, which correspond to non-dominated
in the sense of maximizing fi and maximizing h(vi ) (or analogically
fi and minimizing vi ). This method is clearly invariant to both monotone
transformation of objective function values f → f and dierent choices of h, as
determining non-dominated points depends only on their ordering along the axes fi
and h(xi ), which will always be preserved due to the fact that both f → f and h
are monotone. Moreover, the points which lie on the right upper convex hull of P ,
and thus maximize the optimistic estimate fi + ch(vi ), are always non-dominated,
points of

P

maximizing

and thus will always be selected.

14

1
3 2 1
10
0 1 2
2
3 3

2

15
10
5
30

1
3 2 1
10
0 1 2
2
3 3

2

2

1

1

0

0

1

1

2

4000
3000
2000
1000
0
23

2

3

3

2

1

0

1

3

2

3

2

1

1

0

Figure 2: Example of monotone transformation of

2

f (x)

Displays a 3D mesh plot of a Rastrigin like function f (x) in the top left, a
transformed function f (x)3 in the top right and their respective contour
plots on the bottom. It is clear, that the shape of the contours is the
same, but their heights are not.
A possible disadvantage of the proposed algorithm is, that at each round it selects many more points than the original convex hull mechanism. This might result
in selecting instances with low convergence estimate too often, and not dedicating
enough resources to the more promising ones. A visual comparison of the two selection mechanisms and demonstration of the inuence of choice of estimate function
upon selection are presented in gure 3.

15

fi
fi3

0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0
1
2
3
4
5

1e8

1e25

0.1 0.2 0.3 0.4
h(vi )

0.1 0.2 0.3 0.4
h(vi )

Figure 3: MetaMax selection mechanisms

Compares the original selection mechanism based on nding upper convex hull (left sub-gures), with the new proposed mechanism based on
selecting non-dominated points (right sub-gures). Also demonstrates the
eects of monotone transformation of the objective function values on the
selection, with f (x) for the upper sub-gures and f (x)3 for those on the
bottom. Selected points are marked as red diamonds, connected by a red
line. Unselected points are marked as lled black circles.

4

Experimental setup
All of the experiments were conducted using the COCO (Comparing continuous

optimizers) framework [Han13a], which is an open-source set of tools for systematic
evaluation and comparison of real-parameter optimization strategies. It provides a set
of 24 benchmark functions of dierent types, chosen to thoroughly test the limits and
capabilities of optimization algorithms. Also included are tools for running experiments on these functions and logging, processing and visualising the measured data.
The library for running experiments is provided in versions for C, Java, R, Matlab
and Python. The post processing part of the framework is available for Python only.
The benchmark functions are divided into 6 groups, according to their properties.
They are briey described in table 1. For detailed description, please see [Han+13a].
There are also multiple instances dened for each function, which are created by
applying various transformations to the base formula.
We shall now briey explain some of the functions properties mentioned in table
1. As already mentioned, the terms unimodal and multimodal refer to functions with

16

Name

separ
lcond
hcond
multi
mult2

Functions

Description

1-5

Separable functions

6-9

Functions with low or moderate conditionality

10-14

Unimodal functions with high conditionality

15-19

Multimodal structured functions

20-24

Multimodal functions with weak global structure
Table 1: Benchmark function groups

single optimum and multiple local optima respectively.
Conditionality describes how much the functions gradient changes depending on
direction. Simply put, functions with high conditionality (also called ill-conditioned
functions), at certain points, grow rapidly in some directions but slowly in others.
This often means that the gradient points away from the local optimum, which
presents a dicult problem for some local search algorithms. To give a more visual
description, one can imagine that 3D graphs of two-dimensional ill conditioned functions usually form sharp ridges, while those of well conditioned functions form gentle
round hills.

f (x1 , x2 , . . . , xd ) = f (x1 ) + f (x2 ) +
...+f (xd ), which means that they can be minimized by minimizing d one-dimensional
functions, where d is the number of dimensions of the separable function.
Separable functions have the following form:

In order to exhaustively evaluate performance of the selected strategies, we decided to make the following series of measurements for each strategy:
1. Using four dierent local search algorithms - Compass search, Nelder-Mead
method, BFGS and CMA-ES. In order to evaluate the eect of algorithm
choice.
2. Using all of the 24 noiseless benchmark functions available in the COCO framework, to measure performance on a wide variety of dierent problems.
3. Using the following dimensionalities : d = 2, 3, 5, 10, 20. To see how much is
the performance aected by the number of dimensions.
4. Using the rst fteen instances of each function. According to [Han+13b], this
number is sucient to provide statistically sound data.
Resource budget for minimizing a single function instance (a single trial) was set to
105 d, meaning 100000 times the number of dimensions of the instance.
The reasons for choosing the four local search algorithms are: Compass search
algorithm was chosen for its simplicity, in order to allow us to evaluate whether
MetaMax can improve performance of such a basic algorithm. Nelder-Mead method
was chosen as a more sophisticated representative of the group of pattern search
algorithms, than compass search. BFGS was selected as a typical line search method.
Finally, CMA-ES is there to represent population based algorithms. It is also the most
advanced of the four algorithms and thus we expect that it will perform the best
of the four selected algorithm. For a more detailed description of these algorithms,
please see section A.

17

4.1 Used multi-start strategies
In this section, we describe the selected MetaMax and restart strategies, which
were evaluated using the methods described above. For convenience, we assigned a
shorthand name to each used strategy, so that we can write, for example csa-h-10d,
instead of objective function stagnation based restart strategy with history length
10d using compass search algorithm, which is impractically verbose. The shorthand
names have the following form: abbreviation of the used local search algorithm, dash,
used multi-start strategy, dash, strategy parameters. A list of all used strategies and
their shorthand names is given in table 3.
We chose two commonly used restart strategies to compare with MetaMax: a xed
restart strategy with a set number of resources allocated to each local search algorithm run, and a dynamic restart strategy with restart condition based on objective
function value stagnation.
Performance of these two strategies largely depends on the combination of the
problem being solved and the strategy parameters. Therefore, we decided to use six
xed restart strategies and six dynamic restart function stagnation strategies with
dierent parameters:

•

Fixed restart strategies

Run lengths:

nf

= 100d, 200d, 500d, 1000d, 2000d, 5000d evaluations.

Shorthand names:

•

algorithm-f-nf

Function value stagnation restart strategies

hf =
tf = 10−10
algorithm-h-hf

Function value history lengths:

2d, 5d, 10d, 20d, 50d, 100d evaluations

Function value tolerance:
Shorthand names:

Note, that the parameters depend on the number of dimensions of the measured
function d. This is consistent with the fact that the total resource budget of the
strategy also depends on d and that we can expect that for higher dimensionalities,
the used local search algorithms will need longer runs to converge.
The rationale behind choosing the used parameter values is the following: Using
5
the function evaluation budget of 10 d, run lengths longer than 5000d would give us
less than 20 restarts per trial. This would result in a very low chance of nding the
global optimum on most of the benchmark functions, some of which can have up to
10d optima. Also, it is probable that most local search algorithms will converge a
long time before using up all 50000d function evaluations and then the rest of the
allocated resources would be essentially wasted on running an instance which cannot
improve any more. Conversely, run lengths smaller than 100d are probably not long
enough to allow most local search algorithm instances to converge and so there would
be little sense in using them.
The choice of the upper bound of the function value history length

hf

as 100d

is based on a similar idea: For values greater than 100d the restart condition would
trigger too long after the local search algorithm has already converged, and so we
would be needlessly wasting resources on it. The choice of the lower bound of
depends on the used algorithm. For a restart strategy to function properly,

18

hf

hf

has to

be greater, or at least as much, as the number of function evaluations that the used
local search algorithm uses during one step. The above stated value of

hf

= 2d is

the minimal value for which the Nelder-Mead and BFGS algorithms work properly.
For the other two algorithms, the minimal value is

hf

= 5d. We decided to base the

function value history length on number of used function evaluations, rather than on
number of taken steps, because it allows for a more direct comparison of performance
of the same strategy using two dierent algorithms.
Choosing the value of the function stagnation tolerance

tf

involved a little bit

more guesswork. There is target function value dened for all of the benchmark
functions, which is equal to the function value at their global optimum f (xopt ) plus
−8
a tolerance value ftol = 10 . That is, the function instance is considered to be
solved if we nd some point

x

f (x) ≤ f (xopt ) + ftol . We
tf = 10−10 on ftol .

with

the function stagnation tolerance parameter

tf

one hundred times lower than

ftol

based our choice of
Setting the value of

should make it large enough to reliably detect

convergence, while not being too large to trigger the reset condition prematurely,
when the local search algorithm is still converging.
The goal of using multiple strategies with dierent parameter values is to have
at least one xed restart and one objective function value stagnation based strategy,
that performs well on the set of all functions, for each measured dimensionality.
For easier comparison of results of the xed restart strategies, we represent them
all together, by choosing only the results of the best performing strategies for each
dimensionality and collecting them into a best of collection of results, which we
will refer to by the shorthand name

algorithm-f-comb.

This represents the results

of running a xed restart strategy, which is able to choose the optimal run length
(from the set of six used run lengths), based on dimensionality of the function being
solved. The results of objective function value stagnation strategies are represented
in an analogous way, under the name

algorithm-h-comb.

Besides the already mentioned restart strategies, we decided to add four more,
each based on a restart condition specic to one of the used local search algorithms.
Shorthand names for these strategies are

algorithm-special.

They are described

in table 2.
In order to save computing time, and as per recommendation in [Han+13b], we
used an additional termination criterion that halts the execution of a restart strategy
after 100 restarts, even if the resource budget has not yet been exhausted and the
solution has not been found. This does not impact the accuracy of the measurements,
as 100 restarts is enough to provide statistically signicant amount of data and the
metrics which we use (see subsection 4.2) are not biased against results of runs
which did not use up the entire resource budget. In fact, the xed restart strategies

f-100d, f-200d

and

f-500d

always reach 100 restarts before they can fully exhaust

their resource budgets.
The idea of using the original pure versions of MetaMax and MetaMax(∞)
algorithms, which keep adding local search algorithm instances without limit, proved
to be impractical due to its excessive computational resource requirements (for the
length of experiments that were planned). Therefore, we performed measurements
using only the modied versions of MetaMax and MetaMax(∞) with the added

19

Algorithm
Compass search

Description

Restart when the variable

a,

which aects how far from the cur-

rent solution the algorithm generates neighbour solutions, decreases
−10
below 10
. It naturally decreases as the algorithm converges, so
checking its value makes for a good restart condition.
Nelder-Mead

We chose a similar condition to the one mentioned above. Restart
is triggered when distance between the two points of the simplex
−10
which are the farthest apart from each other decreases below 10
.
The rationale is similar as above: the simplex keeps growing smaller
as the algorithm converges. It might be more mathematically proper
to check the area (or volume, or hyper-volume, depending on the
dimensionality) of the simplex, but we discarded this idea out of
concern that it might be too computationally intensive.

BFGS

Restart condition is triggered if the norm of the gradient is smaller
−10
than 10
. Since the algorithm already uses information about the
gradient, it makes sense also to use it for detecting convergence.

CMA-ES

The recommended settings for CMA-ES given in [Han11] suggest
using 9 dierent restart conditions. Here we use these recommended
settings. Note that when using CMA-ES with the other restart
strategies, we use only a single restart condition and the additional
ones are disabled. In a sense, we are not using the algorithm to its
full potential, but this allows for a more direct comparison with
other local search algorithms.
Table 2: Algorithm specic restart strategies

mechanism (described in subsection 3.1) for limiting maximum number of instances.
For all MetaMax strategies, we used the recommended form of estimate function:
h = e−vi /vt . Measurements were performed using the following MetaMax strategies:
1. MetaMax(k), with k=20, k=50 and k=100. This gives the same total number
of local search algorithm instances as when using xed restart strategies with
run lengths equal to 5000d, 2000d and 1000d respectively. This makes it possible to evaluate the degree to which the MetaMax mechanism of selecting the
most promising instances improves the performance over these corresponding
restart strategies. The expectation is, that the success rate for MetaMax(k) will
not increase, because the number of instances and thus the ability to explore
the search space stays the same. However, MetaMax(k) should converge faster
than the xed restart strategies, because it should be able to identify the best
instances and allocate resources to them appropriately.
2. MetaMax and MetaMax(∞) with the maximum number of instances set to
100. This should allow us to asses the benets of the mechanism of adding new
instances (and deleting old ones), by comparing the results with MetaMax(k),
which uses the same number of instances each round, but does not add or
delete any. Here, we would expect an increase in success rate on multimodal

20

functions, as the additional instances ,generated each round, should allow the
algorithms to explore the search space more thoroughly. However, the limit of
100 instances will possibly still not be enough to get a good success rate for
multimodal problems with high dimensionality.
3. MetaMax and MetaMax(∞) with maximum number of instances set to 50d.
This should allow the algorithms to scale better with the number of dimensions and, hopefully, further improve their performance. The number 50d was
chosen as a reasonable compromise between computation time and expected
performance. We expect to get the best results here.

algorithm-k-X for MetaMax(k), algorithm-m-X for MetaMax and algorithm-i-X for MetaMax(∞), where
X is the maximum allowed number of instances (or, equivalently, the value of k for
Shorthand names for MetaMax variants were chosen as

MetaMax(k)).

Fixed restart strategies

f-100d
f-200d
f-500d
f-1000d
f-2000d
f-5000d
f-comb

Run length = 100d evaluations
Combined xed restart strategy
Function value stagnation restart strategies

h-2d
h-5d
h-10d
h-20d
h-50d
h-100d
h-comb

History length = 2d evaluations
Combined function value stagnation restart strategy
Other restart strategies

special

Special restart strategy specic to each algorithm, see table 2
MetaMax variants

k-20
k-50
k-100
k-50d
m-100
m-50d
i-100
i-50d

MetaMax(k) with k=20
MetaMax(k) with k=50d
MetaMax with maximum number of instances = 100
MetaMax with maximum number of instances = 50d
MetaMax(∞) with maximum number of instances = 100
MetaMax(∞) with maximum number of instances = 50d
Table 3: Tested multi-start strategies

21

There is a number of additional interesting aspects of the MetaMax variants,
which would be worth testing and evaluating. For example:

•

Comparison of MetaMax and MetaMax(∞) with the limit on maximum number of instances and without it.

•

Performance of dierent methods of discarding old instances.

•

Inuence of dierent choices of estimate function on performance.

•

Performance of our proposed alternative method for selecting instances.

However, it was not practically possible (mainly time-wise) to perform full sized
5
(10 d function evaluatoin budget) experiments which would test all of these features.
Therefore, we decided to make a series of smaller measurements, with the maximum
number of function evaluations per trial set to 5000d, using only dimensionalities
d=5, d=10 and d=20 and using only the BFGS algorithm. This should allow us to
test these features at least in a limited way and see if any of them warrant further
attention. More specically, we made the following series of measurements:
1. MetaMax and MetaMax(∞) without limit on maximum number of instances
2. MetaMax and MetaMax(∞) with maximum instance limit 5d, 10d and 20d,
discarding the most inactive instances
3. MetaMax and MetaMax(∞) with maximum instance limit 5d, 10d and 20d,
discarding the worst instances
4. MetaMax(k) with k=5d, k=10d and k=20d
These measurements were repeated three times, rst time using the recommended
−v /v
form of the estimate function h1 (vi , vt ) = e i t , second time with a simplied
vi
function h(vi ) = e and the third time using the proposed alternate instance selection
method, based on selecting non-dominated points.

4.2 Used metrics
In this section, we describe the various metrics that were used to compare results
of dierent strategies. The simplest one is the success rate. For a set of trials

U

(usually of one strategy running on one or more benchmark functions) and a chosen
target value

t,

it can be dened as:

SR(U, t) =

(6)

|u : u ∈ U, fbest (u) ≤ t| is the number of trials
which have found a solution at least as good as t. In the rest of this text we use a
mean success rate, averaged over a set of target values T
Where

|U |

|{u ∈ U : fbest (u) ≤ t}|
|U |

is the number of trials and

SRm (U, T ) =

1
|T |
22

SR(U, t)
t∈T

(7)

The main metric used in the COCO framework is the expected running time,
or ERT. It estimates the expected number of function evaluations that a selected
strategy will take to reach a target function value
trials

U.

for the rst time, over a set of

It is dened as:

ERT (U, t) =
Where

t

evals(u, t)

1
evals(u, t)
|{u ∈ U : fbest (u) ≤ t}| u∈U

(8)

u to reach
u if it never reached t. Expression
successful trials for target t. If there were

is the number of function evaluations used by trial

target t, or the total number of evaluations used by

|{u ∈ U : fbest (U ) ≤ t}| is the number of
no such trials, then ERT (U, t) = ∞. In the rest o this text we will use ERT averaged
over a set of target values T , in a similar way to what is described in equation 7. We
will also usually compute it using a set trials obtained by running the same strategy
on multiple dierent functions, usually all functions in one of the function groups
described in table 1.

For comparing two or more strategies in terms of success rates and expected
running times, we use graphs of the empirical cumulative distributive function of
run lengths, or ECDF. Such a graph displays on the y-axis the percentage of trials
for which ERT (averaged over a set of target values
evaluations

x,

that for each

where

x

x

T)

is lower than the number of

is the corresponding value on the x-axis. It can also be said,

it shows the expected average success rate, if a function evaluation

budget equal to

x

was used. For easier comparison of ECDF graphs across dierent

dimensionalities, the values on the x-axis are divided by the number of dimensions.
The function displayed in the graph can then be dened as:

y(x) =

1
|{t ∈ T : ERT (t, u) ≤ x}|
d|T ||U | u∈U

(9)

An example ECDF graph, like ones that are used throughout the rest of the text,
is given in gure 4. It shows ERTs of two sets of trials measured by running two
dierent strategies on the set of all benchmark functions, for d=10 and averaged
over a set of 50 target values. The target values are logarithmically distributed in
−8
2
the interval 10 ; 10 . We use this same set of target values in all our ECDF
graphs.
The marker

× denotes the median number of function evaluations of unsuccessful

trials, divided by the number of dimensions. Values to the right of this marker are
(mostly) estimated using bootstrapping (for details of the bootstrapping method,
please refer to [Han+13b]). The fact that we use 15 trials for each strategy-function
pair means, that the estimate is reliable only up to about fteen times the number of
evaluations marked by

×. This is a fact that should be kept in mind when evaluating

the results. The thick orange line in the plot is represents the best results obtained
during the 2009 BBOB workshop for the same set of problems and is provided for
reference.
Since we are dealing with a very large amount of measured results, it would be
desirable to have a method of comparing them, that is even more concise than ECDF
graphs. To this end, we use metric called aggregate performance index (API), dened

23

Proportion of trials

1.0 f1-24,10-D
best 2009
0.8
0.6
bfgs-k-100
0.4
0.2
nm-k-100
0.00 1 2 3 4 5 6 7 8
log10evaluations/D
Figure 4: Example ECDF graph

Comparison of the results of MetaMax(k), with k=100, using BFGS and
Nelder-Mead local search algorithms, on the set of all benchmark functions. The strategy using BFGS clearly outperforms the other one, both
in terms of success rate and speed of convergence.
by Mr. Po²ík in a yet unpublished (at the time of writing this text) article [Po²13]. It
is based on the idea that the ECDF graph of the results of an ideal strategy, which
solves the given problem instantly, would be a straight horizontal line across the top
of the plot. Conversely, for the worst possible strategy imaginable, the graph would
be a straight line along the bottom. It is apparent, that the area above (or bellow)
the graph makes for quite a natural measure of eectiveness of dierent strategies.
Given a set of ERTs

A,

their aggregate performance index can be computed as:

AP I(A) = exp

1
log a
|A| a∈A 10

(10)

For the purposes of computing API, the ERTs of unsuccessful trials which are by
denition

∞

have to be replaced with a value that is higher than ERT of any suc-

cessful trial. The choice of this value determines how much the unsuccessful trials
are penalized and thus aects the nal ERT score. For our purposes, we chose the
8
value 10 d.
Since we are computing API from the area above the graph this means that the
lower its value the better the corresponding strategy performs. Using API essentially
allows us to represent results of a set of trials by a single number and to easily
compare performances of dierent optimization strategies.

4.3 Implementation details
The software side of this project was implemented mostly in Python, with parts
in C. The original plan was to write the project purely in Python, which was cho-

24

sen because of its ease of use and availability of many open-source scientic and
mathematical libraries. However, during the project it was found out that a pure
Python code performs too slowly and would not allow us to make all the necessary
measurements. Therefore, parts of the program had to be changed over to C, which
has improved performance to a reasonable level.
The used implementations of BFGS and Nelder-Mead algorithms, are based on
the code from open-source Scipy library. They were modied to allow running the
algorithms in single steps, which is necessary in order for them to work with MetaMax. An open-source implementation of CMA-ES was used, available at [Han13b].
Implementation of MetaMax was written based on its description in [GK11]. It was
however, necessary to make several little changes to it mainly because it is designed
with a maximization task in mind but we needed to use it for minimization problems.
For nding upper convex hulls we used Andrew's algorithm with some additional pre
and post processing, to get the exact behaviour described in [GK11].
For description of the source code, please see the le

source/readme.txt on the

attached CD.

5

Results
In this section we will evaluate results of the selected multi-start strategies. We

decided to split the results into four subsections based on used local search algorithm. We present and compare the results mainly using tables, which list APIs and
success rates for dierent groups of functions and dierent dimensionalities. For convenience, the best results are highlighted with bold text. We also show ECDF graphs
to illustrate particularly interesting results. Results of the smaller experiments, described at the end of section 4.1 and results of timing measurements are summarized
in subsection 5.5.
The values of success rates and APIs, which are shown in this section, are com5
puted only using data bootstrapped up to the value of 10 d function evaluations.
In our opinion, these values represent the real performance of the selected strategies
better than if we were to use fully bootstrapped data, which are estimated to a large
degree and therefore not so statistically reliable. In ECDF graphs, bootstrapped re7
sults are shown up to 10 d evaluations. All of the APIs and success rates are averaged
over a set of multiple targets, as described in subsection 4.3.
The measured data are provided in their entirety on the attached CD (see section
B) in the form of tarballed Python pickle les, which can be processed using the
BBOB post processing framework. It was not possible to provide the data in their
original form, as text les, because their total size would be in the order of gigabytes,
which would clearly not t on the attached medium.

5.1 Compass search
Table 4 summarizes which of the used xed restart and function value stagnation
restart strategies were best for each dimensionality and chosen for the best-of result
collections

cs-f-comb

and

cs-h-comb.

Table 5 then compares these two sets of re-

25

sults together with results obtained by the compass search specic restart strategy

cs-special.
It is apparent, that for the best strategies the values of run length and function
value history length increase with the number of dimensions. This is not unexpected
as compass search uses 2d or 2d-1 function evaluations at each step.
Dimensionality

d=3
d=5
d=10
d=20

Stagnation based

cs-f-100d
cs-f-100d
cs-f-200d
cs-f-500d
cs-f-500d

d=2

Fixed

cs-h-5d
cs-h-5d
cs-h-10d
cs-h-10d
cs-h-20d

Table 4: Compass search - best restart strategies for each dimensionality
The comparison of the best restart strategies suggests, that all of them have quite

cs-h-comb being a little better than the others in
cs-f-comb in terms of API. In the subsequent tables, we
cs-f-comb for reference, as an exaple of a well tuned restart

similar overall performance, with
terms of success rate and
will provide results of
strategy.

None of the strategies performs very well on multimodal and highly conditioned
functions. This is to be expected, as the compass search algorithm is known to have
trouble with ill conditioned problems and multimodal problems are dicult to solve

mult2

multi

hcond

lcond

separ

all

for any algorithm.

cs-f-comb
cs-h-comb
cs-special
cs-f-comb
cs-h-comb
cs-special
cs-f-comb
cs-h-comb
cs-special
cs-f-comb
cs-h-comb
cs-special
cs-f-comb
cs-h-comb
cs-special
cs-f-comb
cs-h-comb
cs-special

2D

3D

3.75

4.46

3.82
4.16

2.58

2.69
2.63
3.84

3.49

4.69
4.92

3.03

3.11
3.09

4.18

4.22

log10 API

5D

5.52

5.62
5.80
4.33

4.03

4.32
5.34

10D
6.31

20D
6.90

Success rate [%]
2D 3D 5D 10D 20D
85
74
53
41
34

6.28

6.86

6.91
5.50

85

74

100

100

48
69

4.86

5.43

100

100

84

5.43

100

100

100

84
98

6.39
4.93
4.95
6.35

7.44

5.15

5.97

6.99

56

44

43

66

39

37
62

66

63

66

63

84

72

51

69
63

50

4.18

5.38

100

100

72

5.38

5.99

6.63

7.03

82

52

43

35

33

5.17

6.59

7.34

7.80

80

26

10

44
74

18
17
66

14

76

70

52

4.10

4.71
3.86

3.36

3.89

5.21
6.08

4.48

4.74
4.89

6.50
6.45

6.95
6.99

5.31

5.38
5.76

6.91
6.89

7.53
7.61
6.30

6.08

6.20

7.31
7.30

7.98
7.96
6.85

6.63

6.65

47
37
79
72
80

100

85

35
32
63

66

72

29
31

55

Table 5: Compass search - results of restart strategies

26

64

26

3.63

6.07
6.20

7.26

84

68

4.19

5.40
5.83
4.29

6.27

78

28
28

11
10
42
51

40

26
26
7
8
36

50
50

Comparison of the results of three MetaMax(k) strategies with corresponding
xed restart strategies which use the same total number of local search algorithm
instances is given in table 6. They conrm our expectations, and show that, overall,
MetaMax(k) converges faster than a comparable xed restart strategy. The only
exception being the group

separ.

This can be explained by the fact that functions

from this group are very simple and can be generally solved by a single, or only very
few, runs of the local search algorithm. In this case, the MetaMax mechanism of
selecting multiple instances each round is more of a hindrance than a benet.
In terms of success rate, MetaMax(k) is always as good or even better than the
comparable xed restart strategy, with the improvement being especially obvious on
the groups

lcond

and

mult2.

Of the three tested variants of MetaMax(k),

cs-m-100

ever, it is not better than a well tuned restart strategy like

is the best overall. How-

cs-f-comb.

Figure 5 shows a behaviour which was observed across all function groups and dimensionalities when comparing MetaMax(k) with corresponding xed restart strategies: At rst, MetaMax(k) converges much more slowly than the restart strategy, as
it is still in the phase of initialising all of its instances. However, as soon as this is
nished, it starts converging quickly and overtakes the restart strategy for a certain
interval. After that, its rate of convergence slows down again and it ends up with
5
success rate (for 10 d function evaluations) similar to that of the restart strategy.


This eect seems to get less pronounced with increasing number of dimensions.

1.0 f1-24,5-D
best 2009
0.8
0.6
cs-f-2000d
0.4
0.2
cs-k-50
0.00 1 2 3 4 5 6 7 8
log10evaluations/D

Figure 5: Compass search - ECDF comparing MetaMax(k) with an equivalent xed
restart strategy
Results of comparing

cs-k-100, cs-m-100

and

cs-i-100

are shown in table 7.

It is apparent, that using the same number of instances at a time MetaMax and
MetaMax(∞) clearly outperform MetaMax(k) on all function groups, both in terms
of speed of convergence and success rates.
In general, they also provide results at least as good, or better than the best

27

all
separ
lcond
hcond
multi
mult2

cs-f-1000d
cs-f-2000d
cs-f-5000d
cs-k-20
cs-k-50
cs-k-100
cs-f-comb
cs-f-1000d
cs-f-2000d
cs-f-5000d
cs-k-20
cs-k-50
cs-k-100
cs-f-comb
cs-f-1000d
cs-f-2000d
cs-f-5000d
cs-k-20
cs-k-50
cs-k-100
cs-f-comb
cs-f-1000d
cs-f-2000d
cs-f-5000d
cs-k-20
cs-k-50
cs-k-100
cs-f-comb
cs-f-1000d
cs-f-2000d
cs-f-5000d
cs-k-20
cs-k-50
cs-k-100
cs-f-comb
cs-f-1000d
cs-f-2000d
cs-f-5000d
cs-k-20
cs-k-50
cs-k-100
cs-f-comb

2D
4.60
4.82
5.18
4.55
4.16
4.00

3D
5.20
5.45
5.71
5.30
5.01
4.84

3.75

4.46

3.28
3.54
4.08
3.38
2.82
2.68

2.58

4.56
4.75
4.83
4.19

3.83

3.85
3.84
5.32
5.56
5.97
5.32
5.14
5.03

4.19

5.26
5.44
5.94
5.27
4.80
4.58

4.29

4.59
4.79
5.00
4.51
4.12

3.82

3.86

3.72
4.22
4.74
4.16
3.88
3.85

3.03

5.22
5.37
5.38
5.12
4.98
4.54

log10 API

5D
5.85
5.94
6.08
5.87
5.75
5.62

10D
6.36
6.46
6.59
6.42
6.37
6.29

4.58
4.61
4.86
4.62
4.63
4.71

6.31
5.01
5.08
5.18
5.21
5.24
5.29

4.33

4.93

5.52

5.75
5.84
5.96
5.70
5.58
5.45

6.33
6.45
6.53

6.27

7.11

7.03

4.18

5.34

6.29
6.39
6.52
6.35
6.23
6.13

5.38

5.99

6.63

6.18
6.34
6.51
6.15
5.93
5.81

6.81
6.87
7.00
6.76
6.62

6.47

4.99
5.23
5.66
5.09
4.48

6.59
5.80
5.94
6.05
5.88
5.66

4.25

5.30

5.17

4.48

5.31

6.90
5.44

5.49
5.51
5.75
5.81
5.80
5.50
7.42
7.36
7.48

6.28
6.28
6.35
6.66
6.89
6.93
6.83
6.79
6.70

5.92
6.06
6.21
5.95
5.77
5.67

20D
7.01
6.96
7.13
6.99
6.98
6.97

7.38
7.44
7.48
7.34
7.26

7.23

7.34
6.43
6.42
6.80
6.40
6.24

5.94

6.30

7.22
7.29
7.44
7.14
7.23
7.32
7.27
7.21
7.17
7.82
7.85
7.91
7.76
7.74

7.70

7.80
7.40
6.94
7.48
7.06
6.96
6.94

6.85

Success rate [%]
3D 5D 10D 20D
61 46
41
30
55 45
39 35
48 43
36
28
52 45
41 35
58 47
42 35
62 49 43 35
85
74
53
41
34
100
100
68
65 63
100
84 68
65 63
100
68 67
65
62
84
68 67
65
62
100
84 68
65 63
100
84 68 66 63
100
100
69
66
62
80
60 54
50
27
78
56 52
44
31
82
57 50
46
26
86
58 55
50 41
88
63 57
50
37
88
78 62 52
34
84 84 63
50
26
55
40 37 35
30
52
37 33
27
28
40
35 30
27
26
47
36 33
29
28
52
42 36
31
31
56
45 40
34 33
2D
74
72
67
67
74
75

82

62
59
45
50
62
62

80

72
70
70
70
72
73

80

52

35
31
26
27
31
34

63

70
69
54
69
70
70

74

43

22
21
17
18
21
25

26

52
52
52
52
54
54

66

35
14

13
12
12
13

33
10
10

9
9

10

14

10

14

10

44
47
34
50
50

51

42

18

41

18
34
34
34
36

Table 6: Compass search - results of MetaMax(k) and corresponding xed restart
strategies

28

restart strategies. There is almost no dierence between the performance of
and

cs-i-100,

cs-m-100

which corresponds with results presented in [GK11]. Dierences in

performance seem to diminish with increasing dimensionality and, for d=10 and
d=20. all of the MetaMax strategies which use 100 instances perform almost the
same.
ECDF graph in gure 6 shows an interesting behaviour, where

cs-i-100

start converging right away and overtake

cs-k-100

cs-m-100

and

while it is still in the

process of initializing all of its instances. After that, MetaMax(k) catches up and for
a certain interval all of the strategies perform the same. Then, MetaMax(k) stops
converging, the other two strategies overtake it again and ultimately achieve better
success rates. The sudden stop in MetaMax(k) convergence presumably happens
when all of its best instances have already found their local optima, after which
there is no possibility of nding better solutions without adding new instances, which


MetaMax(k) cannot do.

1.0 f1-24,5-D
best 2009
0.8
cs-m-100
0.6
0.4
cs-i-100
0.2
cs-k-100
0.00 1 2 3 4 5 6 7 8
log10evaluations/D

Figure 6: Compass search - ECDF of MetaMax variants using 100 instances
In the next set of measurements using

cs-m-50d, cs-i-50d

and

cs-k-50d,

it

became apparent that the increased limit on maximum number of instances does not
cause any noticeable increase in performance for MetaMax and MetaMax(∞). The
performance of MetaMax(k) was somewhat improved, but overall it is still worse than
the other two MetaMax variants and slightly worse than the best restart strategies.
These results are also presented in table 7.

cs-k-50d and cs-m50d compared with
the collection of best xed restart strategy results cs-f-comb. We have omitted
cs-i-50d as its performance is very similar to that of cs-m-50d.
ECDF graph in gure 7 shows results of

In conclusion, we can say that using the compass search algorithm MetaMax
and MetaMax(∞) perform better than even well tuned restart strategies, and that
increasing the maximum number of allowed instances does not have any signicant
eect on their performance.

29

all
separ
lcond
hcond
multi
mult2

cs-k-100
cs-m-100
cs-i-100
cs-k-50d
cs-m-50d
cs-i-50d
cs-f-comb
cs-k-100
cs-m-100
cs-i-100
cs-k-50d
cs-m-50d
cs-i-50d
cs-f-comb
cs-k-100
cs-m-100
cs-i-100
cs-k-50d
cs-m-50d
cs-i-50d
cs-f-comb
cs-k-100
cs-m-100
cs-i-100
cs-k-50d
cs-m-50d
cs-i-50d
cs-f-comb
cs-k-100
cs-m-100
cs-i-100
cs-k-50d
cs-m-50d
cs-i-50d
cs-f-comb
cs-k-100
cs-m-100
cs-i-100
cs-k-50d
cs-m-50d
cs-i-50d
cs-f-comb

2D
4.00
3.33

3.34
3.97
3.35
3.35
3.75
2.68
2.44

2.42

2.67
2.45
2.45
2.58
3.85

3.14

3.18
4.05
3.21
3.20
3.84
5.03

4.14

4.22
5.00
4.16
4.24
4.19
4.58
3.61

3.58

4.32
3.59

3.58

4.29
3.82
3.30
3.28
3.84
3.30

3.27

3.86

3D
4.84
4.19
4.15
4.73
4.14

4.16
4.46
3.85
3.01
3.04
3.54

2.99

3.03
3.03
4.54
3.88
3.85
4.46

3.82

3.90
4.18
5.67
5.22

5.16

5.62
5.21

5.16

5.38
5.81
4.70

4.57

5.70
4.62
4.59
5.17
4.25
4.09
4.08
4.25

4.02

4.05
4.48

log10 API

5D
5.62
5.21

5.14

5.64
5.22
5.23
5.52
4.71
4.19

3.83

4.79
4.16
4.21
4.33
5.45

5.09

5.11
5.48
5.17
5.17
5.34
6.13

5.84

5.89
6.14
5.88
5.88
5.99
6.47
6.13
6.16
6.44

6.12

6.14
6.59
5.30
4.76

4.73

5.29
4.77
4.75
5.31

10D
6.29
6.15

6.16
6.29

20D
6.97
6.88
6.87

6.31
5.29
5.22
5.27
5.49
5.23
5.20

7.01
6.88
6.88
6.90
5.80
5.85
5.78
6.14
5.82
5.87

4.93

5.50

6.15
6.15

6.28
6.14
6.10
6.19
6.09

6.06

6.35
6.70

6.53
6.53

6.67

6.53

6.58
6.63
7.23
7.09

7.29

7.32
7.32
7.48
7.33

2D
75
92

91
76
91
91
85

74
84

100

100

100

100

100

100

100

100

100

100

100

100

100

78

69
84
84
69
62

100

92

70

88
99
84

100

7.03

82

7.68

90

7.17

7.68

7.08

7.68

89
68
89

7.68

90

7.08

7.09
7.34
5.94
5.78
5.80
5.94

5.77

5.80
6.30

7.69

62

7.80
6.94
6.51

6.47

6.53
6.49
6.49
6.85

80
73

66

64

100

7.44
7.17
7.12
7.18
7.29
7.17
7.15
7.70

79

50
61
61
53
68
84

98
84
56
80
77
56
79
77

7.29

Success rate [%]
3D 5D 10D 20D
62
49
43
35
78
61 46 39
79
79

91
80
90
91
84
45
57
58
46
58

37

36

52
34
68

74

74

90

75

72
80

48

74

60

90

90

62
68
68
63
40

47
41
47
47
43
25
35
34
26

63
70
74
74
72
74

90

70

74

34
26
54

71
71

55

71
71

66

Table 7: Compass search - results of MetaMax strategies

30

46

45

39

46

37
38

46

39

41

66

34
63
63
63

66

64

66
66

66

63

66

64

56

62
34
33
34
28
32

57

35

40

35

66

52
55
56

57

50
34

40

38

40

38
35
14
19
19
16

20

19
14
51

53

52
52
52
52
42

26
33

33
32
33
34
33
10

12
12

11

12
12

10
34

50
50
50
50
50

36


1.0 f1-24,10-D
best 2009
0.8
cs-m-50d
0.6
0.4
cs-k-50d
0.2
cs-f-comb
0.00 1 2 3 4 5 6 7 8
log10evaluations/D

Figure 7: Compass search - ECDF of MetaMax variants using 50d instances

5.2 Nelder-Mead method
The best restart strategies for each dimensionality are listed in table 8 and their
results are compared in table 9.
For the xed restart strategies we see the expected behaviour, where run lengths
of the best strategies increase with the number of dimensions. However, there seem
to be only two best objective function stagnation based strategies:

nm-h-10d

and

nm-h-100d. Interestingly enough, the switch between them occurs between d=5 and
d=10, which is also the point where the overall performance of the Nelder-Mead
algorithm decreases dramatically.
Dimensionality

d=2
d=3
d=5
d=10
d=20

Fixed

nm-f-100d
nm-f-100d
nm-f-500d
nm-f-1000d
nm-f-5000d

Stagnation based

nm-h-10d
nm-h-10d
nm-h-10d
nm-h-100d
nm-h-100d

Table 8: Nelder-Mead - best restart strategies for each dimensionality
The algorithm performs very well for low number of dimensions - d=2, d=3 and
to some extent also d=5. With results for these dimensionalities approaching those
of the best algorithms from the 2009 BBOB conference. On the other hand, the
performance for higher dimensionalities is very poor, especially on the group

hcond.

The three best-of restart strategies, compared in table 9, are all quite evenly matched
with

nm-special being the best overall by a small margin and nm-f-comb being the

worst.
The comparison of MetaMax(k) with corresponding xed restart strategies, given
in table 10, shows that MetaMax(k) performs better on multimodal functions and

31

all
separ
lcond
hcond
multi
mult2

nm-f-comb
nm-h-comb
nm-special
nm-f-comb
nm-h-comb
nm-special
nm-f-comb
nm-h-comb
nm-special
nm-f-comb
nm-h-comb
nm-special
nm-f-comb
nm-h-comb
nm-special
nm-f-comb
nm-h-comb
nm-special

2D
3.03
2.91

2.95
2.55
2.51

2.46

2.54

2.35

2.45

3D
3.92
3.89
3.88
3.22

3.88
3.55
3.49
3.33

log10 API

5D
4.97

4.71

4.84
4.47

4.38

4.50
4.35

4.07

2.07
1.96
4.59

2.86
2.50

4.14
3.24
3.15

2.48

3.12

4.27

5.82

1.95

4.52
3.42

3.22

3.26

3.31

6.07
6.05
3.85

3.83

3.92

7.07

6.89

6.96
5.58

4.95

5.32

7.79
7.71

Success rate [%]
2D 3D 5D 10D 20D
92
81
62
45
18
92
76 67
45
19
92
79
63 51 24
100
100
64
44
40
100
68 66
44
40
100
84 66 55 41
100
82
78
52
6
100
84 81 56
4
100
86
80 56 10
100
100
100
68
18
100
100
100
68
20

7.20

100

10D
6.31
6.19

20D
7.71
7.61

6.04

7.49

6.02
5.99

5.68

5.95

5.80

5.87
5.65
5.40

5.01

7.62

7.55

7.57
6.24
6.13

6.02

7.07

6.80

6.83
8.16
8.19

8.06

8.04

7.99

8.02
7.60
7.47

7.46

74
76

100

41

46

16

20

85

84

18
55

86

84

72

86

84

78

41

100

56

82

10

11

41
7
7

10
51
51

20

52

20

7

16

Table 9: Nelder-Mead - results of restart strategies

worse on the other function groups.
It is also apparent that increasing the number of used instances for MetaMax(k)
leads to a higher overall success rate and faster convergence on multi-modal problems
but slower convergence on ill-conditioned functions, as apparent from the ECDF


graph in gure 8.

1.0 f10-14,10-D
best 2009
0.8
nm-k-20
0.6
0.4
nm-k-50
0.2
0.00 1 2 3 4 5 6 7 8 9nm-k-100
log10evaluations/D
Figure 8: Nelder-Mead - ECDF comparing MetaMax(k) strategies

In fact, the performance of the tested MetaMax(k) strategies on ill-conditioned

32

functions is worse than that of the corresponding restart strategies. This is the opposite of what was observed when using the compass search algorithm and can be
explained by the fact that the Nelder-Mead algorithm, unlike compass search, can
handle ill-conditioned problems very quickly and with high success rate (at least for
low dimensionalities). Therefore, there is no need for the MetaMax mechanism of
selecting multiple instances each round, as almost any instance is capable of nding
the global optimum. Selecting more than one at the same time only serves to decrease
the rate of convergence.
Overall, the three tested MetaMax(k) strategies perform only slightly better than
the corresponding xed restart strategies and are clearly worse than the best restart
strategies, such as

nm-special.

Table 11 shows the results of other tested MetaMax strategies. Unfortunately,
measurements for all dimensionalities were not nished in time, before the deadline
of this thesis, therefore table 11 contains only partial results for some strategies.
For the dimensionalities where the results of all the strategies are available, it
is apparent that

nm-m-100

and

nm-i-100

outperform

nm-k-100,

both in terms of

success rates and API. There are no signicant dierences in performance between
MetaMax variants using 100 and 50d local search algorithm instances, as well as no
observable dierences between performance of MetaMax and MetaMax(∞).

nm-special, MetaMax and MetaMax(∞)
have better success rates on function groups separ, multi and mult2 and as a result,
In comparison with the restart strategy

a better overall success rate.
MetaMax and MetaMax(∞) also converge faster on
slower on

lcond

and

hcond.

multi

and

mult2,

but are

The overall result being that they are better than the

best restart strategy ,in terms of API, for d=2 and d=3, but are worse for d=5.
Unfortunately, we cannot make comparisons for higher dimensionalities, where the
results for MetaMax and MetaMax(∞) are not available. However, based on the fact
that the advantage in performance of MetaMax over the restart strategy is lower in
d=3 than in d=2, and that the restart strategy is better in d=5, we can extrapolate
that MetaMax would likely also perform worse for higher dimensionalities. Even if
there was an improvement in performance, the fact remains, that the Nelder-Mead
method has such a bad performance in higher dimensionalities, that it is unlikely
that MetaMax could improve it to a practical level.

33

all
separ
lcond
hcond
multi
mult2

nm-f-1000d
nm-f-2000d
nm-f-5000d
nm-k-20
nm-k-50
nm-k-100
nm-special
nm-f-1000d
nm-f-2000d
nm-f-5000d
nm-k-20
nm-k-50
nm-k-100
nm-special
nm-f-1000d
nm-f-2000d
nm-f-5000d
nm-k-20
nm-k-50
nm-k-100
nm-special
nm-f-1000d
nm-f-2000d
nm-f-5000d
nm-k-20
nm-k-50
nm-k-100
nm-special
nm-f-1000d
nm-f-2000d
nm-f-5000d
nm-k-20
nm-k-50
nm-k-100
nm-special
nm-f-1000d
nm-f-2000d
nm-f-5000d
nm-k-20
nm-k-50
nm-k-100
nm-special

2D
3.64
3.82
4.03
3.96
3.76
3.58

3D
4.27
4.38
4.52
4.56
4.52
4.38

2.95

3.88

3.24
3.59
3.65
3.85
3.61
3.21

3.95
4.02
4.07
4.32
4.35
4.38

log10 API

5D
5.07
5.14
5.40
5.26
5.22
5.23

10D
6.31
6.33
6.52
6.54
6.40
6.45

20D
7.76
7.71
7.71
7.65
7.68
7.75

2D
81
75
71
72
81
85

4.84

6.04

7.49

92

4.59
4.56
4.74
4.86
4.93
4.98

6.02
6.01
6.08
6.05
6.14
6.24

7.27
7.04
7.07
6.91
7.03
7.30

Success rate [%]
3D 5D 10D 20D
70
61
45
14
69
61
45
17
65
56
42
18
65
60
38
17
66
61
45
17
71
62
44
15
79

63

51

24

100

67
66
65
65
66
68

84
69
68
69
84

64
63
62
62
63
64

44
46
49
46
45
43

29
40
40
40
40
31

2.46

3.55

4.50

5.68

6.83

100

84

66

55

41

2.45

3.31

4.14

5.87

8.06

100

86

80

56

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

68
68
69
68
68
67

10

100

100

100

82

41

10

8

3.01
3.05
3.13
3.46
3.40
3.44

1.92

1.99
2.00
2.59
2.74
2.85
1.96
5.77
6.11
6.56
6.00
5.36
5.15

4.52

4.12
4.24
4.62
3.82
3.61

3.22

3.26

3.49
3.50
3.60
3.98
4.03
4.00

2.47
2.46

2.44

3.10
3.24
3.33
2.48
6.63
6.78
6.91
6.59
6.40
6.25

6.05

4.68
4.95
5.42
4.70
4.50

3.86

3.92

4.44
4.60
4.86
4.70
4.77
4.81

3.26
3.30
3.51
3.77
3.93
4.02

3.12

7.14
7.21
7.32
7.00
6.93

6.88

6.96
5.81
5.93
6.46
5.84
5.46
5.36

5.32

5.95
5.94
6.15
6.25
6.34
6.42

5.65
5.59
5.74
5.77
5.84
5.96

5.01

7.62
7.64
7.69
7.50
7.46

7.43

7.57
6.24
6.41
6.87
7.07
6.19
6.18

6.02

8.24
8.20
8.16
8.20
8.19
8.21

7.99
7.87
7.79
7.87
7.93
8.00

7.20

7.98
8.02
8.04
7.92
7.88

7.87

8.02

7.42

7.52
7.60
7.47
7.48
7.45
7.46

84
82
80
80
84
84

54
40
23
28
52
55

78

85
84
84
84
84
84

86

80
80
77
78
80
80

24
20
18
17
21
25

41

83
83
68
68
68
83

84

77
77
76
76
77
78

16
14
11
13
14
16

18

53
53
36
52
53
55

56

52
53
52
52
54
53

2
4
6
4
4
4

9
14
18
14
12
9

10

7
7
7
7

10

8

10

8
9

10

51
50
34
18
50
51

52

7

22

18
16
18
18
20
20

Table 10: Nelder-Mead - results of MetaMax(k) and corresponding xed restart
strategies

34

all
separ
lcond
hcond
multi
mult2

nm-k-100
nm-m-100
nm-i-100
nm-k-50d
nm-m-50d
nm-i-50d
nm-special
nm-k-100
nm-m-100
nm-i-100
nm-k-50d
nm-m-50d
nm-i-50d
nm-special
nm-k-100
nm-m-100
nm-i-100
nm-k-50d
nm-m-50d
nm-i-50d
nm-special
nm-k-100
nm-m-100
nm-i-100
nm-k-50d
nm-m-50d
nm-i-50d
nm-special
nm-k-100
nm-m-100
nm-i-100
nm-k-50d
nm-m-50d
nm-i-50d
nm-special
nm-k-100
nm-m-100
nm-i-100
nm-k-50d
nm-m-50d
nm-i-50d
nm-special

2D
3.58
2.89
2.94
3.58

3D
4.38
3.76
3.76
4.42

2.85

3.74

2.88
2.95
3.21
2.62
2.65
3.18
2.55
2.61

3.76
3.88
4.38
3.39

3.36

3.44
2.74
2.76
3.40
2.65
2.76

4.42
3.38
3.40
3.55
4.00
3.73
3.80
4.07
3.73
3.80

2.45

3.31

2.46

2.85
2.54
2.63
2.84
2.52
2.64

1.96

5.15
3.88
3.90
5.19
3.84

3.67

4.52
3.22

2.64

2.72
3.25
2.66
2.70
3.26

3.33
3.15
3.21
3.39
3.13
3.22

2.48

6.25
4.96
4.91
6.26
4.92

4.89

6.05
3.86
3.55
3.51
3.88
3.53

3.50

3.92

log10 API

5D
5.23
4.96
4.95
5.23
4.93
-

10D
6.45
6.55
-

20D
7.75
7.83
-

4.84

6.04

7.49

4.98
4.82
4.81
5.05
4.79
-

4.50

4.81
4.71
4.70
4.89
4.67
-

4.14

4.02
4.00
4.02
4.11
4.03
-

3.12

6.88
6.52
6.51
6.77

6.50

6.96
5.36
4.70
4.68
5.26

4.60

5.32

6.24
6.33
-

5.68

6.42
6.52
-

5.87

5.96
6.35
-

5.01

7.43
-

7.37

7.57
6.18
6.19
-

6.02

7.30
7.52
-

6.83

8.21
8.26
-

8.06

8.00
8.06
-

7.20

7.87
-

7.82

8.02

7.45

7.60
7.46

2D
85
96
96
85
96
98

92

100
100
100
100
100
100
100

84

100
100

86

100
100

Success rate [%]
3D 5D 10D 20D
71
62
44
15
89
69
89
69
71
62
43
13
89
69
89
79
63 51 24
68
64
43
31
100
67
100
66
68
64
44
28
100
66
100
84
66 55 41
80
78
53
4
86
80
85 80
80
78
55
2
85 80
86
-

100

86

80

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

55
82
82
55
82

90

78
84

25
72

41

12

9

16

26

26

73

86

86

82

24

100

84

100

73

100

100

-

67
56
-

10

25
18

73

41
83
85
85
83
85

100

56

84

18
55
73
74
55

76

56

Table 11: Nelder-Mead - results of MetaMax strategies

35

10
10
51
51
-

52

9
7
8
7

20

16
-

20

5.3 BFGS
Results of restart strategies

bfgs-f-comb, bfgs-h-comb

and

bfgs-special

are

shown in table 13. Best xed and objective function stagnation based restart strategies for each dimension, which were used to make

bfgs-hfcomb

and

bfgs-h-comb,

are listed in table 12.
Dimensionality

d=2
d=3
d=5
d=10
d=20

Fixed

Stagnation based

bfgs-f-100d
bfgs-f-100d
bfgs-f-200d
bfgs-f-1000d
bfgs-f-1000d

bfgs-h-2d
bfgs-h-2d
bfgs-h-2d
bfgs-h-2d
bfgs-h-2d

Table 12: BFGS - best restart strategies for each dimensionality

For the selected best xed restart strategies, we see an ordinary behaviour where
run lengths increase with dimensionality. However, for the stagnation based restart
strategies,

bfgs-h-2d

is apparently the best for all dimensionalities. This is quite

unusual but, in hindsight, not entirely unexpected. It has to do with the way our
implementation of BFGS works: At the begging of each step, the algorithm estimates
gradient of the objective function by using the nite dierence method. This involves
evaluating the objective function at a set of neighbouring solutions, which are very
close to the current solution. The number of these neighbour solutions is always 2d one for each vector in a positive orthonormal basis of the search space. A very quick
way to detect convergence is to check, if objective function values at these points are
worse than function value at the current solution. As it turns out, this is precisely
what

bfgs-h-2d

does and also the reason why it works so well.

In contrast with the surprisingly good results of

bfgs-h-2d,

the special restart

strategy, which is based on monitoring value of the norm of the estimated gradient,
performs very poorly and is clearly the worst of all the tested restart strategies. The
other two strategies
with

bfgs-h-comb

bfgs-h-comb and bfgs-f-comb have a very similar performance,

being slightly better.

Overall, BFGS has excellent results on ill-conditioned problems, even exceeding
the performance of the best algorithms from the BBOB 2009 conference for certain
dimensionalities on the group

hcond,

which is illustrated in gure 9. However, it

performs quite poorly on multimodal functions (multi and

mult2).

Table 14 sums up the results comparing MetaMax(k) with corresponding xed
restart strategies. In terms of success rate, both types of strategies perform the
same on all function groups. In terms of rate of convergence, expressed by values of
API, the results are similar to those observed when using the Nelder-Mead method:
MetaMax(k) strategies perform better on

hcond

and

separ.

multi

and

mult2,

but worse on

lcond,

The overall performance of MetaMax(k) across all function groups is worse than
that of the corresponding xed restart strategies and consequently also worse than
performance of the best restart strategies.

36

Black box optimization of restart strategies for the MetaMax algorithm

Black box optimization of restart strategies for the MetaMax algorithm

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (13)

Ähnlich wie Black box optimization of restart strategies for the MetaMax algorithm

Ähnlich wie Black box optimization of restart strategies for the MetaMax algorithm (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Black box optimization of restart strategies for the MetaMax algorithm