SlideShare ist ein Scribd-Unternehmen logo
1 von 72
Downloaden Sie, um offline zu lesen
Czech Technical University in Prague
Faculty of Electrical Engineering

DIPLOMA THESIS

Bc. Viktor Kajml

Black box optimization: Restarting versus MetaMax
algorithm

Department of Cybernetics

Project supervisor: Ing. Petr Posik, Ph.D.

Prague, 2014
Abstrakt
Tato diplomová práce se zabývá vyhodnocením nového perspektivního optimaliza£ního algoritmu, nazvaného MetaMax. Hlavním cílem je zhodnotit vhodnost jeho
pouºití pro °e²ení problém· optimalizace £erné sk°í¬ky se spojitými parametry, obzvlá²t¥
v porovnání s ostatními metodami b¥ºn¥ pouºívanými v této oblasti. Za tímto ú£elem
je MetaMax a vybrané tradi£ní restartovací strategie, podrobn¥ otestován na rozsáhlé
sad¥ srovnávacích funkcí, za pouºití r·zných algoritm· lokálního prohledávání. Takto
nam¥°ené výsledky jsou poté porovnány a vyhodnoceny. Druhotným cílem je navrhnout
a implementovat modikace algoritmu MetaMax v jistých oblastech, kde je prostor
pro zlep²ení jeho výkon·.

Abstract
This diploma thesis is focused on evaluating a new promising multi-start optimization algorithm called MetaMax. The main goal is to assess its utility
it in the area of black-box continuous parameter optimization, especially in
comparison with other strategies commonly used in this area. To achieve this,
MetaMax and a selection of traditional restart strategies are thoroughly tested
on a large set of benchmark problems and using multiple dierent local search
algorithms. Their results are then compared and evaluated. An additional
goal is to suggest and implement modications of the MetaMax algorithm,
in certain areas where it seems that there could be a potential room for improvement.
I would like to thank:
Mr. Petr Po²ík for his help on this thesis
The Centre of Machine perception at the Czech Technical University in Prague
for providing me with access to their computer grid
My friends and family for their support
Contents
1

Introduction

1

2

Problem description and related work

3

2.1

Local search algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2.2

Multi-start strategies . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

3

MetaMax algorithm and its variants

3.1
4

Suggested modications

. . . . . . . . . . . . . . . . . . . . . . . . .

Experimental setup

9

13
16

4.1

18

Used metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

4.3
5

Used multi-start strategies . . . . . . . . . . . . . . . . . . . . . . . .

4.2

Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . .

24

Results

25

5.1

25

Nelder-Mead method . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

5.3

BFGS

36

5.4

CMA-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

5.5
6

Compass search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2

Additional results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusion

48

A Used local search algorithms

51

A.1

Compass search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

A.2

Nelder-Mead algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .

51

A.3

BFGS

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

A.4

CMA-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

B

CD contents

56

C

Acknowledgements

56

i
List of Tables
1

Benchmark function groups

. . . . . . . . . . . . . . . . . . . . . . .

17

2

Algorithm specic restart strategies . . . . . . . . . . . . . . . . . . .

20

3

Tested multi-start strategies . . . . . . . . . . . . . . . . . . . . . . .

21

4

Compass search - best restart strategies for each dimensionality

. . .

26

5

Compass search - results of restart strategies . . . . . . . . . . . . . .

26

6

Compass search - results of MetaMax(k) and corresponding xed restart
strategies

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

7

Compass search - results of MetaMax strategies

. . . . . . . . . . . .

30

8

Nelder-Mead - best restart strategies for each dimensionality . . . . .

31

9

Nelder-Mead - results of restart strategies . . . . . . . . . . . . . . . .

32

10

Nelder-Mead - results of MetaMax(k) and corresponding xed restart
strategies

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

11

Nelder-Mead - results of MetaMax strategies . . . . . . . . . . . . . .

35

12

BFGS - best restart strategies for each dimensionality . . . . . . . . .

36

13

BFGS - results of restart strategies

37

14

BFGS - results of MetaMax(k) and corresponding xed restart strategies 38

15

BFGS - results of MetaMax strategies . . . . . . . . . . . . . . . . . .

40

16

CMA-ES - best restart strategies for each dimensionality

. . . . . . .

41

17

CMA-ES - results of restart strategies . . . . . . . . . . . . . . . . . .

42

18

CMA-ES - results of MetaMax(k) and corresponding xed restart
strategies

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

19

CMA-ES - results of MetaMax strategies . . . . . . . . . . . . . . . .

45

20

CD contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

ii
List of Figures
1

Restart condition based on function value stagnation

2

Example of monotone transformation of

3

MetaMax selection mechanisms

4

Example ECDF graph

5

Compass search - ECDF comparing MetaMax(k) with an equivalent
. . . . . . . . . . . . . . . . . . . . . . . . . . .

27

6

Compass search - ECDF of MetaMax variants using 100 instances . .

29

7

Compass search - ECDF of MetaMax variants using 50d instances . .

31

8

Nelder-Mead - ECDF comparing MetaMax(k) strategies

. . . . . . .

32

9

BFGS - ECDF of the best restart strategies

. . . . . . . . . . . . . .

37

10

BFGS - ECDF of MetaMax variants using 50d instances

. . . . . . .

39

11

CMA-ES - ECDF of function value stagnation based restart strategies

41

12

CMA-ES - ECDF comparison of MetaMax variants using 50d instances 44

13

MetaMax timing measurements

14

ECDF comparing MetaMax strategies using dierent instance selec-

xed restart strategy

tion methods
15

. . . . . . . . .

7

. . . . . . . . . . . . . .

15

. . . . . . . . . . . . . . . . . . . . .

16

. . . . . . . . . . . . . . . . . . . . . . . . . .

24

f (x)

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Nelder-Mead algorithm in 2D

47
48

. . . . . . . . . . . . . . . . . . . . . .

52

1

Typical structure of a local search algorithm . . . . . . . . . . . . . . .

2

2

Variable neighbourhood search

. . . . . . . . . . . . . . . . . . . . . .

9

3

MetaMax(k)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

4

MetaMax(∞) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

5

MetaMax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

6

Compass search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

7

Nelder-Mead method . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

8

BFGS algorithm

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

9

CMA-ES algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

List of Algorithms

iii
1

Introduction
The goal of this thesis is to implement and evaluate the performance of the

MetaMax optimization algorithm, particularly in comparison with other commonly
used optimization strategies.
MetaMax was proposed by György and Kocsis in [GK11] and the results they
present seem very interesting and suggest that MetaMax might be a very competitive
algorithm. Our goal is to more closely evaluate its performance on problems from
the area of black-box continuous optimization, by performing a series of exhaustive
measurements and comparing the results with those of several commonly used restart
strategies.
This text is organized as follows: Firstly there is a short overview of the subjects
of mathematical, continuous and black-box optimization, local search algorithms and
multi-start strategies. This is meant as an introduction for readers who might not
be familiar with these topics. Readers who already have knowledge of these elds
might wish to skip forward to the following sections, which describe the MetaMax
algorithm, the experimental setup, used optimization strategies and the software
implementation. In the last two sections, the measured results are summed up and
evaluated.
The mathematical optimization problem is dened as selecting the best element,
according to some criteria, from a set of feasible elements. Most common form of

x1,opt , . . . , xd,opt , where d is the problem
dimension, for which a value of a given objective function f (x1 , . . . , xd ) is minimal,
that is f (x1,opt , . . . , xd,opt ) ≤ f (x1 , . . . , xd ) for all possible values of x1 , . . . , xd .
the problem is nding a set of parameters

Within this eld of mathematical optimization, it is possible to dene several
subelds based on the properties of the parameters
information available about the objective function

x1 , . . . , x d

and the amount of

f.

The set of all possible solutions (possible combid
nations of the parameter values) is nite. Usually some subset of N .

Combinatorial optimization:

Integer programming:

All of the parameters are restricted to be integers:

x1 , . . . , xd ∈ N. Can be considered to be a subset of combinatorial optimization.
Mixed integer programming:

Some parameters are real-valued and some are

integers.
Continuous optimization:

The set of all possible solutions is innite. Usually

x1 , . . . , x d ∈ R .
black-box optimization:

about

f

Assumes that only a bare minimum of information

is given. It can evaluated at an arbitrary point

tion value

f (x),

but besides that, no other properties of

x, returning the funcf are known. In order

to solve this kind of problems, we have to resolve to searching (The exact techniques are be described in more detail later in this text). Furthermore we are
almost never guaranteed to nd the exact solution, just one that is suciently
close to it, and there is almost always a non-zero probability that even an
approximate solution might not be found at all.

1
White box optimization

knowledge about

f,

deals with problems where we have some additional

for example its gradient, which can obviously be very

useful when looking for its minimum.
In this text we will almost exclusively with black-box continuous optimization
problems.
For a practical example of a black-box optimization problem, imagine the process
of trying to design an airfoil, which should have certain desired properties. It is possible to describe the airfoil by a vector of variables representing its various parameters
- length, thickness, shape, etc. This will be the parameter vector
run aerodynamic simulation with the airfoil described by

x,

x.

Then, we can

evaluate how closely it

matches the desired properties, and based on that, assign a function value

f (x)

the parameter vector. In this way, the simulator becomes the black-box function
and the problem is transformed into task of minimizing the objective function
We can then use black-box optimization methods to nd the parameter vector

to

f
f.

xopt

which will give us an airfoil with the desired properties.
This example hopefully suciently illustrates the fact that, black-box optimization can be a very powerful tool, as it allows us to nd reasonably good solutions
even for problems which we might not be able to, or would not know how to, solve
otherwise.
As already mentioned, the usual method for nding optima (the best possible
set of parameters

xopt )

in continuous mathematical optimization is searching. The

structure of a typical local search algorithm is as follows:

Algorithm 1: Typical structure of a local search algorithm

1

Select a starting solution

x0

somehow (most commonly randomly) from the

set of feasible solutions.

2
3
4
5
6
7
8
9
10
11
12

xc ← x0
f (xc ).

Set current solution:
Get function value
while

Stop condition not met

do

Generate a set of neighbour solutions
Evaluate

f

at each

Xn

similar to

xc

xn ∈ X n

Find the best neighbour solution
∗
if f (x )  f (xc ) then
Update the current solution

x∗ = argmaxxn ∈Xn f (x)

xc ← x∗

else

Modify the way of generating neighbour solutions
return

xc

In the case of continuous optimization, a solution is represented simply by a
d
point in R . There are various ways of generating neighbour solutions. In general,
two neighbouring solutions should be dierent from each other, but in some sense
also similar. In continuous optimization, this usually means that the solutions are
close in terms of Euclidean distance, but not identical.

2
The algorithm described above has the property that it always creates neighbour
solutions close to the current solution and moves the current solution in the direction
of decreasing

f (x). This makes it a greedy algorithm, which works well in cases where

the objective function is unimodal (has only one optimum), but for multimodal
functions (functions with multiple local optima), the resulting behaviour will not be
ideal. The algorithm will move in the direction of the nearest optimum (the optimum

x0 ),

with basin of attraction containng

but when it gets there it will not move any

further, as at this point, all the neighbour solutions will be worse than the current
solution. Such algorithm can therefore be relied on to nd the nearest local optimum,
but there is no guarantee that it will also be the global one. The global optimum
will be found only when

x0

happens to land in its basin of attraction.

The method which is most commonly used to overcome this problem, is to run
multiple instances of the local search algorithm from dierent starting positions

x0 .

Then it is probable that at least one of them will start in the basin of attraction of
the global optimum and will be able nd it.
There are various dierent multi-start strategies which implement this basic idea,
with MetaMax, the main subject of this thesis, being one of them.
More thorough description of the local search algorithms problem of getting stuck
in a local optimum is described in section 2. Detailed description of the MetaMax
algorithm and its variations is given in section 3. Structure of the performed experiments is described in section 4. Finally, the measured results are presented and
evaluated in section 5.

2

Problem description and related work
As mentioned in the previous section, local search algorithms have problems

nding the global optimum of functions with multiple optima (also called multimodal
functions). In this section we focus on this problem more thoroughly. We describe
several common types of local search algorithms in more detail and discuss their
susceptibility to getting stuck in a local optimum. Next, we will describe several
methods to overcome this problem.

2.1 Local search algorithms
Following are the descriptions of four commonly used kinds of local search algorithms, which we hope will give the reader a more concrete idea about the functioning
of local search algorithms, than the very basic example described in algorithm 1.
Line search algorithms try to solve the problem of minimizing a d-dimensional

function

f

by using a series of one-dimensional minimization tasks, called line

searches. During each step of the algorithm, an imaginary line is created starting at the current solution

xc

and going in a suitably chosen direction

σ . Then,

x, with the minimal value of f (x), and the
current solution is updated: xc ← x. In this way, the algorithm will eventually
converge on a nearest local optimum of f .
the line is searched for a point

3
The question remains - How to chose the search direction

σ?

The most simple

algorithms just use a preselected set of directions (usually vectors in an orthonormal positive d-dimensional base) and loop through them on successive
iterations. This method is quite simple to implement, but it has trouble coping
with ill-conditioned functions.
An obvious idea might be to use information about the functions gradient for
determining the search direction. However, this turns out not to be much more
eective than simple alternating algorithms. The best results are achieved when
information about both the functions gradient and its Hessian is used. Then,
it is possible to get quite robust and well performing algorithms. Note, that
for black-box optimization problems, it is necessary to obtain the gradient by
estimation, as it is not explicitly available.
Examples of this kind of algorithm are: Symmetric rank one method, gradient
descent algorithm and Broyden-Fletcher-GoldfarbShanno algorithm
Pattern search algorithms closely t the description given in algorithm 1. They

generate the neighbour solutions
relative to the current solution

xn ∈ X n

in dened positions (a pattern)

xc . If any of the neighbour solutions is found to

be better than the current one, it then becomes the new current solution, the
next set of neighbour solutions is generated around it and so on.
If none of the neighbour solutions is found to be better (an unsuccessful iteration), then the pattern is contracted so that in the next step the neighbour
solutions are generated closer to

xc .

In this way the algorithm will converge to

the nearest local optimum (for proof, please see [KLT03]). Advanced pattern
search algorithms use patterns, which change size shape according to various
rules, both on successful and unsuccessful iterations.
Typical algorithms of this type are: Compass search (or coordinate search),
Nelder-Mead simplex algorithm and Luus-Jakola algorithm.
Population based algorithms keep track of a number of solutions, also called

individuals, at one time, which together constitute a population. A new generation of solutions is generated each step, based on the properties of a set of
selected (usually the best) individuals from the previous generation. Dierent
algorithms vary in the exact implementation of this process.
For example, in the family of genetic algorithms, this process is designed to
emulate natural evolution: Properties of each individual (in case of continuous
optimization, this means its position) are encoded into a genome and new
individuals are created by combining parts of genomes of successful individuals
from the previous generation, or by random mutation. Unsuccessful individuals
are discarded, in an analogy with the natural principle of survival of the ttest.
Other population based algorithms, such as CMA-ES take a somewhat more
mathematical approach: New generations are populated by sampling a multivariate normal distribution, which is in turn updated every step, based on the
properties of a number of best individuals from the previous generation.

4
Swarm intelligence algorithms are based on the observation, that it is possible to

get quite well performing optimization algorithms by trying to emulate natural
behaviours, such as ocking of birds or sh schools. Each solution represents one
member of a swarm and moves around the search space according to a simple
set of rules. For example, it might try to keep certain minimal distance form
other ock members, while also heading in the direction with the best values
of

f (x).

The specic rules vary a great deal between dierent algorithms, but

in general even a simple individual behaviour is often enough to result in quite
complex collective emergent behaviour. Because swarm intelligence algorithms
keep track of multiple individuals/solutions during each step, they can also be
considered to be a subset of population based algorithms.
Some examples of this class of algorithms are the Particle swarm optimization
algorithm and the Fish school search algorithm.
Pattern search and line search algorithms, have the property that they always
choose neighbour solutions close to the current solution and they move in direction
of decreasing

f (x).

Thus, as was already described in the previous section, they are

able to nd only the local optimum, which is nearest to their starting position

x0 .

Population based and swarm intelligence algorithms might be somewhat less susceptible to this behaviour, in the case where the initial population is spread over a
large area of the search space. Then there is a chance that some individuals might
land near to the global optimum, and eventually pull the others towards it.
There are several modications of local search algorithms specically designed to
overcome the problem of getting stuck in a local optimum. We shall now describe
two basic ones - Simulated annealing and Tabu search. The main idea behind them,
is to limit the local search algorithms greedy behaviour by sometimes taking steps
other than those, which lead to the greatest decrease of

f (x).

Simulated annealing implements the above mentioned idea in a very straightfor-

ward way: During each step, the local search algorithm may select any of the
generated neighbour solutions with a non-zero probability, thus possibly not
selecting the best one.
The probability
of

f (xc ), f (xn ),

P

of choosing a particular neighbour solution

and

s,

where

s

xn

is a function

is the number of steps already taken by the

algorithm. Usually, it increases with the value of

∆f = f (xc ) − f (xn ),

so that

the best neighbour solutions are still likely to be picked the most often. The
probability of choosing a neighbour solution other than the best one also usually
decreases as

s

increases, so that the algorithm behaves more randomly in the

beginning and then, as the time goes on, settles down to a more predictable
behaviour and converges to the nearest optimum. This is somewhat similar
to the metallurgical process of annealing, from where the algorithm takes its
name.
It is possible to apply this method to almost any of the previously mentioned local search algorithms, simply by adding the possibility of choosing neighbour solutions, which are not the best. In practice, the exact form of

5

P (f (xc ), f (xn ), s)
has to be ne-tuned for a given problem in order to get good results. Therefore,
this algorithm is of limited usefulness in the area of black-box optimization.
Tabu search works by keeping list of previously visited solutions, which is called

the tabu list. It selects potential moves only from the set of neighbour solutions,
which are not on the this list, even if it means choosing a solution, which is
worse than the current one. The selected solution is then added to the tabu
list and the oldest entry in the Tabu list is deleted. The list therefore works in
a way similar to a cyclic buer.
This method has been originally designed for solving combinatorial optimization problems and it requires certain modications in order to be useful in the
area of continuous parameter optimization. At the very least, it is necessary
to modify the method to not only discard neighbour solutions which are on
the tabu list, but also solutions which are close to them. Without this, the
algorithm would not work very well, as the probability of generating the exact
d
same solution twice in R is quite small.
There is a multitude of advanced variations of this basic method, for example
it is possible to add aspiration rules, which override the tabu status of solutions
that would lead to a large decrease in

f (x).

For a detailed description of tabu

search adapted for continuous optimization, please see [CS00].

2.2 Multi-start strategies
Multi-start strategies allow eectively using local search algorithms on functions
with multiple local optima without making any modication to the way they work.
The basic idea is, that if we run a search algorithm multiple times, each time from
a dierent starting position

x0 ,

then it is probable that at least one of the starting

positions will be in the basin of attraction of the global optimum and thus the
corresponding local search algorithm will be able to nd it. Of course, the probability
of this depends on the number of algorithm instances that are run, relative to the
number and properties of the functions optima. It is possible to think about multistart strategies as meta-heuristics, running above, and controlling multiple instances
of local search algorithm sub-heuristics.
Restart strategies are a subset of multi-start strategies, where multiple instances
are run one at a time in succession. The most basic implementation of a restart
strategy is to take the total amount of allowed resource budget (usually a set number
of objective function evaluations), evenly divide it into multiple slots, and use each
of them to run one instance of a local search algorithm. A very important choice
is deciding the length of a single slot. The optimal length largely depends on the
specic problem and type of used algorithm. If the length is set too low, then the
algorithm might not have enough time to converge to its nearest optimum. If it is too
long, then there is a possibility that resources will be wasted on running instances
which are stuck in local optima and can no longer improve.
Of course, all of the time slots do not have to be of the same length. A good
strategy for black-box optimization is to start with low length and keep increasing

6
it for each subsequent slot. In this way, a reasonable performance can be achieved
even if we are unable to choose the most suitable slot length for a given problem in
advance.
A dierent restart strategy is to keep each instance going as long as it needs
to until it converges to an optimum. The most universal way to detect convergence
is to look for stagnation of values of the objective function over a number of past
function evaluations (or past local search algorithm iterations). If the best objective
function value found so far does not improve by at least the limit

tf

over the last

hf

function evaluations, then the current algorithm instance is terminated and new one
is started. For convenience, in the subsequent text we will call

hf

the function value

history length and tf the function value history tolerance. An example of this restart
condition is given in gure 1: The best solution found after v function evaluations
∗
∗
is marked as xv and its corresponding function value as f (xv ). In the gure, we see
that the restart condition is triggered because at the last function evaluation m, the
∗
∗
following is true: f (xm−h ) ≤ f (xm ) + tf
f

m−hf
3000

f(xv )

2500
2000
1500
1000
5000

5

10

15

20

v

25

30

35

∗
f(xm ) + tf
∗
f(xm )

Figure 1: Restart condition based on function value stagnation

Displays the objective function value f (xv ) (dashed black line) of evaluation v , and the best objective function value reached after v function
evaluations f (x∗ ) (solid black line), over the interval  0, m  function
s
evaluations. The values f (x∗ ), f( x∗ ) + tf and m − hf are highlighted.
m
m
It is, of course, necessary to choose specic values of

hf

and

tf

but usually it is

not overly dicult to nd a combination which works well for a large set of problems.
Various dierent ways of detecting convergence and corresponding restart conditions can be used. For example, reaching zero-gradient for line-search algorithms,

7
reaching minimal size of pattern for pattern search algorithms, etc.
There are also various ways of choosing the starting position
search algorithm instances. The simplest one is to choose

x0

x0

for new local

by sampling random

uniform distribution over the set of all feasible solutions. This is very easy to implement and often gives good results. However, it is also possible to use information
gained by the previous instances, when choosing

x0

for a new one.

A simple algorithm, which utilizes this idea is the Iterated search: The rst instance

i1

is started from an arbitrary position and it is run until it converges (or

until it exhausts certain amount of resources) and returns the best solution it has
∗
found xi1 . Then, the starting position for the next instance is selected from the neigh∗
bourhood N of xi . Note, that N is a qualitatively dierent neighbourhood, than
1

what the instance

i1

might be using to generate neighbourhood solutions each step.

It is usually much larger, with the goal being to generate the new starting point for
instance

i2

by perturbing the best solution of

i1

enough, to move it to a dierent
x∗2 better than x∗1 , then the
i
i
∗
∗
∗
next instance is started from the neighbourhood N (xi ). If f (xi ) ≥ f (xi ) and a
basin of attraction. If the new instance nds a solution

2

2

1

better solution is not found, then the next instance is started from the neighbour∗
hood N (xi ) again. This is repeated until a stop condition is triggered. An obvious
1
assumption, that this method makes, is that the minima of the objective function
are grouped close together. If this is not the case, then it might be better to use
uniform random sampling.
The big question is, how to choose the size of the neighbourhood

N?

Too small,

and the new instance might fall into the same basin of attraction as the previous one.
Too big, and the results will be similar to choosing the starting position uniformly
randomly. Another method, called the Variable neighbourhood search, which can, in
a way, be considered to be an improved version of the iterated search, tackles this

N1 , . . . , Nk of varying sizes,
N1 is the smallest and the following neighbourhoods are successively larger,
with Nk being the largest. The restarting procedure is the same as with iterated
search, with the following modication: If a local search algorithm instance ik , started
∗
from the neighbourhood N1 (xi
) does not improve the current best solution, then
k−1
∗
the algorithm tries starting the next instance from N2 (xi
), then N3 (x∗k−1 ), and so
i
k−1

problem by using multiple neighbourhood structures
where

on. The structure of a basic variable neighbourhood search, as given in [HM03], page
10, is described in algorithm 2. This algorithm can also be used as a description of
iterated search, if the set of neighbourhood structures contains only one element.
Yet another group of methods, which aim to prevent local search algorithms
from getting stuck in local optima is based on the idea that it is not necessary to
run multiple local search algorithm instances one after another, but they can be run
at the same time. Then, it is possible to evaluate the expected performance of each
instance based on the results it obtained so far and allocate the resources to the best
(or most promising) ones. This is somewhat similar to the well known multi-armed
bandit problem.
The basic implementation of this idea is called the explore and exploit strategy.
It involves initially running all of its

k

algorithm instances until a certain fraction

of its resource budget is expended. This is the exploration phase. Then, the best

8
Algorithm 2: Variable neighbourhood search

: initial position

input

x0 ,

set of neighbourhood structures

N1 , . . . , Nk

of

increasing size

1
2
3
4
5
6
7
8
9
10
11

x∗ ← local_search(x0 )
k←1
while

Stop condition not met

Generate random point
y ∗ ← local_search(x )
∗
∗
if f (y )  f (x ) then
∗
∗

x

do

from

Nk (x∗ )

x ←y
k←1

else

k ←k+1
return

x∗

instance is selected and run until the rest of the resource budget is used up - The
exploitation phase.
There is, again, an obvious trade o between the amount of resources allocated
to each phase. The exploration phase should be long enough, so that, when it ends,
it is possible to reliably identify the best instance. On the other hand, it is necessary
to have enough resources left in the exploitation phase in order for the selected best
instance to converge to the optimum. In practice, it is actually not that dicult to
nd balance between these two phases, that gives good results for a wide range of
problems.
Methods like this, which run multiple local search algorithm instances at the
same time, belong into the group of portfolio algorithms. We should, however, note
that portfolio algorithms are usually used in a somewhat dierent way than described
here. Most commonly, they run multiple instances of dierent local search algorithms,
each of which is well suited for a dierent kind of problem. This allows the portfolio
algorithm to select instances of that algorithm, which is able to solve the given
problem the most eciently, even without knowing its properties a priori.
The MetaMax algorithm, which is the main subject of this thesis, is also an portfolio algorithm. However, we use it only running one kind of local search algorithm at
a time, to allow for a more fair and direct comparison with restart strategies, which
typically use only one kind of local search algorithm.

3

MetaMax algorithm and its variants
The MetaMax algoithm is a multi-start portfolio strategy presented by György

and Koscis in [GK11]. There are, actually, three versions of the algorithm, which
dier in certain details. They are called MetaMax(k), MetaMax(∞) and MetaMax
and they will be described in detail in this section.
Please note, that while in this text we usually presume all optimization prob-

9
lems to be minimization problems, the text in [GK11] assumes a maximization task.
Therefore, while describing the workings of MetaMax algorithm in this section, we
will keep to the convention in [GK11], but in the rest of the text we will refer to
minimization tasks as usual. Our implementation of MetaMax was modied to work
with minimization tasks.
György and Kocsis demonstrate ([GK11], page 413, equation 2) that convergence
of an instance of a local search algorithm, after

s

steps, can be optimistically esti-

mated with large probability as:

lim f (x∗ ) ≤ f (x∗ ) + gσ (s)
s
s

(1)

s→∞
Where

f ( x∗ )
s

is the best function value obtained by the local search algorithm in-

stance up until the step

s

and

gσ (s)

is a non increasing, non negative function with

lims→∞ g(s) = 0. Note, that the notation used here is a little dierent than in [GK11],
but the meaning is the same.
In practice, the exact form of

g(s)

is not known, so the right side of equation 1

has to be approximated as:

f (x∗ ) + ch(s)
s
Where

(2)

c is an unknown constant and h(s) is a positive, monotone, decreasing function

with the following properties:

h(0) = 1,

lims→∞ h(s) = 0

(3)

h(s) = e−s .

In the subsequent text, we

One possible simple form of this function is

shall call this function the estimate function. György and Kocsis do not use this
name in their work. In fact, they do not use any name for this function at all and
refer to it simply as

h

function. However, we think that this is not very convenient,

hence we picked a suitable name.
Based on equations 1 and 2, it is possible to create a strategy that allocates
resources only to those instances, which are estimated to converge the most quickly
and maximize the value of expression 2 for a certain range of the constant

c.

The

problem of nding these instances can be solved eectively by transforming it into
a problem of nding the upper right convex hull of a set of points in the following
way:

Ai
si , the position xi,si of the best solution
∗
it has found so far and its corresponding function value f (xi,s ). If we represent the
i
set of the local search algorithm instances Ai , i = 1, . . . , k by a set of points:
We assume, that there is

k

number of instances in total and that each instance

keeps track of the number of steps it has taken

P : {(h(si ), f (x∗ i )), i = 1, . . . , k}
i,s

(4)

Then the instances which minimize the value of expression 2 for a certain range of

c

correspond to those points, which lie on the upper right convex hull of the set

P.

Because the term upper right convex hull is not quite standard, we should clarify
that we understand it to mean an intersection of the upper convex hull and the right
convex hull.

10
Note, that presumably for simplicity, the authors of [GK11] assumed only local
search algorithms which use the same number of function evaluations every step. For
algorithms where this is not true, it makes more sense to set
of function evaluations used by the instance

i

si

equal to the number

so far instead. We believe that this is

a better way to measure the use of resources by individual instances, which is also
conrmed in [PG13].
György and Kocsis suggest using a form of estimate function, which changes based
on the amount of resources used by all the local search algorithm instances, in order
to encourage more exploratory behaviour as the MetaMax algorithm progresses.
Therefore, in our implementation, we use the following estimate function, which is
recommended in [GK11]:

h(vi , vt ) = e−vi /vt
Where

vi

(5)

is the number of function evaluations used by instance

i and vt

is the total

number of function evaluations used by all of the instances combined.
The simplest of the three MetaMax variants is MetaMax(k). It uses

k local search

algorithm instances and is described in algorithm 3. For convenience and improved
readability, we will use simplied notation, when describing MetaMax variants:

vi

for number of function evaluations used by local search algorithm instance

i

so

far

xi

for position of the best solution found by instance

fi

for function value of

i

so far

xi

In the descriptions, we also assume that the estimate function

h is a function of only

one variable.
Algorithm 3: MetaMax(k)

: function to be optimized

input

f,

number of algorithm instances

monotone non-decreasing function

h

k

and a

with properties as given in

equation 3

1

Step each of the
variables

2
3

while

v i , xi

k

local search algorithm instances

and

Ai

and update their

fi

stop conditions not met

do

i = 1, . . . , k , select algorithm Ai if there exists c  0 so that:
fi + ch(vi )  fj + ch(vj ) for all j = 1, . . . , k so that (vi , fi ) = (vj , fj ). If
there are multiple algorithms with identical v and f , then select only one
For

of them at random.

4
5
6
7

Step each selected

Ai

and update its variables

vi , xi

and

fi .

b = argmin1,...,k (fi ).
∗
solution: x ← xb .

Find the best instance:
Update the best
return

x∗

As with a priori scheduled restart strategies, there is the question of choosing
the right number of instances (parameter

k)

11

to use. The other two versions of the
algorithm - MetaMax and MetaMax(∞) get around this problem by gradually increasing the number of instances, starting with a single one and adding a new one
every round. Thus, the number of instances tends to innity as the algorithm keeps
running. This allows to prove that the algorithm is consistent. That is, it will almost
surely nd the global optimum if kept running for an innite amount of time.
Please note, that in some literature, such as [Neu11], the term asymptotically
complete is used, instead of consistent, but both of them mean the same thing. Also
note, that we use the word round to refer to a step of the MetaMax algorithm, in order
to avoid confusion with steps of local search algorithms. MetaMax and MetaMax(∞)
are described in algorithms 5 and 4 respectively, also using the simplied notation.

Algorithm 4: MetaMax(
input

∞)

: function to be optimized

f,

monotone non-decreasing function

h

with

properties as given in equation 3

1
2
3

r←1
while

stop conditions not met

do

Add a new local search algorithm instance

Ar ,

step it once and initialize

vr , xr and fr
For i = 1, . . . , r , select algorithm Ai if there exists c  0 so that:
fi + ch(vi )  fj + ch(vj ) for all j = 1, . . . , r so that (vi , fi ) = (vj , fj ). If
there are multiple algorithms with identical v and f , then select only one
its variables

4

of them at random.

5
6
7
8
9

Step each selected

Ai

and update its variables

Find the best instance:

vi , xi

and

fi .

b = argmin1,...,r (fi ).
x∗ ← x b .

Update the best solution:

r ←r+1
return

x∗

MetaMax and MetaMax(∞) dier only in one point (lines 6 and 7 in algorithm
5): If, after stepping all selected instances, the best instance is a dierent one than
in the previous round, MetaMax will step it until it overtakes the old best instance
in terms of used resources.
In [GK11] it is shown that MetaMax asymptotically approaches the performance
of its best local search algorithm instance as the number of rounds increases. Theoretical analysis suggests that the number of instances increases at a rate of
where

√
Ω( vt ),

vt

is the total number of used function evaluations. However, practical results
vt
Ω( logvt ). Based on this, it can also be estimated ([GK11],
page 439) that to nd the global optimum xopt , MetaMax needs only a logarithmic
give a rate of growth only of

factor more function evaluations than a local search algorithm instance, which would
start in the basin of attraction of

xopt .

Note a small dierence in the way MetaMax and MetaMax(∞) are described
in algorithms 5 and 4 from their descriptions in [GK11]. There, a new algorithm
instance

Ar

is added with

fr = 0

and

sr = 0

and takes at most one step during the

round that it is added. This is possible because in [GK11] a non-negative objective
function

f

and a maximization task are assumed. Therefore, an algorithm instance

12
Algorithm 5: MetaMax
input

: function to be optimized

f,

monotone non-decreasing function

h

with

properties as given in equation 3

1
2
3

r←1
while

stop conditions not met

do

Ar , step it once and initialize
vr , xr and fr
For i = 1, . . . , r , select algorithm Ai if there exists c  0 so that:
fi + ch(vi )  fj + ch(vj ) for all j = 1, . . . , r so that (vi , fi ) = (vj , fj ). If
there are multiple algorithms with identical v and f , then select only one
Add a new local search algorithm instance
its variables

4

of them at random.

5
6
7
8
9
10

Step each selected

Ai

and update its variables

vi , xi

and

fi .

br = argmin1,...,r (fi ).
If br = br−1 step instance Abr until vbr ≥ vbr−1
∗
Update the best solution: x ← xb .
r ←r+1

Find the best instance:

return

x∗

can be added without taking any steps rst, and assigned a function value

fr = 0,

which is guaranteed to not be better than any of the function values of the other
instances.
We are, however, dealing with a minimization problem with a known target value
(see [Han+13b]) but no upper bound on
of

f.

f

and, consequently, no worst possible value

Therefore, we made a little change and step the new instance

Ar

immediately

after it is added. It can then also be stepped second time, during step 4 in algorithms
5 and 4. We believe, that this has no signicant impact on performance.

3.1 Suggested modications
MetaMax and MetaMax(∞) will add a new instance each round as long as
they are running, with no limit on the maximum number of instances. The authors of [GK11] state that the worst-case computational overhead of MetaMax and
2
MetaMax(∞) is O(r ), where r is the number of rounds. For the purpose of optimizing functions, where each function evaluation uses up a large amount of computational time (for which MetaMax was primarily designed), the overhead will be
negligible compared to the time spent calculating function values and will not present
a signicant problem. However, in comparison with restart strategies which have typically almost no overhead this is still a disadvantage for MetaMax. Therefore, it would
be desirable to come up with some mechanism that would improve its computational
complexity.
An obvious solution would be to limit the total number of instances which can
be added or slow down the rate at which they are added so that there will never be
too many of them. However, this would make MetaMax and MetaMax(∞) behave
basically in the same way as MetaMax(k) and lose their main property, which is the

13
consistency based on always generating new instances.
A better solution would be to add a mechanism which would discard one of
already existing instances every time a new one is added and therefore keep the total
number of instances at any given time constant. The important question is: Which
one of the existing instances should be discarded?
We propose the following approach: Discard the instance which has not been
selected for the longest time. If there are multiple instances which qualify, discard the
one with the worst function value. The rationale behind this discarding mechanism
is that MetaMax most often selects (allocates the most resources to) those instances,
which have the best optimistic estimate of convergence. Therefore, the instances
which are selected the least often will likely not give very good results in the future,
and so make good candidates for deletion. An alternative, method may also be to
discard the absolute worst instance (in terms of the best objective function value
found so far). Which is even simpler, but we feel that it does not follow so naturally
from the principles behind MetaMax. Therefore, for most of our experiments we will
use the discarding of least selected instances.
Another area where we think it might be benecial to modify the workings of
MetaMax, is the mechanism of selecting instances to be stepped in each round.
The original mechanism has two possible disadvantages: Firstly, it is not invariant
to monotone transformation of the objective function values. By this we mean a

f (x) → f (x), which itself is only a function of the value of f (x) and not
the parameter vector x. The monotone property meaning, that if f (x1 )  f (x2 ) then
f (x1 )  f (x2 ) for all possible x1 and x2 . Such a monotone transformation will not
change the location of the optima of f (x). I will also not change the direction of
gradient of f (x) for any x, but not necessarily its magnitude. An example of such
mapping

transformation is given in gure 2.
Logically, it would not make much sense to require an optimization algorithm to
be invariant to an objective function value transformation, which is not monotone,
as it could change the position of the functions optima.
The second possible disadvantage of the convex hull based instance selection
mechanism is that it also behaves dierently based on the choice of the estimate
function

h.

given, while

This is not such a great disadvantage as the rst one, because

h

f (x)

is

can be chosen freely. However, it would still be benecial if we could

entirely remove the need to choose

h.

To overcome these problems, we propose a new instance selection mechanism. It
uses the same representation of local search algorithm instances as a set of points

P,

given in equation 4 but it select those instances, which correspond to non-dominated
in the sense of maximizing fi and maximizing h(vi ) (or analogically
fi and minimizing vi ). This method is clearly invariant to both monotone
transformation of objective function values f → f and dierent choices of h, as
determining non-dominated points depends only on their ordering along the axes fi
and h(xi ), which will always be preserved due to the fact that both f → f and h
are monotone. Moreover, the points which lie on the right upper convex hull of P ,
and thus maximize the optimistic estimate fi + ch(vi ), are always non-dominated,
points of

P

maximizing

and thus will always be selected.

14
1
3 2 1
10
0 1 2
2
3 3

2

15
10
5
30

1
3 2 1
10
0 1 2
2
3 3

2

2

1

1

0

0

1

1

2

4000
3000
2000
1000
0
23

2

3

3

2

1

0

1

3

2

3

2

1

1

0

Figure 2: Example of monotone transformation of

2

f (x)

Displays a 3D mesh plot of a Rastrigin like function f (x) in the top left, a
transformed function f (x)3 in the top right and their respective contour
plots on the bottom. It is clear, that the shape of the contours is the
same, but their heights are not.
A possible disadvantage of the proposed algorithm is, that at each round it selects many more points than the original convex hull mechanism. This might result
in selecting instances with low convergence estimate too often, and not dedicating
enough resources to the more promising ones. A visual comparison of the two selection mechanisms and demonstration of the inuence of choice of estimate function
upon selection are presented in gure 3.

15
fi
fi3

0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0
1
2
3
4
5

1e8

1e25

0.1 0.2 0.3 0.4
h(vi )

0.1 0.2 0.3 0.4
h(vi )

Figure 3: MetaMax selection mechanisms

Compares the original selection mechanism based on nding upper convex hull (left sub-gures), with the new proposed mechanism based on
selecting non-dominated points (right sub-gures). Also demonstrates the
eects of monotone transformation of the objective function values on the
selection, with f (x) for the upper sub-gures and f (x)3 for those on the
bottom. Selected points are marked as red diamonds, connected by a red
line. Unselected points are marked as lled black circles.

4

Experimental setup
All of the experiments were conducted using the COCO (Comparing continuous

optimizers) framework [Han13a], which is an open-source set of tools for systematic
evaluation and comparison of real-parameter optimization strategies. It provides a set
of 24 benchmark functions of dierent types, chosen to thoroughly test the limits and
capabilities of optimization algorithms. Also included are tools for running experiments on these functions and logging, processing and visualising the measured data.
The library for running experiments is provided in versions for C, Java, R, Matlab
and Python. The post processing part of the framework is available for Python only.
The benchmark functions are divided into 6 groups, according to their properties.
They are briey described in table 1. For detailed description, please see [Han+13a].
There are also multiple instances dened for each function, which are created by
applying various transformations to the base formula.
We shall now briey explain some of the functions properties mentioned in table
1. As already mentioned, the terms unimodal and multimodal refer to functions with

16
Name

separ
lcond
hcond
multi
mult2

Functions

Description

1-5

Separable functions

6-9

Functions with low or moderate conditionality

10-14

Unimodal functions with high conditionality

15-19

Multimodal structured functions

20-24

Multimodal functions with weak global structure
Table 1: Benchmark function groups

single optimum and multiple local optima respectively.
Conditionality describes how much the functions gradient changes depending on
direction. Simply put, functions with high conditionality (also called ill-conditioned
functions), at certain points, grow rapidly in some directions but slowly in others.
This often means that the gradient points away from the local optimum, which
presents a dicult problem for some local search algorithms. To give a more visual
description, one can imagine that 3D graphs of two-dimensional ill conditioned functions usually form sharp ridges, while those of well conditioned functions form gentle
round hills.

f (x1 , x2 , . . . , xd ) = f (x1 ) + f (x2 ) +
...+f (xd ), which means that they can be minimized by minimizing d one-dimensional
functions, where d is the number of dimensions of the separable function.
Separable functions have the following form:

In order to exhaustively evaluate performance of the selected strategies, we decided to make the following series of measurements for each strategy:
1. Using four dierent local search algorithms - Compass search, Nelder-Mead
method, BFGS and CMA-ES. In order to evaluate the eect of algorithm
choice.
2. Using all of the 24 noiseless benchmark functions available in the COCO framework, to measure performance on a wide variety of dierent problems.
3. Using the following dimensionalities : d = 2, 3, 5, 10, 20. To see how much is
the performance aected by the number of dimensions.
4. Using the rst fteen instances of each function. According to [Han+13b], this
number is sucient to provide statistically sound data.
Resource budget for minimizing a single function instance (a single trial) was set to
105 d, meaning 100000 times the number of dimensions of the instance.
The reasons for choosing the four local search algorithms are: Compass search
algorithm was chosen for its simplicity, in order to allow us to evaluate whether
MetaMax can improve performance of such a basic algorithm. Nelder-Mead method
was chosen as a more sophisticated representative of the group of pattern search
algorithms, than compass search. BFGS was selected as a typical line search method.
Finally, CMA-ES is there to represent population based algorithms. It is also the most
advanced of the four algorithms and thus we expect that it will perform the best
of the four selected algorithm. For a more detailed description of these algorithms,
please see section A.

17
4.1 Used multi-start strategies
In this section, we describe the selected MetaMax and restart strategies, which
were evaluated using the methods described above. For convenience, we assigned a
shorthand name to each used strategy, so that we can write, for example csa-h-10d,
instead of objective function stagnation based restart strategy with history length
10d using compass search algorithm, which is impractically verbose. The shorthand
names have the following form: abbreviation of the used local search algorithm, dash,
used multi-start strategy, dash, strategy parameters. A list of all used strategies and
their shorthand names is given in table 3.
We chose two commonly used restart strategies to compare with MetaMax: a xed
restart strategy with a set number of resources allocated to each local search algorithm run, and a dynamic restart strategy with restart condition based on objective
function value stagnation.
Performance of these two strategies largely depends on the combination of the
problem being solved and the strategy parameters. Therefore, we decided to use six
xed restart strategies and six dynamic restart function stagnation strategies with
dierent parameters:

•

Fixed restart strategies

Run lengths:

nf

= 100d, 200d, 500d, 1000d, 2000d, 5000d evaluations.

Shorthand names:

•

algorithm-f-nf

Function value stagnation restart strategies

hf =
tf = 10−10
algorithm-h-hf

Function value history lengths:

2d, 5d, 10d, 20d, 50d, 100d evaluations

Function value tolerance:
Shorthand names:

Note, that the parameters depend on the number of dimensions of the measured
function d. This is consistent with the fact that the total resource budget of the
strategy also depends on d and that we can expect that for higher dimensionalities,
the used local search algorithms will need longer runs to converge.
The rationale behind choosing the used parameter values is the following: Using
5
the function evaluation budget of 10 d, run lengths longer than 5000d would give us
less than 20 restarts per trial. This would result in a very low chance of nding the
global optimum on most of the benchmark functions, some of which can have up to
10d optima. Also, it is probable that most local search algorithms will converge a
long time before using up all 50000d function evaluations and then the rest of the
allocated resources would be essentially wasted on running an instance which cannot
improve any more. Conversely, run lengths smaller than 100d are probably not long
enough to allow most local search algorithm instances to converge and so there would
be little sense in using them.
The choice of the upper bound of the function value history length

hf

as 100d

is based on a similar idea: For values greater than 100d the restart condition would
trigger too long after the local search algorithm has already converged, and so we
would be needlessly wasting resources on it. The choice of the lower bound of
depends on the used algorithm. For a restart strategy to function properly,

18

hf

hf

has to
be greater, or at least as much, as the number of function evaluations that the used
local search algorithm uses during one step. The above stated value of

hf

= 2d is

the minimal value for which the Nelder-Mead and BFGS algorithms work properly.
For the other two algorithms, the minimal value is

hf

= 5d. We decided to base the

function value history length on number of used function evaluations, rather than on
number of taken steps, because it allows for a more direct comparison of performance
of the same strategy using two dierent algorithms.
Choosing the value of the function stagnation tolerance

tf

involved a little bit

more guesswork. There is target function value dened for all of the benchmark
functions, which is equal to the function value at their global optimum f (xopt ) plus
−8
a tolerance value ftol = 10 . That is, the function instance is considered to be
solved if we nd some point

x

f (x) ≤ f (xopt ) + ftol . We
tf = 10−10 on ftol .

with

the function stagnation tolerance parameter

tf

one hundred times lower than

ftol

based our choice of
Setting the value of

should make it large enough to reliably detect

convergence, while not being too large to trigger the reset condition prematurely,
when the local search algorithm is still converging.
The goal of using multiple strategies with dierent parameter values is to have
at least one xed restart and one objective function value stagnation based strategy,
that performs well on the set of all functions, for each measured dimensionality.
For easier comparison of results of the xed restart strategies, we represent them
all together, by choosing only the results of the best performing strategies for each
dimensionality and collecting them into a best of collection of results, which we
will refer to by the shorthand name

algorithm-f-comb.

This represents the results

of running a xed restart strategy, which is able to choose the optimal run length
(from the set of six used run lengths), based on dimensionality of the function being
solved. The results of objective function value stagnation strategies are represented
in an analogous way, under the name

algorithm-h-comb.

Besides the already mentioned restart strategies, we decided to add four more,
each based on a restart condition specic to one of the used local search algorithms.
Shorthand names for these strategies are

algorithm-special.

They are described

in table 2.
In order to save computing time, and as per recommendation in [Han+13b], we
used an additional termination criterion that halts the execution of a restart strategy
after 100 restarts, even if the resource budget has not yet been exhausted and the
solution has not been found. This does not impact the accuracy of the measurements,
as 100 restarts is enough to provide statistically signicant amount of data and the
metrics which we use (see subsection 4.2) are not biased against results of runs
which did not use up the entire resource budget. In fact, the xed restart strategies

f-100d, f-200d

and

f-500d

always reach 100 restarts before they can fully exhaust

their resource budgets.
The idea of using the original pure versions of MetaMax and MetaMax(∞)
algorithms, which keep adding local search algorithm instances without limit, proved
to be impractical due to its excessive computational resource requirements (for the
length of experiments that were planned). Therefore, we performed measurements
using only the modied versions of MetaMax and MetaMax(∞) with the added

19
Algorithm
Compass search

Description

Restart when the variable

a,

which aects how far from the cur-

rent solution the algorithm generates neighbour solutions, decreases
−10
below 10
. It naturally decreases as the algorithm converges, so
checking its value makes for a good restart condition.
Nelder-Mead

We chose a similar condition to the one mentioned above. Restart
is triggered when distance between the two points of the simplex
−10
which are the farthest apart from each other decreases below 10
.
The rationale is similar as above: the simplex keeps growing smaller
as the algorithm converges. It might be more mathematically proper
to check the area (or volume, or hyper-volume, depending on the
dimensionality) of the simplex, but we discarded this idea out of
concern that it might be too computationally intensive.

BFGS

Restart condition is triggered if the norm of the gradient is smaller
−10
than 10
. Since the algorithm already uses information about the
gradient, it makes sense also to use it for detecting convergence.

CMA-ES

The recommended settings for CMA-ES given in [Han11] suggest
using 9 dierent restart conditions. Here we use these recommended
settings. Note that when using CMA-ES with the other restart
strategies, we use only a single restart condition and the additional
ones are disabled. In a sense, we are not using the algorithm to its
full potential, but this allows for a more direct comparison with
other local search algorithms.
Table 2: Algorithm specic restart strategies

mechanism (described in subsection 3.1) for limiting maximum number of instances.
For all MetaMax strategies, we used the recommended form of estimate function:
h = e−vi /vt . Measurements were performed using the following MetaMax strategies:
1. MetaMax(k), with k=20, k=50 and k=100. This gives the same total number
of local search algorithm instances as when using xed restart strategies with
run lengths equal to 5000d, 2000d and 1000d respectively. This makes it possible to evaluate the degree to which the MetaMax mechanism of selecting the
most promising instances improves the performance over these corresponding
restart strategies. The expectation is, that the success rate for MetaMax(k) will
not increase, because the number of instances and thus the ability to explore
the search space stays the same. However, MetaMax(k) should converge faster
than the xed restart strategies, because it should be able to identify the best
instances and allocate resources to them appropriately.
2. MetaMax and MetaMax(∞) with the maximum number of instances set to
100. This should allow us to asses the benets of the mechanism of adding new
instances (and deleting old ones), by comparing the results with MetaMax(k),
which uses the same number of instances each round, but does not add or
delete any. Here, we would expect an increase in success rate on multimodal

20
functions, as the additional instances ,generated each round, should allow the
algorithms to explore the search space more thoroughly. However, the limit of
100 instances will possibly still not be enough to get a good success rate for
multimodal problems with high dimensionality.
3. MetaMax and MetaMax(∞) with maximum number of instances set to 50d.
This should allow the algorithms to scale better with the number of dimensions and, hopefully, further improve their performance. The number 50d was
chosen as a reasonable compromise between computation time and expected
performance. We expect to get the best results here.

algorithm-k-X for MetaMax(k), algorithm-m-X for MetaMax and algorithm-i-X for MetaMax(∞), where
X is the maximum allowed number of instances (or, equivalently, the value of k for
Shorthand names for MetaMax variants were chosen as

MetaMax(k)).

Fixed restart strategies

f-100d
f-200d
f-500d
f-1000d
f-2000d
f-5000d
f-comb

Run length = 100d evaluations
Run length = 200d evaluations
Run length = 500d evaluations
Run length = 1000d evaluations
Run length = 2000d evaluations
Run length = 5000d evaluations
Combined xed restart strategy
Function value stagnation restart strategies

h-2d
h-5d
h-10d
h-20d
h-50d
h-100d
h-comb

History length = 2d evaluations
History length = 5d evaluations
History length = 10d evaluations
History length = 20d evaluations
History length = 50d evaluations
History length = 100d evaluations
Combined function value stagnation restart strategy
Other restart strategies

special

Special restart strategy specic to each algorithm, see table 2
MetaMax variants

k-20
k-50
k-100
k-50d
m-100
m-50d
i-100
i-50d

MetaMax(k) with k=20
MetaMax(k) with k=50
MetaMax(k) with k=100
MetaMax(k) with k=50d
MetaMax with maximum number of instances = 100
MetaMax with maximum number of instances = 50d
MetaMax(∞) with maximum number of instances = 100
MetaMax(∞) with maximum number of instances = 50d
Table 3: Tested multi-start strategies

21
There is a number of additional interesting aspects of the MetaMax variants,
which would be worth testing and evaluating. For example:

•

Comparison of MetaMax and MetaMax(∞) with the limit on maximum number of instances and without it.

•

Performance of dierent methods of discarding old instances.

•

Inuence of dierent choices of estimate function on performance.

•

Performance of our proposed alternative method for selecting instances.

However, it was not practically possible (mainly time-wise) to perform full sized
5
(10 d function evaluatoin budget) experiments which would test all of these features.
Therefore, we decided to make a series of smaller measurements, with the maximum
number of function evaluations per trial set to 5000d, using only dimensionalities
d=5, d=10 and d=20 and using only the BFGS algorithm. This should allow us to
test these features at least in a limited way and see if any of them warrant further
attention. More specically, we made the following series of measurements:
1. MetaMax and MetaMax(∞) without limit on maximum number of instances
2. MetaMax and MetaMax(∞) with maximum instance limit 5d, 10d and 20d,
discarding the most inactive instances
3. MetaMax and MetaMax(∞) with maximum instance limit 5d, 10d and 20d,
discarding the worst instances
4. MetaMax(k) with k=5d, k=10d and k=20d
These measurements were repeated three times, rst time using the recommended
−v /v
form of the estimate function h1 (vi , vt ) = e i t , second time with a simplied
vi
function h(vi ) = e and the third time using the proposed alternate instance selection
method, based on selecting non-dominated points.

4.2 Used metrics
In this section, we describe the various metrics that were used to compare results
of dierent strategies. The simplest one is the success rate. For a set of trials

U

(usually of one strategy running on one or more benchmark functions) and a chosen
target value

t,

it can be dened as:

SR(U, t) =

(6)

|u : u ∈ U, fbest (u) ≤ t| is the number of trials
which have found a solution at least as good as t. In the rest of this text we use a
mean success rate, averaged over a set of target values T
Where

|U |

|{u ∈ U : fbest (u) ≤ t}|
|U |

is the number of trials and

SRm (U, T ) =

1
|T |
22

SR(U, t)
t∈T

(7)
The main metric used in the COCO framework is the expected running time,
or ERT. It estimates the expected number of function evaluations that a selected
strategy will take to reach a target function value
trials

U.

for the rst time, over a set of

It is dened as:

ERT (U, t) =
Where

t

evals(u, t)

1
evals(u, t)
|{u ∈ U : fbest (u) ≤ t}| u∈U

(8)

u to reach
u if it never reached t. Expression
successful trials for target t. If there were

is the number of function evaluations used by trial

target t, or the total number of evaluations used by

|{u ∈ U : fbest (U ) ≤ t}| is the number of
no such trials, then ERT (U, t) = ∞. In the rest o this text we will use ERT averaged
over a set of target values T , in a similar way to what is described in equation 7. We
will also usually compute it using a set trials obtained by running the same strategy
on multiple dierent functions, usually all functions in one of the function groups
described in table 1.

For comparing two or more strategies in terms of success rates and expected
running times, we use graphs of the empirical cumulative distributive function of
run lengths, or ECDF. Such a graph displays on the y-axis the percentage of trials
for which ERT (averaged over a set of target values
evaluations

x,

that for each

where

x

x

T)

is lower than the number of

is the corresponding value on the x-axis. It can also be said,

it shows the expected average success rate, if a function evaluation

budget equal to

x

was used. For easier comparison of ECDF graphs across dierent

dimensionalities, the values on the x-axis are divided by the number of dimensions.
The function displayed in the graph can then be dened as:

y(x) =

1
|{t ∈ T : ERT (t, u) ≤ x}|
d|T ||U | u∈U

(9)

An example ECDF graph, like ones that are used throughout the rest of the text,
is given in gure 4. It shows ERTs of two sets of trials measured by running two
dierent strategies on the set of all benchmark functions, for d=10 and averaged
over a set of 50 target values. The target values are logarithmically distributed in
−8
2
the interval  10 ; 10 . We use this same set of target values in all our ECDF
graphs.
The marker

× denotes the median number of function evaluations of unsuccessful

trials, divided by the number of dimensions. Values to the right of this marker are
(mostly) estimated using bootstrapping (for details of the bootstrapping method,
please refer to [Han+13b]). The fact that we use 15 trials for each strategy-function
pair means, that the estimate is reliable only up to about fteen times the number of
evaluations marked by

×. This is a fact that should be kept in mind when evaluating

the results. The thick orange line in the plot is represents the best results obtained
during the 2009 BBOB workshop for the same set of problems and is provided for
reference.
Since we are dealing with a very large amount of measured results, it would be
desirable to have a method of comparing them, that is even more concise than ECDF
graphs. To this end, we use metric called aggregate performance index (API), dened

23
Proportion of trials

1.0 f1-24,10-D
best 2009
0.8
0.6
bfgs-k-100
0.4
0.2
nm-k-100
0.00 1 2 3 4 5 6 7 8
log10evaluations/D
Figure 4: Example ECDF graph

Comparison of the results of MetaMax(k), with k=100, using BFGS and
Nelder-Mead local search algorithms, on the set of all benchmark functions. The strategy using BFGS clearly outperforms the other one, both
in terms of success rate and speed of convergence.
by Mr. Po²ík in a yet unpublished (at the time of writing this text) article [Po²13]. It
is based on the idea that the ECDF graph of the results of an ideal strategy, which
solves the given problem instantly, would be a straight horizontal line across the top
of the plot. Conversely, for the worst possible strategy imaginable, the graph would
be a straight line along the bottom. It is apparent, that the area above (or bellow)
the graph makes for quite a natural measure of eectiveness of dierent strategies.
Given a set of ERTs

A,

their aggregate performance index can be computed as:

AP I(A) = exp

1
log a
|A| a∈A 10

(10)

For the purposes of computing API, the ERTs of unsuccessful trials which are by
denition

∞

have to be replaced with a value that is higher than ERT of any suc-

cessful trial. The choice of this value determines how much the unsuccessful trials
are penalized and thus aects the nal ERT score. For our purposes, we chose the
8
value 10 d.
Since we are computing API from the area above the graph this means that the
lower its value the better the corresponding strategy performs. Using API essentially
allows us to represent results of a set of trials by a single number and to easily
compare performances of dierent optimization strategies.

4.3 Implementation details
The software side of this project was implemented mostly in Python, with parts
in C. The original plan was to write the project purely in Python, which was cho-

24
sen because of its ease of use and availability of many open-source scientic and
mathematical libraries. However, during the project it was found out that a pure
Python code performs too slowly and would not allow us to make all the necessary
measurements. Therefore, parts of the program had to be changed over to C, which
has improved performance to a reasonable level.
The used implementations of BFGS and Nelder-Mead algorithms, are based on
the code from open-source Scipy library. They were modied to allow running the
algorithms in single steps, which is necessary in order for them to work with MetaMax. An open-source implementation of CMA-ES was used, available at [Han13b].
Implementation of MetaMax was written based on its description in [GK11]. It was
however, necessary to make several little changes to it mainly because it is designed
with a maximization task in mind but we needed to use it for minimization problems.
For nding upper convex hulls we used Andrew's algorithm with some additional pre
and post processing, to get the exact behaviour described in [GK11].
For description of the source code, please see the le

source/readme.txt on the

attached CD.

5

Results
In this section we will evaluate results of the selected multi-start strategies. We

decided to split the results into four subsections based on used local search algorithm. We present and compare the results mainly using tables, which list APIs and
success rates for dierent groups of functions and dierent dimensionalities. For convenience, the best results are highlighted with bold text. We also show ECDF graphs
to illustrate particularly interesting results. Results of the smaller experiments, described at the end of section 4.1 and results of timing measurements are summarized
in subsection 5.5.
The values of success rates and APIs, which are shown in this section, are com5
puted only using data bootstrapped up to the value of 10 d function evaluations.
In our opinion, these values represent the real performance of the selected strategies
better than if we were to use fully bootstrapped data, which are estimated to a large
degree and therefore not so statistically reliable. In ECDF graphs, bootstrapped re7
sults are shown up to 10 d evaluations. All of the APIs and success rates are averaged
over a set of multiple targets, as described in subsection 4.3.
The measured data are provided in their entirety on the attached CD (see section
B) in the form of tarballed Python pickle les, which can be processed using the
BBOB post processing framework. It was not possible to provide the data in their
original form, as text les, because their total size would be in the order of gigabytes,
which would clearly not t on the attached medium.

5.1 Compass search
Table 4 summarizes which of the used xed restart and function value stagnation
restart strategies were best for each dimensionality and chosen for the best-of result
collections

cs-f-comb

and

cs-h-comb.

Table 5 then compares these two sets of re-

25
sults together with results obtained by the compass search specic restart strategy

cs-special.
It is apparent, that for the best strategies the values of run length and function
value history length increase with the number of dimensions. This is not unexpected
as compass search uses 2d or 2d-1 function evaluations at each step.
Dimensionality

d=3
d=5
d=10
d=20

Stagnation based

cs-f-100d
cs-f-100d
cs-f-200d
cs-f-500d
cs-f-500d

d=2

Fixed

cs-h-5d
cs-h-5d
cs-h-10d
cs-h-10d
cs-h-20d

Table 4: Compass search - best restart strategies for each dimensionality
The comparison of the best restart strategies suggests, that all of them have quite

cs-h-comb being a little better than the others in
cs-f-comb in terms of API. In the subsequent tables, we
cs-f-comb for reference, as an exaple of a well tuned restart

similar overall performance, with
terms of success rate and
will provide results of
strategy.

None of the strategies performs very well on multimodal and highly conditioned
functions. This is to be expected, as the compass search algorithm is known to have
trouble with ill conditioned problems and multimodal problems are dicult to solve

mult2

multi

hcond

lcond

separ

all

for any algorithm.

cs-f-comb
cs-h-comb
cs-special
cs-f-comb
cs-h-comb
cs-special
cs-f-comb
cs-h-comb
cs-special
cs-f-comb
cs-h-comb
cs-special
cs-f-comb
cs-h-comb
cs-special
cs-f-comb
cs-h-comb
cs-special

2D

3D

3.75

4.46

3.82
4.16

2.58

2.69
2.63
3.84

3.49

4.69
4.92

3.03

3.11
3.09

4.18

4.22

log10 API

5D

5.52

5.62
5.80
4.33

4.03

4.32
5.34

10D
6.31

20D
6.90

Success rate [%]
2D 3D 5D 10D 20D
85
74
53
41
34

6.28

6.86

6.91
5.50

85

74

100

100

48
69

4.86

5.43

100

100

84

5.43

100

100

100

84
98

6.39
4.93
4.95
6.35

7.44

5.15

5.97

6.99

56

44

43

66

39

37
62

66

63

66

63

84

72

51

69
63

50

4.18

5.38

100

100

72

5.38

5.99

6.63

7.03

82

52

43

35

33

5.17

6.59

7.34

7.80

80

26

10

44
74

18
17
66

14

76

70

52

4.10

4.71
3.86

3.36

3.89

5.21
6.08

4.48

4.74
4.89

6.50
6.45

6.95
6.99

5.31

5.38
5.76

6.91
6.89

7.53
7.61
6.30

6.08

6.20

7.31
7.30

7.98
7.96
6.85

6.63

6.65

47
37
79
72
80

100

85

35
32
63

66

72

29
31

55

Table 5: Compass search - results of restart strategies

26

64

26

3.63

6.07
6.20

7.26

84

68

4.19

5.40
5.83
4.29

6.27

78

28
28

11
10
42
51

40

26
26
7
8
36

50
50
Comparison of the results of three MetaMax(k) strategies with corresponding
xed restart strategies which use the same total number of local search algorithm
instances is given in table 6. They conrm our expectations, and show that, overall,
MetaMax(k) converges faster than a comparable xed restart strategy. The only
exception being the group

separ.

This can be explained by the fact that functions

from this group are very simple and can be generally solved by a single, or only very
few, runs of the local search algorithm. In this case, the MetaMax mechanism of
selecting multiple instances each round is more of a hindrance than a benet.
In terms of success rate, MetaMax(k) is always as good or even better than the
comparable xed restart strategy, with the improvement being especially obvious on
the groups

lcond

and

mult2.

Of the three tested variants of MetaMax(k),

cs-m-100

ever, it is not better than a well tuned restart strategy like

is the best overall. How-

cs-f-comb.

Figure 5 shows a behaviour which was observed across all function groups and dimensionalities when comparing MetaMax(k) with corresponding xed restart strategies: At rst, MetaMax(k) converges much more slowly than the restart strategy, as
it is still in the phase of initialising all of its instances. However, as soon as this is
nished, it starts converging quickly and overtakes the restart strategy for a certain
interval. After that, its rate of convergence slows down again and it ends up with
5
success rate (for 10 d function evaluations) similar to that of the restart strategy.

Proportion of trials

This eect seems to get less pronounced with increasing number of dimensions.

1.0 f1-24,5-D
best 2009
0.8
0.6
cs-f-2000d
0.4
0.2
cs-k-50
0.00 1 2 3 4 5 6 7 8
log10evaluations/D

Figure 5: Compass search - ECDF comparing MetaMax(k) with an equivalent xed
restart strategy
Results of comparing

cs-k-100, cs-m-100

and

cs-i-100

are shown in table 7.

It is apparent, that using the same number of instances at a time MetaMax and
MetaMax(∞) clearly outperform MetaMax(k) on all function groups, both in terms
of speed of convergence and success rates.
In general, they also provide results at least as good, or better than the best

27
all
separ
lcond
hcond
multi
mult2

cs-f-1000d
cs-f-2000d
cs-f-5000d
cs-k-20
cs-k-50
cs-k-100
cs-f-comb
cs-f-1000d
cs-f-2000d
cs-f-5000d
cs-k-20
cs-k-50
cs-k-100
cs-f-comb
cs-f-1000d
cs-f-2000d
cs-f-5000d
cs-k-20
cs-k-50
cs-k-100
cs-f-comb
cs-f-1000d
cs-f-2000d
cs-f-5000d
cs-k-20
cs-k-50
cs-k-100
cs-f-comb
cs-f-1000d
cs-f-2000d
cs-f-5000d
cs-k-20
cs-k-50
cs-k-100
cs-f-comb
cs-f-1000d
cs-f-2000d
cs-f-5000d
cs-k-20
cs-k-50
cs-k-100
cs-f-comb

2D
4.60
4.82
5.18
4.55
4.16
4.00

3D
5.20
5.45
5.71
5.30
5.01
4.84

3.75

4.46

3.28
3.54
4.08
3.38
2.82
2.68

2.58

4.56
4.75
4.83
4.19

3.83

3.85
3.84
5.32
5.56
5.97
5.32
5.14
5.03

4.19

5.26
5.44
5.94
5.27
4.80
4.58

4.29

4.59
4.79
5.00
4.51
4.12

3.82

3.86

3.72
4.22
4.74
4.16
3.88
3.85

3.03

5.22
5.37
5.38
5.12
4.98
4.54

log10 API

5D
5.85
5.94
6.08
5.87
5.75
5.62

10D
6.36
6.46
6.59
6.42
6.37
6.29

4.58
4.61
4.86
4.62
4.63
4.71

6.31
5.01
5.08
5.18
5.21
5.24
5.29

4.33

4.93

5.52

5.75
5.84
5.96
5.70
5.58
5.45

6.33
6.45
6.53

6.27

7.11

7.03

4.18

5.34

6.29
6.39
6.52
6.35
6.23
6.13

5.38

5.99

6.63

6.18
6.34
6.51
6.15
5.93
5.81

6.81
6.87
7.00
6.76
6.62

6.47

4.99
5.23
5.66
5.09
4.48

6.59
5.80
5.94
6.05
5.88
5.66

4.25

5.30

5.17

4.48

5.31

6.90
5.44

5.49
5.51
5.75
5.81
5.80
5.50
7.42
7.36
7.48

6.28
6.28
6.35
6.66
6.89
6.93
6.83
6.79
6.70

5.92
6.06
6.21
5.95
5.77
5.67

20D
7.01
6.96
7.13
6.99
6.98
6.97

7.38
7.44
7.48
7.34
7.26

7.23

7.34
6.43
6.42
6.80
6.40
6.24

5.94

6.30

7.22
7.29
7.44
7.14
7.23
7.32
7.27
7.21
7.17
7.82
7.85
7.91
7.76
7.74

7.70

7.80
7.40
6.94
7.48
7.06
6.96
6.94

6.85

Success rate [%]
3D 5D 10D 20D
61 46
41
30
55 45
39 35
48 43
36
28
52 45
41 35
58 47
42 35
62 49 43 35
85
74
53
41
34
100
100
68
65 63
100
84 68
65 63
100
68 67
65
62
84
68 67
65
62
100
84 68
65 63
100
84 68 66 63
100
100
69
66
62
80
60 54
50
27
78
56 52
44
31
82
57 50
46
26
86
58 55
50 41
88
63 57
50
37
88
78 62 52
34
84 84 63
50
26
55
40 37 35
30
52
37 33
27
28
40
35 30
27
26
47
36 33
29
28
52
42 36
31
31
56
45 40
34 33
2D
74
72
67
67
74
75

82

62
59
45
50
62
62

80

72
70
70
70
72
73

80

52

35
31
26
27
31
34

63

70
69
54
69
70
70

74

43

22
21
17
18
21
25

26

52
52
52
52
54
54

66

35
14

13
12
12
13

33
10
10

9
9

10

14

10

14

10

44
47
34
50
50

51

42

18

41

18
34
34
34
36

Table 6: Compass search - results of MetaMax(k) and corresponding xed restart
strategies

28
restart strategies. There is almost no dierence between the performance of
and

cs-i-100,

cs-m-100

which corresponds with results presented in [GK11]. Dierences in

performance seem to diminish with increasing dimensionality and, for d=10 and
d=20. all of the MetaMax strategies which use 100 instances perform almost the
same.
ECDF graph in gure 6 shows an interesting behaviour, where

cs-i-100

start converging right away and overtake

cs-k-100

cs-m-100

and

while it is still in the

process of initializing all of its instances. After that, MetaMax(k) catches up and for
a certain interval all of the strategies perform the same. Then, MetaMax(k) stops
converging, the other two strategies overtake it again and ultimately achieve better
success rates. The sudden stop in MetaMax(k) convergence presumably happens
when all of its best instances have already found their local optima, after which
there is no possibility of nding better solutions without adding new instances, which

Proportion of trials

MetaMax(k) cannot do.

1.0 f1-24,5-D
best 2009
0.8
cs-m-100
0.6
0.4
cs-i-100
0.2
cs-k-100
0.00 1 2 3 4 5 6 7 8
log10evaluations/D

Figure 6: Compass search - ECDF of MetaMax variants using 100 instances
In the next set of measurements using

cs-m-50d, cs-i-50d

and

cs-k-50d,

it

became apparent that the increased limit on maximum number of instances does not
cause any noticeable increase in performance for MetaMax and MetaMax(∞). The
performance of MetaMax(k) was somewhat improved, but overall it is still worse than
the other two MetaMax variants and slightly worse than the best restart strategies.
These results are also presented in table 7.

cs-k-50d and cs-m50d compared with
the collection of best xed restart strategy results cs-f-comb. We have omitted
cs-i-50d as its performance is very similar to that of cs-m-50d.
ECDF graph in gure 7 shows results of

In conclusion, we can say that using the compass search algorithm MetaMax
and MetaMax(∞) perform better than even well tuned restart strategies, and that
increasing the maximum number of allowed instances does not have any signicant
eect on their performance.

29
all
separ
lcond
hcond
multi
mult2

cs-k-100
cs-m-100
cs-i-100
cs-k-50d
cs-m-50d
cs-i-50d
cs-f-comb
cs-k-100
cs-m-100
cs-i-100
cs-k-50d
cs-m-50d
cs-i-50d
cs-f-comb
cs-k-100
cs-m-100
cs-i-100
cs-k-50d
cs-m-50d
cs-i-50d
cs-f-comb
cs-k-100
cs-m-100
cs-i-100
cs-k-50d
cs-m-50d
cs-i-50d
cs-f-comb
cs-k-100
cs-m-100
cs-i-100
cs-k-50d
cs-m-50d
cs-i-50d
cs-f-comb
cs-k-100
cs-m-100
cs-i-100
cs-k-50d
cs-m-50d
cs-i-50d
cs-f-comb

2D
4.00
3.33

3.34
3.97
3.35
3.35
3.75
2.68
2.44

2.42

2.67
2.45
2.45
2.58
3.85

3.14

3.18
4.05
3.21
3.20
3.84
5.03

4.14

4.22
5.00
4.16
4.24
4.19
4.58
3.61

3.58

4.32
3.59

3.58

4.29
3.82
3.30
3.28
3.84
3.30

3.27

3.86

3D
4.84
4.19
4.15
4.73
4.14

4.16
4.46
3.85
3.01
3.04
3.54

2.99

3.03
3.03
4.54
3.88
3.85
4.46

3.82

3.90
4.18
5.67
5.22

5.16

5.62
5.21

5.16

5.38
5.81
4.70

4.57

5.70
4.62
4.59
5.17
4.25
4.09
4.08
4.25

4.02

4.05
4.48

log10 API

5D
5.62
5.21

5.14

5.64
5.22
5.23
5.52
4.71
4.19

3.83

4.79
4.16
4.21
4.33
5.45

5.09

5.11
5.48
5.17
5.17
5.34
6.13

5.84

5.89
6.14
5.88
5.88
5.99
6.47
6.13
6.16
6.44

6.12

6.14
6.59
5.30
4.76

4.73

5.29
4.77
4.75
5.31

10D
6.29
6.15

6.16
6.29

20D
6.97
6.88
6.87

6.31
5.29
5.22
5.27
5.49
5.23
5.20

7.01
6.88
6.88
6.90
5.80
5.85
5.78
6.14
5.82
5.87

4.93

5.50

6.15
6.15

6.28
6.14
6.10
6.19
6.09

6.06

6.35
6.70

6.53
6.53

6.67

6.53

6.58
6.63
7.23
7.09

7.29

7.32
7.32
7.48
7.33

2D
75
92

91
76
91
91
85

74
84

100

100

100

100

100

100

100

100

100

100

100

100

100

78

69
84
84
69
62

100

92

70

88
99
84

100

7.03

82

7.68

90

7.17

7.68

7.08

7.68

89
68
89

7.68

90

7.08

7.09
7.34
5.94
5.78
5.80
5.94

5.77

5.80
6.30

7.69

62

7.80
6.94
6.51

6.47

6.53
6.49
6.49
6.85

80
73

66

64

100

7.44
7.17
7.12
7.18
7.29
7.17
7.15
7.70

79

50
61
61
53
68
84

98
84
56
80
77
56
79
77

7.29

Success rate [%]
3D 5D 10D 20D
62
49
43
35
78
61 46 39
79
79

91
80
90
91
84
45
57
58
46
58

37

36

52
34
68

74

74

90

75

72
80

48

74

60

90

90

62
68
68
63
40

47
41
47
47
43
25
35
34
26

63
70
74
74
72
74

90

70

74

34
26
54

71
71

55

71
71

66

Table 7: Compass search - results of MetaMax strategies

30

46

45

39

46

37
38

46

39

41

66

34
63
63
63

66

64

66
66

66

63

66

64

56

62
34
33
34
28
32

57

35

40

35

66

52
55
56

57

50
34

40

38

40

38
35
14
19
19
16

20

19
14
51

53

52
52
52
52
42

26
33

33
32
33
34
33
10

12
12

11

12
12

10
34

50
50
50
50
50

36
Proportion of trials

1.0 f1-24,10-D
best 2009
0.8
cs-m-50d
0.6
0.4
cs-k-50d
0.2
cs-f-comb
0.00 1 2 3 4 5 6 7 8
log10evaluations/D

Figure 7: Compass search - ECDF of MetaMax variants using 50d instances

5.2 Nelder-Mead method
The best restart strategies for each dimensionality are listed in table 8 and their
results are compared in table 9.
For the xed restart strategies we see the expected behaviour, where run lengths
of the best strategies increase with the number of dimensions. However, there seem
to be only two best objective function stagnation based strategies:

nm-h-10d

and

nm-h-100d. Interestingly enough, the switch between them occurs between d=5 and
d=10, which is also the point where the overall performance of the Nelder-Mead
algorithm decreases dramatically.
Dimensionality

d=2
d=3
d=5
d=10
d=20

Fixed

nm-f-100d
nm-f-100d
nm-f-500d
nm-f-1000d
nm-f-5000d

Stagnation based

nm-h-10d
nm-h-10d
nm-h-10d
nm-h-100d
nm-h-100d

Table 8: Nelder-Mead - best restart strategies for each dimensionality
The algorithm performs very well for low number of dimensions - d=2, d=3 and
to some extent also d=5. With results for these dimensionalities approaching those
of the best algorithms from the 2009 BBOB conference. On the other hand, the
performance for higher dimensionalities is very poor, especially on the group

hcond.

The three best-of restart strategies, compared in table 9, are all quite evenly matched
with

nm-special being the best overall by a small margin and nm-f-comb being the

worst.
The comparison of MetaMax(k) with corresponding xed restart strategies, given
in table 10, shows that MetaMax(k) performs better on multimodal functions and

31
all
separ
lcond
hcond
multi
mult2

nm-f-comb
nm-h-comb
nm-special
nm-f-comb
nm-h-comb
nm-special
nm-f-comb
nm-h-comb
nm-special
nm-f-comb
nm-h-comb
nm-special
nm-f-comb
nm-h-comb
nm-special
nm-f-comb
nm-h-comb
nm-special

2D
3.03
2.91

2.95
2.55
2.51

2.46

2.54

2.35

2.45

3D
3.92
3.89
3.88
3.22

3.88
3.55
3.49
3.33

log10 API

5D
4.97

4.71

4.84
4.47

4.38

4.50
4.35

4.07

2.07
1.96
4.59

2.86
2.50

4.14
3.24
3.15

2.48

3.12

4.27

5.82

1.95

4.52
3.42

3.22

3.26

3.31

6.07
6.05
3.85

3.83

3.92

7.07

6.89

6.96
5.58

4.95

5.32

7.79
7.71

Success rate [%]
2D 3D 5D 10D 20D
92
81
62
45
18
92
76 67
45
19
92
79
63 51 24
100
100
64
44
40
100
68 66
44
40
100
84 66 55 41
100
82
78
52
6
100
84 81 56
4
100
86
80 56 10
100
100
100
68
18
100
100
100
68
20

7.20

100

10D
6.31
6.19

20D
7.71
7.61

6.04

7.49

6.02
5.99

5.68

5.95

5.80

5.87
5.65
5.40

5.01

7.62

7.55

7.57
6.24
6.13

6.02

7.07

6.80

6.83
8.16
8.19

8.06

8.04

7.99

8.02
7.60
7.47

7.46

74
76

100

41

46

16

20

85

84

18
55

86

84

72

86

84

78

41

100

56

82

10

11

41
7
7

10
51
51

20

52

20

7

16

Table 9: Nelder-Mead - results of restart strategies

worse on the other function groups.
It is also apparent that increasing the number of used instances for MetaMax(k)
leads to a higher overall success rate and faster convergence on multi-modal problems
but slower convergence on ill-conditioned functions, as apparent from the ECDF

Proportion of trials

graph in gure 8.

1.0 f10-14,10-D
best 2009
0.8
nm-k-20
0.6
0.4
nm-k-50
0.2
0.00 1 2 3 4 5 6 7 8 9nm-k-100
log10evaluations/D
Figure 8: Nelder-Mead - ECDF comparing MetaMax(k) strategies

In fact, the performance of the tested MetaMax(k) strategies on ill-conditioned

32
functions is worse than that of the corresponding restart strategies. This is the opposite of what was observed when using the compass search algorithm and can be
explained by the fact that the Nelder-Mead algorithm, unlike compass search, can
handle ill-conditioned problems very quickly and with high success rate (at least for
low dimensionalities). Therefore, there is no need for the MetaMax mechanism of
selecting multiple instances each round, as almost any instance is capable of nding
the global optimum. Selecting more than one at the same time only serves to decrease
the rate of convergence.
Overall, the three tested MetaMax(k) strategies perform only slightly better than
the corresponding xed restart strategies and are clearly worse than the best restart
strategies, such as

nm-special.

Table 11 shows the results of other tested MetaMax strategies. Unfortunately,
measurements for all dimensionalities were not nished in time, before the deadline
of this thesis, therefore table 11 contains only partial results for some strategies.
For the dimensionalities where the results of all the strategies are available, it
is apparent that

nm-m-100

and

nm-i-100

outperform

nm-k-100,

both in terms of

success rates and API. There are no signicant dierences in performance between
MetaMax variants using 100 and 50d local search algorithm instances, as well as no
observable dierences between performance of MetaMax and MetaMax(∞).

nm-special, MetaMax and MetaMax(∞)
have better success rates on function groups separ, multi and mult2 and as a result,
In comparison with the restart strategy

a better overall success rate.
MetaMax and MetaMax(∞) also converge faster on
slower on

lcond

and

hcond.

multi

and

mult2,

but are

The overall result being that they are better than the

best restart strategy ,in terms of API, for d=2 and d=3, but are worse for d=5.
Unfortunately, we cannot make comparisons for higher dimensionalities, where the
results for MetaMax and MetaMax(∞) are not available. However, based on the fact
that the advantage in performance of MetaMax over the restart strategy is lower in
d=3 than in d=2, and that the restart strategy is better in d=5, we can extrapolate
that MetaMax would likely also perform worse for higher dimensionalities. Even if
there was an improvement in performance, the fact remains, that the Nelder-Mead
method has such a bad performance in higher dimensionalities, that it is unlikely
that MetaMax could improve it to a practical level.

33
all
separ
lcond
hcond
multi
mult2

nm-f-1000d
nm-f-2000d
nm-f-5000d
nm-k-20
nm-k-50
nm-k-100
nm-special
nm-f-1000d
nm-f-2000d
nm-f-5000d
nm-k-20
nm-k-50
nm-k-100
nm-special
nm-f-1000d
nm-f-2000d
nm-f-5000d
nm-k-20
nm-k-50
nm-k-100
nm-special
nm-f-1000d
nm-f-2000d
nm-f-5000d
nm-k-20
nm-k-50
nm-k-100
nm-special
nm-f-1000d
nm-f-2000d
nm-f-5000d
nm-k-20
nm-k-50
nm-k-100
nm-special
nm-f-1000d
nm-f-2000d
nm-f-5000d
nm-k-20
nm-k-50
nm-k-100
nm-special

2D
3.64
3.82
4.03
3.96
3.76
3.58

3D
4.27
4.38
4.52
4.56
4.52
4.38

2.95

3.88

3.24
3.59
3.65
3.85
3.61
3.21

3.95
4.02
4.07
4.32
4.35
4.38

log10 API

5D
5.07
5.14
5.40
5.26
5.22
5.23

10D
6.31
6.33
6.52
6.54
6.40
6.45

20D
7.76
7.71
7.71
7.65
7.68
7.75

2D
81
75
71
72
81
85

4.84

6.04

7.49

92

4.59
4.56
4.74
4.86
4.93
4.98

6.02
6.01
6.08
6.05
6.14
6.24

7.27
7.04
7.07
6.91
7.03
7.30

Success rate [%]
3D 5D 10D 20D
70
61
45
14
69
61
45
17
65
56
42
18
65
60
38
17
66
61
45
17
71
62
44
15
79

63

51

24

100

67
66
65
65
66
68

84
69
68
69
84

64
63
62
62
63
64

44
46
49
46
45
43

29
40
40
40
40
31

2.46

3.55

4.50

5.68

6.83

100

84

66

55

41

2.45

3.31

4.14

5.87

8.06

100

86

80

56

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

68
68
69
68
68
67

10

100

100

100

82

41

10

8

3.01
3.05
3.13
3.46
3.40
3.44

1.92

1.99
2.00
2.59
2.74
2.85
1.96
5.77
6.11
6.56
6.00
5.36
5.15

4.52

4.12
4.24
4.62
3.82
3.61

3.22

3.26

3.49
3.50
3.60
3.98
4.03
4.00

2.47
2.46

2.44

3.10
3.24
3.33
2.48
6.63
6.78
6.91
6.59
6.40
6.25

6.05

4.68
4.95
5.42
4.70
4.50

3.86

3.92

4.44
4.60
4.86
4.70
4.77
4.81

3.26
3.30
3.51
3.77
3.93
4.02

3.12

7.14
7.21
7.32
7.00
6.93

6.88

6.96
5.81
5.93
6.46
5.84
5.46
5.36

5.32

5.95
5.94
6.15
6.25
6.34
6.42

5.65
5.59
5.74
5.77
5.84
5.96

5.01

7.62
7.64
7.69
7.50
7.46

7.43

7.57
6.24
6.41
6.87
7.07
6.19
6.18

6.02

8.24
8.20
8.16
8.20
8.19
8.21

7.99
7.87
7.79
7.87
7.93
8.00

7.20

7.98
8.02
8.04
7.92
7.88

7.87

8.02

7.42

7.52
7.60
7.47
7.48
7.45
7.46

84
82
80
80
84
84

54
40
23
28
52
55

78

85
84
84
84
84
84

86

80
80
77
78
80
80

24
20
18
17
21
25

41

83
83
68
68
68
83

84

77
77
76
76
77
78

16
14
11
13
14
16

18

53
53
36
52
53
55

56

52
53
52
52
54
53

2
4
6
4
4
4

9
14
18
14
12
9

10

7
7
7
7

10

8

10

8
9

10

51
50
34
18
50
51

52

7

22

18
16
18
18
20
20

Table 10: Nelder-Mead - results of MetaMax(k) and corresponding xed restart
strategies

34
all
separ
lcond
hcond
multi
mult2

nm-k-100
nm-m-100
nm-i-100
nm-k-50d
nm-m-50d
nm-i-50d
nm-special
nm-k-100
nm-m-100
nm-i-100
nm-k-50d
nm-m-50d
nm-i-50d
nm-special
nm-k-100
nm-m-100
nm-i-100
nm-k-50d
nm-m-50d
nm-i-50d
nm-special
nm-k-100
nm-m-100
nm-i-100
nm-k-50d
nm-m-50d
nm-i-50d
nm-special
nm-k-100
nm-m-100
nm-i-100
nm-k-50d
nm-m-50d
nm-i-50d
nm-special
nm-k-100
nm-m-100
nm-i-100
nm-k-50d
nm-m-50d
nm-i-50d
nm-special

2D
3.58
2.89
2.94
3.58

3D
4.38
3.76
3.76
4.42

2.85

3.74

2.88
2.95
3.21
2.62
2.65
3.18
2.55
2.61

3.76
3.88
4.38
3.39

3.36

3.44
2.74
2.76
3.40
2.65
2.76

4.42
3.38
3.40
3.55
4.00
3.73
3.80
4.07
3.73
3.80

2.45

3.31

2.46

2.85
2.54
2.63
2.84
2.52
2.64

1.96

5.15
3.88
3.90
5.19
3.84

3.67

4.52
3.22

2.64

2.72
3.25
2.66
2.70
3.26

3.33
3.15
3.21
3.39
3.13
3.22

2.48

6.25
4.96
4.91
6.26
4.92

4.89

6.05
3.86
3.55
3.51
3.88
3.53

3.50

3.92

log10 API

5D
5.23
4.96
4.95
5.23
4.93
-

10D
6.45
6.55
-

20D
7.75
7.83
-

4.84

6.04

7.49

4.98
4.82
4.81
5.05
4.79
-

4.50

4.81
4.71
4.70
4.89
4.67
-

4.14

4.02
4.00
4.02
4.11
4.03
-

3.12

6.88
6.52
6.51
6.77

6.50

6.96
5.36
4.70
4.68
5.26

4.60

5.32

6.24
6.33
-

5.68

6.42
6.52
-

5.87

5.96
6.35
-

5.01

7.43
-

7.37

7.57
6.18
6.19
-

6.02

7.30
7.52
-

6.83

8.21
8.26
-

8.06

8.00
8.06
-

7.20

7.87
-

7.82

8.02

7.45

7.60
7.46

2D
85
96
96
85
96
98

92

100
100
100
100
100
100
100

84

100
100

86

100
100

Success rate [%]
3D 5D 10D 20D
71
62
44
15
89
69
89
69
71
62
43
13
89
69
89
79
63 51 24
68
64
43
31
100
67
100
66
68
64
44
28
100
66
100
84
66 55 41
80
78
53
4
86
80
85 80
80
78
55
2
85 80
86
-

100

86

80

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

100

55
82
82
55
82

90

78
84

25
72

41

12

9

16

26

26

73

86

86

82

24

100

84

100

73

100

100

-

67
56
-

10

25
18

73

41
83
85
85
83
85

100

56

84

18
55
73
74
55

76

56

Table 11: Nelder-Mead - results of MetaMax strategies

35

10
10
51
51
-

52

9
7
8
7

20

16
-

20
5.3 BFGS
Results of restart strategies

bfgs-f-comb, bfgs-h-comb

and

bfgs-special

are

shown in table 13. Best xed and objective function stagnation based restart strategies for each dimension, which were used to make

bfgs-hfcomb

and

bfgs-h-comb,

are listed in table 12.
Dimensionality

d=2
d=3
d=5
d=10
d=20

Fixed

Stagnation based

bfgs-f-100d
bfgs-f-100d
bfgs-f-200d
bfgs-f-1000d
bfgs-f-1000d

bfgs-h-2d
bfgs-h-2d
bfgs-h-2d
bfgs-h-2d
bfgs-h-2d

Table 12: BFGS - best restart strategies for each dimensionality

For the selected best xed restart strategies, we see an ordinary behaviour where
run lengths increase with dimensionality. However, for the stagnation based restart
strategies,

bfgs-h-2d

is apparently the best for all dimensionalities. This is quite

unusual but, in hindsight, not entirely unexpected. It has to do with the way our
implementation of BFGS works: At the begging of each step, the algorithm estimates
gradient of the objective function by using the nite dierence method. This involves
evaluating the objective function at a set of neighbouring solutions, which are very
close to the current solution. The number of these neighbour solutions is always 2d one for each vector in a positive orthonormal basis of the search space. A very quick
way to detect convergence is to check, if objective function values at these points are
worse than function value at the current solution. As it turns out, this is precisely
what

bfgs-h-2d

does and also the reason why it works so well.

In contrast with the surprisingly good results of

bfgs-h-2d,

the special restart

strategy, which is based on monitoring value of the norm of the estimated gradient,
performs very poorly and is clearly the worst of all the tested restart strategies. The
other two strategies
with

bfgs-h-comb

bfgs-h-comb and bfgs-f-comb have a very similar performance,

being slightly better.

Overall, BFGS has excellent results on ill-conditioned problems, even exceeding
the performance of the best algorithms from the BBOB 2009 conference for certain
dimensionalities on the group

hcond,

which is illustrated in gure 9. However, it

performs quite poorly on multimodal functions (multi and

mult2).

Table 14 sums up the results comparing MetaMax(k) with corresponding xed
restart strategies. In terms of success rate, both types of strategies perform the
same on all function groups. In terms of rate of convergence, expressed by values of
API, the results are similar to those observed when using the Nelder-Mead method:
MetaMax(k) strategies perform better on

hcond

and

separ.

multi

and

mult2,

but worse on

lcond,

The overall performance of MetaMax(k) across all function groups is worse than
that of the corresponding xed restart strategies and consequently also worse than
performance of the best restart strategies.

36
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm
Black box optimization of restart strategies for the MetaMax algorithm

Weitere ähnliche Inhalte

Was ist angesagt?

Algorithms for Reinforcement Learning
Algorithms for Reinforcement LearningAlgorithms for Reinforcement Learning
Algorithms for Reinforcement Learningmustafa sarac
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIRaouf KESKES
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerZahra Sadeghi
 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsNYC Predictive Analytics
 
Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4YoussefKitane
 
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...mabualsh
 
dissertation_hrncir_2016_final
dissertation_hrncir_2016_finaldissertation_hrncir_2016_final
dissertation_hrncir_2016_finalJan Hrnčíř
 
A NEW TOOL FOR LARGE SCALE POWER SYSTEM TRANSIENT SECURITY ASSESSMENT
A NEW TOOL FOR LARGE SCALE POWER SYSTEM TRANSIENT SECURITY ASSESSMENTA NEW TOOL FOR LARGE SCALE POWER SYSTEM TRANSIENT SECURITY ASSESSMENT
A NEW TOOL FOR LARGE SCALE POWER SYSTEM TRANSIENT SECURITY ASSESSMENTPower System Operation
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningVivian S. Zhang
 
A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...o_almasi
 
On the-joint-optimization-of-performance-and-power-consumption-in-data-centers
On the-joint-optimization-of-performance-and-power-consumption-in-data-centersOn the-joint-optimization-of-performance-and-power-consumption-in-data-centers
On the-joint-optimization-of-performance-and-power-consumption-in-data-centersCemal Ardil
 

Was ist angesagt? (13)

Algorithms for Reinforcement Learning
Algorithms for Reinforcement LearningAlgorithms for Reinforcement Learning
Algorithms for Reinforcement Learning
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
 
Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4
 
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
 
dissertation_hrncir_2016_final
dissertation_hrncir_2016_finaldissertation_hrncir_2016_final
dissertation_hrncir_2016_final
 
A NEW TOOL FOR LARGE SCALE POWER SYSTEM TRANSIENT SECURITY ASSESSMENT
A NEW TOOL FOR LARGE SCALE POWER SYSTEM TRANSIENT SECURITY ASSESSMENTA NEW TOOL FOR LARGE SCALE POWER SYSTEM TRANSIENT SECURITY ASSESSMENT
A NEW TOOL FOR LARGE SCALE POWER SYSTEM TRANSIENT SECURITY ASSESSMENT
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
 
A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...
 
Barret templates
Barret templatesBarret templates
Barret templates
 
Ajila (1)
Ajila (1)Ajila (1)
Ajila (1)
 
On the-joint-optimization-of-performance-and-power-consumption-in-data-centers
On the-joint-optimization-of-performance-and-power-consumption-in-data-centersOn the-joint-optimization-of-performance-and-power-consumption-in-data-centers
On the-joint-optimization-of-performance-and-power-consumption-in-data-centers
 

Ähnlich wie Black box optimization of restart strategies for the MetaMax algorithm

Brake_Disc_Geometry_Optimization
Brake_Disc_Geometry_OptimizationBrake_Disc_Geometry_Optimization
Brake_Disc_Geometry_OptimizationAditya Vipradas
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network HamdaAnees
 
matconvnet-manual.pdf
matconvnet-manual.pdfmatconvnet-manual.pdf
matconvnet-manual.pdfKhamis37
 
Optimization in scilab
Optimization in scilabOptimization in scilab
Optimization in scilabScilab
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingGabriela Agustini
 
Rapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality ReductionRapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality ReductionMatthieu Cisel
 
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...Evaldas Taroza
 
Quantum Variables in Finance and Neuroscience Lecture Slides
Quantum Variables in Finance and Neuroscience Lecture SlidesQuantum Variables in Finance and Neuroscience Lecture Slides
Quantum Variables in Finance and Neuroscience Lecture SlidesLester Ingber
 

Ähnlich wie Black box optimization of restart strategies for the MetaMax algorithm (20)

master-thesis
master-thesismaster-thesis
master-thesis
 
JJ_Thesis
JJ_ThesisJJ_Thesis
JJ_Thesis
 
M.Sc thesis
M.Sc thesisM.Sc thesis
M.Sc thesis
 
Brake_Disc_Geometry_Optimization
Brake_Disc_Geometry_OptimizationBrake_Disc_Geometry_Optimization
Brake_Disc_Geometry_Optimization
 
Mak ms
Mak msMak ms
Mak ms
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network
 
andershuss2015
andershuss2015andershuss2015
andershuss2015
 
matconvnet-manual.pdf
matconvnet-manual.pdfmatconvnet-manual.pdf
matconvnet-manual.pdf
 
t
tt
t
 
Optimization in scilab
Optimization in scilabOptimization in scilab
Optimization in scilab
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
 
Big data-and-the-web
Big data-and-the-webBig data-and-the-web
Big data-and-the-web
 
Rapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality ReductionRapport d'analyse Dimensionality Reduction
Rapport d'analyse Dimensionality Reduction
 
Manual
ManualManual
Manual
 
main
mainmain
main
 
Matconvnet manual
Matconvnet manualMatconvnet manual
Matconvnet manual
 
Final Report
Final ReportFinal Report
Final Report
 
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
 
Quantum Variables in Finance and Neuroscience Lecture Slides
Quantum Variables in Finance and Neuroscience Lecture SlidesQuantum Variables in Finance and Neuroscience Lecture Slides
Quantum Variables in Finance and Neuroscience Lecture Slides
 
MS-Thesis
MS-ThesisMS-Thesis
MS-Thesis
 

Kürzlich hochgeladen

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 

Kürzlich hochgeladen (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 

Black box optimization of restart strategies for the MetaMax algorithm

  • 1. Czech Technical University in Prague Faculty of Electrical Engineering DIPLOMA THESIS Bc. Viktor Kajml Black box optimization: Restarting versus MetaMax algorithm Department of Cybernetics Project supervisor: Ing. Petr Posik, Ph.D. Prague, 2014
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. Abstrakt Tato diplomová práce se zabývá vyhodnocením nového perspektivního optimaliza£ního algoritmu, nazvaného MetaMax. Hlavním cílem je zhodnotit vhodnost jeho pouºití pro °e²ení problém· optimalizace £erné sk°í¬ky se spojitými parametry, obzvlá²t¥ v porovnání s ostatními metodami b¥ºn¥ pouºívanými v této oblasti. Za tímto ú£elem je MetaMax a vybrané tradi£ní restartovací strategie, podrobn¥ otestován na rozsáhlé sad¥ srovnávacích funkcí, za pouºití r·zných algoritm· lokálního prohledávání. Takto nam¥°ené výsledky jsou poté porovnány a vyhodnoceny. Druhotným cílem je navrhnout a implementovat modikace algoritmu MetaMax v jistých oblastech, kde je prostor pro zlep²ení jeho výkon·. Abstract This diploma thesis is focused on evaluating a new promising multi-start optimization algorithm called MetaMax. The main goal is to assess its utility it in the area of black-box continuous parameter optimization, especially in comparison with other strategies commonly used in this area. To achieve this, MetaMax and a selection of traditional restart strategies are thoroughly tested on a large set of benchmark problems and using multiple dierent local search algorithms. Their results are then compared and evaluated. An additional goal is to suggest and implement modications of the MetaMax algorithm, in certain areas where it seems that there could be a potential room for improvement.
  • 10.
  • 11. I would like to thank: Mr. Petr Po²ík for his help on this thesis The Centre of Machine perception at the Czech Technical University in Prague for providing me with access to their computer grid My friends and family for their support
  • 12.
  • 13. Contents 1 Introduction 1 2 Problem description and related work 3 2.1 Local search algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Multi-start strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 MetaMax algorithm and its variants 3.1 4 Suggested modications . . . . . . . . . . . . . . . . . . . . . . . . . Experimental setup 9 13 16 4.1 18 Used metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3 5 Used multi-start strategies . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Results 25 5.1 25 Nelder-Mead method . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.3 BFGS 36 5.4 CMA-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.5 6 Compass search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Additional results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion 48 A Used local search algorithms 51 A.1 Compass search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 A.2 Nelder-Mead algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 51 A.3 BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 A.4 CMA-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 B CD contents 56 C Acknowledgements 56 i
  • 14. List of Tables 1 Benchmark function groups . . . . . . . . . . . . . . . . . . . . . . . 17 2 Algorithm specic restart strategies . . . . . . . . . . . . . . . . . . . 20 3 Tested multi-start strategies . . . . . . . . . . . . . . . . . . . . . . . 21 4 Compass search - best restart strategies for each dimensionality . . . 26 5 Compass search - results of restart strategies . . . . . . . . . . . . . . 26 6 Compass search - results of MetaMax(k) and corresponding xed restart strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 7 Compass search - results of MetaMax strategies . . . . . . . . . . . . 30 8 Nelder-Mead - best restart strategies for each dimensionality . . . . . 31 9 Nelder-Mead - results of restart strategies . . . . . . . . . . . . . . . . 32 10 Nelder-Mead - results of MetaMax(k) and corresponding xed restart strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 11 Nelder-Mead - results of MetaMax strategies . . . . . . . . . . . . . . 35 12 BFGS - best restart strategies for each dimensionality . . . . . . . . . 36 13 BFGS - results of restart strategies 37 14 BFGS - results of MetaMax(k) and corresponding xed restart strategies 38 15 BFGS - results of MetaMax strategies . . . . . . . . . . . . . . . . . . 40 16 CMA-ES - best restart strategies for each dimensionality . . . . . . . 41 17 CMA-ES - results of restart strategies . . . . . . . . . . . . . . . . . . 42 18 CMA-ES - results of MetaMax(k) and corresponding xed restart strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 19 CMA-ES - results of MetaMax strategies . . . . . . . . . . . . . . . . 45 20 CD contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 ii
  • 15. List of Figures 1 Restart condition based on function value stagnation 2 Example of monotone transformation of 3 MetaMax selection mechanisms 4 Example ECDF graph 5 Compass search - ECDF comparing MetaMax(k) with an equivalent . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6 Compass search - ECDF of MetaMax variants using 100 instances . . 29 7 Compass search - ECDF of MetaMax variants using 50d instances . . 31 8 Nelder-Mead - ECDF comparing MetaMax(k) strategies . . . . . . . 32 9 BFGS - ECDF of the best restart strategies . . . . . . . . . . . . . . 37 10 BFGS - ECDF of MetaMax variants using 50d instances . . . . . . . 39 11 CMA-ES - ECDF of function value stagnation based restart strategies 41 12 CMA-ES - ECDF comparison of MetaMax variants using 50d instances 44 13 MetaMax timing measurements 14 ECDF comparing MetaMax strategies using dierent instance selec- xed restart strategy tion methods 15 . . . . . . . . . 7 . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . 16 . . . . . . . . . . . . . . . . . . . . . . . . . . 24 f (x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nelder-Mead algorithm in 2D 47 48 . . . . . . . . . . . . . . . . . . . . . . 52 1 Typical structure of a local search algorithm . . . . . . . . . . . . . . . 2 2 Variable neighbourhood search . . . . . . . . . . . . . . . . . . . . . . 9 3 MetaMax(k) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4 MetaMax(∞) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5 MetaMax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 6 Compass search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7 Nelder-Mead method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 8 BFGS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 9 CMA-ES algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 List of Algorithms iii
  • 16.
  • 17. 1 Introduction The goal of this thesis is to implement and evaluate the performance of the MetaMax optimization algorithm, particularly in comparison with other commonly used optimization strategies. MetaMax was proposed by György and Kocsis in [GK11] and the results they present seem very interesting and suggest that MetaMax might be a very competitive algorithm. Our goal is to more closely evaluate its performance on problems from the area of black-box continuous optimization, by performing a series of exhaustive measurements and comparing the results with those of several commonly used restart strategies. This text is organized as follows: Firstly there is a short overview of the subjects of mathematical, continuous and black-box optimization, local search algorithms and multi-start strategies. This is meant as an introduction for readers who might not be familiar with these topics. Readers who already have knowledge of these elds might wish to skip forward to the following sections, which describe the MetaMax algorithm, the experimental setup, used optimization strategies and the software implementation. In the last two sections, the measured results are summed up and evaluated. The mathematical optimization problem is dened as selecting the best element, according to some criteria, from a set of feasible elements. Most common form of x1,opt , . . . , xd,opt , where d is the problem dimension, for which a value of a given objective function f (x1 , . . . , xd ) is minimal, that is f (x1,opt , . . . , xd,opt ) ≤ f (x1 , . . . , xd ) for all possible values of x1 , . . . , xd . the problem is nding a set of parameters Within this eld of mathematical optimization, it is possible to dene several subelds based on the properties of the parameters information available about the objective function x1 , . . . , x d and the amount of f. The set of all possible solutions (possible combid nations of the parameter values) is nite. Usually some subset of N . Combinatorial optimization: Integer programming: All of the parameters are restricted to be integers: x1 , . . . , xd ∈ N. Can be considered to be a subset of combinatorial optimization. Mixed integer programming: Some parameters are real-valued and some are integers. Continuous optimization: The set of all possible solutions is innite. Usually x1 , . . . , x d ∈ R . black-box optimization: about f Assumes that only a bare minimum of information is given. It can evaluated at an arbitrary point tion value f (x), but besides that, no other properties of x, returning the funcf are known. In order to solve this kind of problems, we have to resolve to searching (The exact techniques are be described in more detail later in this text). Furthermore we are almost never guaranteed to nd the exact solution, just one that is suciently close to it, and there is almost always a non-zero probability that even an approximate solution might not be found at all. 1
  • 18. White box optimization knowledge about f, deals with problems where we have some additional for example its gradient, which can obviously be very useful when looking for its minimum. In this text we will almost exclusively with black-box continuous optimization problems. For a practical example of a black-box optimization problem, imagine the process of trying to design an airfoil, which should have certain desired properties. It is possible to describe the airfoil by a vector of variables representing its various parameters - length, thickness, shape, etc. This will be the parameter vector run aerodynamic simulation with the airfoil described by x, x. Then, we can evaluate how closely it matches the desired properties, and based on that, assign a function value f (x) the parameter vector. In this way, the simulator becomes the black-box function and the problem is transformed into task of minimizing the objective function We can then use black-box optimization methods to nd the parameter vector to f f. xopt which will give us an airfoil with the desired properties. This example hopefully suciently illustrates the fact that, black-box optimization can be a very powerful tool, as it allows us to nd reasonably good solutions even for problems which we might not be able to, or would not know how to, solve otherwise. As already mentioned, the usual method for nding optima (the best possible set of parameters xopt ) in continuous mathematical optimization is searching. The structure of a typical local search algorithm is as follows: Algorithm 1: Typical structure of a local search algorithm 1 Select a starting solution x0 somehow (most commonly randomly) from the set of feasible solutions. 2 3 4 5 6 7 8 9 10 11 12 xc ← x0 f (xc ). Set current solution: Get function value while Stop condition not met do Generate a set of neighbour solutions Evaluate f at each Xn similar to xc xn ∈ X n Find the best neighbour solution ∗ if f (x ) f (xc ) then Update the current solution x∗ = argmaxxn ∈Xn f (x) xc ← x∗ else Modify the way of generating neighbour solutions return xc In the case of continuous optimization, a solution is represented simply by a d point in R . There are various ways of generating neighbour solutions. In general, two neighbouring solutions should be dierent from each other, but in some sense also similar. In continuous optimization, this usually means that the solutions are close in terms of Euclidean distance, but not identical. 2
  • 19. The algorithm described above has the property that it always creates neighbour solutions close to the current solution and moves the current solution in the direction of decreasing f (x). This makes it a greedy algorithm, which works well in cases where the objective function is unimodal (has only one optimum), but for multimodal functions (functions with multiple local optima), the resulting behaviour will not be ideal. The algorithm will move in the direction of the nearest optimum (the optimum x0 ), with basin of attraction containng but when it gets there it will not move any further, as at this point, all the neighbour solutions will be worse than the current solution. Such algorithm can therefore be relied on to nd the nearest local optimum, but there is no guarantee that it will also be the global one. The global optimum will be found only when x0 happens to land in its basin of attraction. The method which is most commonly used to overcome this problem, is to run multiple instances of the local search algorithm from dierent starting positions x0 . Then it is probable that at least one of them will start in the basin of attraction of the global optimum and will be able nd it. There are various dierent multi-start strategies which implement this basic idea, with MetaMax, the main subject of this thesis, being one of them. More thorough description of the local search algorithms problem of getting stuck in a local optimum is described in section 2. Detailed description of the MetaMax algorithm and its variations is given in section 3. Structure of the performed experiments is described in section 4. Finally, the measured results are presented and evaluated in section 5. 2 Problem description and related work As mentioned in the previous section, local search algorithms have problems nding the global optimum of functions with multiple optima (also called multimodal functions). In this section we focus on this problem more thoroughly. We describe several common types of local search algorithms in more detail and discuss their susceptibility to getting stuck in a local optimum. Next, we will describe several methods to overcome this problem. 2.1 Local search algorithms Following are the descriptions of four commonly used kinds of local search algorithms, which we hope will give the reader a more concrete idea about the functioning of local search algorithms, than the very basic example described in algorithm 1. Line search algorithms try to solve the problem of minimizing a d-dimensional function f by using a series of one-dimensional minimization tasks, called line searches. During each step of the algorithm, an imaginary line is created starting at the current solution xc and going in a suitably chosen direction σ . Then, x, with the minimal value of f (x), and the current solution is updated: xc ← x. In this way, the algorithm will eventually converge on a nearest local optimum of f . the line is searched for a point 3
  • 20. The question remains - How to chose the search direction σ? The most simple algorithms just use a preselected set of directions (usually vectors in an orthonormal positive d-dimensional base) and loop through them on successive iterations. This method is quite simple to implement, but it has trouble coping with ill-conditioned functions. An obvious idea might be to use information about the functions gradient for determining the search direction. However, this turns out not to be much more eective than simple alternating algorithms. The best results are achieved when information about both the functions gradient and its Hessian is used. Then, it is possible to get quite robust and well performing algorithms. Note, that for black-box optimization problems, it is necessary to obtain the gradient by estimation, as it is not explicitly available. Examples of this kind of algorithm are: Symmetric rank one method, gradient descent algorithm and Broyden-Fletcher-GoldfarbShanno algorithm Pattern search algorithms closely t the description given in algorithm 1. They generate the neighbour solutions relative to the current solution xn ∈ X n in dened positions (a pattern) xc . If any of the neighbour solutions is found to be better than the current one, it then becomes the new current solution, the next set of neighbour solutions is generated around it and so on. If none of the neighbour solutions is found to be better (an unsuccessful iteration), then the pattern is contracted so that in the next step the neighbour solutions are generated closer to xc . In this way the algorithm will converge to the nearest local optimum (for proof, please see [KLT03]). Advanced pattern search algorithms use patterns, which change size shape according to various rules, both on successful and unsuccessful iterations. Typical algorithms of this type are: Compass search (or coordinate search), Nelder-Mead simplex algorithm and Luus-Jakola algorithm. Population based algorithms keep track of a number of solutions, also called individuals, at one time, which together constitute a population. A new generation of solutions is generated each step, based on the properties of a set of selected (usually the best) individuals from the previous generation. Dierent algorithms vary in the exact implementation of this process. For example, in the family of genetic algorithms, this process is designed to emulate natural evolution: Properties of each individual (in case of continuous optimization, this means its position) are encoded into a genome and new individuals are created by combining parts of genomes of successful individuals from the previous generation, or by random mutation. Unsuccessful individuals are discarded, in an analogy with the natural principle of survival of the ttest. Other population based algorithms, such as CMA-ES take a somewhat more mathematical approach: New generations are populated by sampling a multivariate normal distribution, which is in turn updated every step, based on the properties of a number of best individuals from the previous generation. 4
  • 21. Swarm intelligence algorithms are based on the observation, that it is possible to get quite well performing optimization algorithms by trying to emulate natural behaviours, such as ocking of birds or sh schools. Each solution represents one member of a swarm and moves around the search space according to a simple set of rules. For example, it might try to keep certain minimal distance form other ock members, while also heading in the direction with the best values of f (x). The specic rules vary a great deal between dierent algorithms, but in general even a simple individual behaviour is often enough to result in quite complex collective emergent behaviour. Because swarm intelligence algorithms keep track of multiple individuals/solutions during each step, they can also be considered to be a subset of population based algorithms. Some examples of this class of algorithms are the Particle swarm optimization algorithm and the Fish school search algorithm. Pattern search and line search algorithms, have the property that they always choose neighbour solutions close to the current solution and they move in direction of decreasing f (x). Thus, as was already described in the previous section, they are able to nd only the local optimum, which is nearest to their starting position x0 . Population based and swarm intelligence algorithms might be somewhat less susceptible to this behaviour, in the case where the initial population is spread over a large area of the search space. Then there is a chance that some individuals might land near to the global optimum, and eventually pull the others towards it. There are several modications of local search algorithms specically designed to overcome the problem of getting stuck in a local optimum. We shall now describe two basic ones - Simulated annealing and Tabu search. The main idea behind them, is to limit the local search algorithms greedy behaviour by sometimes taking steps other than those, which lead to the greatest decrease of f (x). Simulated annealing implements the above mentioned idea in a very straightfor- ward way: During each step, the local search algorithm may select any of the generated neighbour solutions with a non-zero probability, thus possibly not selecting the best one. The probability of f (xc ), f (xn ), P of choosing a particular neighbour solution and s, where s xn is a function is the number of steps already taken by the algorithm. Usually, it increases with the value of ∆f = f (xc ) − f (xn ), so that the best neighbour solutions are still likely to be picked the most often. The probability of choosing a neighbour solution other than the best one also usually decreases as s increases, so that the algorithm behaves more randomly in the beginning and then, as the time goes on, settles down to a more predictable behaviour and converges to the nearest optimum. This is somewhat similar to the metallurgical process of annealing, from where the algorithm takes its name. It is possible to apply this method to almost any of the previously mentioned local search algorithms, simply by adding the possibility of choosing neighbour solutions, which are not the best. In practice, the exact form of 5 P (f (xc ), f (xn ), s)
  • 22. has to be ne-tuned for a given problem in order to get good results. Therefore, this algorithm is of limited usefulness in the area of black-box optimization. Tabu search works by keeping list of previously visited solutions, which is called the tabu list. It selects potential moves only from the set of neighbour solutions, which are not on the this list, even if it means choosing a solution, which is worse than the current one. The selected solution is then added to the tabu list and the oldest entry in the Tabu list is deleted. The list therefore works in a way similar to a cyclic buer. This method has been originally designed for solving combinatorial optimization problems and it requires certain modications in order to be useful in the area of continuous parameter optimization. At the very least, it is necessary to modify the method to not only discard neighbour solutions which are on the tabu list, but also solutions which are close to them. Without this, the algorithm would not work very well, as the probability of generating the exact d same solution twice in R is quite small. There is a multitude of advanced variations of this basic method, for example it is possible to add aspiration rules, which override the tabu status of solutions that would lead to a large decrease in f (x). For a detailed description of tabu search adapted for continuous optimization, please see [CS00]. 2.2 Multi-start strategies Multi-start strategies allow eectively using local search algorithms on functions with multiple local optima without making any modication to the way they work. The basic idea is, that if we run a search algorithm multiple times, each time from a dierent starting position x0 , then it is probable that at least one of the starting positions will be in the basin of attraction of the global optimum and thus the corresponding local search algorithm will be able to nd it. Of course, the probability of this depends on the number of algorithm instances that are run, relative to the number and properties of the functions optima. It is possible to think about multistart strategies as meta-heuristics, running above, and controlling multiple instances of local search algorithm sub-heuristics. Restart strategies are a subset of multi-start strategies, where multiple instances are run one at a time in succession. The most basic implementation of a restart strategy is to take the total amount of allowed resource budget (usually a set number of objective function evaluations), evenly divide it into multiple slots, and use each of them to run one instance of a local search algorithm. A very important choice is deciding the length of a single slot. The optimal length largely depends on the specic problem and type of used algorithm. If the length is set too low, then the algorithm might not have enough time to converge to its nearest optimum. If it is too long, then there is a possibility that resources will be wasted on running instances which are stuck in local optima and can no longer improve. Of course, all of the time slots do not have to be of the same length. A good strategy for black-box optimization is to start with low length and keep increasing 6
  • 23. it for each subsequent slot. In this way, a reasonable performance can be achieved even if we are unable to choose the most suitable slot length for a given problem in advance. A dierent restart strategy is to keep each instance going as long as it needs to until it converges to an optimum. The most universal way to detect convergence is to look for stagnation of values of the objective function over a number of past function evaluations (or past local search algorithm iterations). If the best objective function value found so far does not improve by at least the limit tf over the last hf function evaluations, then the current algorithm instance is terminated and new one is started. For convenience, in the subsequent text we will call hf the function value history length and tf the function value history tolerance. An example of this restart condition is given in gure 1: The best solution found after v function evaluations ∗ ∗ is marked as xv and its corresponding function value as f (xv ). In the gure, we see that the restart condition is triggered because at the last function evaluation m, the ∗ ∗ following is true: f (xm−h ) ≤ f (xm ) + tf f m−hf 3000 f(xv ) 2500 2000 1500 1000 5000 5 10 15 20 v 25 30 35 ∗ f(xm ) + tf ∗ f(xm ) Figure 1: Restart condition based on function value stagnation Displays the objective function value f (xv ) (dashed black line) of evaluation v , and the best objective function value reached after v function evaluations f (x∗ ) (solid black line), over the interval 0, m function s evaluations. The values f (x∗ ), f( x∗ ) + tf and m − hf are highlighted. m m It is, of course, necessary to choose specic values of hf and tf but usually it is not overly dicult to nd a combination which works well for a large set of problems. Various dierent ways of detecting convergence and corresponding restart conditions can be used. For example, reaching zero-gradient for line-search algorithms, 7
  • 24. reaching minimal size of pattern for pattern search algorithms, etc. There are also various ways of choosing the starting position search algorithm instances. The simplest one is to choose x0 x0 for new local by sampling random uniform distribution over the set of all feasible solutions. This is very easy to implement and often gives good results. However, it is also possible to use information gained by the previous instances, when choosing x0 for a new one. A simple algorithm, which utilizes this idea is the Iterated search: The rst instance i1 is started from an arbitrary position and it is run until it converges (or until it exhausts certain amount of resources) and returns the best solution it has ∗ found xi1 . Then, the starting position for the next instance is selected from the neigh∗ bourhood N of xi . Note, that N is a qualitatively dierent neighbourhood, than 1 what the instance i1 might be using to generate neighbourhood solutions each step. It is usually much larger, with the goal being to generate the new starting point for instance i2 by perturbing the best solution of i1 enough, to move it to a dierent x∗2 better than x∗1 , then the i i ∗ ∗ ∗ next instance is started from the neighbourhood N (xi ). If f (xi ) ≥ f (xi ) and a basin of attraction. If the new instance nds a solution 2 2 1 better solution is not found, then the next instance is started from the neighbour∗ hood N (xi ) again. This is repeated until a stop condition is triggered. An obvious 1 assumption, that this method makes, is that the minima of the objective function are grouped close together. If this is not the case, then it might be better to use uniform random sampling. The big question is, how to choose the size of the neighbourhood N? Too small, and the new instance might fall into the same basin of attraction as the previous one. Too big, and the results will be similar to choosing the starting position uniformly randomly. Another method, called the Variable neighbourhood search, which can, in a way, be considered to be an improved version of the iterated search, tackles this N1 , . . . , Nk of varying sizes, N1 is the smallest and the following neighbourhoods are successively larger, with Nk being the largest. The restarting procedure is the same as with iterated search, with the following modication: If a local search algorithm instance ik , started ∗ from the neighbourhood N1 (xi ) does not improve the current best solution, then k−1 ∗ the algorithm tries starting the next instance from N2 (xi ), then N3 (x∗k−1 ), and so i k−1 problem by using multiple neighbourhood structures where on. The structure of a basic variable neighbourhood search, as given in [HM03], page 10, is described in algorithm 2. This algorithm can also be used as a description of iterated search, if the set of neighbourhood structures contains only one element. Yet another group of methods, which aim to prevent local search algorithms from getting stuck in local optima is based on the idea that it is not necessary to run multiple local search algorithm instances one after another, but they can be run at the same time. Then, it is possible to evaluate the expected performance of each instance based on the results it obtained so far and allocate the resources to the best (or most promising) ones. This is somewhat similar to the well known multi-armed bandit problem. The basic implementation of this idea is called the explore and exploit strategy. It involves initially running all of its k algorithm instances until a certain fraction of its resource budget is expended. This is the exploration phase. Then, the best 8
  • 25. Algorithm 2: Variable neighbourhood search : initial position input x0 , set of neighbourhood structures N1 , . . . , Nk of increasing size 1 2 3 4 5 6 7 8 9 10 11 x∗ ← local_search(x0 ) k←1 while Stop condition not met Generate random point y ∗ ← local_search(x ) ∗ ∗ if f (y ) f (x ) then ∗ ∗ x do from Nk (x∗ ) x ←y k←1 else k ←k+1 return x∗ instance is selected and run until the rest of the resource budget is used up - The exploitation phase. There is, again, an obvious trade o between the amount of resources allocated to each phase. The exploration phase should be long enough, so that, when it ends, it is possible to reliably identify the best instance. On the other hand, it is necessary to have enough resources left in the exploitation phase in order for the selected best instance to converge to the optimum. In practice, it is actually not that dicult to nd balance between these two phases, that gives good results for a wide range of problems. Methods like this, which run multiple local search algorithm instances at the same time, belong into the group of portfolio algorithms. We should, however, note that portfolio algorithms are usually used in a somewhat dierent way than described here. Most commonly, they run multiple instances of dierent local search algorithms, each of which is well suited for a dierent kind of problem. This allows the portfolio algorithm to select instances of that algorithm, which is able to solve the given problem the most eciently, even without knowing its properties a priori. The MetaMax algorithm, which is the main subject of this thesis, is also an portfolio algorithm. However, we use it only running one kind of local search algorithm at a time, to allow for a more fair and direct comparison with restart strategies, which typically use only one kind of local search algorithm. 3 MetaMax algorithm and its variants The MetaMax algoithm is a multi-start portfolio strategy presented by György and Koscis in [GK11]. There are, actually, three versions of the algorithm, which dier in certain details. They are called MetaMax(k), MetaMax(∞) and MetaMax and they will be described in detail in this section. Please note, that while in this text we usually presume all optimization prob- 9
  • 26. lems to be minimization problems, the text in [GK11] assumes a maximization task. Therefore, while describing the workings of MetaMax algorithm in this section, we will keep to the convention in [GK11], but in the rest of the text we will refer to minimization tasks as usual. Our implementation of MetaMax was modied to work with minimization tasks. György and Kocsis demonstrate ([GK11], page 413, equation 2) that convergence of an instance of a local search algorithm, after s steps, can be optimistically esti- mated with large probability as: lim f (x∗ ) ≤ f (x∗ ) + gσ (s) s s (1) s→∞ Where f ( x∗ ) s is the best function value obtained by the local search algorithm in- stance up until the step s and gσ (s) is a non increasing, non negative function with lims→∞ g(s) = 0. Note, that the notation used here is a little dierent than in [GK11], but the meaning is the same. In practice, the exact form of g(s) is not known, so the right side of equation 1 has to be approximated as: f (x∗ ) + ch(s) s Where (2) c is an unknown constant and h(s) is a positive, monotone, decreasing function with the following properties: h(0) = 1, lims→∞ h(s) = 0 (3) h(s) = e−s . In the subsequent text, we One possible simple form of this function is shall call this function the estimate function. György and Kocsis do not use this name in their work. In fact, they do not use any name for this function at all and refer to it simply as h function. However, we think that this is not very convenient, hence we picked a suitable name. Based on equations 1 and 2, it is possible to create a strategy that allocates resources only to those instances, which are estimated to converge the most quickly and maximize the value of expression 2 for a certain range of the constant c. The problem of nding these instances can be solved eectively by transforming it into a problem of nding the upper right convex hull of a set of points in the following way: Ai si , the position xi,si of the best solution ∗ it has found so far and its corresponding function value f (xi,s ). If we represent the i set of the local search algorithm instances Ai , i = 1, . . . , k by a set of points: We assume, that there is k number of instances in total and that each instance keeps track of the number of steps it has taken P : {(h(si ), f (x∗ i )), i = 1, . . . , k} i,s (4) Then the instances which minimize the value of expression 2 for a certain range of c correspond to those points, which lie on the upper right convex hull of the set P. Because the term upper right convex hull is not quite standard, we should clarify that we understand it to mean an intersection of the upper convex hull and the right convex hull. 10
  • 27. Note, that presumably for simplicity, the authors of [GK11] assumed only local search algorithms which use the same number of function evaluations every step. For algorithms where this is not true, it makes more sense to set of function evaluations used by the instance i si equal to the number so far instead. We believe that this is a better way to measure the use of resources by individual instances, which is also conrmed in [PG13]. György and Kocsis suggest using a form of estimate function, which changes based on the amount of resources used by all the local search algorithm instances, in order to encourage more exploratory behaviour as the MetaMax algorithm progresses. Therefore, in our implementation, we use the following estimate function, which is recommended in [GK11]: h(vi , vt ) = e−vi /vt Where vi (5) is the number of function evaluations used by instance i and vt is the total number of function evaluations used by all of the instances combined. The simplest of the three MetaMax variants is MetaMax(k). It uses k local search algorithm instances and is described in algorithm 3. For convenience and improved readability, we will use simplied notation, when describing MetaMax variants: vi for number of function evaluations used by local search algorithm instance i so far xi for position of the best solution found by instance fi for function value of i so far xi In the descriptions, we also assume that the estimate function h is a function of only one variable. Algorithm 3: MetaMax(k) : function to be optimized input f, number of algorithm instances monotone non-decreasing function h k and a with properties as given in equation 3 1 Step each of the variables 2 3 while v i , xi k local search algorithm instances and Ai and update their fi stop conditions not met do i = 1, . . . , k , select algorithm Ai if there exists c 0 so that: fi + ch(vi ) fj + ch(vj ) for all j = 1, . . . , k so that (vi , fi ) = (vj , fj ). If there are multiple algorithms with identical v and f , then select only one For of them at random. 4 5 6 7 Step each selected Ai and update its variables vi , xi and fi . b = argmin1,...,k (fi ). ∗ solution: x ← xb . Find the best instance: Update the best return x∗ As with a priori scheduled restart strategies, there is the question of choosing the right number of instances (parameter k) 11 to use. The other two versions of the
  • 28. algorithm - MetaMax and MetaMax(∞) get around this problem by gradually increasing the number of instances, starting with a single one and adding a new one every round. Thus, the number of instances tends to innity as the algorithm keeps running. This allows to prove that the algorithm is consistent. That is, it will almost surely nd the global optimum if kept running for an innite amount of time. Please note, that in some literature, such as [Neu11], the term asymptotically complete is used, instead of consistent, but both of them mean the same thing. Also note, that we use the word round to refer to a step of the MetaMax algorithm, in order to avoid confusion with steps of local search algorithms. MetaMax and MetaMax(∞) are described in algorithms 5 and 4 respectively, also using the simplied notation. Algorithm 4: MetaMax( input ∞) : function to be optimized f, monotone non-decreasing function h with properties as given in equation 3 1 2 3 r←1 while stop conditions not met do Add a new local search algorithm instance Ar , step it once and initialize vr , xr and fr For i = 1, . . . , r , select algorithm Ai if there exists c 0 so that: fi + ch(vi ) fj + ch(vj ) for all j = 1, . . . , r so that (vi , fi ) = (vj , fj ). If there are multiple algorithms with identical v and f , then select only one its variables 4 of them at random. 5 6 7 8 9 Step each selected Ai and update its variables Find the best instance: vi , xi and fi . b = argmin1,...,r (fi ). x∗ ← x b . Update the best solution: r ←r+1 return x∗ MetaMax and MetaMax(∞) dier only in one point (lines 6 and 7 in algorithm 5): If, after stepping all selected instances, the best instance is a dierent one than in the previous round, MetaMax will step it until it overtakes the old best instance in terms of used resources. In [GK11] it is shown that MetaMax asymptotically approaches the performance of its best local search algorithm instance as the number of rounds increases. Theoretical analysis suggests that the number of instances increases at a rate of where √ Ω( vt ), vt is the total number of used function evaluations. However, practical results vt Ω( logvt ). Based on this, it can also be estimated ([GK11], page 439) that to nd the global optimum xopt , MetaMax needs only a logarithmic give a rate of growth only of factor more function evaluations than a local search algorithm instance, which would start in the basin of attraction of xopt . Note a small dierence in the way MetaMax and MetaMax(∞) are described in algorithms 5 and 4 from their descriptions in [GK11]. There, a new algorithm instance Ar is added with fr = 0 and sr = 0 and takes at most one step during the round that it is added. This is possible because in [GK11] a non-negative objective function f and a maximization task are assumed. Therefore, an algorithm instance 12
  • 29. Algorithm 5: MetaMax input : function to be optimized f, monotone non-decreasing function h with properties as given in equation 3 1 2 3 r←1 while stop conditions not met do Ar , step it once and initialize vr , xr and fr For i = 1, . . . , r , select algorithm Ai if there exists c 0 so that: fi + ch(vi ) fj + ch(vj ) for all j = 1, . . . , r so that (vi , fi ) = (vj , fj ). If there are multiple algorithms with identical v and f , then select only one Add a new local search algorithm instance its variables 4 of them at random. 5 6 7 8 9 10 Step each selected Ai and update its variables vi , xi and fi . br = argmin1,...,r (fi ). If br = br−1 step instance Abr until vbr ≥ vbr−1 ∗ Update the best solution: x ← xb . r ←r+1 Find the best instance: return x∗ can be added without taking any steps rst, and assigned a function value fr = 0, which is guaranteed to not be better than any of the function values of the other instances. We are, however, dealing with a minimization problem with a known target value (see [Han+13b]) but no upper bound on of f. f and, consequently, no worst possible value Therefore, we made a little change and step the new instance Ar immediately after it is added. It can then also be stepped second time, during step 4 in algorithms 5 and 4. We believe, that this has no signicant impact on performance. 3.1 Suggested modications MetaMax and MetaMax(∞) will add a new instance each round as long as they are running, with no limit on the maximum number of instances. The authors of [GK11] state that the worst-case computational overhead of MetaMax and 2 MetaMax(∞) is O(r ), where r is the number of rounds. For the purpose of optimizing functions, where each function evaluation uses up a large amount of computational time (for which MetaMax was primarily designed), the overhead will be negligible compared to the time spent calculating function values and will not present a signicant problem. However, in comparison with restart strategies which have typically almost no overhead this is still a disadvantage for MetaMax. Therefore, it would be desirable to come up with some mechanism that would improve its computational complexity. An obvious solution would be to limit the total number of instances which can be added or slow down the rate at which they are added so that there will never be too many of them. However, this would make MetaMax and MetaMax(∞) behave basically in the same way as MetaMax(k) and lose their main property, which is the 13
  • 30. consistency based on always generating new instances. A better solution would be to add a mechanism which would discard one of already existing instances every time a new one is added and therefore keep the total number of instances at any given time constant. The important question is: Which one of the existing instances should be discarded? We propose the following approach: Discard the instance which has not been selected for the longest time. If there are multiple instances which qualify, discard the one with the worst function value. The rationale behind this discarding mechanism is that MetaMax most often selects (allocates the most resources to) those instances, which have the best optimistic estimate of convergence. Therefore, the instances which are selected the least often will likely not give very good results in the future, and so make good candidates for deletion. An alternative, method may also be to discard the absolute worst instance (in terms of the best objective function value found so far). Which is even simpler, but we feel that it does not follow so naturally from the principles behind MetaMax. Therefore, for most of our experiments we will use the discarding of least selected instances. Another area where we think it might be benecial to modify the workings of MetaMax, is the mechanism of selecting instances to be stepped in each round. The original mechanism has two possible disadvantages: Firstly, it is not invariant to monotone transformation of the objective function values. By this we mean a f (x) → f (x), which itself is only a function of the value of f (x) and not the parameter vector x. The monotone property meaning, that if f (x1 ) f (x2 ) then f (x1 ) f (x2 ) for all possible x1 and x2 . Such a monotone transformation will not change the location of the optima of f (x). I will also not change the direction of gradient of f (x) for any x, but not necessarily its magnitude. An example of such mapping transformation is given in gure 2. Logically, it would not make much sense to require an optimization algorithm to be invariant to an objective function value transformation, which is not monotone, as it could change the position of the functions optima. The second possible disadvantage of the convex hull based instance selection mechanism is that it also behaves dierently based on the choice of the estimate function h. given, while This is not such a great disadvantage as the rst one, because h f (x) is can be chosen freely. However, it would still be benecial if we could entirely remove the need to choose h. To overcome these problems, we propose a new instance selection mechanism. It uses the same representation of local search algorithm instances as a set of points P, given in equation 4 but it select those instances, which correspond to non-dominated in the sense of maximizing fi and maximizing h(vi ) (or analogically fi and minimizing vi ). This method is clearly invariant to both monotone transformation of objective function values f → f and dierent choices of h, as determining non-dominated points depends only on their ordering along the axes fi and h(xi ), which will always be preserved due to the fact that both f → f and h are monotone. Moreover, the points which lie on the right upper convex hull of P , and thus maximize the optimistic estimate fi + ch(vi ), are always non-dominated, points of P maximizing and thus will always be selected. 14
  • 31. 1 3 2 1 10 0 1 2 2 3 3 2 15 10 5 30 1 3 2 1 10 0 1 2 2 3 3 2 2 1 1 0 0 1 1 2 4000 3000 2000 1000 0 23 2 3 3 2 1 0 1 3 2 3 2 1 1 0 Figure 2: Example of monotone transformation of 2 f (x) Displays a 3D mesh plot of a Rastrigin like function f (x) in the top left, a transformed function f (x)3 in the top right and their respective contour plots on the bottom. It is clear, that the shape of the contours is the same, but their heights are not. A possible disadvantage of the proposed algorithm is, that at each round it selects many more points than the original convex hull mechanism. This might result in selecting instances with low convergence estimate too often, and not dedicating enough resources to the more promising ones. A visual comparison of the two selection mechanisms and demonstration of the inuence of choice of estimate function upon selection are presented in gure 3. 15
  • 32. fi fi3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 1 2 3 4 5 1e8 1e25 0.1 0.2 0.3 0.4 h(vi ) 0.1 0.2 0.3 0.4 h(vi ) Figure 3: MetaMax selection mechanisms Compares the original selection mechanism based on nding upper convex hull (left sub-gures), with the new proposed mechanism based on selecting non-dominated points (right sub-gures). Also demonstrates the eects of monotone transformation of the objective function values on the selection, with f (x) for the upper sub-gures and f (x)3 for those on the bottom. Selected points are marked as red diamonds, connected by a red line. Unselected points are marked as lled black circles. 4 Experimental setup All of the experiments were conducted using the COCO (Comparing continuous optimizers) framework [Han13a], which is an open-source set of tools for systematic evaluation and comparison of real-parameter optimization strategies. It provides a set of 24 benchmark functions of dierent types, chosen to thoroughly test the limits and capabilities of optimization algorithms. Also included are tools for running experiments on these functions and logging, processing and visualising the measured data. The library for running experiments is provided in versions for C, Java, R, Matlab and Python. The post processing part of the framework is available for Python only. The benchmark functions are divided into 6 groups, according to their properties. They are briey described in table 1. For detailed description, please see [Han+13a]. There are also multiple instances dened for each function, which are created by applying various transformations to the base formula. We shall now briey explain some of the functions properties mentioned in table 1. As already mentioned, the terms unimodal and multimodal refer to functions with 16
  • 33. Name separ lcond hcond multi mult2 Functions Description 1-5 Separable functions 6-9 Functions with low or moderate conditionality 10-14 Unimodal functions with high conditionality 15-19 Multimodal structured functions 20-24 Multimodal functions with weak global structure Table 1: Benchmark function groups single optimum and multiple local optima respectively. Conditionality describes how much the functions gradient changes depending on direction. Simply put, functions with high conditionality (also called ill-conditioned functions), at certain points, grow rapidly in some directions but slowly in others. This often means that the gradient points away from the local optimum, which presents a dicult problem for some local search algorithms. To give a more visual description, one can imagine that 3D graphs of two-dimensional ill conditioned functions usually form sharp ridges, while those of well conditioned functions form gentle round hills. f (x1 , x2 , . . . , xd ) = f (x1 ) + f (x2 ) + ...+f (xd ), which means that they can be minimized by minimizing d one-dimensional functions, where d is the number of dimensions of the separable function. Separable functions have the following form: In order to exhaustively evaluate performance of the selected strategies, we decided to make the following series of measurements for each strategy: 1. Using four dierent local search algorithms - Compass search, Nelder-Mead method, BFGS and CMA-ES. In order to evaluate the eect of algorithm choice. 2. Using all of the 24 noiseless benchmark functions available in the COCO framework, to measure performance on a wide variety of dierent problems. 3. Using the following dimensionalities : d = 2, 3, 5, 10, 20. To see how much is the performance aected by the number of dimensions. 4. Using the rst fteen instances of each function. According to [Han+13b], this number is sucient to provide statistically sound data. Resource budget for minimizing a single function instance (a single trial) was set to 105 d, meaning 100000 times the number of dimensions of the instance. The reasons for choosing the four local search algorithms are: Compass search algorithm was chosen for its simplicity, in order to allow us to evaluate whether MetaMax can improve performance of such a basic algorithm. Nelder-Mead method was chosen as a more sophisticated representative of the group of pattern search algorithms, than compass search. BFGS was selected as a typical line search method. Finally, CMA-ES is there to represent population based algorithms. It is also the most advanced of the four algorithms and thus we expect that it will perform the best of the four selected algorithm. For a more detailed description of these algorithms, please see section A. 17
  • 34. 4.1 Used multi-start strategies In this section, we describe the selected MetaMax and restart strategies, which were evaluated using the methods described above. For convenience, we assigned a shorthand name to each used strategy, so that we can write, for example csa-h-10d, instead of objective function stagnation based restart strategy with history length 10d using compass search algorithm, which is impractically verbose. The shorthand names have the following form: abbreviation of the used local search algorithm, dash, used multi-start strategy, dash, strategy parameters. A list of all used strategies and their shorthand names is given in table 3. We chose two commonly used restart strategies to compare with MetaMax: a xed restart strategy with a set number of resources allocated to each local search algorithm run, and a dynamic restart strategy with restart condition based on objective function value stagnation. Performance of these two strategies largely depends on the combination of the problem being solved and the strategy parameters. Therefore, we decided to use six xed restart strategies and six dynamic restart function stagnation strategies with dierent parameters: • Fixed restart strategies Run lengths: nf = 100d, 200d, 500d, 1000d, 2000d, 5000d evaluations. Shorthand names: • algorithm-f-nf Function value stagnation restart strategies hf = tf = 10−10 algorithm-h-hf Function value history lengths: 2d, 5d, 10d, 20d, 50d, 100d evaluations Function value tolerance: Shorthand names: Note, that the parameters depend on the number of dimensions of the measured function d. This is consistent with the fact that the total resource budget of the strategy also depends on d and that we can expect that for higher dimensionalities, the used local search algorithms will need longer runs to converge. The rationale behind choosing the used parameter values is the following: Using 5 the function evaluation budget of 10 d, run lengths longer than 5000d would give us less than 20 restarts per trial. This would result in a very low chance of nding the global optimum on most of the benchmark functions, some of which can have up to 10d optima. Also, it is probable that most local search algorithms will converge a long time before using up all 50000d function evaluations and then the rest of the allocated resources would be essentially wasted on running an instance which cannot improve any more. Conversely, run lengths smaller than 100d are probably not long enough to allow most local search algorithm instances to converge and so there would be little sense in using them. The choice of the upper bound of the function value history length hf as 100d is based on a similar idea: For values greater than 100d the restart condition would trigger too long after the local search algorithm has already converged, and so we would be needlessly wasting resources on it. The choice of the lower bound of depends on the used algorithm. For a restart strategy to function properly, 18 hf hf has to
  • 35. be greater, or at least as much, as the number of function evaluations that the used local search algorithm uses during one step. The above stated value of hf = 2d is the minimal value for which the Nelder-Mead and BFGS algorithms work properly. For the other two algorithms, the minimal value is hf = 5d. We decided to base the function value history length on number of used function evaluations, rather than on number of taken steps, because it allows for a more direct comparison of performance of the same strategy using two dierent algorithms. Choosing the value of the function stagnation tolerance tf involved a little bit more guesswork. There is target function value dened for all of the benchmark functions, which is equal to the function value at their global optimum f (xopt ) plus −8 a tolerance value ftol = 10 . That is, the function instance is considered to be solved if we nd some point x f (x) ≤ f (xopt ) + ftol . We tf = 10−10 on ftol . with the function stagnation tolerance parameter tf one hundred times lower than ftol based our choice of Setting the value of should make it large enough to reliably detect convergence, while not being too large to trigger the reset condition prematurely, when the local search algorithm is still converging. The goal of using multiple strategies with dierent parameter values is to have at least one xed restart and one objective function value stagnation based strategy, that performs well on the set of all functions, for each measured dimensionality. For easier comparison of results of the xed restart strategies, we represent them all together, by choosing only the results of the best performing strategies for each dimensionality and collecting them into a best of collection of results, which we will refer to by the shorthand name algorithm-f-comb. This represents the results of running a xed restart strategy, which is able to choose the optimal run length (from the set of six used run lengths), based on dimensionality of the function being solved. The results of objective function value stagnation strategies are represented in an analogous way, under the name algorithm-h-comb. Besides the already mentioned restart strategies, we decided to add four more, each based on a restart condition specic to one of the used local search algorithms. Shorthand names for these strategies are algorithm-special. They are described in table 2. In order to save computing time, and as per recommendation in [Han+13b], we used an additional termination criterion that halts the execution of a restart strategy after 100 restarts, even if the resource budget has not yet been exhausted and the solution has not been found. This does not impact the accuracy of the measurements, as 100 restarts is enough to provide statistically signicant amount of data and the metrics which we use (see subsection 4.2) are not biased against results of runs which did not use up the entire resource budget. In fact, the xed restart strategies f-100d, f-200d and f-500d always reach 100 restarts before they can fully exhaust their resource budgets. The idea of using the original pure versions of MetaMax and MetaMax(∞) algorithms, which keep adding local search algorithm instances without limit, proved to be impractical due to its excessive computational resource requirements (for the length of experiments that were planned). Therefore, we performed measurements using only the modied versions of MetaMax and MetaMax(∞) with the added 19
  • 36. Algorithm Compass search Description Restart when the variable a, which aects how far from the cur- rent solution the algorithm generates neighbour solutions, decreases −10 below 10 . It naturally decreases as the algorithm converges, so checking its value makes for a good restart condition. Nelder-Mead We chose a similar condition to the one mentioned above. Restart is triggered when distance between the two points of the simplex −10 which are the farthest apart from each other decreases below 10 . The rationale is similar as above: the simplex keeps growing smaller as the algorithm converges. It might be more mathematically proper to check the area (or volume, or hyper-volume, depending on the dimensionality) of the simplex, but we discarded this idea out of concern that it might be too computationally intensive. BFGS Restart condition is triggered if the norm of the gradient is smaller −10 than 10 . Since the algorithm already uses information about the gradient, it makes sense also to use it for detecting convergence. CMA-ES The recommended settings for CMA-ES given in [Han11] suggest using 9 dierent restart conditions. Here we use these recommended settings. Note that when using CMA-ES with the other restart strategies, we use only a single restart condition and the additional ones are disabled. In a sense, we are not using the algorithm to its full potential, but this allows for a more direct comparison with other local search algorithms. Table 2: Algorithm specic restart strategies mechanism (described in subsection 3.1) for limiting maximum number of instances. For all MetaMax strategies, we used the recommended form of estimate function: h = e−vi /vt . Measurements were performed using the following MetaMax strategies: 1. MetaMax(k), with k=20, k=50 and k=100. This gives the same total number of local search algorithm instances as when using xed restart strategies with run lengths equal to 5000d, 2000d and 1000d respectively. This makes it possible to evaluate the degree to which the MetaMax mechanism of selecting the most promising instances improves the performance over these corresponding restart strategies. The expectation is, that the success rate for MetaMax(k) will not increase, because the number of instances and thus the ability to explore the search space stays the same. However, MetaMax(k) should converge faster than the xed restart strategies, because it should be able to identify the best instances and allocate resources to them appropriately. 2. MetaMax and MetaMax(∞) with the maximum number of instances set to 100. This should allow us to asses the benets of the mechanism of adding new instances (and deleting old ones), by comparing the results with MetaMax(k), which uses the same number of instances each round, but does not add or delete any. Here, we would expect an increase in success rate on multimodal 20
  • 37. functions, as the additional instances ,generated each round, should allow the algorithms to explore the search space more thoroughly. However, the limit of 100 instances will possibly still not be enough to get a good success rate for multimodal problems with high dimensionality. 3. MetaMax and MetaMax(∞) with maximum number of instances set to 50d. This should allow the algorithms to scale better with the number of dimensions and, hopefully, further improve their performance. The number 50d was chosen as a reasonable compromise between computation time and expected performance. We expect to get the best results here. algorithm-k-X for MetaMax(k), algorithm-m-X for MetaMax and algorithm-i-X for MetaMax(∞), where X is the maximum allowed number of instances (or, equivalently, the value of k for Shorthand names for MetaMax variants were chosen as MetaMax(k)). Fixed restart strategies f-100d f-200d f-500d f-1000d f-2000d f-5000d f-comb Run length = 100d evaluations Run length = 200d evaluations Run length = 500d evaluations Run length = 1000d evaluations Run length = 2000d evaluations Run length = 5000d evaluations Combined xed restart strategy Function value stagnation restart strategies h-2d h-5d h-10d h-20d h-50d h-100d h-comb History length = 2d evaluations History length = 5d evaluations History length = 10d evaluations History length = 20d evaluations History length = 50d evaluations History length = 100d evaluations Combined function value stagnation restart strategy Other restart strategies special Special restart strategy specic to each algorithm, see table 2 MetaMax variants k-20 k-50 k-100 k-50d m-100 m-50d i-100 i-50d MetaMax(k) with k=20 MetaMax(k) with k=50 MetaMax(k) with k=100 MetaMax(k) with k=50d MetaMax with maximum number of instances = 100 MetaMax with maximum number of instances = 50d MetaMax(∞) with maximum number of instances = 100 MetaMax(∞) with maximum number of instances = 50d Table 3: Tested multi-start strategies 21
  • 38. There is a number of additional interesting aspects of the MetaMax variants, which would be worth testing and evaluating. For example: • Comparison of MetaMax and MetaMax(∞) with the limit on maximum number of instances and without it. • Performance of dierent methods of discarding old instances. • Inuence of dierent choices of estimate function on performance. • Performance of our proposed alternative method for selecting instances. However, it was not practically possible (mainly time-wise) to perform full sized 5 (10 d function evaluatoin budget) experiments which would test all of these features. Therefore, we decided to make a series of smaller measurements, with the maximum number of function evaluations per trial set to 5000d, using only dimensionalities d=5, d=10 and d=20 and using only the BFGS algorithm. This should allow us to test these features at least in a limited way and see if any of them warrant further attention. More specically, we made the following series of measurements: 1. MetaMax and MetaMax(∞) without limit on maximum number of instances 2. MetaMax and MetaMax(∞) with maximum instance limit 5d, 10d and 20d, discarding the most inactive instances 3. MetaMax and MetaMax(∞) with maximum instance limit 5d, 10d and 20d, discarding the worst instances 4. MetaMax(k) with k=5d, k=10d and k=20d These measurements were repeated three times, rst time using the recommended −v /v form of the estimate function h1 (vi , vt ) = e i t , second time with a simplied vi function h(vi ) = e and the third time using the proposed alternate instance selection method, based on selecting non-dominated points. 4.2 Used metrics In this section, we describe the various metrics that were used to compare results of dierent strategies. The simplest one is the success rate. For a set of trials U (usually of one strategy running on one or more benchmark functions) and a chosen target value t, it can be dened as: SR(U, t) = (6) |u : u ∈ U, fbest (u) ≤ t| is the number of trials which have found a solution at least as good as t. In the rest of this text we use a mean success rate, averaged over a set of target values T Where |U | |{u ∈ U : fbest (u) ≤ t}| |U | is the number of trials and SRm (U, T ) = 1 |T | 22 SR(U, t) t∈T (7)
  • 39. The main metric used in the COCO framework is the expected running time, or ERT. It estimates the expected number of function evaluations that a selected strategy will take to reach a target function value trials U. for the rst time, over a set of It is dened as: ERT (U, t) = Where t evals(u, t) 1 evals(u, t) |{u ∈ U : fbest (u) ≤ t}| u∈U (8) u to reach u if it never reached t. Expression successful trials for target t. If there were is the number of function evaluations used by trial target t, or the total number of evaluations used by |{u ∈ U : fbest (U ) ≤ t}| is the number of no such trials, then ERT (U, t) = ∞. In the rest o this text we will use ERT averaged over a set of target values T , in a similar way to what is described in equation 7. We will also usually compute it using a set trials obtained by running the same strategy on multiple dierent functions, usually all functions in one of the function groups described in table 1. For comparing two or more strategies in terms of success rates and expected running times, we use graphs of the empirical cumulative distributive function of run lengths, or ECDF. Such a graph displays on the y-axis the percentage of trials for which ERT (averaged over a set of target values evaluations x, that for each where x x T) is lower than the number of is the corresponding value on the x-axis. It can also be said, it shows the expected average success rate, if a function evaluation budget equal to x was used. For easier comparison of ECDF graphs across dierent dimensionalities, the values on the x-axis are divided by the number of dimensions. The function displayed in the graph can then be dened as: y(x) = 1 |{t ∈ T : ERT (t, u) ≤ x}| d|T ||U | u∈U (9) An example ECDF graph, like ones that are used throughout the rest of the text, is given in gure 4. It shows ERTs of two sets of trials measured by running two dierent strategies on the set of all benchmark functions, for d=10 and averaged over a set of 50 target values. The target values are logarithmically distributed in −8 2 the interval 10 ; 10 . We use this same set of target values in all our ECDF graphs. The marker × denotes the median number of function evaluations of unsuccessful trials, divided by the number of dimensions. Values to the right of this marker are (mostly) estimated using bootstrapping (for details of the bootstrapping method, please refer to [Han+13b]). The fact that we use 15 trials for each strategy-function pair means, that the estimate is reliable only up to about fteen times the number of evaluations marked by ×. This is a fact that should be kept in mind when evaluating the results. The thick orange line in the plot is represents the best results obtained during the 2009 BBOB workshop for the same set of problems and is provided for reference. Since we are dealing with a very large amount of measured results, it would be desirable to have a method of comparing them, that is even more concise than ECDF graphs. To this end, we use metric called aggregate performance index (API), dened 23
  • 40. Proportion of trials 1.0 f1-24,10-D best 2009 0.8 0.6 bfgs-k-100 0.4 0.2 nm-k-100 0.00 1 2 3 4 5 6 7 8 log10evaluations/D Figure 4: Example ECDF graph Comparison of the results of MetaMax(k), with k=100, using BFGS and Nelder-Mead local search algorithms, on the set of all benchmark functions. The strategy using BFGS clearly outperforms the other one, both in terms of success rate and speed of convergence. by Mr. Po²ík in a yet unpublished (at the time of writing this text) article [Po²13]. It is based on the idea that the ECDF graph of the results of an ideal strategy, which solves the given problem instantly, would be a straight horizontal line across the top of the plot. Conversely, for the worst possible strategy imaginable, the graph would be a straight line along the bottom. It is apparent, that the area above (or bellow) the graph makes for quite a natural measure of eectiveness of dierent strategies. Given a set of ERTs A, their aggregate performance index can be computed as: AP I(A) = exp 1 log a |A| a∈A 10 (10) For the purposes of computing API, the ERTs of unsuccessful trials which are by denition ∞ have to be replaced with a value that is higher than ERT of any suc- cessful trial. The choice of this value determines how much the unsuccessful trials are penalized and thus aects the nal ERT score. For our purposes, we chose the 8 value 10 d. Since we are computing API from the area above the graph this means that the lower its value the better the corresponding strategy performs. Using API essentially allows us to represent results of a set of trials by a single number and to easily compare performances of dierent optimization strategies. 4.3 Implementation details The software side of this project was implemented mostly in Python, with parts in C. The original plan was to write the project purely in Python, which was cho- 24
  • 41. sen because of its ease of use and availability of many open-source scientic and mathematical libraries. However, during the project it was found out that a pure Python code performs too slowly and would not allow us to make all the necessary measurements. Therefore, parts of the program had to be changed over to C, which has improved performance to a reasonable level. The used implementations of BFGS and Nelder-Mead algorithms, are based on the code from open-source Scipy library. They were modied to allow running the algorithms in single steps, which is necessary in order for them to work with MetaMax. An open-source implementation of CMA-ES was used, available at [Han13b]. Implementation of MetaMax was written based on its description in [GK11]. It was however, necessary to make several little changes to it mainly because it is designed with a maximization task in mind but we needed to use it for minimization problems. For nding upper convex hulls we used Andrew's algorithm with some additional pre and post processing, to get the exact behaviour described in [GK11]. For description of the source code, please see the le source/readme.txt on the attached CD. 5 Results In this section we will evaluate results of the selected multi-start strategies. We decided to split the results into four subsections based on used local search algorithm. We present and compare the results mainly using tables, which list APIs and success rates for dierent groups of functions and dierent dimensionalities. For convenience, the best results are highlighted with bold text. We also show ECDF graphs to illustrate particularly interesting results. Results of the smaller experiments, described at the end of section 4.1 and results of timing measurements are summarized in subsection 5.5. The values of success rates and APIs, which are shown in this section, are com5 puted only using data bootstrapped up to the value of 10 d function evaluations. In our opinion, these values represent the real performance of the selected strategies better than if we were to use fully bootstrapped data, which are estimated to a large degree and therefore not so statistically reliable. In ECDF graphs, bootstrapped re7 sults are shown up to 10 d evaluations. All of the APIs and success rates are averaged over a set of multiple targets, as described in subsection 4.3. The measured data are provided in their entirety on the attached CD (see section B) in the form of tarballed Python pickle les, which can be processed using the BBOB post processing framework. It was not possible to provide the data in their original form, as text les, because their total size would be in the order of gigabytes, which would clearly not t on the attached medium. 5.1 Compass search Table 4 summarizes which of the used xed restart and function value stagnation restart strategies were best for each dimensionality and chosen for the best-of result collections cs-f-comb and cs-h-comb. Table 5 then compares these two sets of re- 25
  • 42. sults together with results obtained by the compass search specic restart strategy cs-special. It is apparent, that for the best strategies the values of run length and function value history length increase with the number of dimensions. This is not unexpected as compass search uses 2d or 2d-1 function evaluations at each step. Dimensionality d=3 d=5 d=10 d=20 Stagnation based cs-f-100d cs-f-100d cs-f-200d cs-f-500d cs-f-500d d=2 Fixed cs-h-5d cs-h-5d cs-h-10d cs-h-10d cs-h-20d Table 4: Compass search - best restart strategies for each dimensionality The comparison of the best restart strategies suggests, that all of them have quite cs-h-comb being a little better than the others in cs-f-comb in terms of API. In the subsequent tables, we cs-f-comb for reference, as an exaple of a well tuned restart similar overall performance, with terms of success rate and will provide results of strategy. None of the strategies performs very well on multimodal and highly conditioned functions. This is to be expected, as the compass search algorithm is known to have trouble with ill conditioned problems and multimodal problems are dicult to solve mult2 multi hcond lcond separ all for any algorithm. cs-f-comb cs-h-comb cs-special cs-f-comb cs-h-comb cs-special cs-f-comb cs-h-comb cs-special cs-f-comb cs-h-comb cs-special cs-f-comb cs-h-comb cs-special cs-f-comb cs-h-comb cs-special 2D 3D 3.75 4.46 3.82 4.16 2.58 2.69 2.63 3.84 3.49 4.69 4.92 3.03 3.11 3.09 4.18 4.22 log10 API 5D 5.52 5.62 5.80 4.33 4.03 4.32 5.34 10D 6.31 20D 6.90 Success rate [%] 2D 3D 5D 10D 20D 85 74 53 41 34 6.28 6.86 6.91 5.50 85 74 100 100 48 69 4.86 5.43 100 100 84 5.43 100 100 100 84 98 6.39 4.93 4.95 6.35 7.44 5.15 5.97 6.99 56 44 43 66 39 37 62 66 63 66 63 84 72 51 69 63 50 4.18 5.38 100 100 72 5.38 5.99 6.63 7.03 82 52 43 35 33 5.17 6.59 7.34 7.80 80 26 10 44 74 18 17 66 14 76 70 52 4.10 4.71 3.86 3.36 3.89 5.21 6.08 4.48 4.74 4.89 6.50 6.45 6.95 6.99 5.31 5.38 5.76 6.91 6.89 7.53 7.61 6.30 6.08 6.20 7.31 7.30 7.98 7.96 6.85 6.63 6.65 47 37 79 72 80 100 85 35 32 63 66 72 29 31 55 Table 5: Compass search - results of restart strategies 26 64 26 3.63 6.07 6.20 7.26 84 68 4.19 5.40 5.83 4.29 6.27 78 28 28 11 10 42 51 40 26 26 7 8 36 50 50
  • 43. Comparison of the results of three MetaMax(k) strategies with corresponding xed restart strategies which use the same total number of local search algorithm instances is given in table 6. They conrm our expectations, and show that, overall, MetaMax(k) converges faster than a comparable xed restart strategy. The only exception being the group separ. This can be explained by the fact that functions from this group are very simple and can be generally solved by a single, or only very few, runs of the local search algorithm. In this case, the MetaMax mechanism of selecting multiple instances each round is more of a hindrance than a benet. In terms of success rate, MetaMax(k) is always as good or even better than the comparable xed restart strategy, with the improvement being especially obvious on the groups lcond and mult2. Of the three tested variants of MetaMax(k), cs-m-100 ever, it is not better than a well tuned restart strategy like is the best overall. How- cs-f-comb. Figure 5 shows a behaviour which was observed across all function groups and dimensionalities when comparing MetaMax(k) with corresponding xed restart strategies: At rst, MetaMax(k) converges much more slowly than the restart strategy, as it is still in the phase of initialising all of its instances. However, as soon as this is nished, it starts converging quickly and overtakes the restart strategy for a certain interval. After that, its rate of convergence slows down again and it ends up with 5 success rate (for 10 d function evaluations) similar to that of the restart strategy. Proportion of trials This eect seems to get less pronounced with increasing number of dimensions. 1.0 f1-24,5-D best 2009 0.8 0.6 cs-f-2000d 0.4 0.2 cs-k-50 0.00 1 2 3 4 5 6 7 8 log10evaluations/D Figure 5: Compass search - ECDF comparing MetaMax(k) with an equivalent xed restart strategy Results of comparing cs-k-100, cs-m-100 and cs-i-100 are shown in table 7. It is apparent, that using the same number of instances at a time MetaMax and MetaMax(∞) clearly outperform MetaMax(k) on all function groups, both in terms of speed of convergence and success rates. In general, they also provide results at least as good, or better than the best 27
  • 44. all separ lcond hcond multi mult2 cs-f-1000d cs-f-2000d cs-f-5000d cs-k-20 cs-k-50 cs-k-100 cs-f-comb cs-f-1000d cs-f-2000d cs-f-5000d cs-k-20 cs-k-50 cs-k-100 cs-f-comb cs-f-1000d cs-f-2000d cs-f-5000d cs-k-20 cs-k-50 cs-k-100 cs-f-comb cs-f-1000d cs-f-2000d cs-f-5000d cs-k-20 cs-k-50 cs-k-100 cs-f-comb cs-f-1000d cs-f-2000d cs-f-5000d cs-k-20 cs-k-50 cs-k-100 cs-f-comb cs-f-1000d cs-f-2000d cs-f-5000d cs-k-20 cs-k-50 cs-k-100 cs-f-comb 2D 4.60 4.82 5.18 4.55 4.16 4.00 3D 5.20 5.45 5.71 5.30 5.01 4.84 3.75 4.46 3.28 3.54 4.08 3.38 2.82 2.68 2.58 4.56 4.75 4.83 4.19 3.83 3.85 3.84 5.32 5.56 5.97 5.32 5.14 5.03 4.19 5.26 5.44 5.94 5.27 4.80 4.58 4.29 4.59 4.79 5.00 4.51 4.12 3.82 3.86 3.72 4.22 4.74 4.16 3.88 3.85 3.03 5.22 5.37 5.38 5.12 4.98 4.54 log10 API 5D 5.85 5.94 6.08 5.87 5.75 5.62 10D 6.36 6.46 6.59 6.42 6.37 6.29 4.58 4.61 4.86 4.62 4.63 4.71 6.31 5.01 5.08 5.18 5.21 5.24 5.29 4.33 4.93 5.52 5.75 5.84 5.96 5.70 5.58 5.45 6.33 6.45 6.53 6.27 7.11 7.03 4.18 5.34 6.29 6.39 6.52 6.35 6.23 6.13 5.38 5.99 6.63 6.18 6.34 6.51 6.15 5.93 5.81 6.81 6.87 7.00 6.76 6.62 6.47 4.99 5.23 5.66 5.09 4.48 6.59 5.80 5.94 6.05 5.88 5.66 4.25 5.30 5.17 4.48 5.31 6.90 5.44 5.49 5.51 5.75 5.81 5.80 5.50 7.42 7.36 7.48 6.28 6.28 6.35 6.66 6.89 6.93 6.83 6.79 6.70 5.92 6.06 6.21 5.95 5.77 5.67 20D 7.01 6.96 7.13 6.99 6.98 6.97 7.38 7.44 7.48 7.34 7.26 7.23 7.34 6.43 6.42 6.80 6.40 6.24 5.94 6.30 7.22 7.29 7.44 7.14 7.23 7.32 7.27 7.21 7.17 7.82 7.85 7.91 7.76 7.74 7.70 7.80 7.40 6.94 7.48 7.06 6.96 6.94 6.85 Success rate [%] 3D 5D 10D 20D 61 46 41 30 55 45 39 35 48 43 36 28 52 45 41 35 58 47 42 35 62 49 43 35 85 74 53 41 34 100 100 68 65 63 100 84 68 65 63 100 68 67 65 62 84 68 67 65 62 100 84 68 65 63 100 84 68 66 63 100 100 69 66 62 80 60 54 50 27 78 56 52 44 31 82 57 50 46 26 86 58 55 50 41 88 63 57 50 37 88 78 62 52 34 84 84 63 50 26 55 40 37 35 30 52 37 33 27 28 40 35 30 27 26 47 36 33 29 28 52 42 36 31 31 56 45 40 34 33 2D 74 72 67 67 74 75 82 62 59 45 50 62 62 80 72 70 70 70 72 73 80 52 35 31 26 27 31 34 63 70 69 54 69 70 70 74 43 22 21 17 18 21 25 26 52 52 52 52 54 54 66 35 14 13 12 12 13 33 10 10 9 9 10 14 10 14 10 44 47 34 50 50 51 42 18 41 18 34 34 34 36 Table 6: Compass search - results of MetaMax(k) and corresponding xed restart strategies 28
  • 45. restart strategies. There is almost no dierence between the performance of and cs-i-100, cs-m-100 which corresponds with results presented in [GK11]. Dierences in performance seem to diminish with increasing dimensionality and, for d=10 and d=20. all of the MetaMax strategies which use 100 instances perform almost the same. ECDF graph in gure 6 shows an interesting behaviour, where cs-i-100 start converging right away and overtake cs-k-100 cs-m-100 and while it is still in the process of initializing all of its instances. After that, MetaMax(k) catches up and for a certain interval all of the strategies perform the same. Then, MetaMax(k) stops converging, the other two strategies overtake it again and ultimately achieve better success rates. The sudden stop in MetaMax(k) convergence presumably happens when all of its best instances have already found their local optima, after which there is no possibility of nding better solutions without adding new instances, which Proportion of trials MetaMax(k) cannot do. 1.0 f1-24,5-D best 2009 0.8 cs-m-100 0.6 0.4 cs-i-100 0.2 cs-k-100 0.00 1 2 3 4 5 6 7 8 log10evaluations/D Figure 6: Compass search - ECDF of MetaMax variants using 100 instances In the next set of measurements using cs-m-50d, cs-i-50d and cs-k-50d, it became apparent that the increased limit on maximum number of instances does not cause any noticeable increase in performance for MetaMax and MetaMax(∞). The performance of MetaMax(k) was somewhat improved, but overall it is still worse than the other two MetaMax variants and slightly worse than the best restart strategies. These results are also presented in table 7. cs-k-50d and cs-m50d compared with the collection of best xed restart strategy results cs-f-comb. We have omitted cs-i-50d as its performance is very similar to that of cs-m-50d. ECDF graph in gure 7 shows results of In conclusion, we can say that using the compass search algorithm MetaMax and MetaMax(∞) perform better than even well tuned restart strategies, and that increasing the maximum number of allowed instances does not have any signicant eect on their performance. 29
  • 46. all separ lcond hcond multi mult2 cs-k-100 cs-m-100 cs-i-100 cs-k-50d cs-m-50d cs-i-50d cs-f-comb cs-k-100 cs-m-100 cs-i-100 cs-k-50d cs-m-50d cs-i-50d cs-f-comb cs-k-100 cs-m-100 cs-i-100 cs-k-50d cs-m-50d cs-i-50d cs-f-comb cs-k-100 cs-m-100 cs-i-100 cs-k-50d cs-m-50d cs-i-50d cs-f-comb cs-k-100 cs-m-100 cs-i-100 cs-k-50d cs-m-50d cs-i-50d cs-f-comb cs-k-100 cs-m-100 cs-i-100 cs-k-50d cs-m-50d cs-i-50d cs-f-comb 2D 4.00 3.33 3.34 3.97 3.35 3.35 3.75 2.68 2.44 2.42 2.67 2.45 2.45 2.58 3.85 3.14 3.18 4.05 3.21 3.20 3.84 5.03 4.14 4.22 5.00 4.16 4.24 4.19 4.58 3.61 3.58 4.32 3.59 3.58 4.29 3.82 3.30 3.28 3.84 3.30 3.27 3.86 3D 4.84 4.19 4.15 4.73 4.14 4.16 4.46 3.85 3.01 3.04 3.54 2.99 3.03 3.03 4.54 3.88 3.85 4.46 3.82 3.90 4.18 5.67 5.22 5.16 5.62 5.21 5.16 5.38 5.81 4.70 4.57 5.70 4.62 4.59 5.17 4.25 4.09 4.08 4.25 4.02 4.05 4.48 log10 API 5D 5.62 5.21 5.14 5.64 5.22 5.23 5.52 4.71 4.19 3.83 4.79 4.16 4.21 4.33 5.45 5.09 5.11 5.48 5.17 5.17 5.34 6.13 5.84 5.89 6.14 5.88 5.88 5.99 6.47 6.13 6.16 6.44 6.12 6.14 6.59 5.30 4.76 4.73 5.29 4.77 4.75 5.31 10D 6.29 6.15 6.16 6.29 20D 6.97 6.88 6.87 6.31 5.29 5.22 5.27 5.49 5.23 5.20 7.01 6.88 6.88 6.90 5.80 5.85 5.78 6.14 5.82 5.87 4.93 5.50 6.15 6.15 6.28 6.14 6.10 6.19 6.09 6.06 6.35 6.70 6.53 6.53 6.67 6.53 6.58 6.63 7.23 7.09 7.29 7.32 7.32 7.48 7.33 2D 75 92 91 76 91 91 85 74 84 100 100 100 100 100 100 100 100 100 100 100 100 100 78 69 84 84 69 62 100 92 70 88 99 84 100 7.03 82 7.68 90 7.17 7.68 7.08 7.68 89 68 89 7.68 90 7.08 7.09 7.34 5.94 5.78 5.80 5.94 5.77 5.80 6.30 7.69 62 7.80 6.94 6.51 6.47 6.53 6.49 6.49 6.85 80 73 66 64 100 7.44 7.17 7.12 7.18 7.29 7.17 7.15 7.70 79 50 61 61 53 68 84 98 84 56 80 77 56 79 77 7.29 Success rate [%] 3D 5D 10D 20D 62 49 43 35 78 61 46 39 79 79 91 80 90 91 84 45 57 58 46 58 37 36 52 34 68 74 74 90 75 72 80 48 74 60 90 90 62 68 68 63 40 47 41 47 47 43 25 35 34 26 63 70 74 74 72 74 90 70 74 34 26 54 71 71 55 71 71 66 Table 7: Compass search - results of MetaMax strategies 30 46 45 39 46 37 38 46 39 41 66 34 63 63 63 66 64 66 66 66 63 66 64 56 62 34 33 34 28 32 57 35 40 35 66 52 55 56 57 50 34 40 38 40 38 35 14 19 19 16 20 19 14 51 53 52 52 52 52 42 26 33 33 32 33 34 33 10 12 12 11 12 12 10 34 50 50 50 50 50 36
  • 47. Proportion of trials 1.0 f1-24,10-D best 2009 0.8 cs-m-50d 0.6 0.4 cs-k-50d 0.2 cs-f-comb 0.00 1 2 3 4 5 6 7 8 log10evaluations/D Figure 7: Compass search - ECDF of MetaMax variants using 50d instances 5.2 Nelder-Mead method The best restart strategies for each dimensionality are listed in table 8 and their results are compared in table 9. For the xed restart strategies we see the expected behaviour, where run lengths of the best strategies increase with the number of dimensions. However, there seem to be only two best objective function stagnation based strategies: nm-h-10d and nm-h-100d. Interestingly enough, the switch between them occurs between d=5 and d=10, which is also the point where the overall performance of the Nelder-Mead algorithm decreases dramatically. Dimensionality d=2 d=3 d=5 d=10 d=20 Fixed nm-f-100d nm-f-100d nm-f-500d nm-f-1000d nm-f-5000d Stagnation based nm-h-10d nm-h-10d nm-h-10d nm-h-100d nm-h-100d Table 8: Nelder-Mead - best restart strategies for each dimensionality The algorithm performs very well for low number of dimensions - d=2, d=3 and to some extent also d=5. With results for these dimensionalities approaching those of the best algorithms from the 2009 BBOB conference. On the other hand, the performance for higher dimensionalities is very poor, especially on the group hcond. The three best-of restart strategies, compared in table 9, are all quite evenly matched with nm-special being the best overall by a small margin and nm-f-comb being the worst. The comparison of MetaMax(k) with corresponding xed restart strategies, given in table 10, shows that MetaMax(k) performs better on multimodal functions and 31
  • 48. all separ lcond hcond multi mult2 nm-f-comb nm-h-comb nm-special nm-f-comb nm-h-comb nm-special nm-f-comb nm-h-comb nm-special nm-f-comb nm-h-comb nm-special nm-f-comb nm-h-comb nm-special nm-f-comb nm-h-comb nm-special 2D 3.03 2.91 2.95 2.55 2.51 2.46 2.54 2.35 2.45 3D 3.92 3.89 3.88 3.22 3.88 3.55 3.49 3.33 log10 API 5D 4.97 4.71 4.84 4.47 4.38 4.50 4.35 4.07 2.07 1.96 4.59 2.86 2.50 4.14 3.24 3.15 2.48 3.12 4.27 5.82 1.95 4.52 3.42 3.22 3.26 3.31 6.07 6.05 3.85 3.83 3.92 7.07 6.89 6.96 5.58 4.95 5.32 7.79 7.71 Success rate [%] 2D 3D 5D 10D 20D 92 81 62 45 18 92 76 67 45 19 92 79 63 51 24 100 100 64 44 40 100 68 66 44 40 100 84 66 55 41 100 82 78 52 6 100 84 81 56 4 100 86 80 56 10 100 100 100 68 18 100 100 100 68 20 7.20 100 10D 6.31 6.19 20D 7.71 7.61 6.04 7.49 6.02 5.99 5.68 5.95 5.80 5.87 5.65 5.40 5.01 7.62 7.55 7.57 6.24 6.13 6.02 7.07 6.80 6.83 8.16 8.19 8.06 8.04 7.99 8.02 7.60 7.47 7.46 74 76 100 41 46 16 20 85 84 18 55 86 84 72 86 84 78 41 100 56 82 10 11 41 7 7 10 51 51 20 52 20 7 16 Table 9: Nelder-Mead - results of restart strategies worse on the other function groups. It is also apparent that increasing the number of used instances for MetaMax(k) leads to a higher overall success rate and faster convergence on multi-modal problems but slower convergence on ill-conditioned functions, as apparent from the ECDF Proportion of trials graph in gure 8. 1.0 f10-14,10-D best 2009 0.8 nm-k-20 0.6 0.4 nm-k-50 0.2 0.00 1 2 3 4 5 6 7 8 9nm-k-100 log10evaluations/D Figure 8: Nelder-Mead - ECDF comparing MetaMax(k) strategies In fact, the performance of the tested MetaMax(k) strategies on ill-conditioned 32
  • 49. functions is worse than that of the corresponding restart strategies. This is the opposite of what was observed when using the compass search algorithm and can be explained by the fact that the Nelder-Mead algorithm, unlike compass search, can handle ill-conditioned problems very quickly and with high success rate (at least for low dimensionalities). Therefore, there is no need for the MetaMax mechanism of selecting multiple instances each round, as almost any instance is capable of nding the global optimum. Selecting more than one at the same time only serves to decrease the rate of convergence. Overall, the three tested MetaMax(k) strategies perform only slightly better than the corresponding xed restart strategies and are clearly worse than the best restart strategies, such as nm-special. Table 11 shows the results of other tested MetaMax strategies. Unfortunately, measurements for all dimensionalities were not nished in time, before the deadline of this thesis, therefore table 11 contains only partial results for some strategies. For the dimensionalities where the results of all the strategies are available, it is apparent that nm-m-100 and nm-i-100 outperform nm-k-100, both in terms of success rates and API. There are no signicant dierences in performance between MetaMax variants using 100 and 50d local search algorithm instances, as well as no observable dierences between performance of MetaMax and MetaMax(∞). nm-special, MetaMax and MetaMax(∞) have better success rates on function groups separ, multi and mult2 and as a result, In comparison with the restart strategy a better overall success rate. MetaMax and MetaMax(∞) also converge faster on slower on lcond and hcond. multi and mult2, but are The overall result being that they are better than the best restart strategy ,in terms of API, for d=2 and d=3, but are worse for d=5. Unfortunately, we cannot make comparisons for higher dimensionalities, where the results for MetaMax and MetaMax(∞) are not available. However, based on the fact that the advantage in performance of MetaMax over the restart strategy is lower in d=3 than in d=2, and that the restart strategy is better in d=5, we can extrapolate that MetaMax would likely also perform worse for higher dimensionalities. Even if there was an improvement in performance, the fact remains, that the Nelder-Mead method has such a bad performance in higher dimensionalities, that it is unlikely that MetaMax could improve it to a practical level. 33
  • 50. all separ lcond hcond multi mult2 nm-f-1000d nm-f-2000d nm-f-5000d nm-k-20 nm-k-50 nm-k-100 nm-special nm-f-1000d nm-f-2000d nm-f-5000d nm-k-20 nm-k-50 nm-k-100 nm-special nm-f-1000d nm-f-2000d nm-f-5000d nm-k-20 nm-k-50 nm-k-100 nm-special nm-f-1000d nm-f-2000d nm-f-5000d nm-k-20 nm-k-50 nm-k-100 nm-special nm-f-1000d nm-f-2000d nm-f-5000d nm-k-20 nm-k-50 nm-k-100 nm-special nm-f-1000d nm-f-2000d nm-f-5000d nm-k-20 nm-k-50 nm-k-100 nm-special 2D 3.64 3.82 4.03 3.96 3.76 3.58 3D 4.27 4.38 4.52 4.56 4.52 4.38 2.95 3.88 3.24 3.59 3.65 3.85 3.61 3.21 3.95 4.02 4.07 4.32 4.35 4.38 log10 API 5D 5.07 5.14 5.40 5.26 5.22 5.23 10D 6.31 6.33 6.52 6.54 6.40 6.45 20D 7.76 7.71 7.71 7.65 7.68 7.75 2D 81 75 71 72 81 85 4.84 6.04 7.49 92 4.59 4.56 4.74 4.86 4.93 4.98 6.02 6.01 6.08 6.05 6.14 6.24 7.27 7.04 7.07 6.91 7.03 7.30 Success rate [%] 3D 5D 10D 20D 70 61 45 14 69 61 45 17 65 56 42 18 65 60 38 17 66 61 45 17 71 62 44 15 79 63 51 24 100 67 66 65 65 66 68 84 69 68 69 84 64 63 62 62 63 64 44 46 49 46 45 43 29 40 40 40 40 31 2.46 3.55 4.50 5.68 6.83 100 84 66 55 41 2.45 3.31 4.14 5.87 8.06 100 86 80 56 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 68 68 69 68 68 67 10 100 100 100 82 41 10 8 3.01 3.05 3.13 3.46 3.40 3.44 1.92 1.99 2.00 2.59 2.74 2.85 1.96 5.77 6.11 6.56 6.00 5.36 5.15 4.52 4.12 4.24 4.62 3.82 3.61 3.22 3.26 3.49 3.50 3.60 3.98 4.03 4.00 2.47 2.46 2.44 3.10 3.24 3.33 2.48 6.63 6.78 6.91 6.59 6.40 6.25 6.05 4.68 4.95 5.42 4.70 4.50 3.86 3.92 4.44 4.60 4.86 4.70 4.77 4.81 3.26 3.30 3.51 3.77 3.93 4.02 3.12 7.14 7.21 7.32 7.00 6.93 6.88 6.96 5.81 5.93 6.46 5.84 5.46 5.36 5.32 5.95 5.94 6.15 6.25 6.34 6.42 5.65 5.59 5.74 5.77 5.84 5.96 5.01 7.62 7.64 7.69 7.50 7.46 7.43 7.57 6.24 6.41 6.87 7.07 6.19 6.18 6.02 8.24 8.20 8.16 8.20 8.19 8.21 7.99 7.87 7.79 7.87 7.93 8.00 7.20 7.98 8.02 8.04 7.92 7.88 7.87 8.02 7.42 7.52 7.60 7.47 7.48 7.45 7.46 84 82 80 80 84 84 54 40 23 28 52 55 78 85 84 84 84 84 84 86 80 80 77 78 80 80 24 20 18 17 21 25 41 83 83 68 68 68 83 84 77 77 76 76 77 78 16 14 11 13 14 16 18 53 53 36 52 53 55 56 52 53 52 52 54 53 2 4 6 4 4 4 9 14 18 14 12 9 10 7 7 7 7 10 8 10 8 9 10 51 50 34 18 50 51 52 7 22 18 16 18 18 20 20 Table 10: Nelder-Mead - results of MetaMax(k) and corresponding xed restart strategies 34
  • 51. all separ lcond hcond multi mult2 nm-k-100 nm-m-100 nm-i-100 nm-k-50d nm-m-50d nm-i-50d nm-special nm-k-100 nm-m-100 nm-i-100 nm-k-50d nm-m-50d nm-i-50d nm-special nm-k-100 nm-m-100 nm-i-100 nm-k-50d nm-m-50d nm-i-50d nm-special nm-k-100 nm-m-100 nm-i-100 nm-k-50d nm-m-50d nm-i-50d nm-special nm-k-100 nm-m-100 nm-i-100 nm-k-50d nm-m-50d nm-i-50d nm-special nm-k-100 nm-m-100 nm-i-100 nm-k-50d nm-m-50d nm-i-50d nm-special 2D 3.58 2.89 2.94 3.58 3D 4.38 3.76 3.76 4.42 2.85 3.74 2.88 2.95 3.21 2.62 2.65 3.18 2.55 2.61 3.76 3.88 4.38 3.39 3.36 3.44 2.74 2.76 3.40 2.65 2.76 4.42 3.38 3.40 3.55 4.00 3.73 3.80 4.07 3.73 3.80 2.45 3.31 2.46 2.85 2.54 2.63 2.84 2.52 2.64 1.96 5.15 3.88 3.90 5.19 3.84 3.67 4.52 3.22 2.64 2.72 3.25 2.66 2.70 3.26 3.33 3.15 3.21 3.39 3.13 3.22 2.48 6.25 4.96 4.91 6.26 4.92 4.89 6.05 3.86 3.55 3.51 3.88 3.53 3.50 3.92 log10 API 5D 5.23 4.96 4.95 5.23 4.93 - 10D 6.45 6.55 - 20D 7.75 7.83 - 4.84 6.04 7.49 4.98 4.82 4.81 5.05 4.79 - 4.50 4.81 4.71 4.70 4.89 4.67 - 4.14 4.02 4.00 4.02 4.11 4.03 - 3.12 6.88 6.52 6.51 6.77 6.50 6.96 5.36 4.70 4.68 5.26 4.60 5.32 6.24 6.33 - 5.68 6.42 6.52 - 5.87 5.96 6.35 - 5.01 7.43 - 7.37 7.57 6.18 6.19 - 6.02 7.30 7.52 - 6.83 8.21 8.26 - 8.06 8.00 8.06 - 7.20 7.87 - 7.82 8.02 7.45 7.60 7.46 2D 85 96 96 85 96 98 92 100 100 100 100 100 100 100 84 100 100 86 100 100 Success rate [%] 3D 5D 10D 20D 71 62 44 15 89 69 89 69 71 62 43 13 89 69 89 79 63 51 24 68 64 43 31 100 67 100 66 68 64 44 28 100 66 100 84 66 55 41 80 78 53 4 86 80 85 80 80 78 55 2 85 80 86 - 100 86 80 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 55 82 82 55 82 90 78 84 25 72 41 12 9 16 26 26 73 86 86 82 24 100 84 100 73 100 100 - 67 56 - 10 25 18 73 41 83 85 85 83 85 100 56 84 18 55 73 74 55 76 56 Table 11: Nelder-Mead - results of MetaMax strategies 35 10 10 51 51 - 52 9 7 8 7 20 16 - 20
  • 52. 5.3 BFGS Results of restart strategies bfgs-f-comb, bfgs-h-comb and bfgs-special are shown in table 13. Best xed and objective function stagnation based restart strategies for each dimension, which were used to make bfgs-hfcomb and bfgs-h-comb, are listed in table 12. Dimensionality d=2 d=3 d=5 d=10 d=20 Fixed Stagnation based bfgs-f-100d bfgs-f-100d bfgs-f-200d bfgs-f-1000d bfgs-f-1000d bfgs-h-2d bfgs-h-2d bfgs-h-2d bfgs-h-2d bfgs-h-2d Table 12: BFGS - best restart strategies for each dimensionality For the selected best xed restart strategies, we see an ordinary behaviour where run lengths increase with dimensionality. However, for the stagnation based restart strategies, bfgs-h-2d is apparently the best for all dimensionalities. This is quite unusual but, in hindsight, not entirely unexpected. It has to do with the way our implementation of BFGS works: At the begging of each step, the algorithm estimates gradient of the objective function by using the nite dierence method. This involves evaluating the objective function at a set of neighbouring solutions, which are very close to the current solution. The number of these neighbour solutions is always 2d one for each vector in a positive orthonormal basis of the search space. A very quick way to detect convergence is to check, if objective function values at these points are worse than function value at the current solution. As it turns out, this is precisely what bfgs-h-2d does and also the reason why it works so well. In contrast with the surprisingly good results of bfgs-h-2d, the special restart strategy, which is based on monitoring value of the norm of the estimated gradient, performs very poorly and is clearly the worst of all the tested restart strategies. The other two strategies with bfgs-h-comb bfgs-h-comb and bfgs-f-comb have a very similar performance, being slightly better. Overall, BFGS has excellent results on ill-conditioned problems, even exceeding the performance of the best algorithms from the BBOB 2009 conference for certain dimensionalities on the group hcond, which is illustrated in gure 9. However, it performs quite poorly on multimodal functions (multi and mult2). Table 14 sums up the results comparing MetaMax(k) with corresponding xed restart strategies. In terms of success rate, both types of strategies perform the same on all function groups. In terms of rate of convergence, expressed by values of API, the results are similar to those observed when using the Nelder-Mead method: MetaMax(k) strategies perform better on hcond and separ. multi and mult2, but worse on lcond, The overall performance of MetaMax(k) across all function groups is worse than that of the corresponding xed restart strategies and consequently also worse than performance of the best restart strategies. 36