Modern software systems are now being built to be used in dynamic environments utilizing configuration capabilities to adapt to changes and external uncertainties. In a self-adaptation context, we are often interested in reasoning about the performance of the systems under different configurations. Usually, we learn a black-box model based on real measurements to predict the performance of the system given a specific configuration. However, as modern systems become more complex, there are many configuration parameters that may interact and, therefore, we end up learning an exponentially large configuration space. Naturally, this does not scale when relying on real measurements in the actual changing environment. We propose a different solution: Instead of taking the measurements from the real system, we learn the model using samples from other sources, such as simulators that approximate performance of the real system at low cost.
4. Motivation
0 1 2 3 4 5
average read latency (µs) ×104
0
20
40
60
80
100
120
140
160
observations
1000 1200 1400 1600 1800 2000
average read latency (µs)
0
10
20
30
40
50
60
70
observations
1
1
(a) cass-20 (b) cass-10
Best configurations
Worst configurations
Experiments on
Apache Cassandra:
- 6 parameters, 1024 configurations
- Average read latency
- 10 millions records (cass-10)
- 20 millions records (cass-20)
Look at these outliers!
Large statistical dispersion
5. Problem-solution overview
• The robotic software is considered as a highly configurable system.
• The configuration of the robot influence the performance (as well as
energy usage).
• Problem: there are many different parameters making the configuration
space highly dimensional and difficult to understand the influence of
configuration on system performance.
• Solution: learn a black-box performance model using measurements from
the robot. performance_value=f(configuration)
• Challenge: Measurements from real robot is expensive (time consuming,
require human resource, risky when fail)
• Our contribution: performing most measurements on simulator (Gazebo)
and taking only few samples from the real robot to learn a reliable and
accurate performance model given a certain experimental budget. 5
9. Goal
9
m(Xi). In general, Xi may either indicate (i) integer vari-
e such as level of parallelism or (ii) categorical variable
ch as messaging frameworks or Boolean variable such as
abling timeout. We use the terms parameter and factor in-
changeably; also, with the term option we refer to possible
ues that can be assigned to a parameter.
We assume that each configuration x 2 X in the configura-
n space X = Dom(X1) ⇥ · · · ⇥ Dom(Xd) is valid, i.e., the
tem accepts this configuration and the corresponding test
ults in a stable performance behavior. The response with
nfiguration x is denoted by f(x). Throughout, we assume
at f(·) is latency, however, other metrics for response may
used. We here consider the problem of finding an optimal
nfiguration x⇤
that globally minimizes f(·) over X:
x⇤
= arg min
x2X
f(x) (1)
n fact, the response function f(·) is usually unknown or
rtially known, i.e., y = f(x ), x ⇢ X. In practice, such
it still requires hu
per, we propose t
with the search e
than starting the
the learned know
software to accel
version. This idea
in real software e
in DevOps di↵er
tinuously, (ii) Big
frameworks (e.g.,
on similar platfor
versions of a syste
To the best of o
the possibility of
The authors learn
of a system and
systems. However
(x). Throughout, we assume
her metrics for response may
problem of finding an optimal
minimizes f(·) over X:
min
2X
f(x) (1)
n f(·) is usually unknown or
i), xi ⇢ X. In practice, such
ise, i.e., yi = f(xi) + ✏. The
onfiguration is thus a black-
ject to noise [27, 33], which
terministic optimization. A
sampling that starts with a
ons. The performance of the
s initial samples can deliver
and guide the generation of
properly guided, the process
n-feedback-regeneration will
ptimal configuration will be
frameworks (e.g., Apache Hadoop, Spark, Kafka) an
on similar platforms (e.g., cloud clusters), (iii) and di
versions of a system often share a similar business log
To the best of our knowledge, only one study [9] ex
the possibility of transfer learning in system configu
The authors learn a Bayesian network in the tuning p
of a system and reuse this model for tuning other s
systems. However, the learning is limited to the struc
the Bayesian network. In this paper, we introduce a m
that not only reuse a model that has been learned prev
but also the valuable raw data. Therefore, we are not l
to the accuracy of the learned model. Moreover, we
consider Bayesian networks and instead focus on MT
2.4 Motivation
A motivating example. We now illustrate the pr
points on an example. WordCount (cf. Figure 1) is a p
benchmark [12]. WordCount features a three-layer ar
ture that counts the number of words in the incoming s
Partially known
Taking measurements
Configuration space
A black-box response function f : X ! R is used
to build a performance model given some observations of
the system performance under different settings. In practice,
though, the observation data may contain noise, i.e., yi =
f(xi) + ✏i, xi 2 X where ✏i ⇠ N (0, i) and we only
partially know the response function through the observations
D = {(xi, yi)}d
i=1, |D| ⌧ |X|. In other words, a response
function is simply a mapping from the configuration space to
a measurable performance metric that produces interval-scaled
data (here we assume it produces real numbers). Note that
we can learn a model for all measurable attributes (including
accuracy, safety if suitably operationalized), but here we
mostly train predictive models on performance attributes (e.g.,
response time, throughput, CPU utilization).
Our main goal is to learn a reliable regression model, ˆf(·),
operations including refl
guide the sample generat
(QOG) [26] speeds up
some heuristics to filter
Machine learning. Als
niques, such as support-
evolutionary algorithms
approaches trade simpl
learned models for predic
work [39] exploited a ch
the configurable software
only a small sample size.
this fact, but iteratively c
ing performance influenc
2) Positioning in the s
easurable performance metric that produces interval-scaled
a (here we assume it produces real numbers). Note that
can learn a model for all measurable attributes (including
uracy, safety if suitably operationalized), but here we
stly train predictive models on performance attributes (e.g.,
ponse time, throughput, CPU utilization).
Our main goal is to learn a reliable regression model, ˆf(·),
can predict the performance of the system, f(·), given a
ted number of observations D. More specifically, we aim
minimize the prediction error over the configuration space:
arg min
x2X
pe = | ˆf(x) f(x)| (1)
n order to solve the problem above, we assume f(·) is
onably smooth, but otherwise little is known about the
learned models
work [39] expl
the configurab
only a small sa
this fact, but ite
ing performanc
2) Positioni
reasoning is a
Time series tec
response time
of time. FUSI
ships (e.g., fea
configuration
feasible. Diffe
for the purpose
function f : X ! R is used
model given some observations of
nder different settings. In practice,
ata may contain noise, i.e., yi =
ere ✏i ⇠ N (0, i) and we only
e function through the observations
⌧ |X|. In other words, a response
ng from the configuration space to
metric that produces interval-scaled
produces real numbers). Note that
all measurable attributes (including
ly operationalized), but here we
dels on performance attributes (e.g.,
CPU utilization).
n a reliable regression model, ˆf(·),
operations including reflection, expansion, and contraction to
guide the sample generation. Quick Optimization via Guessing
(QOG) [26] speeds up the optimization process exploiting
some heuristics to filter out sub-optimal configurations.
Machine learning. Also, standard machine-learning tech-
niques, such as support-vector machines, decision trees, and
evolutionary algorithms have been tried [20], [38]. These
approaches trade simplicity and understandability of the
learned models for predictive power. For example, some recent
work [39] exploited a characteristic of the response surface of
the configurable software to learn Fourier sparse functions by
only a small sample size. Another approach [31] also exploited
this fact, but iteratively construct a regression model represent-
ing performance influences in an active learning process.
2) Positioning in the self-adaptive community: Performance
Response function to learn
In practice,
e, i.e., yi =
nd we only
observations
, a response
tion space to
terval-scaled
s). Note that
es (including
but here we
ributes (e.g.,
model, ˆf(·),
f(·), given a
ally, we aim
ration space:
(1)
(QOG) [26] speeds up the optimization process exploiting
some heuristics to filter out sub-optimal configurations.
Machine learning. Also, standard machine-learning tech-
niques, such as support-vector machines, decision trees, and
evolutionary algorithms have been tried [20], [38]. These
approaches trade simplicity and understandability of the
learned models for predictive power. For example, some recent
work [39] exploited a characteristic of the response surface of
the configurable software to learn Fourier sparse functions by
only a small sample size. Another approach [31] also exploited
this fact, but iteratively construct a regression model represent-
ing performance influences in an active learning process.
2) Positioning in the self-adaptive community: Performance
reasoning is a key activity for decision making at runtime.
Time series techniques [10] shown to be effective in predicting
response time and uncovering performance anomalies ahead
of time. FUSION [12], [11] exploited inter-feature relation-
ships (e.g., feature dependencies) to reduce the dimensions of
Learn model
10. Cost model
10
constructed using the target and the source samples. The new
model is based on only 3 target measurements while providing
accurate predictions. (d) [negative transfer] a regression model
is constructed with transfer from a misleading response func-
tion and therefore the model prediction is inaccurate.
For the choice of number of samples from the source and
target, we use a simple cost model (cf. Figure 1) that is based
on the number of source and target samples as follows:
C(Ds, Dt) = cs · |Ds| + ct · |Dt| + CT r(|Ds|, |Dt|), (2)
C(Ds, Dt) Cmax, (3)
where Cmax is the experimental budget and CT r(|Ds|, |Dt|)
is the cost of model training, which is a function of source
s = g(xs) + ✏s, xs 2 X}, and transfer these obser-
earn a performance model for the real system using
bservations from that system, Dt = {(xt, yt)|yt =
xt 2 X}. The cost of obtaining each observation on
system is ct, and we assume that cs ⌧ ct and that
response function, g(·), is related (correlated) to the
tion, and this relatedness contributes to the learning
get function, using only sparse samples from the
m, i.e., |Dt| ⌧ |Ds|. Intuitively, rather than starting
ch, we transfer the learned knowledge from the data
ollected using cheap methods such as simulators
s system versions, to learn a more accurate model
target configurable system and use this model to
ut its performance at runtime.
e 2, we demonstrate the idea of transfer learning
thetic example. Only three observations are taken
(c)
Fig. 2: (a) 9 samples h
spectively 3 from a targ
regression model is con
ples. (c) [with transfer l
constructed using the tar
model is based on only 3
accurate predictions. (d)
is constructed with trans
tion and therefore the m
ng
gn
ng
m
b-
in
at
e-
ms
nd
ed
III. COST-AWARE TRANSFER LEARNING
In this section, we explain our cost-aware transfer learning
solution to improve learning an accurate performance model.
A. Solution overview
Ds = {(xs, ys)|ys = g(xs) + ✏s, xs 2 X}, and transfer
these observations to learn a performance model for the real
system using only few observations from that system, Dt =
{(xt, yt)|yt = f(xt) + ✏t, xt 2 X}. The cost of obtaining
each observation on the target system is ct, and we assume
that cs ⌧ ct and that the source response function, g(·), is
. COST-AWARE TRANSFER LEARNING
ction, we explain our cost-aware transfer learning
mprove learning an accurate performance model.
overview
xs, ys)|ys = g(xs) + ✏s, xs 2 X}, and transfer
ations Dt = {(xt, yt)|yt = f(xt) + ✏t, xt 2 X}.
obtaining each observation on the target system is
assume that cs ⌧ ct and that the source response
·), is related (correlated) to the target function,
atedness contributes to the learning of the target
In this section, we explain our cost-aware transfer learning
lution to improve learning an accurate performance model.
Solution overview
Ds = {(xs, ys)|ys = g(xs) + ✏s, xs 2 X}, and transfer
ese observations Dt = {(xt, yt)|yt = f(xt) + ✏t, xt 2 X}.
he cost of obtaining each observation on the target system is
and we assume that cs ⌧ ct and that the source response
nction, g(·), is related (correlated) to the target function,
d this relatedness contributes to the learning of the target
Source observations
Target observations
Assumption
Budget
Total cost
12. Assumption: Source and target are correlated
0 2 4 6 8 10
0
0.5
1
1.5
2
2.5
10
4
0 2 4 6 8 10
0
0.5
1
1.5
2
2.5
3
10
4
0 2 4 6 8 10
0
0.5
1
1.5
2
2.5
10
4
(a) (b) (c)
1.5
2
2.5
3
104
(d)0 2 4 6 8 10
-0.5
0
0.5
1
1.5
2
2.5
10
4
1.5
2
2.5
3
104
1.5
2
2.5
3
104
ntime
it is
cisely
using
ration
i). In
uch as
rithm
binary
There-
tesian
terest
space
R is
ations
✓ X.
noise,
words,
from
c that
s real
dimensional spaces. Relevant to software engineering commu-
nity, several approaches tried different experimental designs
for highly configurable software [14] and some even consider
cost as an explicit factor to determine optimal sampling [26].
Recently, researchers have tried novel way of sampling with
a feedback embedded inside the process where new samples
are derived based on information gain from previous set of
samples. Recursive Random Sampling (RRS) [33] integrates
a restarting mechanism into the random sampling to achieve
high search efficiency. Smart Hill Climbing (SHC) [32] inte-
grates the importance sampling with Latin Hypercube Design
(lhd). SHC estimates the local regression at each potential
region, then it searches toward the steepest descent direction.
An approach based on direct search [35] forms a simplex in
the parameter space by a number of samples, and iteratively
updates a simplex through a number of well-defined operations
including reflection, expansion, and contraction to guide the
sample generation. Quick Optimization via Guessing (QOG)
in [23] speeds up the optimization process exploiting some
heuristics to filter out sub-optimal configurations. Some recent
work [34] exploited a characteristic of the response surface of
the configurable software to learn Fourier sparse functions by
only a small sample size. Another approach also exploited this
fact, but iteratively construct a regression model representing
performance influences in an active learning process.
time constrained environments (e.g., runtime
making in a feedback loop for robots), it is
to select the sources purposefully.
ormulation: Model learning
ntroduce the concepts in our approach precisely
we define the model learning problem using
notation. Let Xi indicate the i-th configuration
hich ranges in a finite domain Dom(Xi). In
ay either indicate (i) an integer variable such as
iterative refinements in a localization algorithm
orical variable such as sensor names or binary
local vs global localization method). There-
figuration space is mathematically a Cartesian
of the domains of the parameters of interest
) ⇥ · · · ⇥ Dom(Xd).
ation x resides in the design parameter space
black-box response function f : X ! R is
a performance model given some observations
performance under different settings, D ✓ X.
hough, such measurements may contain noise,
xi) + ✏i where ✏i ⇠ N (0, i). In other words,
e model is simply a function (mapping) from
space to a measurable performance metric that
val-scaled data (here we assume it produces real
dimensional spaces. Relevant to software engineering commu-
nity, several approaches tried different experimental designs
for highly configurable software [14] and some even consider
cost as an explicit factor to determine optimal sampling [26].
Recently, researchers have tried novel way of sampling with
a feedback embedded inside the process where new samples
are derived based on information gain from previous set of
samples. Recursive Random Sampling (RRS) [33] integrates
a restarting mechanism into the random sampling to achieve
high search efficiency. Smart Hill Climbing (SHC) [32] inte-
grates the importance sampling with Latin Hypercube Design
(lhd). SHC estimates the local regression at each potential
region, then it searches toward the steepest descent direction.
An approach based on direct search [35] forms a simplex in
the parameter space by a number of samples, and iteratively
updates a simplex through a number of well-defined operations
including reflection, expansion, and contraction to guide the
sample generation. Quick Optimization via Guessing (QOG)
in [23] speeds up the optimization process exploiting some
heuristics to filter out sub-optimal configurations. Some recent
work [34] exploited a characteristic of the response surface of
the configurable software to learn Fourier sparse functions by
only a small sample size. Another approach also exploited this
fact, but iteratively construct a regression model representing
performance influences in an active learning process.
in some time constrained environments (e.g., runtime
decision making in a feedback loop for robots), it is
important to select the sources purposefully.
C. Problem formulation: Model learning
In order to introduce the concepts in our approach precisely
and concisely, we define the model learning problem using
mathematical notation. Let Xi indicate the i-th configuration
parameter, which ranges in a finite domain Dom(Xi). In
general, Xi may either indicate (i) an integer variable such as
the number of iterative refinements in a localization algorithm
or (ii) a categorical variable such as sensor names or binary
options (e.g., local vs global localization method). There-
fore, the configuration space is mathematically a Cartesian
product of all of the domains of the parameters of interest
X = Dom(X1) ⇥ · · · ⇥ Dom(Xd).
A configuration x resides in the design parameter space
x 2 X. A black-box response function f : X ! R is
used to build a performance model given some observations
of the system performance under different settings, D ✓ X.
In practice, though, such measurements may contain noise,
i.e., yi = f(xi) + ✏i where ✏i ⇠ N (0, i). In other words,
a performance model is simply a function (mapping) from
configuration space to a measurable performance metric that
produces interval-scaled data (here we assume it produces real
dimensional spaces. Relevant to software engineering commu-
nity, several approaches tried different experimental designs
for highly configurable software [14] and some even consider
cost as an explicit factor to determine optimal sampling [26].
Recently, researchers have tried novel way of sampling with
a feedback embedded inside the process where new samples
are derived based on information gain from previous set of
samples. Recursive Random Sampling (RRS) [33] integrates
a restarting mechanism into the random sampling to achieve
high search efficiency. Smart Hill Climbing (SHC) [32] inte-
grates the importance sampling with Latin Hypercube Design
(lhd). SHC estimates the local regression at each potential
region, then it searches toward the steepest descent direction.
An approach based on direct search [35] forms a simplex in
the parameter space by a number of samples, and iteratively
updates a simplex through a number of well-defined operations
including reflection, expansion, and contraction to guide the
sample generation. Quick Optimization via Guessing (QOG)
in [23] speeds up the optimization process exploiting some
heuristics to filter out sub-optimal configurations. Some recent
work [34] exploited a characteristic of the response surface of
the configurable software to learn Fourier sparse functions by
only a small sample size. Another approach also exploited this
fact, but iteratively construct a regression model representing
performance influences in an active learning process.
in some time constrained environments (e.g., runtime
decision making in a feedback loop for robots), it is
important to select the sources purposefully.
C. Problem formulation: Model learning
In order to introduce the concepts in our approach precisely
and concisely, we define the model learning problem using
mathematical notation. Let Xi indicate the i-th configuration
parameter, which ranges in a finite domain Dom(Xi). In
general, Xi may either indicate (i) an integer variable such as
the number of iterative refinements in a localization algorithm
or (ii) a categorical variable such as sensor names or binary
options (e.g., local vs global localization method). There-
fore, the configuration space is mathematically a Cartesian
product of all of the domains of the parameters of interest
X = Dom(X1) ⇥ · · · ⇥ Dom(Xd).
A configuration x resides in the design parameter space
x 2 X. A black-box response function f : X ! R is
used to build a performance model given some observations
of the system performance under different settings, D ✓ X.
In practice, though, such measurements may contain noise,
i.e., yi = f(xi) + ✏i where ✏i ⇠ N (0, i). In other words,
a performance model is simply a function (mapping) from
configuration space to a measurable performance metric that
produces interval-scaled data (here we assume it produces real
dimensional spaces. Relevant to s
nity, several approaches tried dif
for highly configurable software [
cost as an explicit factor to deter
Recently, researchers have tried
a feedback embedded inside the
are derived based on information
samples. Recursive Random Sam
a restarting mechanism into the
high search efficiency. Smart Hil
grates the importance sampling w
(lhd). SHC estimates the local
region, then it searches toward th
An approach based on direct sea
the parameter space by a numbe
updates a simplex through a numb
including reflection, expansion, a
sample generation. Quick Optimi
in [23] speeds up the optimizati
heuristics to filter out sub-optimal
work [34] exploited a characterist
the configurable software to learn
only a small sample size. Another
fact, but iteratively construct a re
performance influences in an acti
a) Source may be a shift to the target
b) Source may be a noisy version of the target
c) Source may be different to target in some extreme positions in the configuration space
d) Source may be irrelevant to the target -> negative transfer will happen!
12
16. GP for modeling black box response function
true function
GP mean
GP variance
observation
selected point
true
minimum
mposed by its prior mean (µ(·) : X ! R) and a covariance
nction (k(·, ·) : X ⇥ X ! R) [41]:
y = f(x) ⇠ GP(µ(x), k(x, x0
)), (2)
here covariance k(x, x0
) defines the distance between x
d x0
. Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
e collection of t experimental data (observations). In this
mework, we treat f(x) as a random variable, conditioned
observations S1:t, which is normally distributed with the
lowing posterior mean and variance functions [41]:
µt(x) = µ(x) + k(x)|
(K + 2
I) 1
(y µ) (3)
2
t (x) = k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (4)
here y := y1:t, k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)],
:= µ(x1:t), K := k(xi, xj) and I is identity matrix. The
ortcoming of BO4CO is that it cannot exploit the observa-
ns regarding other versions of the system and as therefore
nnot be applied in DevOps.
2 TL4CO: an extension to multi-tasks
TL4CO 1
uses MTGPs that exploit observations from other
evious versions of the system under test. Algorithm 1
fines the internal details of TL4CO. As Figure 4 shows,
4CO is an iterative algorithm that uses the learning from
her system versions. In a high-level overview, TL4CO: (i)
ects the most informative past observations (details in
ction 3.3); (ii) fits a model to existing data based on kernel
arning (details in Section 3.4), and (iii) selects the next
ork are based on tractable linear algebra.
evious work [21], we proposed BO4CO that ex-
task GPs (no transfer learning) for prediction of
tribution of response functions. A GP model is
y its prior mean (µ(·) : X ! R) and a covariance
·, ·) : X ⇥ X ! R) [41]:
y = f(x) ⇠ GP(µ(x), k(x, x0
)), (2)
iance k(x, x0
) defines the distance between x
us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
n of t experimental data (observations). In this
we treat f(x) as a random variable, conditioned
ons S1:t, which is normally distributed with the
sterior mean and variance functions [41]:
µ(x) + k(x)|
(K + 2
I) 1
(y µ) (3)
k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (4)
1:t, k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)],
, K := k(xi, xj) and I is identity matrix. The
of BO4CO is that it cannot exploit the observa-
ng other versions of the system and as therefore
pplied in DevOps.
CO: an extension to multi-tasks
uses MTGPs that exploit observations from other
Motivations:
1- mean estimates + variance
2- all computations are linear algebra
3- good estimations when few data
K := 4
...
...
...
k(xt, x1) . . . k(xt, xt)
5 (7)
odels have shown to be effective for performance
ns in data scarce domains [20]. However, as we
monstrated in Figure 2, it may becomes inaccurate
e samples do not cover the space uniformly. For
onfigurable systems, we require a large number of
ions to cover the space uniformly, making GP models
ve in such situations.
el prediction using transfer learning
nsfer learning, the key question is how to make accu-
dictions for the target environment using observations
er sources, Ds. We need a measure of relatedness not
ween input configurations but between the sources
The relationships between input configurations was
in the GP models using the covariance matrix that
ned based on the kernel function in Eq. (7). More
lly, a kernel is a function that computes a dot product
ure of “similarity”) between two input configurations.
kernel helps to get accurate predictions for similar
ations. We now need to exploit the relationship be-
e source and target functions, g, f, using the current
ions Ds, Dt to build the predictive model ˆf. To capture
ionship, we define the following kernel function:
k(f, g, x, x0
) = kt(f, g) ⇥ kxx(x, x0
), (8)
other components can query the performance under specific
configurations or update the model given a new observation.
We use transfer learning to make the knowledge more accurate
using observations that are taken from a simulator or any
other cheap sources. For deciding how many observations
and from what source to transfer, we use the cost model
that we have introduced earlier. At runtime, the managed
system is Monitored by pulling the end-to-end performance
metrics (e.g., latency, throughput) from the corresponding
sensors. Then, the retrieved performance data needs to be
Analysed and the mean performance associated to a specific
setting of the system will be stored in a data repository.
Next, the GP model needs to be updated taking into account
the new performance observation. Having updated the GP
model, a new configuration may be Planned to replace the
current configuration. Finally, the new configuration will be
enacted by Executing appropriate platform specific operations.
This enables model-based knowledge evolution using machine
learning [2], [21]. The underlying GP model can now be
updated not only when a new observation is available but
also by transferring the learning from other related sources.
So at each adaptation cycle, we can update our belief about
the correct response given data from the managed system and
other related sources, accelerating the learning process.
IV. EXPERIMENTAL RESULTS
We evaluate the effectiveness and applicability of our
transfer learning approach for learning models for highly-
orate domain knowledge as prior, if available, which
nce the model accuracy [20].
er to describe the technical details of our transfer
methodology, let us briefly describe an overview of
l regression; a more detailed description can be found
e [35]. GP models assume that the function ˆf(x) can
reted as a probability distribution over functions:
y = ˆf(x) ⇠ GP(µ(x), k(x, x0
)), (4)
: X ! R is the mean function and k : X ⇥ X ! R
variance function (kernel function) which describes
onship between response values, y, according to the
of the input values x, x0
. The mean and variance of
model predictions can be derived analytically [35]:
= µ(x) + k(x)|
(K + 2
I) 1
(y µ), (5)
= k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x), (6)
(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)], I is
matrix and
K :=
2
6
4
k(x1, x1) . . . k(x1, xt)
...
...
...
k(xt, x1) . . . k(xt, xt)
3
7
5 (7)
odels have shown to be effective for performance
ns in data scarce domains [20]. However, as we
Eq. (5), (6) with the new K. The main essence of transfer
learning is, therefore, the kernel that capture the source and
target relationship and provide more accurate predictions using
the additional knowledge we can gain via the relationship
between source and target.
D. Transfer learning in a self-adaptation loop
Now that we have described the idea of transfer learning for
providing more accurate predictions, the question is whether
such an idea can be applied at runtime and how the self-
adaptive systems can benefit from it. More specifically, we
now describe the idea of model learning and transfer learning
in the context of self-optimization, where the system adapts
its configuration to meet performance requirements at runtime.
The difference to traditional configurable systems is that we
learn the performance model online in a feedback loop under
time and resource constraints. Such performance reasoning is
done more frequently for self-adaptation purposes.
An overview of a self-optimization solution is depicted
in Figure 3 following the well-know MAPE-K framework
[9], [23]. We consider the GP model as the K (knowledge)
component of this framework that acts as an interface to which
other components can query the performance under specific
configurations or update the model given a new observation.
We use transfer learning to make the knowledge more accurate
using observations that are taken from a simulator or any
17. The case where we learn from correlated responses
-1.5 -1 -0.5 0 0.5 1 1.5
-4
-3
-2
-1
0
1
2
3
(a) 3 sample response functions
configuration domain
responsevalue
(1)
(2)
(3)
observations
(b) GP fit for (1) ignoring observations for (2),(3)
LCB
not informative
(c) multi-task GP fit for (1) by transfer learning from (2),(3)
highly informative
GP prediction mean
GP prediction variance
probability distribution
of the minimizers