Transfer Learning for Improving Model Predictions in Highly Configurable Software

Transfer Learning for Improving Model Predictions
in Highly Configurable Software
Pooyan Jamshidi, Miguel Velez, Christian Kästner
Norbert Siegmund, Prasad Kawthekar

My background and objectives
• PhD (2010-13): Fuzzy control for auto-scaling in cloud
• Postdoc 1 (2014): Fuzzy Q-Learning for online rule update, MAPE-KE
• Postdoc 2 (2015-16): Configuration optimization and performance
modeling for big data systems (Stream processing, Map-Reduce, NoSQL)
• Now: Transfer learning for making performance model predictions more
accurate and more reliable for highly configurable software
Objectives:
• Opportunities for applying model learning at runtime
• Collaborations
2

Motivation
1- Many different
Parameters =>
- large state space
- interactions
2- Defaults are
typically used =>
- poor performance

Motivation
0 1 2 3 4 5
average read latency (µs) ×104
0
20
40
60
80
100
120
140
160
observations
1000 1200 1400 1600 1800 2000
average read latency (µs)
0
10
20
30
40
50
60
70
observations
1
1
(a) cass-20 (b) cass-10
Best configurations
Worst configurations
Experiments on
Apache Cassandra:
- 6 parameters, 1024 configurations
- Average read latency
- 10 millions records (cass-10)
- 20 millions records (cass-20)
Look at these outliers!
Large statistical dispersion

Problem-solution overview
• The robotic software is considered as a highly configurable system.
• The configuration of the robot influence the performance (as well as
energy usage).
• Problem: there are many different parameters making the configuration
space highly dimensional and difficult to understand the influence of
configuration on system performance.
• Solution: learn a black-box performance model using measurements from
the robot. performance_value=f(configuration)
• Challenge: Measurements from real robot is expensive (time consuming,
require human resource, risky when fail)
• Our contribution: performing most measurements on simulator (Gazebo)
and taking only few samples from the real robot to learn a reliable and
accurate performance model given a certain experimental budget. 5

Overview of our approach
ML Model
Learn
Model
Measure
Measure
Data
Source
Target
Simulator (Gazebo) Robot (TurtleBot)
Predict
Performance
Predictions
Adaptation
Reason
6

ML Model
Learn
Model
Measure
Measure
Data
Source
Target
Predict
Performance
Predictions
Adaptation
Reason
Data
7

We get more cheaper samples from the simulator and only few expensive from the
real robot to learn an accurate model to predict the performance of the robot.
Here, performance may denote “battery usage”, “localization error” or “mission time”.
ML Model
Learn
Model
Measure
Measure
Data
Source
Target
Predict
Performance
Predictions
Adaptation
Reason
8

Goal
9
m(Xi). In general, Xi may either indicate (i) integer vari-
e such as level of parallelism or (ii) categorical variable
ch as messaging frameworks or Boolean variable such as
abling timeout. We use the terms parameter and factor in-
changeably; also, with the term option we refer to possible
ues that can be assigned to a parameter.
We assume that each configuration x 2 X in the configura-
n space X = Dom(X1) ⇥ · · · ⇥ Dom(Xd) is valid, i.e., the
tem accepts this configuration and the corresponding test
ults in a stable performance behavior. The response with
nfiguration x is denoted by f(x). Throughout, we assume
at f(·) is latency, however, other metrics for response may
used. We here consider the problem of finding an optimal
nfiguration x⇤
that globally minimizes f(·) over X:
x⇤
= arg min
x2X
f(x) (1)
n fact, the response function f(·) is usually unknown or
rtially known, i.e., y = f(x ), x ⇢ X. In practice, such
it still requires hu
per, we propose t
with the search e
than starting the
the learned know
software to accel
version. This idea
in real software e
in DevOps di↵er
tinuously, (ii) Big
frameworks (e.g.,
on similar platfor
versions of a syste
To the best of o
the possibility of
The authors learn
of a system and
systems. However
(x). Throughout, we assume
her metrics for response may
problem of finding an optimal
minimizes f(·) over X:
min
2X
f(x) (1)
n f(·) is usually unknown or
i), xi ⇢ X. In practice, such
ise, i.e., yi = f(xi) + ✏. The
onfiguration is thus a black-
ject to noise [27, 33], which
terministic optimization. A
sampling that starts with a
ons. The performance of the
s initial samples can deliver
and guide the generation of
properly guided, the process
n-feedback-regeneration will
ptimal configuration will be
frameworks (e.g., Apache Hadoop, Spark, Kafka) an
on similar platforms (e.g., cloud clusters), (iii) and di
versions of a system often share a similar business log
To the best of our knowledge, only one study [9] ex
the possibility of transfer learning in system configu
The authors learn a Bayesian network in the tuning p
of a system and reuse this model for tuning other s
systems. However, the learning is limited to the struc
the Bayesian network. In this paper, we introduce a m
that not only reuse a model that has been learned prev
but also the valuable raw data. Therefore, we are not l
to the accuracy of the learned model. Moreover, we
consider Bayesian networks and instead focus on MT
2.4 Motivation
A motivating example. We now illustrate the pr
points on an example. WordCount (cf. Figure 1) is a p
benchmark [12]. WordCount features a three-layer ar
ture that counts the number of words in the incoming s
Partially known
Taking measurements
Configuration space
A black-box response function f : X ! R is used
to build a performance model given some observations of
the system performance under different settings. In practice,
though, the observation data may contain noise, i.e., yi =
f(xi) + ✏i, xi 2 X where ✏i ⇠ N (0, i) and we only
partially know the response function through the observations
D = {(xi, yi)}d
i=1, |D| ⌧ |X|. In other words, a response
function is simply a mapping from the configuration space to
a measurable performance metric that produces interval-scaled
data (here we assume it produces real numbers). Note that
we can learn a model for all measurable attributes (including
accuracy, safety if suitably operationalized), but here we
mostly train predictive models on performance attributes (e.g.,
response time, throughput, CPU utilization).
Our main goal is to learn a reliable regression model, ˆf(·),
operations including refl
guide the sample generat
(QOG) [26] speeds up
some heuristics to filter
Machine learning. Als
niques, such as support-
evolutionary algorithms
approaches trade simpl
learned models for predic
work [39] exploited a ch
the configurable software
only a small sample size.
this fact, but iteratively c
ing performance influenc
2) Positioning in the s
easurable performance metric that produces interval-scaled
a (here we assume it produces real numbers). Note that
can learn a model for all measurable attributes (including
uracy, safety if suitably operationalized), but here we
stly train predictive models on performance attributes (e.g.,
ponse time, throughput, CPU utilization).
Our main goal is to learn a reliable regression model, ˆf(·),
can predict the performance of the system, f(·), given a
ted number of observations D. More specifically, we aim
minimize the prediction error over the configuration space:
arg min
x2X
pe = | ˆf(x) f(x)| (1)
n order to solve the problem above, we assume f(·) is
onably smooth, but otherwise little is known about the
learned models
work [39] expl
the configurab
only a small sa
this fact, but ite
ing performanc
2) Positioni
reasoning is a
Time series tec
response time
of time. FUSI
ships (e.g., fea
configuration
feasible. Diffe
for the purpose
function f : X ! R is used
model given some observations of
nder different settings. In practice,
ata may contain noise, i.e., yi =
ere ✏i ⇠ N (0, i) and we only
e function through the observations
⌧ |X|. In other words, a response
ng from the configuration space to
metric that produces interval-scaled
produces real numbers). Note that
all measurable attributes (including
ly operationalized), but here we
dels on performance attributes (e.g.,
CPU utilization).
n a reliable regression model, ˆf(·),
operations including reflection, expansion, and contraction to
guide the sample generation. Quick Optimization via Guessing
(QOG) [26] speeds up the optimization process exploiting
some heuristics to filter out sub-optimal configurations.
Machine learning. Also, standard machine-learning tech-
niques, such as support-vector machines, decision trees, and
evolutionary algorithms have been tried [20], [38]. These
approaches trade simplicity and understandability of the
learned models for predictive power. For example, some recent
work [39] exploited a characteristic of the response surface of
the configurable software to learn Fourier sparse functions by
only a small sample size. Another approach [31] also exploited
this fact, but iteratively construct a regression model represent-
ing performance influences in an active learning process.
2) Positioning in the self-adaptive community: Performance
Response function to learn
In practice,
e, i.e., yi =
nd we only
observations
, a response
tion space to
terval-scaled
s). Note that
es (including
but here we
ributes (e.g.,
model, ˆf(·),
f(·), given a
ally, we aim
ration space:
(1)
(QOG) [26] speeds up the optimization process exploiting
some heuristics to filter out sub-optimal configurations.
Machine learning. Also, standard machine-learning tech-
niques, such as support-vector machines, decision trees, and
evolutionary algorithms have been tried [20], [38]. These
approaches trade simplicity and understandability of the
learned models for predictive power. For example, some recent
only a small sample size. Another approach [31] also exploited
this fact, but iteratively construct a regression model represent-
ing performance influences in an active learning process.
2) Positioning in the self-adaptive community: Performance
reasoning is a key activity for decision making at runtime.
Time series techniques [10] shown to be effective in predicting
response time and uncovering performance anomalies ahead
of time. FUSION [12], [11] exploited inter-feature relation-
ships (e.g., feature dependencies) to reduce the dimensions of
Learn model

Cost model
10
constructed using the target and the source samples. The new
model is based on only 3 target measurements while providing
accurate predictions. (d) [negative transfer] a regression model
is constructed with transfer from a misleading response func-
tion and therefore the model prediction is inaccurate.
For the choice of number of samples from the source and
target, we use a simple cost model (cf. Figure 1) that is based
on the number of source and target samples as follows:
C(Ds, Dt) = cs · |Ds| + ct · |Dt| + CT r(|Ds|, |Dt|), (2)
C(Ds, Dt)  Cmax, (3)
where Cmax is the experimental budget and CT r(|Ds|, |Dt|)
is the cost of model training, which is a function of source
s = g(xs) + ✏s, xs 2 X}, and transfer these obser-
earn a performance model for the real system using
bservations from that system, Dt = {(xt, yt)|yt =
xt 2 X}. The cost of obtaining each observation on
system is ct, and we assume that cs ⌧ ct and that
response function, g(·), is related (correlated) to the
tion, and this relatedness contributes to the learning
get function, using only sparse samples from the
m, i.e., |Dt| ⌧ |Ds|. Intuitively, rather than starting
ch, we transfer the learned knowledge from the data
ollected using cheap methods such as simulators
s system versions, to learn a more accurate model
target conﬁgurable system and use this model to
ut its performance at runtime.
e 2, we demonstrate the idea of transfer learning
thetic example. Only three observations are taken
(c)
Fig. 2: (a) 9 samples h
spectively 3 from a targ
regression model is con
ples. (c) [with transfer l
constructed using the tar
model is based on only 3
accurate predictions. (d)
is constructed with trans
tion and therefore the m
ng
gn
ng
m
b-
in
at
e-
ms
nd
ed
III. COST-AWARE TRANSFER LEARNING
In this section, we explain our cost-aware transfer learning
solution to improve learning an accurate performance model.
A. Solution overview
Ds = {(xs, ys)|ys = g(xs) + ✏s, xs 2 X}, and transfer
these observations to learn a performance model for the real
system using only few observations from that system, Dt =
{(xt, yt)|yt = f(xt) + ✏t, xt 2 X}. The cost of obtaining
each observation on the target system is ct, and we assume
that cs ⌧ ct and that the source response function, g(·), is
. COST-AWARE TRANSFER LEARNING
ction, we explain our cost-aware transfer learning
mprove learning an accurate performance model.
overview
xs, ys)|ys = g(xs) + ✏s, xs 2 X}, and transfer
ations Dt = {(xt, yt)|yt = f(xt) + ✏t, xt 2 X}.
obtaining each observation on the target system is
assume that cs ⌧ ct and that the source response
·), is related (correlated) to the target function,
atedness contributes to the learning of the target
In this section, we explain our cost-aware transfer learning
lution to improve learning an accurate performance model.
Solution overview
Ds = {(xs, ys)|ys = g(xs) + ✏s, xs 2 X}, and transfer
ese observations Dt = {(xt, yt)|yt = f(xt) + ✏t, xt 2 X}.
he cost of obtaining each observation on the target system is
and we assume that cs ⌧ ct and that the source response
nction, g(·), is related (correlated) to the target function,
d this relatedness contributes to the learning of the target
Source observations
Target observations
Assumption
Budget
Total cost

Synthetic
example
(a) Source and target
samples
(b) Learning without
transfer
(c) Learning with
transfer
(d) Negative transfer
1 3 5 9 10 11 12 13 14
0
50
100
150
200
source samples
target samples
(a) (b)
(c) (d) 11

Assumption: Source and target are correlated
0 2 4 6 8 10
0
0.5
1
1.5
2
2.5
10
4
0 2 4 6 8 10
0
0.5
1
1.5
2
2.5
3
10
4
0 2 4 6 8 10
0
0.5
1
1.5
2
2.5
10
4
(a) (b) (c)
1.5
2
2.5
3
104
(d)0 2 4 6 8 10
-0.5
0
0.5
1
1.5
2
2.5
10
4
1.5
2
2.5
3
104
1.5
2
2.5
3
104
ntime
it is
cisely
using
ration
i). In
uch as
rithm
binary
There-
tesian
terest
space
R is
ations
✓ X.
noise,
words,
from
c that
s real
dimensional spaces. Relevant to software engineering commu-
nity, several approaches tried different experimental designs
for highly configurable software [14] and some even consider
cost as an explicit factor to determine optimal sampling [26].
Recently, researchers have tried novel way of sampling with
a feedback embedded inside the process where new samples
are derived based on information gain from previous set of
samples. Recursive Random Sampling (RRS) [33] integrates
a restarting mechanism into the random sampling to achieve
high search efficiency. Smart Hill Climbing (SHC) [32] inte-
grates the importance sampling with Latin Hypercube Design
(lhd). SHC estimates the local regression at each potential
region, then it searches toward the steepest descent direction.
An approach based on direct search [35] forms a simplex in
the parameter space by a number of samples, and iteratively
updates a simplex through a number of well-defined operations
including reflection, expansion, and contraction to guide the
sample generation. Quick Optimization via Guessing (QOG)
in [23] speeds up the optimization process exploiting some
heuristics to filter out sub-optimal configurations. Some recent
only a small sample size. Another approach also exploited this
fact, but iteratively construct a regression model representing
performance influences in an active learning process.
time constrained environments (e.g., runtime
making in a feedback loop for robots), it is
to select the sources purposefully.
ormulation: Model learning
ntroduce the concepts in our approach precisely
we define the model learning problem using
notation. Let Xi indicate the i-th configuration
hich ranges in a finite domain Dom(Xi). In
ay either indicate (i) an integer variable such as
iterative refinements in a localization algorithm
orical variable such as sensor names or binary
local vs global localization method). There-
figuration space is mathematically a Cartesian
of the domains of the parameters of interest
) ⇥ · · · ⇥ Dom(Xd).
ation x resides in the design parameter space
black-box response function f : X ! R is
a performance model given some observations
performance under different settings, D ✓ X.
hough, such measurements may contain noise,
xi) + ✏i where ✏i ⇠ N (0, i). In other words,
e model is simply a function (mapping) from
space to a measurable performance metric that
val-scaled data (here we assume it produces real
in some time constrained environments (e.g., runtime
decision making in a feedback loop for robots), it is
important to select the sources purposefully.
C. Problem formulation: Model learning
In order to introduce the concepts in our approach precisely
and concisely, we define the model learning problem using
mathematical notation. Let Xi indicate the i-th configuration
parameter, which ranges in a finite domain Dom(Xi). In
general, Xi may either indicate (i) an integer variable such as
the number of iterative refinements in a localization algorithm
or (ii) a categorical variable such as sensor names or binary
options (e.g., local vs global localization method). There-
fore, the configuration space is mathematically a Cartesian
product of all of the domains of the parameters of interest
X = Dom(X1) ⇥ · · · ⇥ Dom(Xd).
A configuration x resides in the design parameter space
x 2 X. A black-box response function f : X ! R is
used to build a performance model given some observations
of the system performance under different settings, D ✓ X.
In practice, though, such measurements may contain noise,
i.e., yi = f(xi) + ✏i where ✏i ⇠ N (0, i). In other words,
a performance model is simply a function (mapping) from
configuration space to a measurable performance metric that
produces interval-scaled data (here we assume it produces real
in some time constrained environments (e.g., runtime
decision making in a feedback loop for robots), it is
important to select the sources purposefully.
C. Problem formulation: Model learning
In order to introduce the concepts in our approach precisely
and concisely, we define the model learning problem using
mathematical notation. Let Xi indicate the i-th configuration
parameter, which ranges in a finite domain Dom(Xi). In
general, Xi may either indicate (i) an integer variable such as
the number of iterative refinements in a localization algorithm
or (ii) a categorical variable such as sensor names or binary
options (e.g., local vs global localization method). There-
fore, the configuration space is mathematically a Cartesian
product of all of the domains of the parameters of interest
X = Dom(X1) ⇥ · · · ⇥ Dom(Xd).
A configuration x resides in the design parameter space
x 2 X. A black-box response function f : X ! R is
used to build a performance model given some observations
of the system performance under different settings, D ✓ X.
In practice, though, such measurements may contain noise,
i.e., yi = f(xi) + ✏i where ✏i ⇠ N (0, i). In other words,
a performance model is simply a function (mapping) from
configuration space to a measurable performance metric that
produces interval-scaled data (here we assume it produces real
dimensional spaces. Relevant to s
nity, several approaches tried dif
for highly configurable software [
cost as an explicit factor to deter
Recently, researchers have tried
a feedback embedded inside the
are derived based on information
samples. Recursive Random Sam
a restarting mechanism into the
high search efficiency. Smart Hil
grates the importance sampling w
(lhd). SHC estimates the local
region, then it searches toward th
An approach based on direct sea
the parameter space by a numbe
updates a simplex through a numb
including reflection, expansion, a
sample generation. Quick Optimi
in [23] speeds up the optimizati
heuristics to filter out sub-optimal
work [34] exploited a characterist
the configurable software to learn
only a small sample size. Another
fact, but iteratively construct a re
performance influences in an acti
a) Source may be a shift to the target
b) Source may be a noisy version of the target
c) Source may be different to target in some extreme positions in the configuration space
d) Source may be irrelevant to the target -> negative transfer will happen!
12

Correlations
100
150
1
200
250
Latency(ms)
300
2 5
3 104
5 15
6
14
16
18
20
1
22
24
26
Latency(ms)
28
30
32
2 53 104
5 156
number of countersnumber of splitters number of countersnumber of splitters
2.8
2.9
1
3
3.1
3.2
3.3
2
Latency(ms)
3.4
3.5
3.6
3 5
4 10
5 15
6
1.2
1.3
1.4
1
1.5
1.6
1.7
Latency(ms)
1.8
1.9
2 53
104
5 156
(a) WordCount v1
(b) WordCount v2
(c) WordCount v3 (d) WordCount v4
(e) Pearson correlation coefficients
(g) Measurement noise across WordCount versions
(f) Spearman correlation coefficients
correlation coefficient
p-value
v1 v2 v3 v4
500
600
700
800
900
1000
1100
1200
Latency(ms)
hardware change
softwarechange
Table 1: My caption
v1 v2 v3 v4
v1 1 0.41 -0.46 -0.50
v2 7.36E-06 1 -0.20 -0.18
v3 6.92E-07 0.04 1 0.94
v4 2.54E-08 0.07 1.16E-52 1
Table 2: My caption
v1 v2 v3 v4
v1 1 0.49 -0.51 -0.51
v2 5.50E-08 1 -0.2793 -0.24
v3 1.30E-08 0.003 1 0.88
v4 1.40E-08 0.01 8.30E-36 1
Table 3: My caption
ver. µ µ
v1 516.59 7.96 64.88
v1 v2 v3 v4
v1 1 0.41 -0.46 -0.50
v2 7.36E-06 1 -0.20 -0.18
v3 6.92E-07 0.04 1 0.94
v4 2.54E-08 0.07 1.16E-52 1
Table 2: My caption
v1 v2 v3 v4
v1 1 0.49 -0.51 -0.51
v2 5.50E-08 1 -0.2793 -0.24
v3 1.30E-08 0.003 1 0.88
v4 1.40E-08 0.01 8.30E-36 1
Table 3: My caption
ver. µ µ
v1 516.59 7.96 64.88
v2 584.94 2.58 226.32
v3 654.89 13.56 48.30
v4 1125.81 16.92 66.56
- Different correlations
- Different optimum
Configurations
- Different noise level

Traditional machine learning vs. transfer learning
Source Domain
(cheap)
Target Domain
(expensive)
Learning
Algorithm
Different Domains
Learning
Algorithm
Learning
Algorithm
Learning
Algorithm
Traditional Machine Learning Transfer Learning
Knowledge
Data
Source
Target
14

Traditional machine learning vs. transfer learning
Source Domain
(cheap)
Target Domain
(expensive)
Learning
Algorithm
Different Domains
Learning
Algorithm
Learning
Algorithm
Learning
Algorithm
Traditional Machine Learning Transfer Learning
The goal of transfer learning is to improve learning in the target
domain by leveraging knowledge from the source domain.
Knowledge
Data
Source
Target
15

GP for modeling black box response function
true function
GP mean
GP variance
observation
selected point
true
minimum
mposed by its prior mean (µ(·) : X ! R) and a covariance
nction (k(·, ·) : X ⇥ X ! R) [41]:
y = f(x) ⇠ GP(µ(x), k(x, x0
)), (2)
here covariance k(x, x0
) defines the distance between x
d x0
. Let us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
e collection of t experimental data (observations). In this
mework, we treat f(x) as a random variable, conditioned
observations S1:t, which is normally distributed with the
lowing posterior mean and variance functions [41]:
µt(x) = µ(x) + k(x)|
(K + 2
I) 1
(y µ) (3)
2
t (x) = k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (4)
here y := y1:t, k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)],
:= µ(x1:t), K := k(xi, xj) and I is identity matrix. The
ortcoming of BO4CO is that it cannot exploit the observa-
ns regarding other versions of the system and as therefore
nnot be applied in DevOps.
2 TL4CO: an extension to multi-tasks
TL4CO 1
uses MTGPs that exploit observations from other
evious versions of the system under test. Algorithm 1
fines the internal details of TL4CO. As Figure 4 shows,
4CO is an iterative algorithm that uses the learning from
her system versions. In a high-level overview, TL4CO: (i)
ects the most informative past observations (details in
ction 3.3); (ii) fits a model to existing data based on kernel
arning (details in Section 3.4), and (iii) selects the next
ork are based on tractable linear algebra.
evious work [21], we proposed BO4CO that ex-
task GPs (no transfer learning) for prediction of
tribution of response functions. A GP model is
y its prior mean (µ(·) : X ! R) and a covariance
·, ·) : X ⇥ X ! R) [41]:
y = f(x) ⇠ GP(µ(x), k(x, x0
)), (2)
iance k(x, x0
) defines the distance between x
us assume S1:t = {(x1:t, y1:t)|yi := f(xi)} be
n of t experimental data (observations). In this
we treat f(x) as a random variable, conditioned
ons S1:t, which is normally distributed with the
sterior mean and variance functions [41]:
µ(x) + k(x)|
(K + 2
I) 1
(y µ) (3)
k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x) (4)
1:t, k(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)],
, K := k(xi, xj) and I is identity matrix. The
of BO4CO is that it cannot exploit the observa-
ng other versions of the system and as therefore
pplied in DevOps.
CO: an extension to multi-tasks
uses MTGPs that exploit observations from other
Motivations:
1- mean estimates + variance
2- all computations are linear algebra
3- good estimations when few data
K := 4
...
...
...
k(xt, x1) . . . k(xt, xt)
5 (7)
odels have shown to be effective for performance
ns in data scarce domains [20]. However, as we
monstrated in Figure 2, it may becomes inaccurate
e samples do not cover the space uniformly. For
onfigurable systems, we require a large number of
ions to cover the space uniformly, making GP models
ve in such situations.
el prediction using transfer learning
nsfer learning, the key question is how to make accu-
dictions for the target environment using observations
er sources, Ds. We need a measure of relatedness not
ween input configurations but between the sources
The relationships between input configurations was
in the GP models using the covariance matrix that
ned based on the kernel function in Eq. (7). More
lly, a kernel is a function that computes a dot product
ure of “similarity”) between two input configurations.
kernel helps to get accurate predictions for similar
ations. We now need to exploit the relationship be-
e source and target functions, g, f, using the current
ions Ds, Dt to build the predictive model ˆf. To capture
ionship, we define the following kernel function:
k(f, g, x, x0
) = kt(f, g) ⇥ kxx(x, x0
), (8)
other components can query the performance under specific
configurations or update the model given a new observation.
We use transfer learning to make the knowledge more accurate
using observations that are taken from a simulator or any
other cheap sources. For deciding how many observations
and from what source to transfer, we use the cost model
that we have introduced earlier. At runtime, the managed
system is Monitored by pulling the end-to-end performance
metrics (e.g., latency, throughput) from the corresponding
sensors. Then, the retrieved performance data needs to be
Analysed and the mean performance associated to a specific
setting of the system will be stored in a data repository.
Next, the GP model needs to be updated taking into account
the new performance observation. Having updated the GP
model, a new configuration may be Planned to replace the
current configuration. Finally, the new configuration will be
enacted by Executing appropriate platform specific operations.
This enables model-based knowledge evolution using machine
learning [2], [21]. The underlying GP model can now be
updated not only when a new observation is available but
also by transferring the learning from other related sources.
So at each adaptation cycle, we can update our belief about
the correct response given data from the managed system and
other related sources, accelerating the learning process.
IV. EXPERIMENTAL RESULTS
We evaluate the effectiveness and applicability of our
transfer learning approach for learning models for highly-
orate domain knowledge as prior, if available, which
nce the model accuracy [20].
er to describe the technical details of our transfer
methodology, let us briefly describe an overview of
l regression; a more detailed description can be found
e [35]. GP models assume that the function ˆf(x) can
reted as a probability distribution over functions:
y = ˆf(x) ⇠ GP(µ(x), k(x, x0
)), (4)
: X ! R is the mean function and k : X ⇥ X ! R
variance function (kernel function) which describes
onship between response values, y, according to the
of the input values x, x0
. The mean and variance of
model predictions can be derived analytically [35]:
= µ(x) + k(x)|
(K + 2
I) 1
(y µ), (5)
= k(x, x) + 2
I k(x)|
(K + 2
I) 1
k(x), (6)
(x)|
= [k(x, x1) k(x, x2) . . . k(x, xt)], I is
matrix and
K :=
2
6
4
k(x1, x1) . . . k(x1, xt)
...
...
...
k(xt, x1) . . . k(xt, xt)
3
7
5 (7)
odels have shown to be effective for performance
ns in data scarce domains [20]. However, as we
Eq. (5), (6) with the new K. The main essence of transfer
learning is, therefore, the kernel that capture the source and
target relationship and provide more accurate predictions using
the additional knowledge we can gain via the relationship
between source and target.
D. Transfer learning in a self-adaptation loop
Now that we have described the idea of transfer learning for
providing more accurate predictions, the question is whether
such an idea can be applied at runtime and how the self-
adaptive systems can benefit from it. More specifically, we
now describe the idea of model learning and transfer learning
in the context of self-optimization, where the system adapts
its configuration to meet performance requirements at runtime.
The difference to traditional configurable systems is that we
learn the performance model online in a feedback loop under
time and resource constraints. Such performance reasoning is
done more frequently for self-adaptation purposes.
An overview of a self-optimization solution is depicted
in Figure 3 following the well-know MAPE-K framework
[9], [23]. We consider the GP model as the K (knowledge)
component of this framework that acts as an interface to which
other components can query the performance under specific
configurations or update the model given a new observation.
We use transfer learning to make the knowledge more accurate
using observations that are taken from a simulator or any

The case where we learn from correlated responses
-1.5 -1 -0.5 0 0.5 1 1.5
-4
-3
-2
-1
0
1
2
3
(a) 3 sample response functions
configuration domain
responsevalue
(1)
(2)
(3)
observations
(b) GP fit for (1) ignoring observations for (2),(3)
LCB
not informative
(c) multi-task GP fit for (1) by transfer learning from (2),(3)
highly informative
GP prediction mean
GP prediction variance
probability distribution
of the minimizers

Evaluation
• Case study (robotics)
• Controlled experiments (highly configurable software)
19

Performance
prediction
for CoBot
5 10 15 20 25
5
10
15
20
25
14
16
18
20
22
24
5 10 15 20 25
5
10
15
20
25
8
10
12
14
16
18
20
22
24
26
5 10 15 20 25
5
10
15
20
25
5
10
15
20
25
30
CPU usage [%] CPU usage [%]
(a) (b)
(c) (d)
Prediction without transfer learning
5 10 15 20 25
5
10
15
20
25
10
15
20
25
Prediction with transfer learning
(a) Source response
function (cheap)
(b) Target response
function (expensive)
20

Performance
prediction
for CoBot
5 10 15 20 25
5
10
15
20
25
14
16
18
20
22
24
5 10 15 20 25
5
10
15
20
25
8
10
12
14
16
18
20
22
24
26
5 10 15 20 25
5
10
15
20
25
5
10
15
20
25
30
(a) (b)
(c) (d)
Prediction without transfer learning
5 10 15 20 25
5
10
15
20
25
10
15
20
25
Prediction with transfer learning
(a) Source response
function (cheap)
(b) Target response
function (expensive)
(c) Prediction without
transfer learning
(d) Prediction with
transfer learning
21

Prediction
accuracy and
model reliability
2
2
2
2
2
2
2
2
2
44
4
4
6
6
6
8
8 1012
14 16
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
5
5
5
5
5
5
5
5
5
10
10
10
10
15
15
15
20
20
25
303540
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
(a) (b)
0.5 0.5 0.5
1 1 1
1.5 1.5 1.5
2
2 2
2.5 2.5
2.5
3
3
3
3
3.5
3.5
3.5
3.5
4
4
4
4
4
4.5 4.5
4.5
5
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
0.005 0.005 0.005
0.01 0.01 0.01
0.015 0.015 0.0150.02 0.02 0.02
0.02
0.025
0.025
0.025 0.025
0.025
0.025
0.025
0.03
0.03
0.03
0.03
0.03
0.03
0.03
0.035
0.035
0.035
0.04
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
(c) (d)
(a) Prediction error
• Trade-off among using more
source or target samples
(b) Model reliability
• Transfer learning contributes
to lower the prediction
uncertainty
22

Prediction
accuracy and
model reliability
2
2
2
2
2
2
2
2
2
44
4
4
6
6
6
8
8 1012
14 16
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
5
5
5
5
5
5
5
5
5
10
10
10
10
15
15
15
20
20
25
303540
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
(a) (b)
0.5 0.5 0.5
1 1 1
1.5 1.5 1.5
2
2 2
2.5 2.5
2.5
3
3
3
3
3.5
3.5
3.5
3.5
4
4
4
4
4
4.5 4.5
4.5
5
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
0.005 0.005 0.005
0.01 0.01 0.01
0.015 0.015 0.0150.02 0.02 0.02
0.02
0.025
0.025
0.025 0.025
0.025
0.025
0.025
0.03
0.03
0.03
0.03
0.03
0.03
0.03
0.035
0.035
0.035
0.04
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
(c) (d)
(a) Prediction error
• Trade-off among using more
source or target samples
(b) Model reliability
• Transfer learning contributes
to lower the prediction
uncertainty
(c) Training overhead
(d) Evaluation overhead
• Appropriate for runtime
usage
23

Level of correlation between source and
target is important
10
20
30
40
50
60
AbsolutePercentageError[%]
Sources s s1 s2 s3 s4 s5 s6
noise-level 0 5 10 15 20 25 30
corr. coeff. 0.98 0.95 0.89 0.75 0.54 0.34 0.19
µ(pe) 15.34 14.14 17.09 18.71 33.06 40.93 46.75
TABL
colum
datase
measu• Model becomes more accurate
when the source is more
related to the target
• Even learning from a source
with a small correlation is
better than no transfer
24

25
5 10 15 20 25
number of particles
5
10
15
20
25
numberofrefinements
5
10
15
20
25
30
5 10 15 20 25
number of particles
5
10
15
20
25
numberofrefinements
10
12
14
16
18
20
22
24
5 10 15 20 25
number of particles
5
10
15
20
25
numberofrefinements
10
15
20
25
5 10 15 20 25
number of particles
5
10
15
20
25
numberofrefinements
10
15
20
25
5 10 15 20 25
number of particles
5
10
15
20
25
numberofrefinements
6
8
10
12
14
16
18
20
22
24
(a) (b) (c)
(d) (e) 5 10 15 20 25
number of particles
5
10
15
20
25
numberofrefinements
12
14
16
18
20
22
24
(f)
CPU usage [%]
CPU usage [%] CPU usage [%] CPU usage [%]

Prediction accuracy
• The model provides more
accurate predictions as we
exploit more data
• Transfer learning may (i) boost
initial performance, (ii)
increase the learning speed,
(iii) lead to a more accurate
performance finally
correspondences. In much of the work on transfer learning, a human p
this mapping, but some methods provide ways to perform the mappin
matically. Another section of the chapter discusses work in this area.
performance
training
with transfer
without transfer
higher start
higher slope higher asymptote
Fig. 2. Three ways in which transfer might improve learning.
2
[Lisa Torrey and Jude Shavlik]
26

Experimental evaluations
• RQ1: How much does transfer learning improve the prediction
accuracy?
• RQ2: What are the trade-offs between learning with different
numbers of samples from source and target in terms of prediction
accuracy?
• RQ3: Is our transfer learning approach applicable in the context of
self-adaptive systems in terms of model training and evaluation time?
27

Subject systems
• Robotic software (CoBot)
• Stream processing systems (WordCount, RollingSort, SOL) on Apache
Storm
• NoSQL benchmark on Apache Cassandra
28

Stream Processing Systems
Logical
View
Physical
View
pipe
Spout A Bolt A Bolt B
socket socket
out queue in queue
Worker A Worker B Worker C
out queue in queue
Kafka Spout Splitter Bolt Counter Bolt
(sentence) (word)
[paintings, 3]
[poems, 60]
[letter, 75]
Kafka Topic
Stream to
Kafka
File
(sentence)
(sentence)
(sentence)
Kafka Spout
RollingCount
Bolt
Intermediate
Ranking Bolt
(hashtags)
(hashtag,
count)
Ranking
Bolt
(ranking)
(trending
topics)Kafka Topic
Twitter to
Kafka
(tweet)
Twitter Stream
(tweet)
(tweet)
Storm Architecture
Word Count Architecture
• CPU intensive
Rolling Sort Architecture
• Memory intensive
Applications:
• Fraud detection
• Trending topics

30
5
5
5
5
5
5
5
5
5
10
10
10
15
15
202530
35
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
10
10
10
10
20
20
20
20
20
30
30
30
4040
40
5050
50
6060
60
70
708090
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
20
20
20
20
20
20
4040
40
6060
60
8080
80
100
100
120
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
20
20
20
20
4040
40
60
60
60
80
80
100120
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
20
20
20
4040
40
6060
60
8080
80
100
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
20
20
20
20
20
4040
40
6060
60
8080
80
100
1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
(a) (b) (c)
(d) (e) (f)
umber of source and target samples as follows:
, Dt) = cs · |Ds| + ct · |Dt| + CT r(|Ds|, |Dt|), (2)
, Dt)  Cmax, (3)
max is the experimental budget and CT r(|Ds|, |Dt|)
st of model training, which is a function of source
et sample sizes. Note that this cost model makes the
we have formulated in Eq. (1) into a multi-objective
where not only accuracy is considered, but also cost
onsidered in the transfer learning process.
l learning: Technical details
e Gaussian Process (GP) models to learn a reliable
nce model ˆf(·) that can predict unobserved response
The main motivation to chose GP here is that it offers
CoBot WC SOL
Cass (hdw)RS Cass (db size)

Trade-off between measurement cost and
accuracy
0 200 400 600 800 1000 1200
Measurement cost ($)
0
10
20
30
40
50
60
70
80
accuracy
threshold
budget
threshold
sweet
spot
31

Model training and evaluation overhead
0 10 20 30 40 50 60 70 80 90 100
Percentage of training data from source
10
-2
10
-1
10
0
10
1
10
2
10
3
Modeltrainingtime
WC
RS
SOL
cass
0 10 20 30 40 50 60 70 80 90 100
Percentage of training data from source
10
-3
10
-2
10
-1
10
0
10
1
Modelevaluationtime
WC
RS
SOL
cass
(a) (b)
32

Future work, insights and ideas

What if multiple related sources?
Target
corr_2
corr_3
corr_1
source 1
source 2
source 3
Target
select
(a) (b)
source 1
source 2
source 3
34

Example of multiple sources
Source Simulator Target Simulator
Source Robot Target Robot
(1)
(2)
(3)
(4)
(5)
35

Current priority
We have looked into transfer learning from simulator to real robot, but
now, we consider the following scenarios:
• Workload change (New tasks or missions, new environmental
conditions)
• Infrastructure change (New Intel NUC, new Camera, new Sensors)
• Code change (new versions of ROS, new localization algorithm)
36

Learned function with bad
samples -> overfitting
Goal: find best sample points iteratively by gaining
knowledge from source and target domain
Learned function with
good samples
37
Active learning + transfer learning

Future work: Integrating transfer learning
with MAPE-K
38
Environment
R.1 Performance
Monitoring performance
repository
GP modelM
A P
E
Highly Configurable System
R.2 Data
Analysis
R.4 Select the most
promising
configuration to test
R.5 Update System
Configuration
R.3 Model Update
K
AS
Autonomic Manager
simulation
data
repository
D.1 Change config.
D.2 Perform sim.
D.3 Extract data
R.3.1
Transfer Learning
Simulator
RuntimeDesign-time
Real System

Conclusion: Our cost-aware transfer learning
• Improves the model accuracy up to several orders of magnitude
• Is able to trade-off between different number of samples from source
and target enabling a cost-aware model
• Imposes an acceptable model building and evaluation cost making
appropriate for application in the robotics domain
39

Transfer Learning for Improving Model Predictions in Highly Configurable Software

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Transfer Learning for Improving Model Predictions in Highly Configurable Software

Ähnlich wie Transfer Learning for Improving Model Predictions in Highly Configurable Software (20)

Mehr von Pooyan Jamshidi

Mehr von Pooyan Jamshidi (14)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Transfer Learning for Improving Model Predictions in Highly Configurable Software