Autonomous control in Big Data platforms: and experience with Cassandra

Autonomous control in Big
Data platforms: an
experience with Cassandra
Emiliano Casalicchio (emc@bth.se)
Joint research with:
Lars Lundberg and Sogand Shirinbab
Computer Science Dep.
Blekinge Institute of Technology
ACROSS
-‐ Rome
Meeting

Research framework
• Scalable resource-efficient systems for big data analytics
• awarded by Knowledge Foundation, Sweden (20140032),
ACROSS
-‐ Rome
Meeting

Agenda
• Big Data Platform
• Main properties
• Why autonomous control is important
• Challenges
• The Cassandra case study
• Conclusions
ACROSS
-‐ Rome
Meeting

The NIST BDRABig
Data
Framework
Providers
BD
Platforms
(logical
data
organization
and

distribution,
access
API)
Infrastructures
(networking,
computing,

storage)
BD
Processing
Frameworks
(batch,

interactive,
streaming)
Big
Data
Applications
e.g.
HDFS,
Cassandra,
Hbase,
Dynamo,
PNUTS,
…

e.g.
MapReduce,
Flink,
Mahart,
Storm,
pbdR,
Tez,
Spark,

Esper,
WSO2-‐CEP
• File
systems
• Google
File
System

• Apache
Hadoop
File

Systems
(HDFS)
• NoSQL
data
store
• Hbase,
BigTable
• Cassandra
• Dynamo,
DynamoDB
• Sherpa,
PNUTS
ACROSS
-‐ Rome
Meeting

Properties of NoSQL data stores
• Scalability
• Throughput / Dataset size
• Availability
• Data replication
• Eventual consistency
• Consistency level for R/W, to trade off availability and latency
ACROSS
-‐ Rome
Meeting

Autonomic control is a must
• System complexity
• Human assisted management is unrealistic
• Security
• complete automation of procedures
• self-configuration, self-healing and self-protection
• Optimization
• Self-optimization
ACROSS
-‐ Rome
Meeting

Two approaches
ACROSS
-‐ Rome
Meeting
Multi-layer
adaptation
(e.g. orchestration
of DB nodes auto-
scaling and VM
placement on top
of the physical
infrastructure)
Single-layer adaptation
(e.g. auto-scaling of DB nodes, self
configuration of DB parameters)
Platforms
(logical
data
organization
and

distribution,
access
API)
Infrastructures
Big
Data
Applications
+
processing

Framework
e.g.
HDFS,
Cassandra,
Hbase,
Dynamo,
PNUTS,
…

Virtual
infrastructure
Physical
infrastructure

Issues in single layer adaptation
• Interference between infrastructure adaptation and
platform adaptation
• Platform properties can limit infrastructure level adaptation
actions
• E.g. effect of auto-scaling can be limited by serialization constraints.
• Geographical distribution (and network configuration) can
conflict with latency/availability trade off at platform layer
• Infrastructure adaptation can hurt NoSQL data store
properties
• E.g. 2 or more replicas on the same PM impact node reliabilitt and
consistency level reliability
ACROSS
-‐ Rome
Meeting

An example
DB DB DB
VM VM VM
PM PM PM
reliability=1-(1-r)3
If r=0.9 reliability is 0.999
DB DB DB
VM VM VM
PM
reliability=0.9
RF=3 each node store a replica of the data set
DB DB DB
VM VM VM
PM
reliability=0.99
PM
ACROSS
-‐ Rome
Meeting

Multi layer adaptation
• It means to coordinate at run time
• Self configuration of BD platform
• Deployment of the platform on the virtual infrastructure
• Allocation/placement of virtual infrastructure on physical infrastructure
• The challenges are:
• To formulate an optimization model that account for all the
dependencies and constraints imposed by the system architecture
• To formulate multi objective functions that account for contrasting
objectives at infrastructure level and application level
• E.g. minimizing power consumption and maximizing platform reliability
ACROSS
-‐ Rome
Meeting

The Cassandra case study
• E.Casalicchio, L.Lundberg, S.Shirinbab, Energy-aware adaptation in managed Cassandra
datacenters, IEEE International Conference on Cloud and Autonomic Computing (ICCAC
2016), Augsburg, Germany, September 12-16, 2016
• E. Casalicchio, L.Lundberg, S.Shirinbab (2017), Energy-aware Auto-scaling Algorithms for
Cassandra Virtual Data Centers, Cluster Computing, Elsevier (TO APPEAR JUNE 2017).
ACROSS
-‐ Rome
Meeting

Motivations
• Data storage (or serving systems) are playing an important role in
the cloud and big data industry
• The management is a challenging task
• The complexity increase when multitenancy is considered
• Human assisted control is unrealistic
• there is a growing demand for autonomic solutions
• Our industrial partner Ericsson AB is interested in autonomic
management of cassandra
ACROSS
-‐ Rome
Meeting

Problem description
• We consider a provider of a
managed Apache Cassandra
service
• The applications or tenants of the
service are independent and each
uses its own Cassandra Virtual
Data Center (VDC)
• The service provider want to maintain SLAs, that requires
• to properly plan the capacity and the configuration of each Cassandra VDC
• To dynamically adapt the infrastructure and VDC configuration without
disrupting performances
• To minimize power consumption.
ACROSS
-‐ Rome
Meeting

Solution proposed (1000 ft view)
• An energy-aware adaptation model specifically designed for Cassandra VDC
running on a cloud infrastructure
• Architectural constraints imposed by Cassandra (minimum number of nodes,
homogeneity of nodes, replication factor and heap size)
• Constraints on throughput and replication factor imposed by the SLA
• Power consumption model based on CPU utilization
• An adaptation policy of the Cassandra VDC configuration and the cloud
infrastructure configuration, that orchestrate three strategies:
• Vertical scaling
• Horizontal scaling
• Optimal placement
ACROSS
-‐ Rome
Meeting

Solution proposed (deep details)
• A Workload and SLA model
• A System architecture model
• A throughput model
• The utility function and problem formulation
• Drawbacks of the optimal solution and alternatives
• Experimental results
ACROSS
-‐ Rome
Meeting

Workload and SLA model
• Workload features
• the type of requests, e.g. read only, write only, read & write, scan, or a
combination of those
• The rate of the operation requests
• The size of the dataset
• Workload types
• CPU bound, Memory bound
• The data replication_factor
• SLA
R, W, RW (75/25) Dataset size
Replication factor
< 𝑙#
, 𝑇#
&#'
, 𝐷#, 𝑟# >
Min Thr ACROSS
-‐ Rome
Meeting

Architecture model
• Homogeneous physical machines (H)
• VMs of different type and size (V)
• A VDC is composed of 𝑛# homogeneous VMs
• 𝑛# ≥ 𝐷#
• At least 𝐷# vnodes out of 𝑛# must run on different PMs
• Datacenter configuration is described by a vector
• 𝑥 = 𝑥#,/,0
TABLE I. t0
li,j AS FUNCTION OF cj (VIRTUAL CPU), mj (GBYTE),
heapSizej (GBYTE) AND li. THE THROUGHPUT IS MEASURED IN
OPERATIONS/SECOND (OPS/SEC)
VM type and config. Throughput for different workloads (ops/sec)
j cj mj heapSizej R W RW
1 8 32 8 16.6 ⇥103
8.3 ⇥103
13.3 ⇥103
2 4 16 4 8.3 ⇥103
8.3 ⇥103
8.3 ⇥103
3 2 16 4 3.3 ⇥103
3.3 ⇥103
3.3 ⇥103
TABLE II. MEMORY AVAILABLE FOR THE DATASET IN A CASSANDRA
VNODE (JVM HEAP) AS FUNCTION OF THE VM MEMORY SIZE.
mj (RAM size in GB) 1 2 4 8 16 32
heapSizej (max Heap size in GB) 0.5 1 1 2 4 8
the minimum throughput the service provider must guarantee
In case ri > heapSize
constraint ni Di. Con
can be defined as
ni =
j
and considering that in
heapSizej for all co
constraints are modelle
X
j2J ,h2H
xi,j,h
X
j2J
yi,j = 1
X
si,h Di
1 8 32 8 16.6 ⇥103
8.3 ⇥103
13.3 ⇥103
2 4 16 4 8.3 ⇥103
8.3 ⇥103
8.3 ⇥103
3 2 16 4 3.3 ⇥103
3.3 ⇥103
3.3 ⇥103
mj (RAM size in GB) 1 2 4 8 16 32
to process the requests from application i. The SLA parameters
Di and ri are used to determine the number of vnodes to be
instantiated, as discussed in the next section.
Concerning Assumption 1, we limit the study to the set L =
{R, W, RW} because the model we propose can deal with any
type of operation requests, as clarified later in Section IV-C
Assumption 4 implies that the service provider have to set up
ACROSS
-‐ Rome
Meeting

Architecture model (cont’d)
• To make CPU bound a VDC we need 𝑛# ≥ 𝐷#
12
03456#738
, if 𝑟# > ℎ𝑒𝑎𝑝𝑆𝑖𝑧𝑒/
• 𝑦#,/ = 1 if application i use a VM configuration j to run Cassandra vnodes,
otherwise 𝑦#,/ = 0
• 𝑠#,0 = 1
if a Cassandra vnode serving application i run of PM h. Otherwise 𝑠#,/ =
0
• Vertical scaling is modelled assuming that is possible to switch from
configuration j1 to j2 at runtime
ri > heapSizej it holds Eq. 1, otherwise, hold the
nt ni Di. Considering that the number ni of vnodes
efined as
ni =
X
j2J ,h2H
xi,j,h 8i 2 I. (2)
sidering that in our industrial case is always ri
ej for all configurations j, the above introduced
nts are modelled by the following equations:
X
2J ,h2H
xi,j,h Di ·
ri
heapSizej
8i 2 I (3)
X
2J
yi,j = 1 8i 2 I (4)
X
2H
si,h Di 8i 2 I (5)
Throughput for different workloads (ops/sec)
R W RW
16.6 ⇥103
8.3 ⇥103
13.3 ⇥103
8.3 ⇥103
8.3 ⇥103
8.3 ⇥103
3.3 ⇥103
3.3 ⇥103
3.3 ⇥103
BLE FOR THE DATASET IN A CASSANDRA
NCTION OF THE VM MEMORY SIZE.
1 2 4 8 16 32
GB) 0.5 1 1 2 4 8
e service provider must guarantee
application i. The SLA parameters
mine the number of vnodes to be
the next section.
1, we limit the study to the set L =
odel we propose can deal with any
as clarified later in Section IV-C.
he service provider have to set up,
oarding phase, and to maintain, at
f vnodes for tenant i. Dealing only
ni =
X
j2J ,h2H
xi,j,h 8i 2 I. (2)
and considering that in our industrial case is always ri
heapSizej for all configurations j, the above introduced
constraints are modelled by the following equations:
X
j2J ,h2H
xi,j,h Di ·
ri
heapSizej
8i 2 I (3)
X
j2J
yi,j = 1 8i 2 I (4)
X
h2H
si,h Di 8i 2 I (5)
where: yi,j is equal to 1 if application i use a VM configuration
j to run Cassandra vnodes, otherwise yi,j = 0; si,h is equal
to 1 if a Cassandra vnode serving application i run of PM h.
Otherwise si,h = 0.
Finally, to model vertical scaling actions, that is a change
from configuration j1 to j2, we replace the VM of type j1
with a VM of type j2. However, in a real setting, hypervisorsACROSS
-‐ Rome
Meeting

Throughput model
• The actual throughput 𝑇# is a function of 𝑥#,/,0
0 2 4 6 8 10 12
Number of nodes (ni
)
0
20
40
60
80
100
Throughputti,j,h
(103
ops/sec)
R
W
RW
t0
t(5< ni
≤ 8) = t0
· δk
l
i
,j
· (ni
-4) + t(ni
=4)
n
i
Fig. 2. A real example of Cassandra throughput as function of the number
of Cassandra vnodes allocated for different type of requests. The plot shows
how the model we propose is realistic.
min f(x) = P(x)
subject to:
X
J ,H
t(xi,j,h) Tmin
i , 8i 2 I
X
H
xi,j,h · yi,j
Di · ri
heapSizej
, 8i 2 I,
xi,j,h  · yi,j, 8i 2 I, j 2 J , h 2 H
X
J
yi,j = 1, 8i 2 I
X
I,J
xi,j,h · cj  Ch, 8 h 2 H
X
i
ple, in Figure 2. k
li,j is the slope of the kth
segment and
id for a number of Cassandra vnodes ni between nk 1
nk. Therefore, for nk 1  ni  nk, we can write the
wing expression:
t(ni) = t(nk 1) + t0
li,j · k
li,j · (ni nk 1) (6)
e k 1, n0 = 1 and ti,j,h(1) = t0
li,j.
nally, for a configuration x of a VDC, and considering
ion 2 we define the overall throughput Ti as:
Ti(x) = t (ni) , 8i 2 I (7)
ower consumption model
s service provider utility we chose the power consumption
s directly related with the provider revenue (and with IT
nability).
literature has been proposed many work for reducing
tacenter running N independent
vector x = [xi,j,h], where xi,j,h
nodes serving application i and
ation j allocated on PM h, 8i 2
2 H = [1, H] and I, J , H ⇢
s a nominal CPU capacity Ch,
ble cores, and a RAM of Mh
onfigured with cj virtual cores,
mum JVM heap size heapSizej
mportant parameter in our case
of the data a Cassandra vnodes
or fast retrieval and processing.
size of the RAM of the heap
ummarised in Table II. Hence,
C instantiated for application i
li,j
is valid for a number of Cassandra vnodes n
and nk. Therefore, for nk 1  ni  nk, w
following expression:
t(ni) = t(nk 1) + t0
li,j · k
li,j · (ni
where k 1, n0 = 1 and ti,j,h(1) = t0
li,j.
Finally, for a configuration x of a VDC,
Equation 2 we define the overall throughput
Ti(x) = t (ni) , 8i 2 I
D. Power consumption model
As service provider utility we chose the po
that is directly related with the provider reven
sustainability).
TABLE I. t0
1 8 32 8 16.6 ⇥103
8.3 ⇥103
13.3 ⇥103
2 4 16 4 8.3 ⇥103
8.3 ⇥103
8.3 ⇥103
3 2 16 4 3.3 ⇥103
3.3 ⇥103
3.3 ⇥103
In case ri > heapSizej it holds Eq.
constraint ni Di. Considering that t
can be defined as
ni =
X
j2J ,h2H
xi,j,h 8
and considering that in our industria
heapSizej for all configurations j,
ACROSS
-‐ Rome
Meeting

Energy consumption model
• Based on physical node utilization
• 𝑃0
&4F
= 500𝑊 and 𝑘0 = 0.7
core used (e.g. [7]).
In this work we chose a linear model [3] where the power
Ph consumed by a physical machine h is a function of the
CPU utilization and hence of the system configuration x:
Ph(x) = kh · Pmax
h + (1 kh) · Pmax
h · Uh(x) (8)
where Pmax
h is the maximum power consumed when the PM
h is fully utilised (e.g. 500W), kh is the fraction of power
consumed by the idle PM h (e.g. 70%), and the CPU utilisation
for PM h is defined by
Uh(x) =
1
Ch
·
X
I,J
xi,j,h · cj (9)
I
yi,j, si,h a
xi,j,k 2 N,
where: t
that the SLA
all the tenan
non linear, b
from operat
eq. 6. Eq. 1
the number
portion of da
and that the
implemente
that for eac
is an extrem
Ph(x) = kh · Pmax
h + (1 kh) · Pmax
h · Uh(x) (8)
Pmax
h is the maximum power consumed when the PM
ly utilised (e.g. 500W), kh is the fraction of power
ed by the idle PM h (e.g. 70%), and the CPU utilisation
h is defined by
Uh(x) =
1
Ch
·
X
I,J
xi,j,h · cj (9)
optimisation problem
ntroduced before, the service provider aims to minimise
all energy consumption P(x) defined by
X
where: the set of c
that the SLA is satisfie
all the tenants. For the s
non linear, but they can
from operational resear
eq. 6. Eq. 12 introduce
the number of vnodes a
portion of dataset handl
and that the replicatio
implemented. Equation
that for each tenant mu
is an extremely large po
maximum capacity of t
relaxation of such cons
In the same way, Eq.
for the vnodes do not e
physical nodes. Eq. 17ACROSS
-‐ Rome
Meeting

Problem formulation
• Linear model
• Objective function linear
• Constraints linear
• Constraints imposed by
• SLA on throughput and
replication factor
• Replication factor (and number of
distinct PMs)
• Homogeneity of vnodes
configuration
• Heap size (we want a CPU bound
configuration)
0 2 4 6 8 10 12
Number of nodes (ni
)
0
20
40
60
80
100
Throughputti,j,h
(103
ops/sec)
R
W
RW
t
0
t(5< ni
≤ 8) = t
0
· δ
k
l
i
,j
· (ni
-4) + t(ni
=4)
n
i
Fig. 2. A real example of Cassandra throughput as function of the number
of Cassandra vnodes allocated for different type of requests. The plot shows
how the model we propose is realistic.
on cloud management systems the techniques typically used
are: scheduling, placement, migration, and reconfiguration
of virtual machines. In the specific the ultimate goal is to
optimise the use of resources to reduce power consumption.
Optimisation depends on the context, it could means min-
imising PM utilisation or to balance the utilisation level of
physical machine with the use of network devices for data
transfer and storage. Independently from the configuration or
adaptation policy adopted all these techniques are based on
power and/or energy consumption models. Power consumption
models usually define a linear relationship between the amount
of power used by a system as function of the CPU utilisation
(e.g. [3]–[5]), or processor frequency (e.g. [6]) or number of
core used (e.g. [7]).
In this work we chose a linear model [3] where the power
Ph consumed by a physical machine h is a function of the
min f(x) = P(x)
subject to:
X
J ,H
t(xi,j,h) Tmin
i , 8i 2 I (11)
X
H
xi,j,h · yi,j
Di · ri
heapSizej
, 8i 2 I, j 2 J (12)
xi,j,h  · yi,j, 8i 2 I, j 2 J , h 2 H (13)
X
J
yi,j = 1, 8i 2 I (14)
X
I,J
xi,j,h · cj  Ch, 8 h 2 H (15)
X
I,J
xi,j,h · mj  Mh, 8 h 2 H (16)
X
H
si,h Di, 8i 2 I (17)
X
J
xi,j,h si,h ·  0, 8h 2 H (18)
X
J
xi,j,h + si,h  0, 8h 2 H (19)
X
I
si,h rh ·  0, 8h 2 H (20)
X
I
si,h + rh  0, 8h 2 H (21)
yi,j, si,h and rh 2 [0, 1], 8i 2 I, j 2 J , h 2 H (22)
xi,j,k 2 N, 8i 2 I, j 2 J , h 2 H (23)ACROSS
-‐ Rome
Meeting

Sub-optimal adaptation
• LocalOpt
• LocalOpt-H
• BestFit
• BestFit-H
10
0 500 1000
Number of tenants
102
10
4
10
6
10
8
Num.ofIterations
V=3
Opt (H=100)
BestFit (H=100)
LocalOpt (H=100)
Opt (H=1000)
BestFit (H=1000)
LocalOpt (H=1000)
0 500 1000
Number of tenants
102
10
4
106
10
8
V=10
Fig. 3 Number of Iterations for di↵erent values of N, V and
H
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 ACROSS
-‐ Rome
Meeting

Scenarios
• New Service Subscriptions
• Dataset size increase
• Throughput Increase
• Surge in the throughput
• Physical node failures
ACROSS
-‐ Rome
Meeting
asible solutions.
ines 19 - 30 place the vnodes on the PMs minimising
umber of PMs used packing as much vnodes as possible
PM, of course in the respect of Di constraint. That also
mise the energy consumption. The function any(cj⇤ 
compare cj⇤ with all the element of Ca
and it returns
f exist at least one element of Ca
that is greater or equal
cj⇤ . Otherwise, if no PMs satisfy the constraint it returns
The same behaviour is valid for any(mj⇤  Ma
). The
ion sortDescendent(Ha) sorts the Ha in descending
. The function popRR(Ha,Di) extracts, in round-robin
, a PM from the ﬁrst Di in Ha. At Line 28, if there is
ore room in the selected PMs the set Ha is sorted again
allows to try the allocation for the PMs that have now
capacity available). At line 32, if not all the n⇤
i,j⇤ vnodes
allocated the empty set is returned because no feasible
ions for the allocation. Otherwise, the suboptimal solution
is returned.
VI. PERFORMANCE EVALUATION METHODOLOGY
TABLE III. MODEL PARAMETERS USED IN THE EXPERIMENTS
Parameter Value Description
N 1 – 10 Number of tenants
V 3 Number of VM types
H 8 Number of PMs
Di 1 – 4 Replication factor for App. i
ri 5 - 50 Dataset size for App. i
L {R, W, RW } Set of request types
T min
i 10000 70000 ops/sec Minimum throughput agreed in the
SLA
Ch 16 Number of cores for PM h
cj 2 – 8 Number of vcores used by VM
type j
Mh 128 GB Memory size of PM h
mj 16 – 32 GB Total memory used by VM type j
heapSizej 4 – 8 GB Max heap size used by VM type j
8li: 1
li
1 1  xi,j,h  2
2
li
0.8 3  xi,j,h  7
3
li
0.66 xi,j,h 8
P max
h 500 Watt Maximum power consumed by PM
h if fully loaded
kh 0.7 Fraction of P max
h consumed by
PM h if idle

Performance metrics
• Total power consumption
• Scaling Index
• Count for horizontal and vertical scaling
• The Migration Index
• Count the number of migrations
• Delayed Requests
• Assume not request timeout
• Consistency level reliability
• Defined as the probability that the number of healthy replicas in the Cassandra VDC is
enough to guarantee a specific level of consistency over a fixed time interval
ACROSS
-‐ Rome
Meeting

New service subscriptions
Workload
• 75%R, 15%W, 10%RW (unif)
• Tmin = 10-18Kops/sec (unif)
• Di = 2,3 (unif)
• Ri = 8GB
et size variation scenarios, while
for the SLA variation scenario.
d on using Matlab R2015b 64-
gle Intel Core i5 processor with
model parameters we used for
ble III.
MENTAL RESULTS
compare the performance of the
lOpt and BestFit heuristics.
we increase the number of sub-
e SLA parameters are randomly
ts summarise the data collected
nant subscription.
system scale when new tenants
mber of vnodes allocated grows
enants. The differences are that:
alOpt tend to allocate small
Number of tenants (N)
Fig. 3. New service subscription: number of vnodes allocated by the three
adaptation policies
1 2 3 4 5 6 7 8
500
1000
1500
2000
2500
3000
3500
4000
PowerconsumptionP(x)(Watt)
Optimal policy
1 2 3 4 5 6 7 8
LocalOpt
1 2 3 4 5 6 7 8
BestFit
Fig. 4. New service subscription: total power consumed by the three
adaptation policies
3
4
5
6
tionIndex(MI)
ing Monte Carlo simulation for the New service
ns and Small dataset size variation scenarios, while
valuation is used for the SLA variation scenario.
s have been carried on using Matlab R2015b 64-
X running on a single Intel Core i5 processor with
ain memory. The model parameters we used for
are reported in Table III.
VII. EXPERIMENTAL RESULTS
vice subscription
owing experiments compare the performance of the
icy with the LocalOpt and BestFit heuristics.
th one tenant and we increase the number of sub-
to 8. Because the SLA parameters are randomly
he following results summarise the data collected
ns for each new tenant subscription.
3 shows how the system scale when new tenants
he service. The number of vnodes allocated grows
h the number of tenants. The differences are that:
l policy and LocalOpt tend to allocate small
hines (VM type 1 and 2) and rarely select VM
the contrary, the BestFit algorithm often uses
(VM type 3) and that explains why the minimum
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Fig. 3. New service subscription: number of vnodes allocated by the three
adaptation policies
1 2 3 4 5 6 7 8
500
1000
1500
2000
2500
3000
3500
4000
Optimal policy
1 2 3 4 5 6 7 8
LocalOpt
1 2 3 4 5 6 7 8
BestFit
Fig. 4. New service subscription: total power consumed by the three
adaptation policies
1 2 3 4 5 6 7
Number of tenants
0
1
2
3
4
5
6
MigrationIndex(MI)
Results
• Optimal policy and LocalOpt
• allocate small VMs (type 1 and 2)
• rarely select VM type 3
• BestFit algorithm
• uses large VMs (VM type 3)
• the minimum number of vnodes
allocated is lower than the other policies.
ACROSS
-‐ Rome
Meeting

Dataset size increase
TABLE I. t0
1 8 32 8 16.6 ⇥103
8.3 ⇥103
13.3 ⇥103
2 4 16 4 8.3 ⇥103
8.3 ⇥103
8.3 ⇥103
3 2 16 4 3.3 ⇥103
3.3 ⇥103
3.3 ⇥103
mj (RAM size in GB) 1 2 4 8 16 32
{R, W, RW} because the model we propose can deal with any
I
c
c
a
h
c
w
j
t
O
Workload
• 3 tenants ( R, W, RW )
• ri = 10 – 50 GB
• Tmin = [14,10,18] Kops/sec
• Di = [3,2,3]
plot for the power consumed P(x)
ht.
40 50
± %)
D2
=2
0 20 30 40 50
r
i
(8GB ± %)
App. 3, D3
=3
ox plot for the number of vnodes
0, Tmin
2 = 14.000 and Tmin
3 =
ect the performance of the
from a dataset size of 10GB
Scali
10 15 20 25 30 35 40 45 50
0
10
20
10 15 20 25 30 35 40 45 50
ri
(GByte)
10 15 20 25 30 35 40 45 50
Fig. 11. Dataset size increase: bar plot represents the Scaling Index and the
line represents the number of virtual nodes allocated during each experiment
10 15 20 25 30 35 40 45 50
r
i
(GByte)
LocalOpt
10 15 20 25 30 35 40 45 50
BestFit
10 15 20 25 30 35 40 45 50
500
1000
1500
2000
2500
3000
3500
4000
Optimal policy
Fig. 12. Dataset size increase: box plot for the power consumed P(x).
4
0
10
0
10
20
30
Scalingindexandnum.ofvnod
10 15 20 25 30 35 40 45 50
0
10
20
30
10 15 20 25 30 35 40 45 50
r
i
(GByte)
10 15 20 25 30 35 40 45 50
VM type 1 VM type 2 VM type 3 vnodes
LocalOpt
BestFit
Fig. 11. Dataset size increase: bar plot represents the Scaling Index and the
line represents the number of virtual nodes allocated during each experiment
10 15 20 25 30 35 40 45 50
r
i
(GByte)
LocalOpt
10 15 20 25 30 35 40 45 50
BestFit
10 15 20 25 30 35 40 45 50
500
1000
1500
2000
2500
3000
3500
4000
Optimal policy
Fig. 12. Dataset size increase: box plot for the power consumed P(x).
15 20 25 30 35 40 45 50
ri
(GByte)
0
1
2
3
4
MigrationIndexMI
ACROSS
-‐ Rome
Meeting

Throughput increase
0 10 20 30 40 50 60 70
T
min
i
× 10
3
ops/sec
10 20 30 40 50 60 70
ndex for the three applications and for the three policies. The line represent the number of VMs
or the write
ies have the
not capable
ervice level
orkload RW
nodes used
selects VM
back to VM
ise, locally,
adopts VM
o VM type
the price of
of vnodes migrations.
10 20 30 40 50 60 70
LocalOpt
10 20 30 40 50 60 70
500
1000
1500
2000
2500
3000
3500
4000
Optimal policy
10 20 30 40 50 60 70
BestFit
Tmin
i
(× 103
ops/sec.)
TABLE I. t0
1 8 32 8 16.6 ⇥103
8.3 ⇥103
13.3 ⇥103
2 4 16 4 8.3 ⇥103
8.3 ⇥103
8.3 ⇥103
3 2 16 4 3.3 ⇥103
3.3 ⇥103
3.3 ⇥103
mj (RAM size in GB) 1 2 4 8 16 32
I
c
c
a
h
c
w
j
Workload
• 3 tenants ( R, W, RW )
• Tmin = 10-70 Kops/sec
• Di = 3
• Ri = 8GB
ACROSS
-‐ Rome
Meeting

Throughput increase (cont’d)
-10
0
10
ScalingIndex(SI)andnum.ofvnodes
App. 1 (R) App. 2 (W) App. 3 (RW)
-10
0
10
10 20 30 40 50 60 70
-10
0
10
10 20 30 40 50 60 70
Tmin
i
× 103
ops/sec
10 20 30 40 50 60 70
VM type 1 VM type 2 VM type 3 vnodes
LocalOpt
BestFit
Optimal policy
ut increase: Box plot represent the scaling Index for the three applications and for the three policies. The line represent the number of VMs
ch experiment
higher power consumption. Also for the write of vnodes migrations.
TABLE I. t0
heapSizej (GBYTE) AND li . THE THROUGHPUT IS MEASURED IN
1 8 32 8 16.6 ⇥103
8.3 ⇥103
13.3 ⇥103
2 4 16 4 8.3 ⇥103
8.3 ⇥103
8.3 ⇥103
3 2 16 4 3.3 ⇥103
3.3 ⇥103
3.3 ⇥103
In case ri > heapSizej it holds Eq
constraint ni Di. Considering that t
can be defined as
ni =
X
j2J ,h2H
xi,j,h
and considering that in our industri
2 but return immediately to VM type 1. That at the price of
an higher energy consumption.
From the optimal policy behaviour we learned that is
better to allocate always smaller virtual machines. That choice
usually allow to satisfy both the dataset and throughput con-
straints minimising the actual throughput provided, that for
large dataset can be higher than Tmin
i .
The power consumption is plotted in Figure 7. The
LocalOpt outperforms the BestFit, specifically for low
throughput. The penalty payed is between 15 and 25% if
LocalOpt is used and between the 13 and 50% for the
BestFit. The higher loss is observed for low values of the
throughput.
Figure 8 shows the Migration Index. Each box plot is
computed over the data collected for the three tenants. We
can observe that each application experiments between 0 and
3 vnode migrations depending on the infrastructure load state.
The case Tmin
i = 50.000 ops/sec. is an exception because it
corresponds to the vertical scaling action taken for App. 1. The
decrease in the number of vnodes used impacts also the value
T
min
i
(× 10
3
ops/sec.)
Fig. 7. Throughput increase: the power consumed P(x) by the three
adaptation policies.
20 30 40 50 60 70
T
min
i
× 10
3
ops/sec
0
1
2
3
MigrationIndex(MI)
Fig. 8. Throughput increase: Migration Index for the optimal policy.
LocalOpt and BestFit have a MI equal to zero by definition.
C. SLA variation: small dataset size variations
Dataset size could heavily impact the number of vnodes
used in a VDC (see Eq. 1). In this experiment we assess if
ACROSS
-‐ Rome
Meeting

Throughput increase (cont’d)Auto-scaling Algorithms for Cassandra Virtual Data Centers 13
20 30 40 50 60 70
-10
-5
0
5
10
ScalingIndex(SI)
Opt
20 30 40 50 60 70
-10
-5
0
5
10
LocalOpt
20 30 40 50 60 70
Throughput Ti
min
(× 103
tps)
-10
-5
0
5
10
LocalOpt-H
20 30 40 50 60 70
-10
-5
0
5
10
BestFit
20 30 40 50 60 70
-10
-5
0
5
10
BestFit-H
VM type 1 VM type 2 VM type 3
ghput increase: The box represent the scaling Index actions for the RW workload and for the five policies
he adaptation policy switches between two
ations (vertical scaling). The negative bar
M type dismissed and the positive for the
e allocated. Observations with only pos-
respond to horizontal scaling adaptation
example, for the Optimal policy there is
m VM type 3 (yellow bar) to VM type 2
r the observation Tmin
i = 30.000 ops/sec.
between 0 and 3 vnode migrations depending on the
infrastructure load state.
Considering the low values for the migration index
for the Opt allocation and the high saving in the en-
ergy consumed compared with the other algorithms, it
makes sense to perform periodic VDC consolidation us-
ing the Opt policy, as recommended in Section 7.
is for the VM type dismissed and the positive for the
new VM type allocated. Observations with only pos-
itive bars correspond to horizontal scaling adaptation
actions. For example, for the Optimal policy there is
a change from VM type 3 (yellow bar) to VM type 2
(green bar) for the observation Tmin
i = 30.000 ops/sec.
The number of new allocated VMs is smaller because
each new VM o↵ers a higher throughput. The optimal
adaptation policy always starts allocating VMs of Type
3 (cf. Tab. 1) and, if needed progressively moves to more
powerful VM types. The Opt policy performs only one
vertical scaling and when the VM type if changed from
type 3 to type 2; after that it always does horizon-
tal scaling actions (this is a particularly lucky case).
The two heuristics LocalOpt and BestFit show a very
unstable behaviour performing both vertical and hor-
izontal scaling. Both first scale to VM type 1 from
VM type 3 and then they scale back to VM type 2.
When the variant of the above algorithm is used, that is
LocalOpt-H and BestFit-H respectively, the VM type
is fixed to type 1 and the only action taken is horizontal
scaling.
The power consumption is plotted in Figure 6. For
throughput higher than 40⇥103
ops/sec, with the opti-
mal scaling is possible to save about 50% of the energy
consumed by the heuristic allocation. For low values of
the throughput (10 20⇥103
ops/sec) the BestFit and
BestFit-H show a very high energy consumption com-
Considering the low values for the migration index
for the Opt allocation and the high saving in the en-
ergy consumed compared with the other algorithms, it
makes sense to perform periodic VDC consolidation us-
ing the Opt policy, as recommended in Section 7.
10 20 30 40 50 60 70
Throughput T
i
min
(× 103
tps)
1000
1500
2000
2500
3000
3500
4000
PowerconsumedP(x)(Watt)
Opt
LocalOpt
LocalOpt-H
BestFit
BestFit-H
Fig. 6 Throughput increase: the power consumed P(x) by
the five adaptation policies when increasing the throughput
for Application 3 (RW workload).
9.2 Throughput surge
In this set of experiments we analyse how fast the scal-
ing is, with respect to the throughput variation rate,
and what is the number of delayed requests. We assume
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49 ACROSS
-‐ Rome
Meeting
Auto-scaling Algorithms for Cassandra Virtual Data Centers 11
of the auto-scaling algorithms
Use case Opt LocalOpt BestFit LocalOpt-H BestFit-H
Capacity planning X
Data center consolidation X
VDC consolidation X X
run-time adaptation X X X X

Surge in the throughput (RW)Emiliano Casalicchio et al.
0 2 4 6 8 10 12 14 16 18 20 22
Time (minutes)
0
10
20
30
40
50
60
70
80
Throughput(×10
3
ops/sec)
ThrRW
min
Opt (Case A, actual thr)
BestFit (Case A, actual thr)
Opt (Case B, actual thr)
BestFit (Case B, actual thr)
Fig. 8 Auto-scaling actions in case of a throughput surge:
Case A and Case B
). Than, at time t = 10, twelve vnodes are al-
Considering the serialization of the horizontal
actions (cf. Section 7) the seven Cassandra vn-
e added in 14 minutes. The LocalOpt behaves
Opt in terms of scaling decisions. The BestFit
aling start allocationg 4vnodes of Type 3, than
up to seven vnodes (at time t = 8) and finally
wo vertical scaling actions: the first from vnodes
to Type 2, and the second from Type 2to Type
s.
number of delayed requests Qi and the percent-
h respect the total number of received requests
) are reported in table 5. Qi and tot.req. are
ed over the time interval the requested through-
n
W exceed the actual throughput.
itively, with Cassandra vnodes capable to han-
gher workload it should be possible to better
the surge in the throughput. Hence, we have an-
Case B where we configure three new types of
dra vnodes capable to handle the following RW
put: type 4, 20 ⇥ 103
ops/sec.; type 5, 15 ⇥ 103
vnodes type capable to handle from low throughput
to very high throughput allow to manage throughput
surges.
Table 5 The number of delayed requests Qi and the per-
centage with respect the total number of received requests
(tot.req.). Qi and tot.req. are computed over the time in-
terval the requested throughput (Tmin
RW ) exceed the actual
throughput.
Case A Qi (⇥103
) Qi
tot.req.
(%)
Opt 191.84 22.78
LocalOpt 191.84 22.78
BestFit 70.89 46.33
Case B
Opt 7.66 4
LocalOpt 7.66 4
BestFit 70.58 30.29
Case
B
T4,
20
× 103ops/sec.

T5,
15
× 103ops/sec.
T6,
7
× 103
ops/sec.

Case
A
T1,
13.3x103 ops/sec
T2,
8.3x103 ops/sec
T3,
3.3x103
ops/sec
ACROSS
-‐ Rome
Meeting

Physical node failure
Consistency
level
reliability
R
defined
as
the
probability
that
the
number
of
healthy
replicas
in
the

Cassandra
VDC
is
enough
to
guarantee
a
specific
level
of
consistency
over
a
fixed
time
interval

Consistency
level
ONE
and
QUORUM
(Q=

)
rithms are applied.
In case the Cassandra VDC has a number of physi-
cal nodes H equal to the number of vnodes n, and there
is a one-to-one mapping between vnodes and physical
nodes, the consistency level of ONE is guaranteed if one
replica is up. Hence, the Consistence reliability is the
probability that at least one vnode is up and a replica
is on that node:
RO = 1
D
n
⇥ (1 ⇢)n
(24)
where: ⇢ is the resiliency of a physical node, and D
n is the
probability that a replica is on a Cassandra vnode when
the data replication strategy used is the SimpleStrategy
(cf. the Datastax documentation for Cassandra). In the
same way, we can define the reliability of the Cassandra
VDC to guarantee a consistency level of QUORUM as
the probability that at least Q vnodes are up and that
Q replicas are on them:
RQ = 1
D
⇥ (1 ⇢)n Q+1
. (25)
physical nodes is unknown
puted the values of KO and
the value for KO is equa
nodes used, the values for
allocation and on the node
allocation, we could have
where the max{K1
Q, K2
Q, .
and the min{K1
Q, K2
Q, ...} r
example, if 8 vnodes are d
following way {1, 1, 2, 1, 3}
n Q + 1 = 7, KO = 5, K
Table 6 reports the val
and D = 5, ⇢ = 0.9. Th
following way. We consid
for a randomly generated S
RW, 15%W and 75% R; T
in the interval [10.000, 18.0
ri is constant (8GB). The
each tenant is n = 6 for th
the case D = 5. We run 1
the best and worst case ov
In the first set of expe
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
RO = 1
D
n
⇥ (1 ⇢)n
(24)
where: ⇢ is the resiliency of a physical node, and D
n is the
probability that a replica is on a Cassandra vnode when
the data replication strategy used is the SimpleStrategy
(cf. the Datastax documentation for Cassandra). In the
same way, we can define the reliability of the Cassandra
VDC to guarantee a consistency level of QUORUM as
the probability that at least Q vnodes are up and that
RQ = 1
D
n
⇥ (1 ⇢)n Q+1
. (25)
Table 6 shows the values of RO and RQ for D = 3 and
5 and for ⇢ = 0.9 and ⇢ = 0.8.
In a managed Cassandra data center, a Cassandra
VDC is rarely allocated using a one-to-one mapping of
vnodes on physical nodes. The resource management
policies adopted by the provider usually end-up with
a many-to-one mapping, that is h physical nodes run
n Cassandra vnodes: D  h < n. In that case we can
following way {1, 1, 2, 1, 3}
n Q + 1 = 7, KO = 5, K
Table 6 reports the valu
and D = 5, ⇢ = 0.9. Th
following way. We conside
for a randomly generated S
RW, 15%W and 75% R; T
in the interval [10.000, 18.0
ri is constant (8GB). The
each tenant is n = 6 for th
the case D = 5. We run 10
the best and worst case ov
In the first set of exper
The one-to-one mapping
ONE and QUORUM with
9s and five 9s respectively
tion factor increase to 5 th
9s and eight 9s for consis
respectively.
Unfortunately, when a
the reliability of the consi
orders of magnitude. In th
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
DB DB DB
VM VM VM
PM PM PM
DB DB DB
VM VM VM
PM PM
of the Opt and BestFit auto-scaling al-
LocalOpt behaves as the Opt. From the
ent that with more powerful vnodes the
gorithms are capable to satisfy the re-
hput with a delay of only 2 minutes. The
ating 3 vnodes of type 6, at time t = 4
type 6 is added and at time t = 6 the
a vertical scaling allocating 4 vnodes of
Cassandra o↵ers three main levels of consistency (both
for Read and Write): ONE, QUORUM and ALL. Con-
sistency level of ONE means that only one replica node
is required to reply correctly, that is it contains the
replica of the portion of the dataset needed to answer
the query. Consistency level QUORUM means that Q =⌅D
2
⇧
+ 1 replicas nodes are available to reply correctly
RQ = 1
D
n
⇥ (1 ⇢)n Q+1
. (25)
Table 6 shows the values of RO and RQ for D = 3 and
5 and for ⇢ = 0.9 and ⇢ = 0.8.
In a managed Cassandra data center, a Cassandra
VDC is rarely allocated using a one-to-one mapping of
vnodes on physical nodes. The resource management
policies adopted by the provider usually end-up with
a many-to-one mapping, that is h physical nodes run
n Cassandra vnodes: D  h < n. In that case we can
generalise equations 24 and 25 to the following:
RO = 1
D
n
⇥ (1 r)KO
(26)
the case D = 5. We
the best and worst ca
In the first set of
The one-to-one map
ONE and QUORUM
9s and five 9s respec
tion factor increase t
9s and eight 9s for
respectively.
Unfortunately, wh
the reliability of the
orders of magnitude.
that all the auto-scali
ONE with a reliabilit
tor increases to 5 the
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
for Cassandra Virtual Data Centers 15
the consistency level of ONE and QUORUM. The probability that a data replica is on
= 5. We assume the reliability of a physical node is ⇢ = 0.9
n-to-one
to-one Opt LocalOpt LocalOpt-H BestFit BestFit-H
99995 0.9995 0.9995
99995 0.995 – 0.9995 0.9995
9999995 0.999995 0.999995
999995 0.9995 – 0.999995 0.99995
9996 0.996 0.996
9984 0.98 – 0.996 0.996
999948 0.99984 0.99984
99987 0.996 – 0.99984 0.9992
). Consistency level
e available.
Consistency level re-
f ONE and of QUO-
RQ = 1
D
n
⇥ (1 r)KQ
. (27)
where: KO is the number of failed physical nodes that
causes a failure of n vnodes; and KQ is the number of
KO is
the
number
of
failed
physical
nodes
that
causes
a

failure
of
n
vnodes;
KQ is
the
number
of
failed
physical
nodes
that
causes
a

failure
of

(n
−
Q
+
1)
vnodes.

ACROSS
-‐ Rome
Meeting

Physical node failure (cont’d)ware Auto-scaling Algorithms for Cassandra Virtual Data Centers
Consistency reliability R for the consistency level of ONE and QUORUM. The probability that a data repli
s 0.5 for both D = 3 and D = 5. We assume the reliability of a physical node is ⇢ = 0.9
n-to-one
⇢ = 0.9 one-to-one Opt LocalOpt LocalOpt-H BestFit BestFit-H
RO|D=3 0.9999995 0.9995 0.9995
RQ|D=3 0.999995 0.995 – 0.9995 0.9995
RO|D=5 0.99999999995 0.999995 0.999995
RQ|D=5 0.999999995 0.9995 – 0.999995 0.99995
⇢ = 0.8
RO|D=3 0.99996 0.996 0.996
RQ|D=3 0.99984 0.98 – 0.996 0.996
RO|D=5 0.999999948 0.99984 0.99984
RQ|D=5 0.9999987 0.996 – 0.99984 0.9992
D is the replication factor). Consistency level
ans that all the replicas are available.
RQ = 1
D
n
⇥ (1 r)KQ
.ACROSS
-‐ Rome
Meeting

Lesson learned
• When you have to deal with a specific technology there are many
constraint to be considered
• The multi-layer adaptation is a must
• Not a single policies fit all the workload
• Not all the policies fit all the application life cycle
aware Auto-scaling Algorithms for Cassandra Virtual Data Centers
3 Use of the auto-scaling algorithms
Use case Opt LocalOpt BestFit LocalOpt-H BestFit-H
Capacity planning X
Data center consolidation X
VDC consolidation X X
run-time adaptation X X X X
ACROSS
-‐ Rome
Meeting

Questions ?
Emiliano Casalicchio
http://www.bth.se/people/emc
emc@bth.se
ACROSS
-‐ Rome
Meeting

Autonomous control in Big Data platforms: and experience with Cassandra

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Autonomous control in Big Data platforms: and experience with Cassandra

Ähnlich wie Autonomous control in Big Data platforms: and experience with Cassandra (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Autonomous control in Big Data platforms: and experience with Cassandra