Towards a Unified View of Cloud Elasticity

Towards a UnifiedView of Elasticity
Srikumar Venugopal & Team

School of Computer Science and Engineering,
University of New South Wales, Sydney, Australia
srikumarv@cse.unsw.edu.au

Acknowledgements
•  Basem Suleiman
•  Han Li
•  Reza Nouri
•  Freddie Sunarso
•  Richard Gow

Agenda
•  Introduction to elasticity and its
challenges
•  Performance Modeling of Elasticity Rules
•  Autonomic Decentralised Elasticity
Management of Cloud Applications
•  Eﬃcient Bootstrapping for Decentralised
Shared-nothing Key-value Stores

Simple Service Deployment on Cloud

Elasticity

The ability of a system to change its capacity
in direct response to the workload demand

DifferentViews of Elasticity
•  Performance View
– When to scale and how much ?
•  Application View
– Does the architecture accommodate scaling ?
– How is state managed ?
•  Conﬁguration View
– Are there changes in conﬁguration due to
scaling?

Elastic Deployment Architecture

Elasticizing Application Layer

Trigger – Controller – Action
•  Trigger: Threshold Breach
•  Controller: Intelligence/Logic
•  Action: Add or Remove Capacity

State-of-the-art in Auto-scaling
Product/Project
Trigger
Controller
Ac3ons

Amazon

Autoscaling

Cloudwatch

metrics/
Threshold

Rule-‐based/
Schedule-‐based

Add/Remove

Capacity

WASABi
Azure
Diagnos3cs/
Threshold

Rule-‐based
Add/Remove

Capacity,
Custom

RightScale/Scalr
Load
monitoring
Rule-‐based/
Schedule-‐based

Add/Remove

Capacity,
Custom

Google
Compute

Engine

CPU
Load,
etc.
Rule-‐based
Add/Remove

Capacity

Academic

CloudScale
Demand
Predic3on
Control
theory
Voltage-‐scaling

Cataclysm
Threshold-‐based
Queueing-‐model
Admission
Control

IBM
Unity
Applica3on
U3lity
U3lity
func3ons/RL
Add/Remove

Capacity

Summary
•  Currently, the most popular mechanisms
for auto-scaling are rule-based
mechanisms
•  The eﬀectiveness of rule-based
autoscaling is determined by the trigger
conditions
•  So, how do we know how to set up the
right triggers ?

Performance Modeling of Elasticity
Rules
Basem Suleiman

Elasticity (Auto-Scaling) Rules
Examples:
•  If CPU Utilization ≥ 85% for 7 min. add 1 server (Scale Out)
•  If RespTimeSLA ≥ 95% for 10 min. remove 1 server (Scale In)

B. Suleiman, S. Venugopal, Modeling Performance of Elasticity Rules for Cloud-based Applications, EDOC 2013.

Performance of Diﬀerent Elasticity Rules
•  How well do elasticity rules perform in terms of SLA satisfaction,
CPU utilization , costs and % served request?
Rule
Elasticity Rules
CPU75
If CPU Util.>75% for 5 min; add 1 server
If CPU Util.≤30% for 5 min; remove 1 server
CPU80
CPU85
SLA90
If SLA < 90% for 5 min; add 1 server
If SLA ≥ 90% for 5 min; remove 1 server
SLA95
If SLA < 95% for 5 mins; add 1 server
If SLA ≥ 95% for 5 mins; remove 1 server
B.
Suleiman,
S.
Sakr,
S.
Venugopal,
W.
Sadiq,
Trade-‐oﬀ
Analysis
of
Elas2city
Approaches
for
Cloud-‐Based
Business
Applica2ons,
Proc.
WISE

2012

Cloud Testbed for Collecting Metrics

TPC-W
database

EC2

EC2

TPC-W
application

.......

Elastic Load
Balancer

EC2

EC2

% SLA Satisfaction, Avg. CPU Utilization
Server Costs and % served Requests
Response Time
B.
Suleiman,
S.
Sakr,
S.
Venugopal,
W.
Sadiq,
Trade-‐oﬀ
Analysis
of
Elas2city
Approaches
for
Cloud-‐Based
Business
Applica2ons,
Proc.
WISE

2012

Performance Evaluation - Different
Elasticity Rules
Max
Min
Median
Q3
Q1
Mean
Legend
$0.00
$0.50
$1.00
$1.50
$2.00
$2.50
CPU75
CPU80
CPU85
SLA90
SLA95
Costs
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
CPU75
CPU80
CPU85
SLA90
SLA95
CPUUtilization
B.
Suleiman,
S.
Sakr,
S.
Venugopal,
W.
Sadiq,
Trade-‐oﬀ
Analysis
of
Elas2city
Approaches
for
Cloud-‐Based
Business
Applica2ons,
Proc.
WISE

2012

The Challenges of Thresholds
You must be at least
this tall to scale up!
•  Threshold values determine performance
and cost
•  E.g. Low CPU utilization => Higher cost,
Better Performance
•  Thresholds vary from one application to
another
•  Empirically determining thresholds is
expensive.

Can we construct a model that allows
us to establish the right thresholds ?

Queue Model of 3-tier
B.
Suleiman,
S.
Venugopal,
Modeling
Performance
of
Elas2city
Rules
for
Cloud-‐based
Applica2ons,
EDOC
2013
(Accepted)

Establishing Rule Thresholds
•  Developed a model based on M/M/m
queuing model
– Simultaneous session initiations on 1 server
– Provisioning Lag Time of the provider
– Cool-down interval after elasticity action
– Algorithms to model scale-in and scale-out
– Request Mix
•  Compared model ﬁdelity with actual cloud
execution of TPC-W workload.

Experiments: Methodology
•  Run the TPC-W workload on Amazon
cloud resources using thresholds
•  Simulate the model using MATLAB with
the same thresholds
•  Compare the simulation results to the
results from the actual execution
– If both are equivalent, then we are good J
B.
Suleiman,
S.
Venugopal,
Modeling
Performance
of
Elas2city
Rules
for
Cloud-‐based
Applica2ons,
EDOC
2013
(Accepted)

Experiments: Testbed

TPC-W
database

EC2

TPC-W user
emulation
Linux – Extra-large

EC2

TPC-W
application

.......
Elastic Load
Balancer

EC2

Small/Medium server
Linux – JBoss/JSDK
Extra-large server
Linux - MySQL
EC2

Experiments: Input Workload
0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
RequestArrivalRate(req/min)
Time (minutes)
Workload
•  Used TPC-W Browsing proﬁle (95% read)
•  Stress on application tier
•  Number of concurrent-users – Zipf
•  Inter-arrival times - Poisson

Experiments: Elasticity Rules
Rule
Rule
Expansion

CPU75
If CPU Util. > 75% for 5 min, add 1 server
If CPU Util. < 30% for 5 min, remove 1 server
CPU80
If CPU Util. > 80% for 5 min, add 1 server
If CPU Util. < 30% for 5 min, remove 1 server
Common parameters:
•  Waiting time – 10 mins., Measuring interval – 1 min.

Metrics Captured:
•  Average CPU Utilization across all the servers
•  Average Response Time in a time interval
•  Number of servers in operation at any point of time

Results
CPU Utilization
CPU75M CPU75E CPU80M CPU80E
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Elasticity Rules - Model (M) & Empirical (E)
Avg.CPUUtilization
CPU75M
CPU75E
CPU80M
CPU80E
Average
Response
Time

CPU75M CPU75E CPU80M CPU80E
0.0
0.1
0.2
0.3
0.4
0.5
Elasticity Rules - Models (M) & Empirical (E)
Avg.ResponseTime(sec)
CPU75M
CPU75E
CPU80M
CPU80E

0 40 80 120 160 200 240 280 320 360 400 440 480 520 560
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%Avg.CPUUtilization(%)
Time (minutes)
CPU80M
CPU80E
CPU Utilization over Time

0 40 80 120 160 200 240 280 320 360 400 440 480 520 560
0
1
2
3
4
5
6
No.Servers(App.Tier)
Time (minutes)
CPU75M
CPU75E
CPU80M
CPU80E
Number of Servers Initialized

Summary
•  Developed a queueing model that can be
used to reason about elasticity
•  Model captures eﬀects of thresholds and
can be used for testing diﬀerent rules
•  Evaluations show that the model approx.
real-world conditions closely
•  Future work: handling initial bursts in
workload

Autonomic Decentralised Elasticity
Management of Cloud Applications
Reza Nouri and Han Li

Cons of Rule-based Autoscaling
•  Commercial products are rule-based
– Gives “illusion of control” to users
– Leads to the problem of deﬁning the “right”
thresholds
•  Centralised controllers
– Communication overhead increases with size
– Processing overhead also increases (Big
Data!)
•  One application/VM at a time

Challenges of large-scale elasticity
•  Large numbers of instances and apps
– Deriving solutions takes time
•  Dynamic conditions
– Apps are going into critical all the time
•  Shifting bottlenecks
– Greedy solutions may create bottlenecks in
other places
•  Network partitions, fault tolerance…
H. Li, S. Venugopal, Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform, Proceedings of
8th ICAC '11.

Initial Conditions
Instance1

App
Server1

app1 app2
Instance2

App
Server2

app3 app4
IaaS Provider

A Critical Event
Instance1

App
Server1

app1 app2
IaaS Provider
Instance2

App
Server2

app3 app4

Placement 1
Instance1

App
Server1

app1
IaaS Provider
Instance2

App
Server2

app3 app4 app2

Placement 2
Instance1

App
Server1

app2
IaaS Provider
Instance2

App
Server2

app3 app4
Instance3

App
Server3

app1
$$

Placements 3 & 4
Instance1

App
Server1

app2
IaaS Provider
Instance2

App
Server2

app3 app4
Instance1

App
Server1

app2
IaaS Provider
Instance2

App
Server2

app3 app4
Instance3

App
Server3

app1 app1
app1 app1

Problems for Automatic Placement
•  Provisioning
– Smallest number of servers required to satisfy
resource requirements of all the applications

•  Dynamic Placement
– Distribute applications so as to maximise
utilisation yet meet each app’s response time
and availability requirements
8th ICAC '11.

Co-ordinated Control of Elasticity
•  Instances control their own utilisation
– Monitoring, management and feedback
•  Local controllers are learning agents
– Reinforcement Learning
•  Controllers learn from each other
– Share their knowledge and update their own
•  Servers are linked by a DHT
– Agility, Flexibility, Co-ordination
H. Li, S. Venugopal, “Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform”, Proceedings
of 8th ICAC '11.

Abstract View of the Control
Scheme
H. Li, S. Venugopal, “Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform”, Proceedings
of 8th ICAC '11.

Fuzzy Thresholds
8th ICAC '11.

Basic Actions
Instance

Applica3on

create! terminate! find!
move! duplicate! merge!
(-‐3.5)
(3.5)
(3.5)

(0.5)
(0.5)
(0.5)

Co-ordination using find!
•  Server looks up other servers with the
least load
– DHT lookup
•  Sends a move message to the selected
server
•  Replies with accept or reject!
– accept has a +ve reward

Shrinking
•  The controller is always reward
maximising
– Highest Reward is for merge+terminate
•  A controller initiates its own shutdown
– Low load on its applications
•  Gets exclusive lock on termination
– Only one instance can terminate at a time
•  Transfers state before shutdown

Experiments
•  Six web applications
–  Test Application: Hotel Management
–  Search à Book à Conﬁrm
•  Five were subjected to a background load
–  Uniform Random
•  One was subjected to the test load
•  Application threshold: 200 and 500 ms
•  Metrics
–  Average Response Time, Drop Rate, Servers
H.
Li,
S.
Venugopal,
“Using
Reinforcement
Learning
for
Controlling
an
Elas3c
Web
Applica3on
Hos3ng
Plaorm”,
Proceedings
of
8th
ICAC

'11.

Elasticising Persistence Layer

Efficient Bootstrapping for
Decentralised Shared-nothing Key-
value Stores
Han Li

Key-value Stores
•  The standard component for cloud data
management
•  Increasing workload à Node bootstrapping
–  Incorporate a new, empty node as a member of KVS

•  Decreasing workload à Node decommissioning
–  Eliminate an existing member with redundant data oﬀ
the KVS
H. Li, S. Venugopal, Eﬃcient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of
MIddleware 2013.

Research Questions
•  As the system scales, how to eﬃciently
incorporate or remove data nodes?
– Load balancing, migration overheads, etc.
•  How to partition and place the data
replicas when the system is elastic?
– Data consistency, durability, availability, etc..
MIddleware 2013.

Elasticity in Key-Value Stores
•  Minimise the overhead of data movement
–  How to partition/store data?

•  Balance the load at node bootstrapping
–  Both data volume and workload
–  How to place/allocate data?

•  Maintain data consistency and availability
–  How to execute data movement?
MIddleware 2013.

A
B
G
F
C
D
E
I
H
Key space
Split-Move Approach
A
I
C C
D
Node 1 Node 2
Node 3 Node 4
B
I
B
B
A
Master Replica Slave Replica
A
H
A
I B2
C CD
Node 1 Node 2
Node 3 Node 4
New Node
B1 B2
I
B1
B2
A
B1
A
H
①
①①
A
B
G
F
C
D
E
I
H
B2
B1
①
Key space
②A
I B2
C CD
B2
A B1
Node 1 Node 2
Node 3 Node 4
New Node
②
B1 B2
I
B1
B2
A
B1
A
H
A
I B2
C CD
B2
A B1
Node 1 Node 2
Node 3 Node 4
New Node
②②
B1 B2
I
B1
B2
A
B1
To be deleted
③
A
H
Partition at node bootstrapping
MIddleware 2013.

Virtual-Node Approach
A
B
G
F
C
D
E
I
H
Key space
D B
E H
I G
A C
D F
G I
A B
C E
I
C D
F H
G
Node 1 Node 2
Node 3 Node 4
D B
E H
I G
A C
D F
G I
A B
C E
I
C D
F H
G
Node 1 Node 2
Node 3 Node 4
New Node
D B
E H
I G
A C
D F
G I
A B
C E
I
C D
F H
G
B A
E F
H
Node 1 Node 2
Node 3 Node 4
New Node
......
Partition at system startup
Data skew: e.g., the majority of data is stored in a minority of partitions.
Moving around giant partitions is not a good idea.
MIddleware 2013.

Our Solution
•  Virtual-node based movement
–  Each partition of data is stored in separated files
–  Reduced overhead of data movement
–  Many existing nodes can participate in bootstrapping
•  Automatic sharding
–  Split and merge partitions at runtime
–  Each partition stores a bounded volume of data
–  Easy to reallocate data
–  Easy to balance the load
MIddleware 2013.

The timing for data partitioning
•  Shard partitions at writes (insert and delete)
–  Split: Size(Pi) ≤ Θmax
–  Merge: Size(Pi) + Size(Pi+1) ≥ Θmin
Split
Delete
Insert
Merge
B
A
C
D
E
B1
A
C
D
E
B2
B1A
C
D
E
B2
B1A
M
D
E
Split
Delete
Insert
B
A
C
D
E
B1
A
C
D
E
B2
B1A
C
D
E
B2
Split
Insert
B
A
C
D
E
B1
A
C
D
E
B2
B
A
C
D
E
Θmax
≥
2Θmin

Avoid
oscilla3on!

MIddleware 2013.

Sharding coordination
•  Solution: Election-based coordination
Node-A
Node-C
Node-E
Node-B
SortedList:
C, E, ..., A, ..., B Step1
Election
Node-A
Coordinator
Node-C
Node-E
Node-B
Step 2
Enforce Split/Merge
Data/Node
mappingNode-A
Coordinator
Node-C
Node-E
Node-B 1st
Data/Node
mapping
Step 3
Finish Split/Merge
2nd
3rd
4th
Node-A
Coordinator
Node-C
Node-E
Node-B
Step 4
Announce to all nodes
MIddleware 2013.

Node failover during sharding
Non-
coordinators
Non-
coordinators
Non-
coordinators
Election
Notification:
Shard Pi
Time
Before
execution
During
execution
After
execution
Replace Replicas
Coordinator
Announce:
Successful
Step2
Step3
Step4
Step1
Non-
coordinators
Non-
coordinators
Removed from
candidate list
Non-
coordinators
Election
Failed Resurrect
yes
No
Yes
Notification:
Shard Pi
Append to
candidate list
Gossip
No Dead
Time
Before
execution
During
execution
After
execution
Replace Replicas
Coordinator
Announce:
Successful
Step2
Step3
Step4
Step1
Non-
coordinators
Non-
coordinators
Non-
coordinators
Election
Notification:
Shard Pi
Gossip Continue without coordinator Resurrect
Dead
No
Yes
Time
Before
execution
During
execution
After
execution
Failed
Replace Replicas
Coordinator
Announce:
Successful
Step2
Step3
Step4
Step1
Non-
coordinators
Non-
coordinators
Non-
coordinators
Election
Notification:
Shard Pi
Failed
Gossip
Yes
Continue without coordinator
Elect
New coordinator
No
Invalidate Pi
in this node
Timeout
Time
Before
execution
During
execution
After
execution
Replace Replicas
Coordinator
Announce:
Successful
Step2
Step3
Step4
Step1
MIddleware 2013.

Evaluation Setup
•  ElasCass: An implemention of auto-sharding,
building on Apache Cassandra (version 1.0.5),
which uses Split-Move approach.
•  Key-value stores: ElasCass vs. Cassandra
(v1.0.5)
•  Test bed: Amazon EC2, m1.large type, 2 CPU
cores, 8GB ram
•  Benchmark: YCSB
MIddleware 2013.

Evaluation – Bootstrap Time
•  Start from 1 node, with 100GB of data,
R=2. Scale up to 10 nodes.
•  In Split-Move, data volume transferred
reduces by half from 3 nodes onwards.
•  In ElasCass, data volume transferred
remains below 10GB from 2 nodes.
•  Bootstrap time is determined by data
volume transferred. ElasCass exhibits a
consistent performance at all scales.
MIddleware 2013.

Conclusions
•  We have designed and implemented a
decentralised auto-sharding scheme that
– consolidates each partition replica into single
transferable units to provide eﬃcient data
movement;
– automatically shards the partitions into
bounded ranges to address data skew;
– reduces the time to bootstrap nodes,
achieves more balancing load and better
performance of query processing.

A UnifiedView of Elasticity (?)

Final Thoughts
•  Elasticising Application Logic is done
– How do we eliminate thresholds ?
– Should it be more autonomic ?
•  Application View of Elasticity
– Managing state is the big challenge
– Decoupling of diﬀerent components (service-
oriented model)
– How would you scale interconnected
components ?

Questions ?
srikumarv@cse.unsw.edu.au
Thank you!

Towards a Unified View of Cloud Elasticity

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (10)

Ähnlich wie Towards a Unified View of Cloud Elasticity

Ähnlich wie Towards a Unified View of Cloud Elasticity (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Towards a Unified View of Cloud Elasticity