Everyone knows about eventual consistency properties of the cloud, but do you know how long it will take for a piece of data to become consistent/fresh? Despite the aim of providing infinite scalability, is there any hard limits on some of the leading cloud platform services? We know cloud platforms aims to provide auto-scaling, but is it really all magic?
We at the University of NSW and National ICT Australia (NICTA) have been evaluating Cloud platforms over the last 18 months. In this session, we will share with the audience some of these (often surprising) evaluation findings, that should be of interest to application architects and developers looking at designing and building solutions using the cloud.
By Anna Liu, Hiroshi Wada, Kevin Lee, National ICT Australia, UNSW
2. 10ThingsYou Didn’t Know About Cloud
Platforms: Azure, GAE and AWS
Dr. Anna Liu, Dr. Hiroshi Wada, Kevin Lee
National ICT Australia
3. The 10 Things are...
1. How long does it take for data in cloud to become
consistent
2. Limitation and quotas
3. How unpredictable/variable is the cloud?
4. Distributed transaction support in Cloud
5. Pricing variations over time and space
6. Sticky session support
7. The new matrix of roles and responsibilities for cloud
providers, consumers and system integrators
8. Secure connections to the cloud
9. Time to getting a new instance
10. Auto-scaling is not all magic
4. The 10 Things are...
1. How long does it take for data in cloud to become
consistent
2. Limitation and quotas
3. How unpredictable/variable is the cloud?
4. Distributed transaction support in Cloud
5. Pricing variations over time and space
6. Sticky session support
7. The new matrix of roles and responsibilities for cloud
providers, consumers and system integrators
8. Secure connections to the cloud
9. Time to getting a new instance
10. Auto-scaling is not all magic
5. 5
The Reality of Eventual Consistency in
Amazon SimpleDB
• The probability to read updated data in SimpleDB in US West
– An application reads data X (ms) after it has written data
• SimpleDB has two
read operations
– Eventual Consistent
Read
– Consistent Read
• This pattern is
consistent
regardless of the
time of day
Eventual ConsistentConsistent Read
6. 6
Consistent vs. Eventual Consistent Read
• SimpleDB’s consistent read guarantees to read
updated data
• What is the cost you need to pay for consistency?
– RTT is same as that of eventual consistent read
– Monetary cost (usage fee) is exactly same as eventual
consistent read
Trade-off is not clear! We suspect consistent read is
less scalable and slower under datacenter failures.
However, we’ve not observed any differences
7. 7
Other Commercial NoSQL Databases
• Google App Engine
– Offers eventual consistent read and consistent read
– Behavior of eventual consistent read is completely
different from Amazon’s
– In GAE, both types of reads behave exactly same unless
data centers have a failure(s)
• Windows Azure
– Offers no options for read
– Always consistent
8. The 10 Things are...
1. How long does it take for data in cloud to become
consistent
2. Limitation and quotas
3. How unpredictable/variable is the cloud?
4. Distributed transaction support in Cloud
5. Pricing variations over time and space
6. Sticky session support
7. The new matrix of roles and responsibilities for cloud
providers, consumers and system integrators
8. Secure connections to the cloud
9. Time to getting a new instance
10. Auto-scaling is not all magic
9. Limitations and Quotas
Limitations Quotas
Amazon
Web
Services
•Manually setup all
applications
•Maximum 5 GB per file in S3
•Maximum 5 seconds query
execution time in SimpleDB
•20 On-Demand or Reserved
Instances and 100 Spot Instances by
default
•1GB free outgoing bandwidth per
month in SimpleDB, S3 and EC2
Microsof
t
Windows
Azure
•2 deployments per service
(production and staging)
•.NET, PHP or Java
programming language
•Up to 50 GB for a SQL Azure
•20 concurrent small compute
instances or equivalent per month
•10 TB of total data transfers per
month
Google
App
Engine
•Java or Python programming
language
•Maximum 30 seconds for
each request
•1 MB for each Datastore
entity
•Maximum 2 GB per file in
Blobstore (per API call
manipulate <1MB)
•10 web applications per user
•43, 200, 000 requests per day
•1 GB (1, 046 GB maximum if billing
enabled) incoming/outgoing
bandwidth per day
•6.5 CPU-hours (1, 729 CPU-hours
maximum if billing enabled) per day
10. The 10 Things are...
1. How long does it take for data in cloud to become
consistent
2. Limitation and quotas
3. How unpredictable/variable is the cloud?
4. Distributed transaction support in Cloud
5. Pricing variations over time and space
6. Sticky session support
7. The new matrix of roles and responsibilities for cloud
providers, consumers and system integrators
8. Secure connections to the cloud
9. Time to getting a new instance
10. Auto-scaling is not all magic
11. Performance Unpredictability in Cloud
• Performance unpredictability is one of the major
obstacles
– Performance variance of a MapReduce job for a 50-node
EC2 cluster and a 50-node local cluster
– Examples (time as performance
metric)
• Repeatability of results for
researchers
• Time critical tasks for enterprises
12. Benchmark Details
Metrics Measurements
Benchma
rk Tools
Instance Startup
elapsed time from the moment a request
for an instance is sent to the moment that
the requested instance is available.
CPU
a single score by executing various
concurrent integer and floating point
calculations
Ubench
Memory Speed
a single score by executing random memory
allocations as well as memory to memory
copying
Ubench
Disk I/O
sequential reads/writes and random reads
block I/O Bonnie++
Network Bandwidth bandwidth, delay jitter and diagram loss Iperf
S3 Access
uploading a 100 MB file from one unused
node of physical cluster at Saarland
University to a newly created bucket on S3
13. Benchmark Results in EC2
CPU
Memor
y
Sequen
tial
Read
Rando
m
Read
Networ
k
S3
Access
COV in
Physical
Cluster
0.1% 0.3% 0.6% 1.9% 0.2%
COV in
Small EC2 21% 8% 17% 9%
19% 54%
COV in
Large EC2 24% 10% 20% 13%
The COV of large instance is higher than the small. However,
both are at least by an order magnitude less stable than on a
physical cluster.
The COV of S3 Access may be influenced by other traffic on
the network, showing this experiment just for completeness.
Reference - Schad, Jo rg, Jens Dittrich, and Jorge-Arnulfo Quiané-Ruiz. 2010. Runtime Measurements in the Cloud: Observing, Analyzing, and̈
Reducing Variance. In Proceedings of the 36th international conference on Very large data bases. Vol. 3. 1. Singapore, Singapore: VLDB
Endowment.
14. The 10 Things are...
1. How long does it take for data in cloud to become
consistent
2. Limitation and quotas
3. How unpredictable/variable is the cloud?
4. Distributed transaction support in Cloud
5. Pricing variations over time and space
6. Sticky session support
7. The new matrix of roles and responsibilities for cloud
providers, consumers and system integrators
8. Secure connections to the cloud
9. Time to getting a new instance
10. Auto-scaling is not all magic
15. Distributed Transactions in Cloud
• There is now a range of Cloud Database types
• NOSQL (Azure Table, GAE Datastore, Amazon SimpleDB...)
– Much more ‘shardable’ architecture; No joins, not full ACID support
• SQL (Azure SQL, Amazon RDS, Oracle on EC2...)
– Variable distributed transactional support compared to their traditional
RDBMS counterpart
• Experience with porting PetShop
• Challenge with porting the data access layer
– Some JDO interface not supported by App Engine, eg. ‘Join query’
– No distributed transaction support in Azure SQL atm
15
16. The 10 Things are...
1. How long does it take for data in cloud to become
consistent
2. Limitation and quotas
3. How unpredictable/variable is the cloud?
4. Distributed transaction support in Cloud
5. Pricing variations over time and space
6. Sticky session support
7. The new matrix of roles and responsibilities for cloud
providers, consumers and system integrators
8. Secure connections to the cloud
9. Time to getting a new instance
10. Auto-scaling is not all magic
17. Pricing fluctuates over space and time
• On demand pricing (hourly, per GB, per ‘000 requests)
• Reserved instances (1 or 3 year term + unit cost)
• Spot pricing (typically cheaper in US-East!)
• Similar pricing schemes observed for GAE and Azure
17
18. The 10 Things are...
1. How long does it take for data in cloud to become
consistent
2. Limitation and quotas
3. How unpredictable/variable is the cloud?
4. Distributed transaction support in Cloud
5. Pricing variations over time and space
6. Sticky session support
7. The new matrix of roles and responsibilities for cloud
providers, consumers and system integrators
8. Secure connections to the cloud
9. Time to getting a new instance
10. Auto-scaling is not all magic
19. Sticky Session Support
• Autoscaling alone does not guarantee that clients of the
same session will always contact the same instance
• Clients cannot perform a series of connected operations
• Amazon ELB supports Session Affinity
– Session affinity allows mapping to be created at the ELB
– Limitations
• Session affinity cannot handle HTTPS
• Autoscaling down an instance with a live session
• MS Azure advocates stateless sessions
– If you must – store session state in eg table storage
• Design issue - Server to remember conversation context? Or
for client to remind it every time? How long should it ‘stick’?
Too long: compromise server ability to distribute load
20. The 10 Things are...
1. How long does it take for data in cloud to become
consistent
2. Limitation and quotas
3. How unpredictable/variable is the cloud?
4. Distributed transaction support in Cloud
5. Pricing variations over time and space
6. Sticky session support
7. The new matrix of roles and responsibilities for cloud
providers, consumers and system integrators
8. Secure connections to the cloud
9. Time to getting a new instance
10. Auto-scaling is not all magic
21. Infrastructure Configuration
(VPN, VMs, Disk, …)
Infrastructure Configuration
(VPN, VMs, Disk, …)
OS/ApplicationSecurity
(e.g.,ActiveDirectory)
OS/ApplicationSecurity
(e.g.,ActiveDirectory)
OS/Middleware Installation/ConfigurationOS/Middleware Installation/Configuration
OS
Patching
OS
Patching
Application Installation/ConfigurationApplication Installation/Configuration
Application
Patching
Application
Patching
Billing
(CostCenterCharging)
Billing
(CostCenterCharging)
AntivirusAntivirus OS
Backup
OS
Backup
OS
Monitoring
OS
Monitoring
App Data
Backup
App Data
Backup
Application
Monitoring
Application
Monitoring
Amazon EC2
(IaaS providers)
Infrastructure
Monitoring
(CPU, Disk, Net, …)
Infrastructure
Monitoring
(CPU, Disk, Net, …)
Usage Report
and
Basic Billing
Usage Report
and
Basic Billing
Access Control
to IaaS
Access Control
to IaaS
Customers’ Responsibility in IaaS Cloud
Customers’
Responsibility
22. The 10 Things are...
1. How long does it take for data in cloud to become
consistent
2. Limitation and quotas
3. How unpredictable/variable is the cloud?
4. Distributed transaction support in Cloud
5. Pricing variations over time and space
6. Sticky session support
7. The new matrix of roles and responsibilities for cloud
providers, consumers and system integrators
8. Secure connections to the cloud
9. Time to getting a new instance
10. Auto-scaling is not all magic
24. Performance Implications
• Low Security Option – max throughput 5.6MB/sec
• High Security Option - connection throughput is 4MB/sec
– Performance hit due to encryption, decryption and firewall
• Other interesting observations:
– VPC only available US East-1 and EU-west1
– in single availability zone only
– S3 not working well with VPC yet (very slow), EBS is a workaround
– MS Azure VPN support next year
– Google Secure Connector
25. The 10 Things are...
1. How long does it take for data in cloud to become
consistent
2. Limitation and quotas
3. How unpredictable/variable is the cloud?
4. Distributed transaction support in Cloud
5. Pricing variations over time and space
6. Sticky session support
7. The new matrix of roles and responsibilities for cloud
providers, consumers and system integrators
8. Secure connections to the cloud
9. Time to getting a new instance
10. Auto-scaling is not all magic
26. Time to Getting a New Instance
• Typically takes minutes to create an instance from its image
on EC2
• Trick to “create” instances quicker
– Create a pool of instances in advance, and stop (hibernate) them all
• Pay no instance cost but need to pay for storage cost (for stopped
instances)
– Revive stopped instances if new instances are needed
Operating
System
Method Time
Windows Create from image 10-15 minutes
Linux Create from image 5-10 minutes
Windows Revive stopped instance 30 seconds
Linux Revive stopped instance 30 seconds
27. The 10 Things are...
1. How long does it take for data in cloud to become
consistent
2. Limitation and quotas
3. How unpredictable/variable is the cloud?
4. Distributed transaction support in Cloud
5. Pricing variations over time and space
6. Sticky session support
7. The new matrix of roles and responsibilities for cloud
providers, consumers and system integrators
8. Secure connections to the cloud
9. Time to getting a new instance
10. Auto-scaling is not all magic
28. Autoscaling is Not All Magic
• Amazon EC2
“… your application can automatically scale itself up and down depending on its
needs.”
• Windows Azure
“Optimizd for scale-out applications-designed so that developers can easily build
scale-out applications…”
• Google App Engine
“No matter how many users you have or how much data your application stores,
App Engine can scale to meet your needs”
29. Autoscaling is Not All Magical (contd)
Provider How to Scale? Limitations
Amazon EC2 • Load balancing with Elastic Load
Balancer (ELB)
• Event processing with Autoscaling API
• Monitoring through CloudWatch
• Load balancer is the bottle-neck,
hence limited throughput
• Limited load balancing options (e.g.,
no hardware load balancer)
• Limited rule support (e.g. no
conjunctions allowed in rules)
• Limited monitoring support (e.g.
limited to minute granularity)
Windows
Azure
• Load balancing with Azure Queue
Storage
• Event processing with WF rules engine
• Monitoring through Azure Diagnostics
• Create/Delete instances with
Management API
• Throughput limited by Azure Queue
• Limited monitoring support (e.g.
billing information not monitored)
Google App
Engine
• Built-in with App Engine • No control over how it scales
• Number of simultaneous sessions
limited by per-minute (burst) quota
(500 requests per sec by default),
server request time-out (30 secs), etc.
30. The 10 Things are...
1. How long does it take for data in cloud to become
consistent
2. Limitation and quotas
3. How unpredictable/variable is the cloud?
4. Distributed transaction support in Cloud
5. Pricing variations over time and space
6. Sticky session support
7. The new matrix of roles and responsibilities for cloud
providers, consumers and system integrators
8. Secure connections to the cloud
9. Time to getting a new instance
10. Auto-scaling is not all magic
31. Getting Involved
• Linkage with National ICT Australia
•Contract Research, Expert Advisory Services,
Architecture Reviews
•Public and In-house Training Courses
•Market Surveys, Case Studies
•Professional in Research Residence
Anna.Liu@nicta.com.au, @annaliu
http://blogs.unsw.edu.au/annaliu/
33. Virtual Machine ‘Stolen Time’
• Using traditional system resource monitoring tools in cloud
– Measuring system performance within a virtual instance (using tools
such as vmstat and top) can give misleading information
– Example: An EC2 instance (e.g. m1.small with 1 EC2 compute unit)
does not go above around 40% CPU load as observed from vmstat
• Certain percentage (around 50-60%) appears on vmstat as ‘st’
“st – Time stolen from a virtual machine” (from vmstat manpage)
• Does it mean I am not getting what I paid for? No, not really
– Amazon instances are measured by EC2 compute units
– “One EC2 compute Unit provides the equivalent CPU capacity of a 1.0-
1.2GHz 2007 Opteron or 2007 Xeon process”
• Monitoring system performance in cloud
– Use Cloud monitoring tools such as CloudWatch and RightScale
34. Limitation of Virtual Private Cloud (VPC)
• VPC hosts are logically detached from (but physically
attached to) the Amazon network
– No direct connection to and from S3 via the Amazon local network
– Connection via internet only
• What happen if we need to transfer data from S3 to a VPC
host?
– E.g. If we ship a removable media to Amazon, it would be uploaded to
S3. How do we transfer the data to a VPC host?
– Option 1: Direct transfer from S3 to VPC host
• Traffic routes through the remote side and comes back (High latency)
– Option 2: Transfer to EBS and mount EBS to VPC host
• Traffic routes through local network (Low latency)
35. 35
How Long You Need to Wait to Get Updated
with Eventual Consistent Read?
• Result of the “5 minutes run” for one week
• t1: the first time to
read updated data
• t2: the first time to
reach 100% of
reading updated
• t3: the last time to
read stale data
Mostly updated
after 600ms but no
guarantee
Hinweis der Redaktion
Presentation Abstract:
Everyone knows about eventual consistency properties of the cloud, but do you know how long it will take for a piece of data to become consistent/fresh? Despite the aim of providing infinite scalability, is there any hard limits on some of the leading cloud platform services? We know cloud platforms aims to provide auto-scaling, but is it really all magic?
We at the University of NSW and National ICT Australia (NICTA) have been evaluating Cloud platforms over the last 18 months. In this session, we will share with the audience some of these (often surprising) evaluation findings, that should be of interest to application architects and developers looking at designing and building solutions using the cloud.
Reduce cost, reduce complexity
Reduce cost, reduce complexity
Reduce cost, reduce complexity
Quotas are resource constrains configured by the vendors. You probably can contact the vendors for more resources beyond the quotas, but communication takes time, and it will bring about opportunity cost. Limitations mostly are functions restrictions, you probably can’t go beyond it by making a phone call.
Amazon Web Services
Manually setup all applications – large maintenance cost and operation cost, including upgrading systems, installing applications and configuration.
Maximum 5 GB per file in S3 – e.g. TB magnitude files can not be put into S3 directly. Extra efforts are needed, i.e. It has to be divided into small trunks (5GB each) before storing. Same efforts are also required during retrieval, all retrieved trunks have to be merged manually.
Maximum 5 seconds query execution time in SimpleDB – no long time query in SimpleDB. If thousands items are query in SimpleDB, it could be failed due to timeout. Developers need to estimate the query time before hand, and separate a large query into small queries. And combine/merge the query results on client sides.
20 On-Demand or Reserved Instances and 100 Spot Instances by default – You can have more instances by contacting Amazon, but that definitely will increase your opportunity cost, if you need a scale out immediately.
1GB free outgoing bandwidth per month in SimpleDB, S3 and EC2 – Yep, you need to pay for extra usages.
Microsoft Windows Azure
2 deployments per service (production and staging) – The two deployments are used for deploying production version and staging version separately, targeting the end-users and test users correspondingly. But it is not efficient enough to run multiple test versions at the same time.
.NET, PHP or Java programming language – limited languages for .NET, PHP and Java developers
Up to 50 GB for SQL Azure – The maximum size of a single SQL Azure database is 50 GB. If your data is more than 50 GB, then you probably have to consider data partitioning to scale out your database to multiple databases.
20 concurrent small compute instances or equivalent per month – 1 clock hour to an extra large instance equates to 8 small instance hours. Therefore, you can only have
10 TB of total data transfers per month – Probably you can get more if you send a request to Microsoft
Up to 750 GB SQL Azure databases per month – For SQL Azure, it originally states 150 Web Edition databases (not sure it is or/and, see http://www.microsoft.com/windowsazure/offers/popup/popup.aspx?lang=en&locale=en-us&offer=MS-AZR-0013P) 15 Business Edition databases, since the maximum size for each Web Edition is 5GB and maximum size for each Business Edition is 50GB. I do the simple math, 150*5 or 15*50, calculating the result as 750 GB.
Google App Engine
Java or Python programming language – PHP developer can do nothing on Google App Engine.
Maximum 30 seconds for each request – Each request has to be responded within 30 seconds, otherwise, exceptions will be returned instead of results. In this case, high computational tasks is not applicable in GAE. The alternative is still splitting the task. GAE has made an early experimental release of MapReduce to fulfill the alternative. But only Mapper is implemented at this stage.
1 MB for each Datastore entity – Only 1MB for each data item. You probably will find it hard to store a photo in GAE. And also due to the 30 seconds limitation, your query should also be processed within 30 seconds.
Maximum 2 GB per file in Blobstore – The same reason as AWS. Plus: maximum size of Blobstore data that can be read by the app with one API call is only 1 MB. So even you stored 2GB in Blobstore, it is still difficult to manipulate these data in GAE.
10 web applications per user – since the case of bush fire in 2009. I think all the following parameters can be adjusted by Google.
43, 200, 000 requests per day
1 GB (1, 046 GB maximum if billing enabled) incoming/outgoing bandwidth per day
6.5 CPU-hours (1, 729 CPU-hours maximum if billing enabled) per day
Reduce cost, reduce complexity
Reference – Saaland paper at VLDB
Reduce cost, reduce complexity
Reduce cost, reduce complexity
Reduce cost, reduce complexity
Reduce cost, reduce complexity
IaaS provides
Basic Infrastructure Monitoring: such as CPU/RAM/disk/network usage. Need to convert and feed the data into a dashboard system in AMP
Usage report and basic billing: Usually one account = one bill
Customers’ responsibility
Access control to IaaS: password and secret key management. change password and keys regularly to tighten security. access log of IaaS console (not available in EC2)
Infrastructure Configuration: Establish VPN. Choose appropriate machine images, or upload machine images. Adding disks to virtual machines
OS/Middleware installation/configuration: depending on machine images. Pre-configured machine images reduce workload
OS patching: need to perform by customers
Antivirus: need to install by customers if not included in a machine image
OS backup: IaaS usually allows for taking snapshots of virtual machines
OS Monitoring: if a monitoring facility provided by IaaS is not enough, you need to run yours. Feed the data into a dashboard system in AMP
Application installation/configuration: as usual
Application patching:
App data backup: Taking snapshots using IaaS’s functionality. Or do your self such as running rsync.
Application monitoring: Feed the data into a dashboard system in AMP
OS/application security: such as access control by Active Directory
Billing: Need to translate IaaS’s bill into cost center-based bill.
Reduce cost, reduce complexity
Figure 1 shows a typical set up of the Amazon VPC. This VPC setup allows a company’s infrastructure to be connected with the Amazon EC2 infrastructure via a VPN connection. It requires setting up two VPN gateways (one on each of the local and remote sides). A secure VPN connection is established between the two gateways via the IPsec protocol. EC2 instances on the remote side (Amazon side) are operated within subnets behind the remote VPN gateway. That is, these EC2 instances are isolated from the rest of the EC2 network and only these instances can access the hosts on the local side. Similarly, hosts can be added on the local side behind the customer gateway (local VPN gateway) and only these hosts have access to the remote EC2 instances.
A typical VPC connection meets the following security requirements:
Utilise the AES 128-bit encryption function
Utilise the SHA-1 hashing function
An example business report query took 16min 30sec
takes less than 1min in the existing on-premise dev environment
Data transfer over SSIS takes 14min (only 42KB/sec of throughput)
No bottleneck observed on CPU (3-10%), memory (6G free), disk (low activity) or network (0.03% usage of 1Gbps)
SSIS protocol?
-----------------
Done. It works! I did the following:1. Start an EC2 micro instance outside the VPC and attach an EBS volume to it2. Copy file from S3 to the EBS volume attached to the micro instance3. Detach the EBS volume from the micro instance4. Attach EBS volume to an instance inside the VPCNote that, we did NOT route through NICTA here at all.The file I used for this experiment is ~700MB in size. Step 2 took 130s (i.e. 5.39MB/s).
An article (with link to his paper) by Huan Liu discussing limitations of load balancers and autoscaling:
http://huanliu.wordpress.com/tag/auto-scaling/
http://codecrafter.wordpress.com/2008/10/03/google-app-engine-scalability-that-doesnt-just-work/
An example on scaling in Azure:
http://code.msdn.microsoft.com/azurescale/Release/ProjectReleases.aspx?ReleaseId=4167