This document discusses the importance of performance testing cloud applications and outlines best practices for defining performance requirements, testing methodology, and identifying issues. It provides examples of performance problems found in databases, applications, operating systems, and networks. The key goals of performance testing are to understand system behavior under load, find bottlenecks and hidden bugs, and verify that requirements are met.
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
Adding Value in the Cloud with Performance Test
1. MySQL, NoSQL & Cloud 2014
Adding Value in the Cloud
with Performance Test
Rodolfo Kohn
Intel Software Argentina
2. Cloud Applications are Complex
11/24/2014 2
DNS
Server
.com Root
GLB
Auth
Datacenter-1
GLB
Auth
Datacenter-2
Service
Cache
Cache
Cache
Cache
DNS
Disk
Network
SMTP
CDN
NoSQL
SQL
Monitoring
Logs
Configuration Management
Multiple Opportunities for Unexpected Failures
Load bursts &
Response time
deterioration
3. Bad Performance affects User Experience
Consumer
Competitor
Intel® Web Service
Intel® Web Service
If backend system has poor performance or poor scalability,
it will miserably fail.
Impact to company’s business and reputation
4. Performance Requirements
System has to accomplish performance targets
•Response time
–Under average load
–Under heavy load
•Throughput
•Concurrent operations
System has to deal with traffic peaks
•Acceptable response time up to maximum load burst
•Maximum load supported per capacity unit
•No failure under stress
System has to scale horizontally
•All layers
•Linear throughput increase keeping response time upon capacity increase
•Scalability profile: how to deal sequential bottlenecks as load increases
11/24/2014 4
5. Availability and Resilience Requirements
System has to be designed for failures
•Server failures (Web Servers, DBs, Security gateways, etc)
•Hardware failures (disk, networking)
•Datacenter failures
•Replication failures
Availability: 99.9%, 99.99%
•Service Level Agreement
•Monitoring
•Self-Healing
11/24/2014 5
Achieving these quality attributes is not straightaway
6. Problem I-A: Performance
Requirements
Developer Product Manager
Performance Engineer
Any performance and
scalability requirements?
Yes, it has to be fast
and very scalable
7. Problem I-B: Performance
Requirements
Manager
Technical Leader
Etc.
Performance Engineer
This request has a
response time of 5
minutes
Where is the
requirement saying it
should be less?
8. Good Performance Requirements
Identify your business events and business entities
Understand the order of magnitude you have to deal with
Estimate your workload
•Based on current workloads if possible
•Educated guess
Set performance targets for business events based on expected business entities
For static datacenters performance targets will change as load changes
For Dynamic datacenters and scalable systems performance targets should be set by server or group of servers
9. Good Scalability Requirements
Horizontal Scalability
•At all layers
•At all layers but DB
Linear Scalability: if capacity is duplicated then throughput is duplicated with same response time
•Scalability can be linear until bottleneck in DB
•Usually DB is the most difficult layer to scale out
10. How do we verify requirements
11/24/2014 10
Requirement
Test type
Tools
Performance
Performancetests
•SCAP Management Tool
•Apache Jmeter
•PAL
•New Relic
•MicrosoftPerformance Monitor
•iostats
•Ganglia
•Wireshark, tcpdump
Stress tests
Longevitytests
Scalability
Scalability tests
Availability and Designedfor Failure
Longevity tests
Stress tests
Chaos monkey tests
•Apache JMeter
•Nagios
•New Relic
•SOAP UI
11. Performance and stress testing -Goals
Understand actual system behavior under load.
Determine actual system performance
•Load supported.
•Concurrent clients supported.
•Response time.
Find out hidden bugs
•Memory leaks, deadlocks, race conditions, unhealthy resource consumption, logs filling up disks, system exceptions in logs
Find out bottlenecks
•Too long db queries, missing indexes, resource consumption by component, etc.
11/24/2014 11
12. Performance test tools in action
11/24/2014 12
Datacenter-1
Datacenter-2
Service
Cache
Cache
Cache
Cache
DNS
Server
.com Root
GLB
Auth
GLB
Auth
DNS
NoSQL
SQL
Performance Monitor
Performance Monitor
Agent
Agent
Agent
Agent
Performance Monitor
Pal Reports
New Relic Dashboard
Performance
Baseline
14. Performance test executions
We use internal/external Jmetersto execute.
We increase the number of threads until the system cannot handle them.
Previous to execute we create the db with preload data.
•Results change when you execute with a populated db
During each execution we collect:
•TPS, Response time
•Performance counters results –Templates exported from PAL.
•New Relic Transaction breakdown.
•Slow queries with New Relic
•DB CPU utilization and memory.
Establish baselines
•Create/update the baselines for each scenario
•Compare with existent baselines.
14
15. Problem II: How to test
Developer Performance Engineer
From where are
you generating
load?
From the same
datacenter or from a
different datacenter
16. Same Datacenter
Competitor
Intel® Web Service
Intel® Web Service
JMeter
server
JMeter
client
JMeter
server
JMeter
server
JMeter
server
Easier to stress target system
Easier to target specific layer or server
17. Different Datacenter
Competitor
Intel® Web Service
Intel® Web Service
JMeter
server
JMeter
client
JMeter
server
JMeter
server
JMeter
server
More realistic
It is possible to detect issues in firewalls and external load balancers
It is possible to understand effect of latency (still not end user experience)
It is possible to use IaaS(AWS, Rackspace, etc.)
It is possible to use external performance test services
18. Problem III: Test Environment
Performance Engineer
Manager
The test
environment is not
exactly the same
as production
Test is not valid!
Enemy
19. Performance Test Environment
Ideally performance test environment has to be equal to production
Many times it is not possible for the high cost
•Still findings on software performance and scalability are most of the times valid
•Still possible to obtain comparable results between different software versions
•Not possible to find infrastructure issues
IaaSor PaaS
•Generate identical environments on demand
•Pay for what you use
Performance test on Production Environment is not a good idea
11/24/2014 19
20. Problem IV: Simulating users
Performance Engineer
Manager
How will you
generate the load
for 1 million users?
21. Virtual Users and Load Generation
There are two options to simulate users
Simulate the expected number of users including thinking time
•Closer to reality
•It is costly for tools that use one thread per virtual user (JMeter)
•Tools that work asynchronously (Tsung) are more efficient
Generate the load the expected number of users would generate
•Determine if the system can reach certain throughput
•Thinking time is eliminated
•Easier to stress the system
•Possible with tools the use one thread per virtual user
11/24/2014 21
22. Problem V: Load Test Time
Performance Engineer
Manager
For how long are
you running each
performance test
Minimum 15
minutes
24. Problem VI: Issues found in
performance test
Performance Engineer
Manager
Did you find any
issue?
25. Where issues can be found
HW: Network, Disk, CPU, Memory
OS
Middelware
Application
DB
Load Balancer
Firewall
Internet
26. Database Performance Issues
Complex Data Model prioritizing Maintainability over Performance
•Large joins with execution time growing exponentially as number of entities grows
•Detected by
–Evaluating execution plan
–Pre-populating DB with large number of rows and measuring DB query execution time
Performance vs. Maintainability: break normalization if necessary
Missing Index
•Easy to find with Execution Plan and when DB is pre- populated
Predicates evaluating two conditions that occur with different frequency (99-1)
26
27. Application Issues
Performance vs. Maintainability
Design prioritizing Maintainability over Performance
•Request generating N requests between components multiplying total request execution time
–If response time per request is 300 ms, final response time will be 300 * N ms.
•ORM sometimes turns a simple request into N DB queries multiplying DB access time
Access to time consuming third-party services
No use of cache
•Developers are mostly focused on functionality and little on data access frequency
Bad use of cache
•Didn’t think about data invalidation
•Creation of normalized data in cache
•Access time always should be O(1)
27
28. Performance -Real life example
•During the tests we executed:
–10, 20, 30, 40, 50, 60 threads
–TPS went down after 50 threads
–Response time increased strongly after 50 threads.
•Database was populated with 300K accounts
Ws Servers
IIS
AppFabric
Servers
MySqlDB
Ws Servers
IIS
Service layer
MySqlDB
External Service
28
1
10
100
1000
10000
100000
10
20
30
40
50
70
TPS and Response Time per Threads
TPS
Response Time(ms)
29. Performance -Real life example
•We executed the PAL report.
•CPU and Memory were healthy in the servers .
•There was a bottleneck that was causing slow responses
29
30. Performance -Real life example
•We found the problem looking a New Relic dashboards
•A call to an external system was taking 90% of the time
90% of the time
Is consumed by
An external call
30
31. OS Issues: Real Life Example
Configuration issues: TCP configuration
While stressing our system we noticed the following exception in our application:
Error: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full
31
32. User Ports
Windows Server by default limits the maximum number of ephemeral TCP ports to 5000 (this default changes for windows server 2008).
If the application tries to reserve an additional port (beyond the limit) it receives error:
•An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full
This behavior is (and was) changed by adding a new key to the registry:
•In HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServices TcpipParameters added the key MaxUserPort
•This value was set to 65534 (decimal)
11/24/2014 32
34. Socket TIME WAIT
When we ran netstat –bwe found many finished TCP connections in state TIME_WAIT.
This is a state of a TCP connection after a machine’s TCP has sent the ACK segment in response to a FIN segment received from its peer.
During this time resources are not released.
The default value for this timeout is 240 seconds according to documentation, we found it is actually between 60 or 120 seconds in our servers (Windows Server 2008 has different defaults).
We changed the value to 30 secs, in the registry:
•In HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters added the key TcpTimedWaitDelay
•This value was set to 30
More about this: http://msdn.microsoft.com/en-us/library/ee377084%28v=bts.10%29.aspx
11/24/2014 34
35. Networking Issues: Real Life Example
We had two replicated MySQL instances behind and F5 load balancer in active/standby mode.
Query response time was 200 msunder no load.
35
Load Balancer
DB
DB
Replication
Application
Query
Response
200ms
37. Data from Wireshark
LB
Application
Server
Full TCP Segment
1460 TCP data Bytes
ACK
200ms
Last TCP Segment
>1460 TCP data Bytes
FIN
Nagle’s
Algorithm
Enabled
Delayed
ACK
38. 11/24/2014
38
Performance vs. Security
Usually security wins
But there are still some possible teaks
Example: SSL handshake protocol
•Increase initial congestion window in OS (sometimes eliminates 1 RTT on server certificate)
•Keep connection open between components
•Reduce latency with datacenter closer to clients
40. 40
SCAP E2E Cloud Performance Analysis
Round-Trip Time (RTT)
Cloud Service
processing
UI Time
Total operation time –End User Experience
Client
Network Latency
Browser
Service 1
Service 2
DB
T1
T2
T3
41. Scalability tests
• Duplicate capacity and prove linear scalability.
– Measure with a set of servers – duplicate the servers and
measure again.
– Scalability profile
– How bottlenecks affects the scalability of the system?
Applications
API mgr
Data
Cache
Server
Load
Balancer
Load
Balancer
API mgr
Applications Applications App App
Data
Time
Load & Infrastructure Cost
API
mgr
Load
Balancer
Replicable
Eventual
Consistency
41