We face the challenge of monitoring and managing performance in clouds every other day. Not only is application performance management different in a cloud, but all clouds are not equal either. This lessons learned session will show how to do APM in several different Clouds (Azure, EC2, VMware private Clouds) and how it differs from more traditional environments. The session will also cover performance monitoring, troubleshooting and tuning in environments where resources are virtually infinite, but application performance is not.
2. Application Performance Management $
DB App App Web
Servers Servers Servers Servers Maria
Carl
Tenant
Manage End User Satisfaction
Manage Application Performance
Ensure SLA compliance
Ensure optimal Performance and Resource Utilization
3. Application Performance is a Business Issue
Improving performance lowers cost and increases revenue
Cost Revenue
REDUCED… IMPROVED…
• Reduce Problem • Improved Conversion Rate
Resolution Time
• Improved Capacity
• Reduce Hardware and
• Improved Productivity
other operational Cost
• Less Production Issues
Sources varied, including Compuware ROI studies and actual observed user behavior over 180M+ page views
4. Lesson #1
Private and Public Clouds are not alike
4
5. Private Cloud
Applications in a private Cloud
Community portal Web Server Application Server Backend Database
wiki Web Server Application Server Backend Database
7. Identifying hidden Application Impact
1000
500
0
Response Time Throughput
800
600
Another Application?
Infrastructure?
400
200
0
Response Time Throughput
8. Private Cloud
Overcoming Organisational Barriers
Application Team No Line of Communication Application Team
Two One
Operations Team
One
Operations Team
Two
Virtualization
Team
10. Private Cloud
Application Balance and Resource Optimization
Is this due to
application
failures?
How healthy is my
application? Do we have
any problems?
What is the resource
utilization of my
virtualized hosts and Or do we have an
guests? Infrastructure Issue?
CPU Overcommit?
Which other
Application is impacted
Memory Overcommit? or impacts ours?
10
11. Public Cloud
Hidden Impact in the Public Cloud
1000
500
0
Response Time Throughput
800
600
400
No Resource Utilization
No Visibility in
200 underlying Layer
Goal!
0
Response Time Throughput
12. Public Cloud
Your Application is your only concern
Infrastructure Issue?
Did we reach capacity?
Steal time?
Application is King!
Infrastructurerecycle!
Scale up or is a black
box and commodity
12
13. Lesson #2
Cloud Monitoring must be
Application (Performance) Monitoring
13
14. Application Performance starts with the End User
Public Cloud Users
Load Balancers ▪ Web ▪ Application logic ▪ Database ▪ Network Third ISPs ▪ Mobile carriers ▪ Browsers
▪ CDNs ▪ Third party services Party ▪ Devices ▪ AJAX ▪ JavaScript ▪ Mobile apps
Customers
Load
Balancer
Application Application
Database
CDN
Employees
Browser & Device
Infrastructure Cloud Monitoring Cloud Internet Backbone Performance
15. Public Cloud
Managing End User Experience (EUE)
What did the user do?
Where do my users come from?
Is the problem in the Browser? In the
AJAX Call? The Web-Server? In the
Application? Or is it a 3rd party
Which devices to they use? service?
Which users suffer from bad user experience? Any Bandwidth Issues? Mobile Carrier?
If the problem is my 3rd
Party Content – Who
was it? Does
Facebook, LinkedIn or
Google Ads have a
negative impact?
16. Real End-to-End Application Performance
Our Application
Third Party
External
End User
Services
Identify Fault Domain
Impact of Cloud Resources
16
17. Lesson #3
Time and resources are relative in the Clouds
17
19. …when your clock is skewed?
• Use virtualization Aware times
• and/or Exclude heavily suspended Transaction from analysis
Tier Response Times
VM Suspension
Inter Tier Latency
20. Public Cloud
Utilization of Cloud Resources?
Real Instance CPU Time
Over 60% Steal Time?
Know how toreal CPU, but you
EC2 shows shows allocated
VM Ware interpret resource
CPU, but you metrics! you bought
cannot use more than get all of it
might not
21. Impact on Business Transactions!
Use Latency and Transaction
Impact
instead of Utilization
Latency can be easily
correlated
Latency on specific
Transaction
21
22. Lesson #4
Rapid Elasticity is painful, but a success
requirement!
22
23. Why do we want elasticity? Resource
Enterprise Data Center
Static Provisioning is
easy, save and
performance can be
Database guaranteed.
CPU
Storage
Elastic Scaling has no
advantage for the
Application!
23
24. Why do we want elasticity? Resource
Enterprise Data Center
Unplanned and
unhandled load
Available
Capacity
Database
CPU
Storage
24
25. Why do we want elasticity? Resource
Enterprise Data Center
Unplanned and
unhandled load
Slow Application
Database
CPU
Storage
Unsatisfied Users
Loss of Revenue
25
26. The Cloud Reason
Enterprise Data Center Cloud
Unplanned and On-Demand
unhandled load Provisioning
Slow Application
CPU
Database
Database
CPU
CPU
Storage
Storage
Unsatisfied Users
Loss of Revenue
No Capacity
Barrier
26
27. The Cloud Reason
Enterprise Data Center Cloud
Unplanned and
unhandled load Easy Scaling
Slow Application
Sowhydoup iswant to
PublicCloud is easy!
Scaling we about
Purchased
CPU
On scale down again?
Scaling down is hard!
Demand Provisioning
Database
Database
capacity
CPU
Storage
Storage
Unsatisfied Users
Loss of Revenue
No Capacity
Barrier
27
28. Why we scale down again
Enterprise Data Center Cloud
Overprovisioned
Saving Resources
Additional
But already paid Runtime$$
and Costs!
CPU
Database
Database
CPU
Storage
28 Storage
29. Why we scale down again
Enterprise Data Center Cloud
Elastic Scaling is a
Business Requirement
CPU
Database
Database
CPU
Storage
Storage
Not a technical one!
29
31. Capacity Planning and Resource Time
Optimization Next investment:
estimate and buy…
Capacity planning: …and postpone
estimate future load new investments
Load and Resources
and buy infrastructure Available Resources
Load
Resource Usage
time
Performance optimization:
use existing infrastructure as
long as possible…
32. On Demand Provisioning and Resource Time
Optimization
Load and Resources
Available Resources
Load
Resource Usage
Performance optimization:
lower the cost structure
time
33. Load and Resources
On Demand Provisioning Time
Available Resources
Load
Resource Usage
Optimization in the Cloud
Performance optimization:
lower the cost structure
is about Cost Savings!
time
34. Cloud Resources
Where does
the cost
come from?
Resource Usage
how our application
consumes these resources
34
35. Cloud Resources
Developer decisions
User behavior Resource Usage Implementation
how our application
consumes these resources
35
36. Manage by planning?
Enterprise Data Center Cloud
Developer decisions
User behavior Implementation
Purchased Planned
capacity Cost capacity
estimation
36
37. Manage by monitoring!
Enterprise Data Center Cloud
Developer decisions
Implementation
User behavior
Purchased Planned
CPU
Database
capacity Cost capacity
Storage
estimation
APM
Business
Transactions
37
UEM Application-centric
38. Managing Cost
Cost functions
of our resources
$
Cloud Resources
$ $ $ $
Amount of Resource Amount of Resource Amount of Resource Amount of Resource Amount of Resource
Compute
Resource Usage
Database Storage Billing …
how our application
Search .2% consumes these resources
18 How these
resources are used
Purchase .4% 3 1 by our application
Identify costly Identify costly
transactions features
38
39. Managing Cost
Cost functions
of our resources
$ $ $ $ $
Amount of Resource Amount of Resource Amount of Resource Amount of Resource Amount of Resource
Compute Database Storage Billing …
Search .2% 18 How these
resources are used
Purchase .4% 3 1 by our application
Identify costly Identify costly Identify costly Identify costly
transactions features user behavior tenants
39
40. Managing Cost
$ $ $
End-user visibility Understand
how our
Amount of Resource Amount of Resource Amount of Re
application
drives cost
Compute Database Storag
Search .2% 18
Purchase .4% 3
40
41. Managing Cost
$ $ $
Amount of Resource Amount of Resource Amount of Re
APM let’s you Compute Database Storag
Optimize the Cost Structure of
Search .2% 18
your Business Transactions
Purchase .4% 3
41
42. THANK YOU
Michael Kopp, Technology Strategist
michael.kopp@compuware.com
@mikopp
42
blog.dynatrace.com
Hinweis der Redaktion
Last updated or created: April ‘11Key themes:You improve application performance to improve your business. It is a business issue more than a technical one.Talk trackWhy worry about application performance? Because it improves your businessThere are numerous studies that prove that improving application performance can reduce cost and increase revenue. Reduce Cost-- one study demonstrated that improving application performance lowered the effort – and cost – needed to resolve problems by 83%. That not only saves money and effort, but it delivers results more quickly-- another study determined that improving application performance reduced calls to the call center by 61%. If those calls are customer calls, that will also directly increase revenue.Improve Revenue-- There is a direct and clear correlation between website performance and customer conversion rates. Time and time again customers are proving with their actions that the faster the site is the more likely they are to stay on a it and move through a conversion process. We’ve seen, on average, that conversion rates can increase by over 70% if page load times decrease from 8 seconds to 2 seconds.There is also a direct correlation between abandonment rates and website performance. Using another set of observed data we’ve seen a 39% DECREASE in abandonment rates when page load times drop from 8 seconds to 2 seconds.Bottom line: improving app performance improves your business
Thismeansthattwoapplications, ormore, canimpacteachother. This impactisreallyhiddenfromyourapplication, all itseesisthatitslows down orthatitdoesn‘tget 100% CPU. Even morethethingscaneffecteachotherthatcouldn‘tbefore: networkand I/O.
In a sense virtualziationistheopositeofDevOpswhichhasbeenmy last forthisgroup.Ops – App People, problem, but directreleation.Now App People don‘tseewhentheyhave a opsproblem? Opssayseverythingfine. This is not only a problem in production, thinkabouttesting in a virtualizedenvironemnt, youeitherhavetomakesureyouget „dedicated“ environemntoryouhavetofilter out thenoise.This abstractionmakesproblemsolvingevenharderthantoday. Itmakesitevenmoreobviousthatthe total separationofappandopsisnolongfeasible.
Correlationevenharder
Thismeansthattwoapplications, ormore, canimpacteachother. This impactisreallyhiddenfromyourapplication, all itseesisthatitslows down orthatitdoesn‘tget 100% CPU. Even morethethingscaneffecteachotherthatcouldn‘tbefore: networkand I/O.
Now especially in a public cloud we have less visibility and control than in a private cloud. At the same time more of what we use is third party (internet back bone, Cdn, load balancers, databases). To know what is going on we need that visibility back, and we need to start where it matters, in our case at the user.End user Application (impact of IT) Services like DB and WebServiceAzure? EC2?
Last updated or created: April ‘11Key themes:major change #3: the Cloud has arrivedTalk trackIf it wasn’t complicated enough to have the data center and the web be more complex, now we also have the cloud as part of the equation.More and more companies are moving some or all of their applications to a private or public cloud. And that certainly changes the way you do APM – the cloud is opaque, so you can’t monitor its inner workings, and the cloud is shared, so you need to be careful that someone else’s app is not making yours slow.THIS is today’s app delivery chain. Far more complex than just a few years ago.
From virtualization we already knew that timing is sometimes a problem. However to do proper fault domain resolution we needed to have accurate timing at least at the tier and service level.The timing problem.There is more, the timing issue leads to the problem that guest meassures are skewed, this is a problem for APM as we need to know how utialized things are. In a private cloud we can use the VM and vHost metrics to make up for that. We can correlate them on a time basis, thus we ignore the guest metrics for the most part. But for performance analysis we need to know more detailed CPU break downs on our application. Lukily vendors like vmware ensure to a large degree that the CPU time accounted on threads works out, in addition we correlate the steal time so that we know which transactions we must simply ignore in the analysis because they are skewed beyond repair.In a public cloud things a little more difficult, we get less insight into the metrics. But there are other caviats. Let’s take EC2, we found that CPU…Azure?
A common misnomer is that Scalability takes care of performance. That is not true. Performance is about speed of a single transaction or throughput at a given size. Scalability is about being able to get the same speed with more transactions and more nodes. Scalability is about doubling throughput when doubling the size. This actually means that an application needs to perform in order to scale!
A common misnomer is that Scalability takes care of performance. That is not true. Performance is about speed of a single transaction or throughput at a given size. Scalability is about being able to get the same speed with more transactions and more nodes. Scalability is about doubling throughput when doubling the size. This actually means that an application needs to perform in order to scale!
Add Load Balancer and RDS DashboardCorrelate as much of the Cloud and guest metrics as we can.
And finally, That brought us to the most important lesson learned. And that is that we don’t really care about resource usage in a public cloud at all. We care about application SLAs and about cost effectiveness. And In a public cloud cost effectiveness is not the same as resource effectiveness. … So we need again to monitor the right things. We need to know the cost structure of a transaction and what kind of revenue it brings in order to set priorities. E.g. optimizing the search function so that it A) delivers better results and is not executed 5 times by every user and B) to use less database calls saves us money even it maybe uses a little bit more CPU and is not a bit faster to the end user, remember the end users performance depends on more than the server anyway.
… thatswhywe still do someplanning. In thecloud, we plan toget a costestimation—not forthepurposeofpurchasinginfrastructure.Sincethecloudis agile, wedon‘tneedtoover-provision ourresources, sinceweeasilycanacquirethem. In thiscase, weare lucky and save resourcesandthuscost.
… thatswhywe still do someplanning. In thecloud, we plan toget a costestimation—not forthepurposeofpurchasinginfrastructure.Sincethecloudis agile, wedon‘tneedtoover-provision ourresources, sinceweeasilycanacquirethem. In thiscase, weare lucky and save resourcesandthuscost.