This presentation was given at StarWest 2013 in Anaheim, CA and also broadcasted through the Virtual Conference.
It shows how important it is to focus on performance throughout continuous delivery in order to avoid the most common performance problem patterns that still cause applications to crash and engineers spending their weekends and nights in a firefighting/war room situation
24. 2424
What are the real questions?
Individual Users? ALL users?
Is it the APP? Or Delivery Chain?
Code problem? Infrastructure?
One transaction? ALL transactions?
In AppServer? In Virtual Machine?
26. 2626
Problem: What Devs would like to have
Top Contributor is related to
String handling
99% of that time comes from
RegEx Pattern Matching
Page Rendering is the main component
29. 2929
Problem: Attitudes like this don’t help either
Image taken from https://www.scriptrock.com/blog/devops-whats-hype-about/
Shopzilla CIO (in 2010): “… when they get in the war room - the developers and ops teams
describe the problem as the enemy, not each other”
30. 3030
Problem: Very “expensive” to work on these issues
~80% of problems
caused by ~20% patterns
YES we know this
80%Dev Time in Bug Fixing
$60BDefect Costs
BUT
55. 5757
How? Performance Focus in Test Automation
12 0 120ms
3 1 68ms
Build 20 testPurchase OK
testSearch OK
Build 17 testPurchase OK
testSearch OK
Build 18 testPurchase FAILED
testSearch OK
Build 19 testPurchase OK
testSearch OK
Build # Test Case Status # SQL # Excep CPU
12 0 120ms
3 1 68ms
12 5 60ms
3 1 68ms
75 0 230ms
3 1 68ms
Test Framework Results Architectural Data
We identified a regresesion
Problem solved
Lets look behind the
scenes
Exceptions probably reason
for failed tests
Problem fixed but now we have an
architectural regression
Problem fixed but now we have an
architectural regression
Now we have the functional and
architectural confidence
56. 5858
How? Performance Focus in Test Automation
Analyzing All Unit /
Performance Tests
Analyze Perf
Metrics
Identify
Regressions
66. 6868
DevOps: Actionable Data to Ops
• Input for Capacity and Deployment Planning
Number of Requests on The
App Server we will need to
handle
Might need to tune GC Settings
to reduce GC Overhead
CPU is going to be tight with
these machines – also
impacted by GC Activity
Input on Thread Pool
Configuration
Memory Usage for expected
load still provides enough
“headroom”
67. 6969
IF WE DO ALL THAT
80% Dev Time for Bug Fixing
$60B Costs by Defects
68. 7070
Want MORE of these and more details?
http://apmblog.compuware.com
Who knows what that is?It’s the Fifa World Cup Trophy
Teams are currently competing in the qualifications to compete in Brazil 2014
This is “my” austrian national team soccer team. Their GOAL is to qualify for Brazil 2014. After the many failed attempts in the past we hired a new coach who’s goal is to form a new team that PERFORMs good enough to qualify
In order to get there the team competed in many test games. Which gaves them a lot of confidence because they played against teams that were “easier” to beat. At the end of these tests we even started in the qualification with some wins against teams that we were expecting to winSo – at the end of these “test and easy qualification games” we thought: “ALL GOOD – THE ROAD IS OPEN FOR 2014 – NOT ONLY WILL WE QUALIFY BUT WE ALSO BELIEVE WE HAVE SUCH A STRONG TEAM THAT WILL ALSO DO WELL AT THE WORLDCUP”
Then reality kicked in when we had our first “real competitor” – it was the first qualification against a team whos quality level is at a level that we have to expect at the world cup.The competing team was Germany – and – based on these images you can see how the game went
The coach is responsible to watch the game and see how things are going. Like in other sports – soccer has a couple of Key Performance Indicators such as Ball Possession, Fouls and the actual scoreThe first 5 minutes actually didn’t look too bad
After the first 5 minutes the game changes – with germany taking over the game in their typical way. The KPIs make this very clearThe coach is responsible to react based on these values and how the game wents
The coach should use more data for detailed analysis on what is going wrong in the game
One of his options is to substitute players – or even change tacticsDoes this succeed based on the KPIs that we have seen before?
Well – not always. Just replacing players – putting some in that are faster in chasing the ball doesn’t always help
StoryNew Build Deployed on Thursday Evening Everything runs smooth on Friday DaytimeAn Ad Campaign hits the Air Friday NightThe site crashes under load -> ALERTS GO OFFRestarting Server -> SERVER DOESN’T STARTAdding more Servers-> PROBLEM REMAINSCalling in the “App Experts” and Pizza Delivery!
They getOps’ problem description: “App Server crashed”, “Out of file handles”Users’ problem description: “It is slow”, “It crashed”
They GetHigh CPU, Memory or Bandwidth IssuesLog files: GB’s of logfiles with 99.9% “useless” information
There is lots of data – but – does a high CPU Utilization really mean that this machine has a problem and need to be restarted?What could be the problem if your user experience tool tells you that people have bad response times?But what do we do with all this disconnected data?
They needApplication data: Executed Transactions, Load, CPU, Memory, Disk usage,...Impacted transactions with context information: User Actions, Call stack, Thread Overview, Method Parameters, SQL Calls, Invoked Service CallsInvolved Application Components: Web Server, App Servers, DatabaseImpact of service calls: Performance, Availability, Response CodesError Details: HTTP Errors, Exceptions, warning/severe log messages
Well – I guess there is just not more to say about this. The attitude between these teams doesn’t help in solving issues any faster
We all know this statistic in one form or another – so – it is clear that these problems that are handled in War Rooms are VERY EXPENSIVEBUTWhat is interesting is that these problems are typically not detected earlier because the focus of engineering is on building new features instead of focusing on performance and scalable architecture.What’s interesting though is that many of these problems could easily be found earlier on – LETS have a look at these common problems that we constantly run into …
Depending on the audience you want to show or hide some of the following slides
Resource Pool ExhaustionMisconfiguration or failed deployment, e.g: default config from devActual resource leak -> can be identified with Unit/Integration Tests
Resource Pool Exhaustion (same as before – just different Pool)Using the same deployment tools in Test and Ops can prevent thisTesting with real load can detect that
Deployment Issues leading to heavy logging resulting in high I/O and CPUUsing the same deployment tools in Test and Ops can prevent thisAnalyzing Log Output per Component in Dev prevents this problem
Deployment Issues leading to heavy logging resulting in high I/O and CPUUsing the same deployment tools in Test and Ops can prevent thisAnalyzing Log Output per Component in Dev prevents this problem
Too many and too slow Database QueriesDev and Test need to have “production-like” database – Updates on a “Sample Databases” won’t show slow updatesAccess Patterns such as N+1 can be identified with Unit Tests
Too many and too slow Database QueriesDev and Test need to have “production-like” database – Updates on a “Sample Databases” won’t show slow updatesAccess Patterns such as N+1 can be identified with Unit Tests
Too much data requested from DatabaseDev and Test need to have “production-like” database – Otherwise these problem patterns can only be found in prodEducate Developers on “the power of SQL” – instead of loading everything in memory and performing filters/aggregations/… in the App
Memory Leaks: Too much data in CacheCan be found in test with “production-like” data sets and tests that do not only test the same “search” query -> get feedback from ProdEducate developers on memory and cache strategies
Synching issues caused by deadlocksCan be found with small scale performance unit tests by developersEducate developers on synchronization/multi-threading strategies
Not following WPO (Web Performance Optimization Rules)Non optimized content, e.g: compression, merging, …Educate developers and automate WPO checks
Not leveraging Browser-side CachingMisconfigured CDNs or missing cache settings -> automate cache configuration deploymentEducate developers; Educate testers to do “real life” testing (CDN, …)
Slow or failing 3rd party contentImpacts page load time; Ops is required to monitor 3rd party servicesEducatedevs to optimize loading; Educate test to include 3rd party testing
Why this is a problem?Biz pushes features. In order to deliver more features in a more agile way development adopted agile development methodologies to deliver more releases with more features in a shorter timeframeTo save costs we outsource. Some companies also organically grew by acquisition leaving us with dev teams that are distributed across the globeTo be faster we use 3rd Party Code as we do not want to re-invent the wheel. However – not every 3rd party component or service is really fit for the requirements we have in our production enviornment. It may work well on the workstation for a single user – but often fails in a larger environment3rd Party Services or ContentAverage US Sports Website loads content from 29! domains3rd Party Components in Application CodeHibernate, Spring, .NET Enterprise Blocks …GWT, ExtJS, jQueryAmazon Web Services, Google API, …
Feature – richness vs. NO CHANGE
Not well communicated what change is ahead. No “Integration” of Ops Teams in Agile Process
CAMS is taken from OpsCode (Creators of Chef) Blog: http://www.opscode.com/blog/2010/07/16/what-devops-means-to-me/ Culture People and process first. If you don’t have culture, all automation attempts will be fruitless.Automation This is one of the places you start once you understand your culture. At this point, the tools can start to stitch together an automation fabric for Devops. Tools for release management, provisioning, configuration management, systems integration, monitoring and control, and orchestration become important pieces in building a Devops fabric.Measurement If you can’t measure, you can’t improve. A successful Devops implementation will measure everything it can as often as it can… performance metrics, process metrics, and even people metrics.SharingSharing is the loopback in the CAMS cycle. Creating a culture where people share ideas and problems is critical. Jody Mulkey, the CIO at Shopzilla, told me that they get in the war room the developers and operations teams describe the problem as the enemy, not each other. Another interesting motivation in the Devops movement is the way sharing Devops success stories helps others. First, it attracts talent, and second, there is a belief that by exposing ideas you can create a great open feedback that in the end helps them improveThe change that is required is already well understood in the DevOps movement that’s been going on for years – BUT – it is important to add Performance as Key Requirement to Culture, Automation, Measurement and Sharing. Culture: PERFORMANCE is a key requirement for everything that is done throughout the delivery chain. We have heard that a lot of the problems that lead to a War Room scenario are problems that could be found earlier if there would be a focus on Performance and Quality throughout the organizationAutomation: Automation is Key for DevOps and Agile Development. What needs to change is that performance and architectural problems are automatically detected in the development and delivery process. This can be achieved by focusing automated testing for exactly these problems – whether it is in C/I or in the “traditional” test areaMeasurement: We can only measure success if we have Key Performance Indicators for each team, e.g: Test Coverage %, Number of Tests Executed, Throughput, Response Time, Number of Deployments, … - an additional focus must be on measures that allow us to track performance and architectural issues. This allows us to identify and prevent any performance regressions as soon as they get introducedSharing:
When we look at the results of your Testing Framework from Build over Build we can easily spot functional regressions. In our example we see that testPurchase fails in Build 18. We notify the developer, problem gets fixed and with Build 19 we are back to functional correctness. Looking behind the scenesThe problem is that Functional Testing only verifies the functionality to the caller of the tested function. Using dynaTrace we are able to analyze the internals of the tested code. We analyze metrics such as Number of Executed SQL Statements, Number of Exceptions thrown, Time spent on CPU, Memory Consumption, Number of Remoting Calls, Transfered Bytes, …In Build 18 we can see a nice correlation of Exceptions to the failed functional test. We can assume that one of these exceptions caused the problem. For a developer it would be very helpful to get exception information which helps to quickly identify the root cause of the problem and solve it faster.In Build 19 the Testing Framework indicates ALL GREEN. When we look behind the scenes we see that we have a big jump in SQL Statements as well as CPU Usage. What just happened? The Developer fixed the functional problem but introduced an architectural regression. This needs to be looked into – otherwise this change will have negative impact on the application once tested under loadIn Build 20 all these problems are fixed. We are still meeting our functional goals and are back to acceptable number of SQL Statements, Exceptions, CPU Usage, …
Web Architectural Metrics# of JS Files, # of CSS, # of redirectsSize of Images