StarWest 2013 Performance is not an afterthought – make it a part of your Agile Delivery
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

StarWest 2013 Performance is not an afterthought – make it a part of your Agile Delivery

on

  • 1,787 Views

This presentation was given at StarWest 2013 in Anaheim, CA and also broadcasted through the Virtual Conference. ...

This presentation was given at StarWest 2013 in Anaheim, CA and also broadcasted through the Virtual Conference.
It shows how important it is to focus on performance throughout continuous delivery in order to avoid the most common performance problem patterns that still cause applications to crash and engineers spending their weekends and nights in a firefighting/war room situation

Statistiken

Views

Gesamtviews
1,787
Views auf SlideShare
1,787
Views einbetten
0

Actions

Gefällt mir
1
Downloads
32
Kommentare
0

0 Einbettungen 0

No embeds

Zugänglichkeit

Kategorien

Details hochladen

Uploaded via as Microsoft PowerPoint

Benutzerrechte

© Alle Rechte vorbehalten

Report content

Als unangemessen gemeldet Als unangemessen melden
Als unangemessen melden

Wählen Sie Ihren Grund, warum Sie diese Präsentation als unangemessen melden.

Löschen
  • Full Name Full Name Comment goes here.
    Sind Sie sicher, dass Sie...
    Ihre Nachricht erscheint hier
    Processing...
Kommentar posten
Kommentar bearbeiten
  • Who knows what that is?It’s the Fifa World Cup Trophy
  • Teams are currently competing in the qualifications to compete in Brazil 2014
  • This is “my” austrian national team soccer team. Their GOAL is to qualify for Brazil 2014. After the many failed attempts in the past we hired a new coach who’s goal is to form a new team that PERFORMs good enough to qualify
  • In order to get there the team competed in many test games. Which gaves them a lot of confidence because they played against teams that were “easier” to beat. At the end of these tests we even started in the qualification with some wins against teams that we were expecting to winSo – at the end of these “test and easy qualification games” we thought: “ALL GOOD – THE ROAD IS OPEN FOR 2014 – NOT ONLY WILL WE QUALIFY BUT WE ALSO BELIEVE WE HAVE SUCH A STRONG TEAM THAT WILL ALSO DO WELL AT THE WORLDCUP”
  • Then reality kicked in when we had our first “real competitor” – it was the first qualification against a team whos quality level is at a level that we have to expect at the world cup.The competing team was Germany – and – based on these images you can see how the game went
  • The coach is responsible to watch the game and see how things are going. Like in other sports – soccer has a couple of Key Performance Indicators such as Ball Possession, Fouls and the actual scoreThe first 5 minutes actually didn’t look too bad
  • After the first 5 minutes the game changes – with germany taking over the game in their typical way. The KPIs make this very clearThe coach is responsible to react based on these values and how the game wents
  • The coach should use more data for detailed analysis on what is going wrong in the game
  • One of his options is to substitute players – or even change tacticsDoes this succeed based on the KPIs that we have seen before?
  • Well – not always. Just replacing players – putting some in that are faster in chasing the ball doesn’t always help
  • StoryNew Build Deployed on Thursday Evening Everything runs smooth on Friday DaytimeAn Ad Campaign hits the Air Friday NightThe site crashes under load -> ALERTS GO OFFRestarting Server -> SERVER DOESN’T STARTAdding more Servers-> PROBLEM REMAINSCalling in the “App Experts” and Pizza Delivery!
  • They getOps’ problem description: “App Server crashed”, “Out of file handles”Users’ problem description: “It is slow”, “It crashed”
  • They GetHigh CPU, Memory or Bandwidth IssuesLog files: GB’s of logfiles with 99.9% “useless” information
  • There is lots of data – but – does a high CPU Utilization really mean that this machine has a problem and need to be restarted?What could be the problem if your user experience tool tells you that people have bad response times?But what do we do with all this disconnected data?
  • They needApplication data: Executed Transactions, Load, CPU, Memory, Disk usage,...Impacted transactions with context information: User Actions, Call stack, Thread Overview, Method Parameters, SQL Calls, Invoked Service CallsInvolved Application Components: Web Server, App Servers, DatabaseImpact of service calls: Performance, Availability, Response CodesError Details: HTTP Errors, Exceptions, warning/severe log messages
  • Well – I guess there is just not more to say about this. The attitude between these teams doesn’t help in solving issues any faster
  • We all know this statistic in one form or another – so – it is clear that these problems that are handled in War Rooms are VERY EXPENSIVEBUTWhat is interesting is that these problems are typically not detected earlier because the focus of engineering is on building new features instead of focusing on performance and scalable architecture.What’s interesting though is that many of these problems could easily be found earlier on – LETS have a look at these common problems that we constantly run into …
  • Depending on the audience you want to show or hide some of the following slides
  • Resource Pool ExhaustionMisconfiguration or failed deployment, e.g: default config from devActual resource leak -> can be identified with Unit/Integration Tests
  • Resource Pool Exhaustion (same as before – just different Pool)Using the same deployment tools in Test and Ops can prevent thisTesting with real load can detect that
  • Deployment Issues leading to heavy logging resulting in high I/O and CPUUsing the same deployment tools in Test and Ops can prevent thisAnalyzing Log Output per Component in Dev prevents this problem
  • Deployment Issues leading to heavy logging resulting in high I/O and CPUUsing the same deployment tools in Test and Ops can prevent thisAnalyzing Log Output per Component in Dev prevents this problem
  • Too many and too slow Database QueriesDev and Test need to have “production-like” database – Updates on a “Sample Databases” won’t show slow updatesAccess Patterns such as N+1 can be identified with Unit Tests
  • Too many and too slow Database QueriesDev and Test need to have “production-like” database – Updates on a “Sample Databases” won’t show slow updatesAccess Patterns such as N+1 can be identified with Unit Tests
  • Too much data requested from DatabaseDev and Test need to have “production-like” database – Otherwise these problem patterns can only be found in prodEducate Developers on “the power of SQL” – instead of loading everything in memory and performing filters/aggregations/… in the App
  • Memory Leaks: Too much data in CacheCan be found in test with “production-like” data sets and tests that do not only test the same “search” query -> get feedback from ProdEducate developers on memory and cache strategies
  • Synching issues caused by deadlocksCan be found with small scale performance unit tests by developersEducate developers on synchronization/multi-threading strategies
  • Not following WPO (Web Performance Optimization Rules)Non optimized content, e.g: compression, merging, …Educate developers and automate WPO checks
  • Not leveraging Browser-side CachingMisconfigured CDNs or missing cache settings -> automate cache configuration deploymentEducate developers; Educate testers to do “real life” testing (CDN, …)
  • Slow or failing 3rd party contentImpacts page load time; Ops is required to monitor 3rd party servicesEducatedevs to optimize loading; Educate test to include 3rd party testing
  • Why this is a problem?Biz pushes features. In order to deliver more features in a more agile way development adopted agile development methodologies to deliver more releases with more features in a shorter timeframeTo save costs we outsource. Some companies also organically grew by acquisition leaving us with dev teams that are distributed across the globeTo be faster we use 3rd Party Code as we do not want to re-invent the wheel. However – not every 3rd party component or service is really fit for the requirements we have in our production enviornment. It may work well on the workstation for a single user – but often fails in a larger environment3rd Party Services or ContentAverage US Sports Website loads content from 29! domains3rd Party Components in Application CodeHibernate, Spring, .NET Enterprise Blocks …GWT, ExtJS, jQueryAmazon Web Services, Google API, …
  • Feature – richness vs. NO CHANGE
  • Not well communicated what change is ahead. No “Integration” of Ops Teams in Agile Process
  • CAMS is taken from OpsCode (Creators of Chef) Blog: http://www.opscode.com/blog/2010/07/16/what-devops-means-to-me/ Culture People and process first.  If you don’t have culture, all automation attempts will be fruitless.Automation This is one of the places you start once you understand your culture.  At this point, the tools can start to stitch together an automation fabric for Devops.  Tools for release management, provisioning, configuration management, systems integration, monitoring and control, and orchestration become important pieces in building a Devops fabric.Measurement If you can’t measure, you can’t improve.  A successful Devops implementation will measure everything it can as often as it can… performance metrics, process metrics, and even people metrics.SharingSharing is the loopback in the CAMS cycle.  Creating a culture where people share ideas and problems is critical.  Jody Mulkey, the CIO at Shopzilla, told me that they get in the war room the developers and operations teams describe the problem as the enemy, not each other.  Another interesting motivation in the Devops movement is the way sharing Devops success stories helps others.   First, it attracts talent, and second, there is a belief that by exposing ideas you can create a great open feedback that in the end helps them improveThe change that is required is already well understood in the DevOps movement that’s been going on for years – BUT – it is important to add Performance as Key Requirement to Culture, Automation, Measurement and Sharing. Culture: PERFORMANCE is a key requirement for everything that is done throughout the delivery chain. We have heard that a lot of the problems that lead to a War Room scenario are problems that could be found earlier if there would be a focus on Performance and Quality throughout the organizationAutomation: Automation is Key for DevOps and Agile Development. What needs to change is that performance and architectural problems are automatically detected in the development and delivery process. This can be achieved by focusing automated testing for exactly these problems – whether it is in C/I or in the “traditional” test areaMeasurement: We can only measure success if we have Key Performance Indicators for each team, e.g: Test Coverage %, Number of Tests Executed, Throughput, Response Time, Number of Deployments, … - an additional focus must be on measures that allow us to track performance and architectural issues. This allows us to identify and prevent any performance regressions as soon as they get introducedSharing:
  • When we look at the results of your Testing Framework from Build over Build we can easily spot functional regressions. In our example we see that testPurchase fails in Build 18. We notify the developer, problem gets fixed and with Build 19 we are back to functional correctness. Looking behind the scenesThe problem is that Functional Testing only verifies the functionality to the caller of the tested function. Using dynaTrace we are able to analyze the internals of the tested code. We analyze metrics such as Number of Executed SQL Statements, Number of Exceptions thrown, Time spent on CPU, Memory Consumption, Number of Remoting Calls, Transfered Bytes, …In Build 18 we can see a nice correlation of Exceptions to the failed functional test. We can assume that one of these exceptions caused the problem. For a developer it would be very helpful to get exception information which helps to quickly identify the root cause of the problem and solve it faster.In Build 19 the Testing Framework indicates ALL GREEN. When we look behind the scenes we see that we have a big jump in SQL Statements as well as CPU Usage. What just happened? The Developer fixed the functional problem but introduced an architectural regression. This needs to be looked into – otherwise this change will have negative impact on the application once tested under loadIn Build 20 all these problems are fixed. We are still meeting our functional goals and are back to acceptable number of SQL Statements, Exceptions, CPU Usage, …
  • Web Architectural Metrics# of JS Files, # of CSS, # of redirectsSize of Images

StarWest 2013 Performance is not an afterthought – make it a part of your Agile Delivery Presentation Transcript

  • 1. 11 Make it part of your Agile Delivery
  • 2. 2
  • 3. 3
  • 4. 4
  • 5. 5 Testing is Important – and gives Confidence
  • 6. 6 But are we ready for “The Real” world?
  • 7. 7 Measure Performance during the game Ball Possession: 40 : 60 Fouls: 0 : 0 Score: 0 : 0 Minute 1 - 5
  • 8. 8 Measure Performance during the game Minute 6 - 35 Ball Possession: 80 : 20 Fouls: 2 : 12 Score: 0 : 0
  • 9. 9 Deep Dive Analysis
  • 10. 10 Options “To Fix” the situation
  • 11. 11 Not always a happy ending  Minute 90 Ball Possession: 80 : 20 Fouls: 4 : 25 Score: 3 : 0
  • 12. 12 FRUSTRATED FANS!! 12
  • 13. 13 How does that relate to Software?
  • 14. 1414 From Deploy to … Deploy Promotion/Event Problems Ops Playbook War Room Timeline
  • 15. 1515 The “War Room” – back then 'Houston, we have a problem‘ NASA Mission Control Center, Apollo 13, 1970
  • 16. 1616 The “War Room” – NOW Facebook – December 2012
  • 17. 1717 Problem: Unclear End User Problem Descriptions
  • 18. 1818 Statuc Quo: Ops Runbook – System Unresponsive
  • 19. 1919 Problem: Unclear Ops Problem Descriptions
  • 20. 2020 Status Quo: Ops Runbook – High Resource Usage
  • 21. 2121 Lack of data?
  • 22. 2222
  • 23. 23 Answers to the right questions
  • 24. 2424 What are the real questions? Individual Users? ALL users? Is it the APP? Or Delivery Chain? Code problem? Infrastructure? One transaction? ALL transactions? In AppServer? In Virtual Machine?
  • 25. 2525 Problem: What Devs would like to have
  • 26. 2626 Problem: What Devs would like to have Top Contributor is related to String handling 99% of that time comes from RegEx Pattern Matching Page Rendering is the main component
  • 27. 2727 Its getting this …Its like getting this …
  • 28. 28 … when you need to see this!
  • 29. 2929 Problem: Attitudes like this don’t help either Image taken from https://www.scriptrock.com/blog/devops-whats-hype-about/ Shopzilla CIO (in 2010): “… when they get in the war room - the developers and ops teams describe the problem as the enemy, not each other”
  • 30. 3030 Problem: Very “expensive” to work on these issues ~80% of problems caused by ~20% patterns YES we know this 80%Dev Time in Bug Fixing $60BDefect Costs BUT
  • 31. 3131 TOP PROBLEM PATTERNS • Taken From Production Environments
  • 32. 3232 Top Problem Patterns: Resource Pools
  • 33. 3333 Top Problem Patterns: Resource Pools
  • 34. 3434 Deployment Mistakes lead to internal Exceptions
  • 35. 3535 Deployment Mistakes lead to high logging overhead
  • 36. 3636 Production Deployment leads to Log SYNC Issues
  • 37. 3737 Long running SQL with Production Data
  • 38. 3838 N+1 Query Problem
  • 39. 4040 Memory Leaks in Cache Layer with Production Data Still crashes Problem fixed!Fixed Version Deployed
  • 40. 4242 BLOATED Web Sites 17! JS Files – 1.7MB in Size Useless Information! Even might be a security risk!
  • 41. 4343 Missing or incorrect configured browser caches 62! Resources not cached 49! Resources with short expiration
  • 42. 4444 SLOW or Failing 3rd Party Content
  • 43. 4545 Want MORE of these and more details? http://apmblog.compuware.com
  • 44. 4646 Lots of Problems that could have been avoided • BUT WHY are they still making it to Production?
  • 45. 4747 Missing Focus on Performance
  • 46. 4848 Different Goals for Dev and Ops
  • 47. 4949 Disconnected Teams despite “Shared Responsibility”
  • 48. 5050 Solution: DevOps + Performance Focus
  • 49. 5151 BEST PRACTICES
  • 50. 5252 Culture Become ONE Team
  • 51. 5353 Culture Testability
  • 52. 5454 Automate & Measure …Performance
  • 53. 5555 Automate & Measure …Scalability
  • 54. 5656 Automate Deployment
  • 55. 5757 How? Performance Focus in Test Automation 12 0 120ms 3 1 68ms Build 20 testPurchase OK testSearch OK Build 17 testPurchase OK testSearch OK Build 18 testPurchase FAILED testSearch OK Build 19 testPurchase OK testSearch OK Build # Test Case Status # SQL # Excep CPU 12 0 120ms 3 1 68ms 12 5 60ms 3 1 68ms 75 0 230ms 3 1 68ms Test Framework Results Architectural Data We identified a regresesion Problem solved Lets look behind the scenes Exceptions probably reason for failed tests Problem fixed but now we have an architectural regression Problem fixed but now we have an architectural regression Now we have the functional and architectural confidence
  • 56. 5858 How? Performance Focus in Test Automation Analyzing All Unit / Performance Tests Analyze Perf Metrics Identify Regressions
  • 57. 5959 How? Performance Focus in Test Automation Cross Impact of KPIs
  • 58. 6060 How? Performance Focus in Test Automation Embed your Architectural Results in Jenkins
  • 59. 6161 Share Tools
  • 60. 6262 Share Results
  • 61. 6363 Getting control over your weekend again … Enjoy a beer with friends? Instead of pizza and soda with your colleagues?
  • 62. 64
  • 63. 6565 YOU HAVE TIME FOR THE REAL …
  • 64. 6666 DevOps Automation In-Action • Automate Load Test Analysis and Regression Detection
  • 65. 6767 DevOps Automation In-Action • Automate Load Test Analysis and Regression Detection
  • 66. 6868 DevOps: Actionable Data to Ops • Input for Capacity and Deployment Planning Number of Requests on The App Server we will need to handle Might need to tune GC Settings to reduce GC Overhead CPU is going to be tight with these machines – also impacted by GC Activity Input on Thread Pool Configuration Memory Usage for expected load still provides enough “headroom”
  • 67. 6969 IF WE DO ALL THAT 80% Dev Time for Bug Fixing $60B Costs by Defects
  • 68. 7070 Want MORE of these and more details? http://apmblog.compuware.com
  • 69. 71 © 2011 Compuware Corporation — All Rights Reserved Simply Smarter