Your SlideShare is downloading. ×
JavaOne - Performance Focused DevOps to Improve Cont Delivery
Nächste SlideShare
Wird geladen in ...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

JavaOne - Performance Focused DevOps to Improve Cont Delivery


Published on

These are the slides of my JavaOne presentation. The abstract goes like this: ...

These are the slides of my JavaOne presentation. The abstract goes like this:
How do companies developing business-critical Java enterprise Web applications increase releases from 40 to 300 per year and still remain confident about a spike of 1,800 percent in traffic during key events such as Super Bowl Sunday or Cyber Monday? It takes a fundamental change in culture. Although DevOps is often seen as a mechanism for taming the chaos, adopting an agile methodology across all teams is only the first step. This session explores best practices for continuous delivery with higher quality for improving collaboration between teams by consolidating tools and for reducing overhead to fix issues. It shows how to build a performance-focused culture with tools such as Hudson, Jenkins, Chef, Puppet, Selenium, and Compuware APM/dynaTrace

Published in: Technologie

0 Kommentare
2 Gefällt mir
  • Hinterlassen Sie den ersten Kommentar

Keine Downloads
Bei Slideshare
Aus Einbettungen
Anzahl an Einbettungen
Gefällt mir
Einbettungen 0
No embeds

Inhalte melden
Als unangemessen gemeldet Als unangemessen melden
Als unangemessen melden

Wählen Sie Ihren Grund, warum Sie diese Präsentation als unangemessen melden.

No notes for slide
  • AbstractHow do companies developing business-critical Java enterprise Web applications increase releases from 40 to 300 per year and still remain confident about a spike of 1,800 percent in traffic during key events such as Super Bowl Sunday or Cyber Monday? It takes a fundamental change in culture. Although DevOps is often seen as a mechanism for taming the chaos, adopting an agile methodology across all teams is only the first step. This session explores best practices for continuous delivery with higher quality for improving collaboration between teams by consolidating tools and for reducing overhead to fix issues. It shows how to build a performance-focused culture with tools such as Hudson, Jenkins, Chef, Puppet, Selenium, and Compuware APM/dynaTrace
  • Who knows what that is?It’s the Fifa World Cup Trophy
  • Teams are currently competing in the qualifications to compete in Brazil 2014
  • This is “my” austrian national team soccer team. Their GOAL is to qualify for Brazil 2014. After the many failed attempts in the past we hired a new coach who’s goal is to form a new team that PERFORMs good enough to qualify
  • In order to get there the team competed in many test games. Which gaves them a lot of confidence because they played against teams that were “easier” to beat. At the end of these tests we even started in the qualification with some wins against teams that we were expecting to winSo – at the end of these “test and easy qualification games” we thought: “ALL GOOD – THE ROAD IS OPEN FOR 2014 – NOT ONLY WILL WE QUALIFY BUT WE ALSO BELIEVE WE HAVE SUCH A STRONG TEAM THAT WILL ALSO DO WELL AT THE WORLDCUP”
  • Then reality kicked in when we had our first “real competitor” – it was the first qualification against a team whos quality level is at a level that we have to expect at the world cup.The competing team was Germany – and – based on these images you can see how the game went
  • The coach is responsible to watch the game and see how things are going. Like in other sports – soccer has a couple of Key Performance Indicators such as Ball Possession, Fouls and the actual scoreThe first 5 minutes actually didn’t look too bad
  • After the first 5 minutes the game changes – with germany taking over the game in their typical way. The KPIs make this very clearThe coach is responsible to react based on these values and how the game wents
  • The coach should use more data for detailed analysis on what is going wrong in the game
  • One of his options is to substitute players – or even change tacticsDoes this succeed based on the KPIs that we have seen before?
  • Well – not always. Just replacing players – putting some in that are faster in chasing the ball doesn’t always help
  • StoryNew Build Deployed on Thursday Evening Everything runs smooth on Friday DaytimeAn Ad Campaign hits the Air Friday NightThe site crashes under load -> ALERTS GO OFFRestarting Server -> SERVER DOESN’T STARTAdding more Servers-> PROBLEM REMAINSCalling in the “App Experts” and Pizza Delivery!
  • They getOps’ problem description: “App Server crashed”, “Out of file handles”Users’ problem description: “It is slow”, “It crashed”
  • They GetHigh CPU, Memory or Bandwidth IssuesLog files: GB’s of logfiles with 99.9% “useless” information
  • There is lots of data – but – does a high CPU Utilization really mean that this machine has a problem and need to be restarted?What could be the problem if your user experience tool tells you that people have bad response times?But what do we do with all this disconnected data?
  • They needApplication data: Executed Transactions, Load, CPU, Memory, Disk usage,...Impacted transactions with context information: User Actions, Call stack, Thread Overview, Method Parameters, SQL Calls, Invoked Service CallsInvolved Application Components: Web Server, App Servers, DatabaseImpact of service calls: Performance, Availability, Response CodesError Details: HTTP Errors, Exceptions, warning/severe log messages
  • 30%: What we hear from talking to people is that a lot of problems that happen in production happen to times that are not very “developer friendly” -> RUN THROUGH STORY60%: Restarting a crashed application server or adding an additional server to handle the load often doesn’t solve the problem either -> That’s when its time to call in the Application Experts (Developers or Testers)100%: Devs (and probably anybody else as well) are not happy to get called at 2AM to look at a problem. They also know that its not going to be an easy fix because there is probably not enough data available to fix this – so its going to be a lot of trial and errors with a Team (Ops) that is reluctant for Trial and Error.More talking points:The Challenge with Outside Business Hours problemsRestarts are not the silver bulletApplication Experts to fix problems unlikely to be available at 2AMThis leads to “CritSit”, “War Room”, … including Dev, Test, Ops …The Challenge with Production Problem AnalysisOps often doesn’t know what information is required by Dev & TestOps typically doesn’t want to give Devs access to machines for triage Leads to Tension between Dev, Test and OpsINTERESTING FACT: 80/20 Rule20% of Problem Patterns responsible for 80% of ProblemsMost problems could have been found early on PREVENTION is POSSIBLEBecause RESTARTING Applications IS NOTthe solutionBest Case: You are just “hiding” a problemWorst Case: App doesn’t start anymoreBecause ROOT CAUSEis often NOT FOUNDin log filesWhich log files to look at? App Server, Web Server, OS Event Log, …?Even Splunk can’t help if there is not sufficient informationBecause CHANGING APPbehavior CANNOTalways be done through config filesYou can’t turn off a Memory Leak via a switchTrial & Error Changes, e.g: Increasing pool sizes will just “shift” the problemBecause They (DEV, TEST, ARCHITECTS) are the APPLICATION EXPERTSThey know WHERE to look and WHAT to look forThey can fix the code and advise on other deployment options
  • Well – I guess there is just not more to say about this. The attitude between these teams doesn’t help in solving issues any faster
  • We all know this statistic in one form or another – so – it is clear that these problems that are handled in War Rooms are VERY EXPENSIVEBUTWhat is interesting is that these problems are typically not detected earlier because the focus of engineering is on building new features instead of focusing on performance and scalable architecture.What’s interesting though is that many of these problems could easily be found earlier on – LETS have a look at these common problems that we constantly run into …
  • Depending on the audience you want to show or hide some of the following slides
  • Resource Pool ExhaustionMisconfiguration or failed deployment, e.g: default config from devActual resource leak -> can be identified with Unit/Integration Tests
  • Resource Pool Exhaustion (same as before – just different Pool)Using the same deployment tools in Test and Ops can prevent thisTesting with real load can detect that
  • Deployment Issues leading to heavy logging resulting in high I/O and CPUUsing the same deployment tools in Test and Ops can prevent thisAnalyzing Log Output per Component in Dev prevents this problem
  • Deployment Issues leading to heavy logging resulting in high I/O and CPUUsing the same deployment tools in Test and Ops can prevent thisAnalyzing Log Output per Component in Dev prevents this problem
  • Too many and too slow Database QueriesDev and Test need to have “production-like” database – Updates on a “Sample Databases” won’t show slow updatesAccess Patterns such as N+1 can be identified with Unit Tests
  • Too many and too slow Database QueriesDev and Test need to have “production-like” database – Updates on a “Sample Databases” won’t show slow updatesAccess Patterns such as N+1 can be identified with Unit Tests
  • Too much data requested from DatabaseDev and Test need to have “production-like” database – Otherwise these problem patterns can only be found in prodEducate Developers on “the power of SQL” – instead of loading everything in memory and performing filters/aggregations/… in the App
  • Memory Leaks: Too much data in CacheCan be found in test with “production-like” data sets and tests that do not only test the same “search” query -> get feedback from ProdEducate developers on memory and cache strategies
  • Synching issues caused by deadlocksCan be found with small scale performance unit tests by developersEducate developers on synchronization/multi-threading strategies
  • Not following WPO (Web Performance Optimization Rules)Non optimized content, e.g: compression, merging, …Educate developers and automate WPO checks
  • Not leveraging Browser-side CachingMisconfigured CDNs or missing cache settings -> automate cache configuration deploymentEducate developers; Educate testers to do “real life” testing (CDN, …)
  • Slow or failing 3rd party contentImpacts page load time; Ops is required to monitor 3rd party servicesEducatedevs to optimize loading; Educate test to include 3rd party testing
  • Why this is a problem?Biz pushes features. In order to deliver more features in a more agile way development adopted agile development methodologies to deliver more releases with more features in a shorter timeframeTo save costs we outsource. Some companies also organically grew by acquisition leaving us with dev teams that are distributed across the globeTo be faster we use 3rd Party Code as we do not want to re-invent the wheel. However – not every 3rd party component or service is really fit for the requirements we have in our production enviornment. It may work well on the workstation for a single user – but often fails in a larger environment3rd Party Services or ContentAverage US Sports Website loads content from 29! domains3rd Party Components in Application CodeHibernate, Spring, .NET Enterprise Blocks …GWT, ExtJS, jQueryAmazon Web Services, Google API, …
  • Feature – richness vs. NO CHANGE
  • Not well communicated what change is ahead. No “Integration” of Ops Teams in Agile Process
  • A big step is to tear down these walls between these teams.
  • CAMS is taken from OpsCode (Creators of Chef) Blog: Culture People and process first.  If you don’t have culture, all automation attempts will be fruitless.Automation This is one of the places you start once you understand your culture.  At this point, the tools can start to stitch together an automation fabric for Devops.  Tools for release management, provisioning, configuration management, systems integration, monitoring and control, and orchestration become important pieces in building a Devops fabric.Measurement If you can’t measure, you can’t improve.  A successful Devops implementation will measure everything it can as often as it can… performance metrics, process metrics, and even people metrics.SharingSharing is the loopback in the CAMS cycle.  Creating a culture where people share ideas and problems is critical.  Jody Mulkey, the CIO at Shopzilla, told me that they get in the war room the developers and operations teams describe the problem as the enemy, not each other.  Another interesting motivation in the Devops movement is the way sharing Devops success stories helps others.   First, it attracts talent, and second, there is a belief that by exposing ideas you can create a great open feedback that in the end helps them improveThe change that is required is already well understood in the DevOps movement that’s been going on for years – BUT – it is important to add Performance as Key Requirement to Culture, Automation, Measurement and Sharing. Culture: PERFORMANCE is a key requirement for everything that is done throughout the delivery chain. We have heard that a lot of the problems that lead to a War Room scenario are problems that could be found earlier if there would be a focus on Performance and Quality throughout the organizationAutomation: Automation is Key for DevOps and Agile Development. What needs to change is that performance and architectural problems are automatically detected in the development and delivery process. This can be achieved by focusing automated testing for exactly these problems – whether it is in C/I or in the “traditional” test areaMeasurement: We can only measure success if we have Key Performance Indicators for each team, e.g: Test Coverage %, Number of Tests Executed, Throughput, Response Time, Number of Deployments, … - an additional focus must be on measures that allow us to track performance and architectural issues. This allows us to identify and prevent any performance regressions as soon as they get introducedSharing:
  • Agile Development (Stories & Tasks) excludesPerformance and Scalability Requirements from TestTestability Requirements from TestDeployment and Stability Requirements from OpsRequirements: are currently mainly brought in by the business side who demand more features. What is missing are the requirements from Test and Ops.
  • Agile Process excludes Test and OpsNot part of Standups, Reviews, Planning'sNo active sharing of data, requirements, feedbackNo common toolset/platform/metrics that makes sharing easyCollaboration: Test & Ops are not part of the agile process. There is no active involvement in the standups, reviews or planning meetings. The lack of common tools and a different understanding of quality, metrics and requirements also make it hard to share dataSharing ToolsThe different teams currently use their own set of tools that help them in their day-to-day work in their “local” environment.Developers focus on development tools to help them with developing code, debugging and analyzing the basic problems.Testers use their load testing tools and combine them with some system monitoring tools to e.g: capture CPU, Memory, Network UtilizationOps uses their tools to analyze network traffic, host health, log analyzers, …When these teams need to collaborate in order to identify the root cause of a problem they typically speak a different language. Developers are used to debuggers, thread and memory dumps. But what they get is things like “the system is slow with that many Virtual Users on the system where Host CPU starts showing a problem”.When there is a production problem both developers and testers are typically not satisfied with network statistics or operating system event logs that don’t tell them what really went on in the application. Test wont be able to reproduce the problem with that information nor will devs be able to debug through their code based on that informationIn order to make life easier for developers to troubleshoot the issue they would like to install their tools in test and ops – but these tools are typically not fit for high load and production enviornments. Debuggers have too much overhead, they require restarts and changes to the system -> Ops doesn’t like change!!
  • Some examples on KPIsNumber of SQL Statements executed -> tells Ops on what to expect in production-> tells architects on whether to optimize this with a cache or a different DB Access StrategyNumber of Log Lines-> tells Ops how to optimize storage for Logging-> tells architects whether there is LOG SPAM happeningMemory Consumption per User Session-> tells Ops how to scale their production environment-> tells architects whether you are “wasteful” with heap spaceTime for a single DeploymentTime for rollbacks
  • Automation: C/I currently only executes tests that cover functionality, e.g: unit and maybe integration or some functional tests (Selenium, …). What is missing is the concept of already executing small scale performance and scalability tests that would allow us to automatically detect those problem patterns discussed earlier. With that we could already eliminate the need for MOST War Room situations.
  • So – to sum up – here are some action items (ToDo List)1a: Share and Develop Tools that are used across team boundaries, e.g: Add more diagnostics tools to test or share deployment tools developed by test with Ops1b: a critical component is testing early and testing in real life environments. Test Teams need to empower developers by giving them access to their performance test frameworks that Devs can use to also test performance and architectural KPIs early on. Ops needs to work with Test and provide access to production or staging so that Test can perform their large scale load and performance test in a realistic enviornment2a: It is important to establish a shared reporting and task management system. Very often we see companies that share Wiki Instances and Task Tracking systems where they have status report pages from all teams as well as tracking issues that are found in dev, test and production2b: It is also important to share the same toolset across all tiers so that Dev, Test and Ops get the data they need, that can easily be shared and is understood by everybody
  • When we are recapping the initial problem that we described and the root causes for it we have to say we have a good solution to solve these problems.DevOps is the way to go – BUT – it requires a big focus on Performance, Architecture, Scalability and Deployment.It requires more Automation to find these problems early onIt requires more Measurement as measures allow us to identify these deficiency throughout the Agile processIt requires active sharing of the data which will bring the teams even closer together so that they are working on a “SHARED GOAL”Following all of this will result in 100% confidence when rolling out a production release – without the need of a war room
  • When we look at the results of your Testing Framework from Build over Build we can easily spot functional regressions. In our example we see that testPurchase fails in Build 18. We notify the developer, problem gets fixed and with Build 19 we are back to functional correctness. Looking behind the scenesThe problem is that Functional Testing only verifies the functionality to the caller of the tested function. Using dynaTrace we are able to analyze the internals of the tested code. We analyze metrics such as Number of Executed SQL Statements, Number of Exceptions thrown, Time spent on CPU, Memory Consumption, Number of Remoting Calls, Transfered Bytes, …In Build 18 we can see a nice correlation of Exceptions to the failed functional test. We can assume that one of these exceptions caused the problem. For a developer it would be very helpful to get exception information which helps to quickly identify the root cause of the problem and solve it faster.In Build 19 the Testing Framework indicates ALL GREEN. When we look behind the scenes we see that we have a big jump in SQL Statements as well as CPU Usage. What just happened? The Developer fixed the functional problem but introduced an architectural regression. This needs to be looked into – otherwise this change will have negative impact on the application once tested under loadIn Build 20 all these problems are fixed. We are still meeting our functional goals and are back to acceptable number of SQL Statements, Exceptions, CPU Usage, …
  • Web Architectural Metrics# of JS Files, # of CSS, # of redirectsSize of Images
  • Transcript

    • 1. 11 Andreas Grabner @grabnerandi
    • 2. 2
    • 3. 3
    • 4. 4
    • 5. 5 Testing is Important – and gives Confidence
    • 6. 6 But are we ready for “The Real” world?
    • 7. 7 Measure Performance during the game Ball Possession: 40 : 60 Fouls: 0 : 0 Score: 0 : 0 Minute 1 - 5
    • 8. 8 Measure Performance during the game Minute 6 - 35 Ball Possession: 80 : 20 Fouls: 2 : 12 Score: 0 : 0
    • 9. 9 Deep Dive Analysis
    • 10. 10 Options “To Fix” the situation
    • 11. 11 Not always a happy ending  Minute 90 Ball Possession: 80 : 20 Fouls: 4 : 25 Score: 3 : 0
    • 12. 12 FRUSTRATED FANS!! 12
    • 13. 13 How does that relate to Software?
    • 14. 1414 From Deploy to … Deploy Promotion/Event Problems Ops Playbook War Room Timeline
    • 15. 1515 The “War Room” – back then 'Houston, we have a problem‘ NASA Mission Control Center, Apollo 13, 1970
    • 16. 1616 The “War Room” – NOW Facebook – December 2012
    • 17. 1717 Problem: Unclear End User Problem Descriptions
    • 18. 1818 Statuc Quo: Ops Runbook – System Unresponsive
    • 19. 1919 Problem: Unclear Ops Problem Descriptions
    • 20. 2020 Status Quo: Ops Runbook – High Resource Usage
    • 21. 2121 Lack of data?
    • 22. 2222
    • 23. 23 Answers to the right questions
    • 24. 2424 What are the real questions? Individual Users? ALL users? Is it the APP? Or Delivery Chain? Code problem? Infrastructure? One transaction? ALL transactions? In AppServer? In Virtual Machine?
    • 25. 2525 Problem: What Devs would like to have
    • 26. 2626 Problem: What Devs would like to have Top Contributor is related to String handling 99% of that time comes from RegEx Pattern Matching Page Rendering is the main component
    • 27. 2727 Its getting this …Its like getting this …
    • 28. 28 … when you need to see this!
    • 29. 2929 RECAP Status Quo: We don’t like “War Rooms”
    • 30. 3030 Problem: Attitudes like this don’t help either Image taken from Shopzilla CIO (in 2010): “… when they get in the war room - the developers and ops teams describe the problem as the enemy, not each other”
    • 31. 3131 Problem: Very “expensive” to work on these issues ~80% of problems caused by ~20% patterns YES we know this 80%Dev Time in Bug Fixing $60BDefect Costs BUT
    • 32. 3232 TOP PROBLEM PATTERNS • Focus on Web and Java
    • 33. 3333 Top Problem Patterns: Resource Pools
    • 34. 3434 Top Problem Patterns: Resource Pools
    • 35. 3535 Deployment Mistakes lead to internal Exceptions
    • 36. 3636 Deployment Mistakes lead to high logging overhead
    • 37. 3737 Production Deployment leads to Log SYNC Issues
    • 38. 3838 Long running SQL with Production Data
    • 39. 3939 N+1 Query Problem
    • 40. 4040 Reading and processing too much data in App
    • 41. 4141 Memory Leaks in Cache Layer with Production Data Still crashes Problem fixed!Fixed Version Deployed
    • 42. 4242 Synchronization Issues under real load
    • 43. 4343 BLOATED Web Sites 17! JS Files – 1.7MB in Size Useless Information! Even might be a security risk!
    • 44. 4444 Missing or incorrect configured browser caches 62! Resources not cached 49! Resources with short expiration
    • 45. 4545 SLOW or Failing 3rd Party Content
    • 46. 4646 Want MORE of these and more details?
    • 47. 4747 Lots of Problems that could have been avoided • BUT WHY are they still making it to Production?
    • 48. 4848 Missing Focus on Performance
    • 49. 4949 Different Goals for Dev and Ops
    • 50. 5050 Disconnected Teams despite “Shared Responsibility”
    • 51. 5151
    • 52. 5252 How to make the Enterprise Crew happy?
    • 53. 5353
    • 54. 5454 Solution: DevOps + Performance Focus Culture “Shared Responsibility” Agile Process for ALL Teams Performance as Key Requirement X-Team Collaboration and Education Automation Measurement, Collaboration and Deployment Automate Performance and Architectural Problem Detection Measurement “Visible” KPIs for each Team Focus on Performance, Architectural and Deployment Measures Sharing Expertise, Tool and Data Sharing “Easy” sharing of Performance, Deployment and Production Data
    • 55. 5555 Culture: EXTEND Requirements with … Performance Scalability Testability Deployability Deployability
    • 56. 5656 Sharing: DON’T EXCLUDE anyone from Agile Process Stand-Ups Sharing Tools Feedback
    • 57. 5757 Measurement: Define KPIs accepted by all teams # of SQL Executions # of Log Lines MBs / Uses Time for Deployment Time for Rollback Response TimesPerf Test Code Coverage
    • 58. 5858 AUTOMATION, AUTOMATION, AUTOMATION Performance Scalability Shared Tools Automatic Feedback
    • 59. 5959 DevOps Collaboration – TODO LIST FOR YOU!! Access to Production Data Shared Reporting and Task Management Diagnostic Tools Shared Performance KPIs and Tooling Known How Exchange
    • 60. 6060 Recap – Problem – Root Cause – Solution - Result DevOps + Performance Culture Automation Measurement Collaboration
    • 61. 6161 TIPS FOR DEVS
    • 62. 6262 Performance Focus in Test Automation 12 0 120ms 3 1 68ms Build 20 testPurchase OK testSearch OK Build 17 testPurchase OK testSearch OK Build 18 testPurchase FAILED testSearch OK Build 19 testPurchase OK testSearch OK Build # Test Case Status # SQL # Excep CPU 12 0 120ms 3 1 68ms 12 5 60ms 3 1 68ms 75 0 230ms 3 1 68ms Test Framework Results Architectural Data We identified a regresesion Problem solved Lets look behind the scenes Exceptions probably reason for failed tests Problem fixed but now we have an architectural regression Problem fixed but now we have an architectural regression Now we have the functional and architectural confidence
    • 63. 6363 Performance Focus in Test Automation Analyzing All Unit / Performance Tests Analyzing Metrics such as DB Exec Count Jump in DB Calls from one Build to the next
    • 64. 6464 Performance Focus in Test Automation Cross Impact of KPIs
    • 65. 6565 Performance Focus in Test Automation Embed your Architectural Results in Jenkins
    • 66. 6666 Performance Focus in Test Automation Here is the difference! Compare Build that shows BAD Behavior! With Build that shows GOOD Behavior!
    • 67. 6767 Performance Focus in Test Automation CalculateUserStats is the new Plugin that causes problems
    • 68. 6868 Remember – DevOps requires Cultural Change Share Integrate Collaborate Performance
    • 69. 6969 Elevate our DevOps Investment - REDUCE 80% Dev Time for Bug Fixing $60B Costs by Defects
    • 70. 70 © 2011 Compuware Corporation — All Rights Reserved© 2011 Compuware Corporation — All Rights Reserved 70 Participate in Compuware APM Discussion Forums Like us on Facebook Join our LinkedIn group Compuware APM User Group Follow us on Twitter Read our Blog About:Performance Watch our Videos & product Demos Thank You
    • 71. 71 © 2011 Compuware Corporation — All Rights Reserved Simply Smarter