Background of measuring and metric usage is traditional waterfall projects, psychology of measuring, agile response to traditional metrics, and suggested agile metrics.
1. Erik Weber
@erikjweber
Slidesha.re/AgileM
Agile Metrics
Or: How I Learned to
Stop Worrying and
Love Agile
2. ABOUT CENTARE
Agile/ALM Mobile Cloud
Microsoft
2011 Partner of the Year Finalist
ALM Gold Competency
Azure Circle / Cloud Accelerate
Apple / Java / Scrum
iOS iPhone/iPad/Android
Scrum.org Partner
Certified Professional Scrum Trainers
3. Background
AGENDA Why metrics?
The Psychology of Metrics
Agile Response
Examples of Agile Metrics
Sources
4. ABOUT ME
Work Stuff Me Stuff
Healthcare, Finance, Green Huge foodie and amateur cook
Buildings Wearer of bowties
Huge Conglomerates, Small Homebrewer and beverage
Employee Owned, Fortune imbiber
500 Passionate about Agile (have
Tester -> Developer -> multiple kanban boards up in
Automation Dude -> QA my living room)
Manager -> Project Manager
-> Scrum Master -> Scrum
Product Owner -> Scrum
Coach
Consulting and FTE
Passionate about Agile
6. WE NEED TANGIBLES
As gauges or indicators
- For status, quality, doneness, cost, etc.
As predictors
- What can we expect in the future?
As decision making tools
- Can we release yet?
A visual way to peer into a mostly non-visual world
- Because we don’t completely understand what’s going on in the
software/project and we need to
7. HISTORY TELLS US TO USE METRICS
Tons of research. Mostly from the 80’s and 90’s and based upon
industrial metrics.
Tons of implementation at companies
Research + Implementation has grown exponentially
Hasn’t really affected project success (what a metric!)
Metrics Usage:
Papers, Books, Co
mpanies, etc.
Software Project
Success Rate
1980 1985 1990 1995 2000 2005 2010
*Chaos Report from 1995 to 2010
8. WATERFALL IS SCARY WITHOUT THEM
“Metrics are used in waterfall because we had no idea what was
happening, so we tried to measure anything.”
– Ken Schwaber, ALM Chicago Keynote, 2012
Because the system is complex and intangible.
So we worry.
So we want a way to peer into the system and make predictions.
So we take measurements to try to create a window.
But we still worry.
EVERYTHING STILL FEELS RISKY
10. THE MEASUREMENT PARADOX
“Not everything that can be counted counts, and
not everything that counts can be counted”
– Albert Einstein
Software development is a complex system
Metrics used in isolation probably don’t measure
what you think they do
Beware ‘low hanging fruit’
Value of Measurement = 1/Ease of Measuring
11. Number of Test Cases
600
500
400
300
200
100
0
December January February March
Real Life Example
In reality, we just started focusing on cleaning up
old test cases.
12. THE HAWTHORNE EFFECT
Measuring something will change people’s behavior
When you measure something, you influence it
You can exploit this effect in a positive way
Most traditional metrics have a negative hawthorne
effect
Gaming = Hawthorne Effect * Deliberate Personal Gain
“Tell me how you will measure me and I will tell you how I will behave”
-Goldratt
13. “Test case TC8364 has failed, the
customer settings page doesn’t work
in Chrome.”
“Tests: Passed - But I wrote a bug for
not being able to use the customer
setting page in Chrome.
Real Life Example
Same Tester. Same Test. One sprint before test
pass/fail percentage metric put in place, and one
sprint after.
14. MEASURING AT THE WRONG LEVEL
Austin Corollary: You get what you
measure, and only what you measure
Austin Corollary: You tend to lose others you
cannot measure:
collaboration, creativity, happiness, dedication
to customer service …
Suggests “measuring up”
Measure the team, not the individual
Measure the business, not the team
Helps keep focus on outcomes, not output
15. Real Life Example
Defects per Person-Hour went
down! We met our quality goal!
Customer Complaints went up.
Oops.
Pankaj Jalote. Software Project Management in Practice. Tsinghua University Press, 2004. Pages 90-922.
18. INCREMENTS ARE GAME CHANGERS
- Agile projects produce potentially shippable
Increments every few weeks
- The system is no longer intangible
- No need to have tons of predictive metrics
- Reviewing the Increment (sprint review)
- Enables quick adaptation to customer
needs, market concerns, quality issues, etc.
20. SCRUM APPROACH
The only
metric
that really
matters is
what I say
about
your
product.
21. DOES THAT MEAN …
No Metrics?!
Well, OK; no metrics are better than bad metrics.
22. OUR AGILE METRICS MANIFESTO
We no longer view or use metrics as isolated
gauges, predictors, or decision making tools;
rather they indicate a need to investigate
something and have a conversation, nothing
more.
We realize now that the system is more complex
than could ever be modeled by a discrete set of
measurements; we respect this.
We understand there are some behavioral
psychology concepts associated with measuring
[the product of] people’s work; we respect this.
24. CONSIDERATIONS
What really matters?
Listen to the customer
Understand and respect the complex system
Trends over static numbers
Are we measuring at the right level?
How can we make this measurement a bit less isolated?
How can we ensure only the correct audience sees it?
Measure up!
What behaviors are we trying to nurture (or avoid)?
Will this help us be more agile?
No Single Prescription
25. WORKING SOFTWARE
Can everybody confidently give the “thumbs up” to
the increment?
29. SUMMARY
Waterfall makes me anxious
Agile inherently limits risk, renders many
traditional metrics moot
The increment is a game changer
Measuring people influences their behavior
There are useful metrics in agile
Beware traditional metrics and low hanging fruit
Leverage the Hawthorne effect
Measure up
Promote Agile/Lean/XP/good development practices
30. Scrum.Org Professional Scrum Product Owner Course. http://bit.ly/xOccnM
Mike Grifiths- Leading Answers: “Smart Metrics” http://bit.ly/yfV643
Elisabeth Hendrickson – Test Obsessed : “Question from the Mailbox:
What Metrics Do You Use in Agile?” http://bit.ly/xtSDdg
SOURCES Jason Montague – Observations of a Reflective Commuter:
“Systems Thinking and Brain Surgery” http://bit.ly/ylBxIn
Ian Spence – Measurements for Agile Software Development Organizations:
“Better Faster Cheaper Happier” http://bit.ly/y4UKIt
N.E. Fenton – “Software Metrics: Successes, Failures & New Directions”
http://bit.ly/ybwUzA
Failure Rate - “Statistics over IT projects failure rate.” http://bit.ly/xjBRv0
Chad Albrecht – Ballot Debris: “Simple Scrum Diagram” http://bit.ly/yc7yFW
Robert Austin–“Measuring and Managing Performance in Organization”
http://amzn.to/wTfgx3
These people are Mary Poppendieck– Lean Software Development “Measure Up”
much smarter than I, http://bit.ly/zppVTC
please read what they Jeff Sutherland – Scrum Log: “Happiness Metric – The Wave of the Future”
have to say! http://bit.ly/xO8ETS
32. UNIT TEST COVERAGE
Encourages teams to write unit tests, good
xp/agile/development practice
Doesn’t guarantee GOOD tests – careful!
120%
100%
80%
Team 1
60%
Team 2
Team 3
40%
20%
0%
Sprint 1 Sprint 2 Sprint 3 Sprint 4
34. TEST CASE LIVELIHOOD
Trend of new or Team 2
changing test cases 18%
16%
14%
Shows if tests are 12%
10%
keeping up with a 8%
6%
4%
growing/changing 2%
0%
software Sprint 1 Sprint 2 Sprint 3 Sprint 4
Team 3
Encourages teams to
upkeep tests 10%
9%
8%
7%
6%
5%
4%
3%
2%
1%
0%
Sprint 1 Sprint 2 Sprint 3 Sprint 4
38. STRATEGIC ALIGNMENT INDEX
Are the features we’re
implementing really the
highest value?
Are the projects we’re
running really the best
ROI?
39. USAGE INDEX
Are the features we’ve implemented being used?
Where should we focus our attention?
Feature Usage Index
1
0.9
Percent of Users Using
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
A B C D E F G H I J K
When you’re on a long project – 6 months, a year or longer, we need someway to gauge these things.Developing software is a complex system that is mostly intangible. So we use these measurements as a window into that world. What’s going on here? When will we be done? What’s our quality like? Etc.It’s human nature to explain things we can’t see.
What do you think about this metric? Actually it’s a really bad one – there’s correlation/causation errors going on, and overall “project success” is way too complicated a system to judge based on one metric.Chaos Report from 1995 to 2010: project success rate goes from 16% to 30%
In waterfall we need gauges and indicators and ways to predict the future, because it’s scary to be on a project with a really long time horizon.“We all have a need to understand, we all get anxious when we don’t, we all look for ways to explain things that aren’t easy to explain. That’s what these metrics do. And, if you’re the owner of the project, your butt is on the line, so all the more to be anxious about, all the more to try to make the intangible tangible (which is what I think of software development – until you see a product, it is an intangible and intangibles are scary).” –LL
So before we talk about how Agile responds to that, let’s look a bit at how we operate as humans, and how metrics can effect our behavior.This section has one slide of theory and one real life example.
“There are so many possible measures in a software process that a random selection of metrics will not likely turn up something of value” – Watts Humphrey Metrics used in isolation probably don’t measure what you think they do.-System is more complex than this. We’re probably not ever going to be able to measure enough to give us a simple indicator of the system. - Isolated metrics entice people to draw system wide conclusions.-> Primary/Secondary MetricBeware long hanging fruit. Also, old literature praises low hanging fruit!-> Just because we can measure something easily doesn’t actually mean it’s meaningful.
Ask: Does everyone agree this is a easy to gather metric? What is this metric really telling us? Stakeholders: “How come we have less tests than a few sprints ago? That can’t be right. We must not be testing enough.” Stakeholders: “On my last project we had thousands of tests, why are there only a couple hundred? That can’t be right, we must not be testing enough, I bet this thing is littered with bugs.”This is an example of things that are easy to measure, and things measured in isolation. The system – the software development machine – is far too complex to be making broad quality statements based on such isolated measurements. But we’re so used to doing that. So you can start to see that some traditional metrics might not really fit the bill. Let’s go on
Explain Hawthorne Experiment. Select group of workers old they were being studied, and their productivity changed. All the researchers did was minutely change the lighting levels.For example, measuring test pass/fail status always causes pass percentage to rise. But it is an artificial rise, due to people not wanting to fail tests or splitting up tests into smaller and smaller units to drive the percentage calculation up (which is just creating waste). Also called demand characteristics: refers to an experimental artifact where participants form an interpretation of the experiment's purpose and unconsciously change their behavior to fit that interpretation
I’ve changed the exact wording here to protect the innocent. But here’s a good real life example. Read these two statements and think about what may have changed in the time between these two statement.
Robert Austin. Measuring and Managing Performance in Organization. Nucor Steel. Based plant managers salaries on productivity – of ALL plants, not just theirs.The obvious example here is defect counts.Edward Demming, the noted quality expert, insisted that most quality defects are not caused by individuals, but by management systems that make error-free performance all but impossible.Eh… Attributing defects to individuals does little to address the systemic causes of defects, and placing blame on individuals when the problem is systematic perpetuates the problem. By aggregating defect counts into an informational measurement, and hiding individual performance measurements, it becomes easier to address the root causes of defects. If an entire development team, testers and developers alike, feel responsible for the defects, then testers will tend to become involved earlier and provide more timely and useful feedback to developers. Defects caused by code integration will become everyone’s problem, not just the unlucky person who wrote the last bit of code.
So what do you think happened here? What was the result?Perhaps their intense focus on defects per person, lead to no focus on the customer… perhaps not, it’s too complex to really tell, but the point is that they are probably measuring too low. Are defects-per-person-hour really important to your goal? Probably not. Measure one level up, maybe defects reported by customers…
So we’re at the point where we know that waterfall feels risky, and we know there are some behavioral aspects to metrics that we need to consider.
Agile takes all the worry and all that risk and packages it up into cute little time boxes. Agile inherently limits risk. Even if one of these boxes explode, the project isn’t a failure. And every few weeks we produce a valuable increment of product, we have the chance to inspect it and adapt our approach, reprioritize, replan etc. Managers no longer need to be worried about and have this anxiety over predicting project performance over months and months. We have real tangible results every few weeks. We can inspect it and determine the ACTUAL characteristics of the product that we used to use metrics to try to get at. Agile Projects inherently limit riskTime Boxes, WIP, DoD, AC, fast feedback(lead in) So that’s nice, but how do you define quality on this increment and on the product as a whole?
Two ways. In on any single increment we use the above mindset. These are not strict equations, I’m not doing any math here, it’s just a way to think about quality in the agile world. DoD: Shared definition among the team of what “done” means. Typically you see things like coding standards, unit test coverage, tests pass, deployable, reviewed, etc. Every piece of work must adhere to the DoD.AC: Product Owners business-language criteria for how a specific piece of work must function. Sometime written in the GIVEN-WHEN-THEN format, a practice associated with ATDD. So as we string increments of working software together, how do we get at the quality of the product? We use the mindset at the bottom for this.On the product level, it’s no longer so much about defining quality in a quantitative sense as it is about having a development process that can easily react to change. React to negative customer feedback as well as suggestions for new features and what’s most important to the customer at the moment.Stakeholders that don't show up at the Sprint Review will still be nervous, and rightly so. The corollary is: every time a manager/stakeholder/etc. asks for a report, instead of giving it to them stress the importance of showing up at the Sprint Review.
You have clear development principles that help limit risk (DoD) (verification) and clear business objectives that help limit risk (Acceptance Criteria ) (validation). This ensures some base level of quality in your product, and then through frequent stakeholder and customer feedback, we ensure ongoing quality and value of product. Our chief metric in scrum is working software. That said, what other metrics do we need? Right?
Agile does indeed negate the need for many traditional metrics. It certainly helps make the complex rather intangible process of developing software a bit more tangible – one increment at a time.I do suggest starting here. It’s less dangerous than starting with metrics carried over form waterfall. Rip it all down and build it back up.But there are some useful metrics that we could use, so to set that context…
In his 14 Points, Deming said “Eliminate management by numbers and numerical goals. Instead substitute with leadership.” The more we rely on metrics to tell us what happened, the more we distance ourselves from the actual work being done. We realize that measuring a system as complex as the software development machine, doesn’t really provide understanding, just data. Sometimes bad data, sometimes good data. And we realize that the obvious answer isn’t always right – like blaming bad developers for buggy products – “it must be the developers” – we respect that there is likely more going on in the system than any one root cause of anything. Further, if we use metrics the wrong way, we build games and systems that reward paying attention to the metric and not the success of the company.Overall we believe that being agile is important to the goal – our goal being making really good software products that have high value and delight customers. So we will use metrics that help us be agile. That encourage us to embrace lean and XP and good development practices.
Trends over static numbers: tear the labels off the y axisIs this setting up stakeholders to draw a system conclusion based on an isolated metric?No single prescription – figure out what makes sense for you. Take these considerations into account. We’ll go over a bunch of possible metrics next, but I’m not advocating a simple recipe for anyone. I’m certainly not saying you have to use all of these.
Our chief metric is working software. Did we get to the end of the sprint and have potentially shippable product? How do you measure this? A simple thumbs up or thumbs down. Get everyone in a room and do it. Not good enough? Then document it. We keep a running go/nogo document.Why not just do this in waterfall? Get everyone in a room at after a year long project and give it the thumbs up? Well in some sense you do – we often ignore all those other metrics we’ve spent so long gathering. We rationalize sev 1’s down to 2’s, etc. In agile you can do this more safely because YOU HAVE CONTEXT. You have really good context and memory within a timebox. The risk is limited.
Indicates team progress. A way to visualize what’s done and what’s WIP and what’s left to do. Tool to use to see when we’ll be DONE with a particular chunk of value.Don’t like hours? Don’t want a graph? Fine: use a task board, count tasks, stories-to-done, whatever. It’s just a tool so that you as a team know how work is progressing, and can visualize that and discuss it as a team.If it’s not given to management, there is little risk of negative hawthorne effect or gaming.
Not individualsNo comparing across teamsNot really for management, certainly not for incentives (risk of gaming)
Helps the business know when a larger chuck of functionality might be DONE. Not really part of scrum but also something you usually can’t get away without doing. At least this method of planning is based on empirical evidence of past sprints velocity and what’s actually on the backlog now, and also look at the cone of uncertainty there – we’re not promising a date, we’re just giving a forecast as accurately as we can while still being able to sleep at night.Increments are great, and this tells us when enough increments put together will satisfy some large business objective.
Unit Testing is a great development practice. If we measure it, we just might encourage that behavior. Pick a Target, Should never go down
Don’t discourage check-ins by making this visible at too high a level. Individuals need fast feedback, and sometimes teams can use this in-good-spirits, but it can start to deter checkins
Beware the “math” on this one – as software matures and ceases to change, this percentage approaches 0. But 0 in any one sprint indicates a problem. Rapid fluxuation might indicate some churn our lack of vision around testing (or churn in the software)
Etsy – optimize everything for employee happinesshttp://happiily.comEncourages self-awarenessLeading indicatorConfidence? When you check in or move something to doneScale is 1-5. We measure this continuously through a live Google Spreadsheet. People update it approximately once per month.Here are the columns:NameHow happy are you with Crisp? (scale 1-5)Last update of this row (timestamp)What feels best right now?What feels worst right now?What would increase your happiness index?Other comments
Running Tested Features – XP practicePositive Hawthorne Effect: We want to deliver more value (but beware gaming – you still have to be DONE)Measures up: delivered value for the product (not single team or individual)Little’s Law: queue size ~ queue time
Here’s one for your metric walls – what are the top three most common customer complaints. Or the three hottest issues right now? Post these on a wall where everyone can see them.
Size of bubbles are TCO (total cost of ownershipHopefully in a single project you are up in the magic quadrantAcross a program/product there might be some things in other places – “have to do’s” compliance and legal stuff…
For a single feature, you can also drill down one level and look at the number of times per day/week/month a user uses it, and the amount of time spent using the feature.Why measure this? Are we building the right features? Is a bug in feature “C” more critical than a bug in feature “I”? Feature “K” may have more maintenance costs than value – consider dropping it.
Where is there waste in the system?What’s the best time for a nominal task?