Join me in this talk about why high workload leads to increasing waiting times and is detrimental to your project’s efficiency. We will not only talk about queueing theory and capacity management, but also about strategies to cope with high utilization and how to start a virtuous circle.
Breaking the Kubernetes Kill Chain: Host Path Mount
Wait A Moment? How High Workload Kills Efficiency! - Roman Pickl
1. 45 minutes
Hi,
Thank you for being here, I‘m very impressed that so many people are here in
this room right now.
Welcome to my talk
High workload leads to increasing waiting times
detrimental to your efficiency
So how did I end up here?
1
2. My name is Roman Pickl and for the last half a year I‘ve been a technical project
manager at Elektrobit, which is a automotive software supplier.
Before that I was CTO of fluidtime, which is a provider of integrated mobility
services, but also a process manager at the austrian parcel service, which also
deals with continuous delivery, i guess…
I have a background in IT and Business.
CI/CD/Devops is the sweet spot for me, as I really love how the things I learned
in my Production Management courses are nowadays applied in the IT domain.
2
3. In september this year I was speaking at the DevOpsDays in Cairo and Andrew Shafer,
showed this slide during his keynote:
It says:
„I don‘t have time to learn new things because I‘m too busy gettings things done!“
- Said the least productive person in the world.
During that time I was also finalizing this talk and it really resonated with me, as I was
reading Tom DeMarco‘s book Slack, which also states that we are all too busy and this
not only kills our efficiency but also our effectiveness.
3
4. Being effective is about doing the right things, while being efficient is about doing the
things in the right manner.
You are efficient when you do something with minimum waste
You are effective when you‘re doing the right something
It‘s possible to be one without the other, it is also possible to be both
Example: flat tire
If your car breaks down and you have a flat tire,
And you start cleaning the car or change the wrong tire, you could be very efficient but
you are definitely not effective in fixing your car.
On the other hand, if you change the right tire, but it takes you 2 hours, because you
don‘t know exactly how it works and it takes you multiple attempts, you are effective,
but not very efficient.
So we want to be both but often we are neither.
In his book, DeMarco also cites evolutionary biologist Ronald Fisher
Fisher‘s fundamental theorem:
The more highly adapted an organism becomes, the less adaptable it is to any new
change.
So we need to find the right balance and not over optimize.
4
5. I my career i found myself in a situation similar to the following multiple times…
4
6. So I‘m part of an ops or dev team of three and the workload is not evenly distributed
one working 100% of his / her normal weekly hours (that‘s me)
one, often times the person with the most experience, skipping or having lunchbreaks at
their desk, working on weekends, being the bottleneck and having their hair on fire.
And maybe one, more junior, person with some slack, but no one of the others have
time to show them new stuff or holding on their dear tasks.
So we experience an overload situation
and show symptoms of overload like decreased team moreale, working when sick, being
sick all the time, and so on
And then the following discussions are getting more frequent…
5
7. I‘ve changed the pictures here, but other than that that could be a screenshot of the
discussions we had.
So the Colleagues says:
We really need to fix the … hell,
But the estimate for doing something about it is at least (!) 2-3 days
And I‘m like
At least(!) 2-3 days… that‘s actually not that much...
And it hurts again and again
We really need to..
use a static code analyzer, improve the deployment pipline, introduce a better debugger,
…
And you feel like:
6
8. So I know that there is a better way of doing stuff, I talked with friends about it, I have
seen it at conferences at devopsdays warsaw but we are just too busy.
So I go to my manager and ask him/her for support in the devops, testing, … domain and
he/she manager says …
7
9. Manager: Yeah you know, I don‘t know if we should look for support in this area.
Just do some redistribution of work
And it will certainly take time to find someone.
I‘m not sure even sure if we were be able to utilize that person up to 100%.
But I‘ll talk to xyz in the other department there is a project currently on hold and one of
those people may help us with the missing 20%
ME: and I‘m like individual workers are not freely exchangeable / replaceable (tom de
marco Myth of the fungible resource)
You don‘t want to do that anyway. Utilization is not a good proxy for productivity.
Manager: ?
ME: Let‘s talk about queueing theory.
8
10. So Queuing theory and operations research is very interesting but also quite hard to
understand in its details and there are different type of queues.
When looking at a G/G/1 single server where interarrival times have a general (meaning
arbitrary) distribution and service times have a (different) general distribution
The lead time (time to go through the whole process) of a system is heavily influenced
by the service time, the utilization of the actor and the variation in the process (Task
variability, Interarrival Variation)
This is what the Kingman Equation / approximation tells us.
If we assume that software development / operations is a process with highly variability
with a decent amount of task variability and interarrival time of theses tasks.
Wait Time is the percentage of time busy by percentage of time idle
If a resource is fifty percent busy, then it’s fifty percent idle.
The wait time is fifty percent divided by fifty percent so one unit of time.
So on average a task would wait in the queue for e.g. one hour before work starts.
With ninety percent utilization it would be in waiting state for 9 times longer.
Note that this graph gets steeper with higher variance and flatter with lower variance,
being a flat line in a perfect world.
So there’s an inherent conflict between service quality and capacity management with
things deteriorating heavily at ~70-80 percent. Depending on the variability of the
process of course
9
11. A similar idea is little’s law, which shows a relationship between Lead time, work in
progress and throughput. But there are some important assumptions to it and if you want
to apply it and you can find more in this presentation from Daniel Vacanti.
The important takeaway is that you should focus on throughput rather than utilization.
And if you reach high utilization levels all you work is just piling up and everyone get’s
very nervous and calls for status meetings which adds work. So this escalates quickly
You then reach a state that Mario Kleinsasser in a GitOps presentation at Vienna-DevOps-
Security referred to as quicksand
9
12. The more you fight it, the more it pulls you in.
Each time you try to put your leg out, it sucks you in again.
Don‘t panic
I really liked this metaphor and so I googled how to get out of quicksand.
Make yourself as light as possible—toss your bag, jacket, and shoes.
Try to take a few steps backwards.
Keep your arms up and out of the quicksand.
Try to reach for a branch or person’s hand to pull yourself out.
Take deep breaths.
Move slowly and deliberately.
So I go to my manager again and tell him, you know what, we really need more people…
10
13. And all of them should only be utilized up to 80%, so that they can invest roughly a day
per week to …
-Spur innovation
-Rethink
-Practice new ways
-Master new skills
-Improve efficiency.
Because that‘s how it is or has been done at 3M with their 15% time and google with
their 20% time.
At google Site Reliability Engineer Principle number two says, that people must have
time to make tomorrow better than today.
So after all this research and listing all the pros I have persuaded my manager, but it is
too late.
What actually happened is, that we got even more to do and spend considerable time
interviewing. It was quite unsure, whether we would find someone..
11
14. So that the colleague, you know they one with the hair on fire, could not bear it
anymore and quit.
Now me and the other team member also are under heavy pressure. And people under
time pressure don‘t think faster.
But at least we found a very skilled professional, who starts today.
YOU
And given that you have just joined the team, you usually cannot jump right in and have
some slack, at least in the beginning.
Congratulations you joined a team in an existential crisis.
12
15. But that‘s not necessarily a bad thing.
A Seat at the Table: „Transformational projects occur when the amount of debt has
become too much to bear“
Often organizations need to experience an existential crisis before they really take
continuous improvement seriously
A 10% change (e.g. overtime) is not enough anymore
Now a 50%+ improvement is needed
If you can establish
- A sense of urgency (We have to change how we work)
- A save environment for failure (Things that were already late, will shift, but we really
need to try something new)
And have the courage to question the status quo (which is easier if you are new)
Crisis are chances / opportunities to change things
Example: ios App deployment
I‘m getting a little ahead of myself here, but a little example.
3 years ago: I supported our ops team and asked how we deploy our ios apps.
13
16. So they told me:
For each of the 5 apps do:
Checkout the code on your local pc
Make sure you have the right Xcode version installed
Do these 3 manual changes
Select the right signing certificates
Compile
Upload to appstore
Enter / update all the relevant metadata
Wait for apple to review the software
So this takes 1 day
But what if, the developers/tester find an error?
We have to start over, and as this is so painful, we release very seldom and asked the
team for less releases which are tested better.
Sorry, but I won‘t do it that way.
1 year earlier fastlane, was released, which among others helps to automate the release
process of apps
Setup fastlane with jenkins
Click a button (let them do it)
Wait 20 minutes
This took us a week. And ever since, releasing ios and android apps was a none event and
we did it more frequently with the push of a button.
So it really takes someone like you who questions the status quo.
But where to start?
13
17. I think first you should quantify the work.
So what I like to do is something called activity accounting.
Where you allocate costs to the activities the team is performing to uncover non-value
work.
So you can do this based on your time sheets or a look at your task board.
But you should not only watch out for categories
But also things that you don‘t do, or do very infrequently (e.g. updates, releases,
retrospectives)
Or multitasking which could mean that people are waiting for feedback (e.g. due to long
build times, or other people not being available)
As a side note:
What I also like to do and what really helped is making the things your team is currently
working on visible.
So in my last company we put up a monitor in front of our desks with our kanban board
and everyone could see what we are currently working on.
This reduced the number of interruptions (are you already working on „…“), and also
helped to show that we had a lot of things to do and if priorities differed, then people
new whom to talk to before bugging us.
So in these numbers from the 2018 Accelerate report
14
18. In the LOW group, spend 15% on customer support and 20% on defects identified by the
end users: so this sounds like a quality or at least documentation problem.
What we also see that Low performers only spend 30% of their time on new work as
opposed to 50% in the elite group.
So now you know what you are spending your time on and you should set goals based on
benchmarks, but what takes so long?
14
19. Lord Kelvin said, what gets measured gets managed.
So you should measure flow time i.e. Lead time and cycle time for the features you
deliver.
How long does it spend waiting, in testing, on the build server, hand offs, waiting to be
released?
(Lead time measures the time elapsed between order and delivery, thus it measures
your production process from your customer's perspective. Cycle time starts when the
actual work begins on the unit and ends when it is ready for delivery.)
Additional measures
Bug lead time
Code lead time
Patch lead time
Change success rate
WiP is a leading indicator, the more wip is in the pipeline the longer things tend to take
to complete (little’s law)
Aging reports (look at tickets that did not move 30 days)
Demand / ticket inflow
On the other hand, we want to increase throughpt
So what is the biggest bottleneck
15
20. Can we fix it?
Can we bring the pain forward?
Ask why and investigate the root causes. Sometimes the fixes are quite easy, once you
know the problem
15
21. Example:
Build Time
Jenkins: console log
-> not all had timestamps
Timestamper plugin jenkins:
Buy ssd -> 36% improvement
If your build cycle takes 26h and releasing manually takes 1hour
Start fixing your build cycle first
16
22. When you are in an overload situation, there is the tendency to do more planning but
rather…
There is only one most important thing – let people know
Avoid conflicting priorities
17
23. Reduce WiP and batch size
Stop starting and start finishing
In Kanban there are WIP Limits
In Scrum Sprint Scope
This will Decrease context switching
Increase focus
Reduce cycle time / faster feedback
Usually quality takes time so a daring strategy is to reduce quantity.
Problem visible faster
Say no to additional work, or at least postpone
Reduce variability / increase predictability
Reduce the number of platforms
Kill zombie projects
Less versions/ branches (trunk based)
Fix the root causes
Do less
New and unique work (e.g. new features)
50% to 2/3 of new features are never used / do not meet business intent /
improve key metric
Triage work (find sources of problems)
Smaller batches fewer changes need to be investigated)
18
24. Self service platform:
In my last company we set up a ci/cd platform using jenkins. And if you automate
everything you come to the point, where it doesn‘t really make a difference, at least in
the normal case, who presses the button to release software. So it changed from a
matter of having time and the necessary skill, to a matter of permission.
Switching to jenkinsfiles and infrastructure as code also had the benefit of developers
creating there own ci/cd scripts based on the existing ones or create pull requests, where
in the past everyone was rather unwilling to set up their own jenkins jobs or break
something, because now everything was in version control.
18
25. Make time to create automation
get management support
this may mean ignoring
New and unique work (e.g. new features) – hard to automate
Triage work (find sources of problems)
Automate for consistency
Smaller batches fewer changes need to be investigated
Repetitive work automate
Reduce manual work
Increase frequency
Speeds up feedback loop
Consistency leads to fewer errors
Fix the biggest bottleneck first, everything else may make things worse and accumulate
backlog.
It may not even be a technical problem (e.g. person overloaded)
Help scaling
Improve accuracy
Increase repeatability
Improve reliability
19
26. Save time
Make processes faster
Enable more safeguards
Empower users (reduce handoffs)
Reduce user wait time
Reduce system administrator wait time
What i also liked about our automation and measurement effort is, that we could reach
win-win-win situations.
For example: We noticed, that our end2end tests run rather slow, so measuring them in
detail revealed that, we have a performance issue / memory leak. Fixing this did not only
make our test run faster, but also helped us debugging faster, and also our customers
noticed a considerable improvement.
There are different approaches to automation
Left over principle – automate as much as possible humans do what is left
Increase quality and frequency of feedback
Reduce time and resources between release branch and production (always releasable)
Improve deployment reliability
19
27. This view makes the unrealistic assumption that people are infinitely versatile and
adaptable and have no capability limitations.
(Explain graph)
Acquire automation (e.g. open source solution, commercial solution)
Other approaches
Compensatory principle – human / machine which one is better at which task
Complimentary principle – improve long-term health of the combined system
But automation is not a silver bullet solution and a word of caution is necessary
20
28. When the new automation is in place, there is less total work to be done by humans,
but what is left is harder.
Relentless improvement, refactoring and innovation is needed to reach excellence
This is also depicted in this J-Curve of Transformation
This requires spare capacity and slack i.e. you need to invest to be fast and think of
improvement activities as just regular work.
So for example if you use fastlane to deploy ios apps and it fails, then it takes at least a
day to understand signing ios apps again
21
29. A recurring cycle of events, the result of each one being to increase the beneficial effect
of the next.
“Eliminating one bottleneck in a system often highlights another one. As each change
cycle is completed the resulting improvements, standardization, and automation free up
engineering time. Engineering teams now have the space to more closely examine their
systems and identify more pain points, triggering the next cycle of change” – SRE
workbook
So we freed ourselves from quicksand, which as described is a Vicious circle, that
pulls us in the harder we try to resist.
But what’s next?
If you are not constantly working to improve, then by default, you are gradually getting
worse
(2nd_law_of_thermodynamics, Entropy increases )
So we need to establish a culture of learning and improvments that compound.
Jez Humble says lean works by investing in removing waste so that you can increase
throughput
Prioritize improvements that will do the most to improve flow
Start with the bottleneck / biggest source of waste, fix it, identify the next bottleneck
22
31. One experimental approach for product development and process improvement is the
Improvement Kata.
You get the direction or challenge
You grasp the current condition
You establish your next target condition
Conduct experiments to get there
Start again
23
33. Software delivery performance drives Lean product managment
Lean product management drives software delivery performance
Setup feedback loops and Iterate: The faster we can experiment, iterate, and integrate
feedback the faster we can learn. How quickly we can integrate our feedback depends
on our ability to deploy and release software
Make adjustments based on learnings from feedback and metrics
Speed up the feedback loop
Reduce Waiting Time
25
34. So in the end we reached a state where we did way more with way less.
26
35. I want to conclude with a recap of the things we talked about today.
Get out of the quicksand, but don‘t stop there, the sky is the limit
…
Three ways of Devops
The First Way: Workflow
The Second Way: Improve Feedback
The Third Way: Continual Experimentation and Learning
Don‘t forget that it‘s people over process over tools
27