A Primer: Basics Cloud Engineering Plays. It can be challenging for a technology team to navigate through crazy times with growth, new technologies, many releases etc etc. Here are my thoughts on Basic Plays a cloud engineer could use make sure they always cover the basics. I am sure there are even better ones out there, looking forward to hearing about them.
1. A Primer: Basics Cloud Engineering Plays.1
Introduction
There seem to be more cloud based services than chanterelles in a Swedish October forest. This means
that there are many excellent engineers and technical operations people with long experience developing
and operating cloud services. So why do I think there is a need for this primer? Let me explain.
Businesses are operating at a faster and faster pace and companies need to ship software faster, requiring
us to build out services and infrastructure faster. Additionally, businesses need to be increasingly more
efficient. This means that we need to do more with less; fewer engineers, less hardware, fewer technical
operations staff, et cetera. We push solutions and technologies to the their limits before pausing to
anticipate and address the associated problems. We do not know what will break first, so why fix things
that are not broken? We cannot afford it. And we do not want to waste resources on something that may
not be needed; that would be over engineering. Businesses change strategies and these are easily updated
on a PowerPoint slide, but the bullwhip effect on technology is brutal. The technology team tries to
generalize because the platform is usually reasonably stable, whereas applications and features change
rapidly to accommodate strategy and packaging changes.
In this environment, engineers (devs) and technical operations (ops) do their best to balance the tradeoffs.
They compromise what their instincts tell them are the right things to do from a cloud engineering
perspective for what the business needs to accomplish ridiculously quickly. Management beats down best
practices instincts with aggressive schedules and messages that compel technologists to cut corners.
Management insists that unless we ship on time, we cannot close sales, or we will miss the market
window. Management always wants more, faster. Management asserts that the business needs to cut
costs, so use fewer boxes, and make do with less disk space. This constant pressure wears down devs
and ops until they forsake their instincts to follow basic best practices. When things heat up and a
company or product enters the release tornado, teams sometimes (often?) struggle to follow basic rules
that help them to be successful and continue to execute and scale on a higher level.
This is why this primer is needed; to establish some basic plays for engineers and technical operations.
These best practices represent a code of honor, pride and professional integrity on which a cloud engineer
or cloud ops technician will never compromise. A level of standard that allows a technical team to ensure
they always perform at a level which they can stand by. A set of rules of engagement that can be
communicated and agreed on with Management as standard operating procedure.
The motivational speaker, Tony Robbins, asserts “The path to success is to take massive, determined
action.” I’d like you to consider adopting the basic plays I am proposing (or your own variation thereof) as
something to which you are committed and determined to follow. Use them as the basic principles on which
you base your professional work as a cloud engineer. Never compromise on them.
The purpose of this primer is not to mandate an ideal state where we do everything right and perfectly.
Doing that would be very difficult, particularly in a growing and often chaotic organization. Rather, the goal
1
Author: Jari Koister
2. is to establish a baseline; the cloud engineering basics; that are reasonable and rational to embrace and
general enough to afford flexibility. I believe they will help you be successful.
General Guidelines
There are some things that apply in a general sense across all technology roles.
#1 Customer first.
As an engineer and operations person, you make many decisions every day. You consider them from many
perspectives, such as effort, cost and complexity. It is not always easy to determine the best option given
all of these competing perspectives. My advice is that you take the perspective of customers because it
will often clarify things and make the decision easier. Advocate for you customer in your decisions and you
will at least avoid harming them, even if it happens to be a little more expensive.
#2 Speed is Queen, Predictability is King.
Speed is an advantage and sometimes it is necessary to rush as fast as you can, ignoring everything else,
but it comes with a cost. Most of the time, predictability is just as important as speed. As a developer or
operations person, you know that many people rely on the predictability of your results. Other developers
rely on it, and marketing, sales and, ultimately, customers and users. Being able to predictably ship
software or get the system up is one of the greatest strengths a technology team can have; it is an
important ingredient of a winning recipe. Take time to learn how to break down a problem, assess risk, and
make reasonable estimates. You do not need to be 100% correct all the time, but you need to learn to be in
the zip code. I am always pointing out that as human beings, we tend to focus on the things we know the
best, so estimating those things is easy. The biggest risk is estimating in those areas you do not know. Be
bold, dive in, and take the time to assess the risks to enhance the predictability of successful result.
#3 Chipping away.
Resist the temptation of thinking you can fix everything at once. Building a great system is a marathon, and
you need to chip away on the issues each day, week, and sprint. Find a process for continuously
identifying, starting and finishing things. Make a positive change every sprint. This will lead you to success
cumulatively and less stressfully.
Operations
Cloud Operations are extremely demanding, in particular, in a growing cloud company. To deal with the
avalanche of requests and issues, I need something that guides me so that I do not lose track of my main
mission; I need to adhere to the basics. And when I write adhere to, I mean it. Do not let anything come
between you and the principles you adopt. For Operations, some basic rules are so critical that no
compromise is available.
#4 Keep Lights‐on.
“Lightson” refers to the basic tasks and requirements that can never be compromised. Nothing should be
allowed to take precedence. Lightson tasks include, but are not limited to:
●
●
●
●
Daily backups of all critical data.
Replication of all data needed for HA and for satisfying the SLA.
Redundant HW for all vital components.
○ 2 spare components of everything; you needed to have redundancy even after failure.
Continuous software redundancy: All software processes with redundancy should be redundant all
the time. If one crashes, it is highest priority to get it redundant again.
3. ●
Religiously performing all security related tasks, such as penetration testing.
As an ops person, lightson must be your highest priority. Not covering these would be like a firefighter
arriving at a blaze with an empty water tank because he did not fill it up in advance just a fatal fail. Below,
I list the things that I believe are lightson tasks. Failure to take care of a lightson task can result in fatal
consequences for your business and customers. It can get your boss(es) fired and you can lose many of
your customers, or your stock prices can tank, making your corporate Board very unhappy. So, maintain
your integrity and protect your lightson tasks. Nobody will ever disagree with you on this. If they do you,
should consider another employment opportunity where the basics are cherished and respected.
#5 Backups and HA.
This is a lightson task. You must have backups, and recent ones. Whatever you do, do not compromise
on backups. Never think “it will be ok for a few more days.” A failure can happen any day. You may
accidentally screw something up (more likely than a RAID disc failing you) or an application may
accidentally corrupt something.
Ideally, have everything redundant and at a minimum as a hot standby. Software fails; it is only a matter of
time. Do not accept that you have a single instance of something “just for a few days.” Fix it immediately.
#6 Hardware redundancy.
Like software, hardware eventually fails. I know getting money for hardware can be like pulling teeth. But do
not accept not having spare parts and redundant nodes. When “it” hits the fan, Management will tell you
they did not realize that it was so important and that you should have insisted, hence it will still be your
responsibility. Do not accept not to have 2 spares of every component type and redundancy on all vital
system components. You need to be able replace any failing component within 1 hour. This is a lightson
task.
#7 Reactive and proactive.
As technical operations, you become very reactive by necessity. Things happen all the time; software
issues occur; you need to track down an error in a log for devs; hardware fails. Concurrently, you need to
be building out things strategically to scale, otherwise you will eventually hit a brick wall as your business
grows. During each sprint, pick a few things that are strategic and fix them. Chip away, and over the course
of months, you will have made a lot of progress. Do not accept the status quo: improve and automate those
areas you need to support your anticipated scale (see #18)
#8 Communicate.
Tell people what is going on. What you do it so vital to everyone. It has such an impact on your users,
sales, developers and others that any disruption is potentially corrosively disruptive. Keeping people in the
loop on areas such as upgrades, tests and known increased loads makes everyone more comfortable, and
you will receive fewer distress emails.
#9 You can not please everyone.
It is natural that you want to help everyone. Ops warriors just want too; it’s in their blood; they were born
that way. But do not fool yourself and compromise your main mission by trying to help everyone. Prioritize
and make sure you do that well. I am suggesting lightson as the basic priority one. During a discrete
period, you may choose development over client services as your second priority, and during another
period, the other way around. Be cognizant of your priorities and use them to simplify decisions in your
daily work.
4. #10 Protect your system; never lose sight of security.
There is nothing worse than having someone come in and break your system. You will be the one having to
fix it. Make sure you know who has access to what and do not be afraid to revoke access. If they need it,
they will ask you again. None should have access to things they do not absolutely need, not even
Management.
Never lose sight of security. Review it regularly. Identify improvements; chip away on them. You cannot fix
all at once, but if you fix a few per quarter, you will have done a lot at the end of the year. This is a lightson
task.
#11 Protect your data.
Nuff said; never allow customer data to travel somewhere it shouldn’t. Protect your data stores rigorously.
Demand that devs fix any potential security issues immediately. This is a lightson task. Seriously, do not
ignore it. This is a lightson task.
Engineering
As an engineer and developer, what are the basic principles I should embrace and apply everyday and
never compromise? Here is my advice.
#12 Hit your sprint goals.
As mentioned earlier, predictability is often more important than speed, not always, but often. Predictability
helps everyone make better decisions. It makes coordination easier, reduces waste, and creates a
heightened level of satisfaction. Plan so that you hit your goals most of the time. No one hits them always,
but do not accept that you often miss them. Have stretch goals so you push yourself, yet allocate time for
ensuring adequate quality and fixing bugs. Bad quality means many P0’s, which means missing sprint
goals. Avoid this spiral. If you do get tapped, break out as soon as possible.
#14 Manage your debt.
We always generate debt. We cut corners in our implementation, we simplify things to get them out more
quickly. We leave configurations in code that should be externalized, engineer better orchestration without
moving all processes to it. We cut corners on how many automated tests we implement, we test a feature
manually “just this one time.” This is all fine, but you need to identify the high risk, high interest debt you
are accumulating. You need to pay down these high risk debts during sprints. Believe me, you do not want
to accumulate so much debt that it becomes unsurmountable to fix. Not everything can or needs to be
fixed, but as an engineer you need to keep track of what you absolutely want to fix and make sure you do
it. No one will stop you. If you believe it is important, just do it. Maintain your integrity.
#15 Take action on your Jira’s.
Filing a Jira when an issue occurs, or when you find a bug is a great habit. But I seen hundreds, perhaps
thousands of Jira’s that just sit there. You feel good for capturing an issue in Jira, but you actually have not
accomplished anything. It would be like a surgeon being satisfied with identifying that a tumor exists
without bothering to excise it. Make scrubbing and acting on your Jira issues part of your weekly routine. It
counts as work, just like anything else.
#16 Have a goal architecture, develop with an end in mind.
It is impossible to build out the architecture you ultimately want to have immediately because it would
usually take too long. You may not be able to afford the software you need, or you may not yet need
everything. So, you take shortcuts and simplify to get product out, and this is very rational. As an example,
5. you know it is best to fix running processes on separate nodes, but for the time being you put them in one
process. At some later point in time, this will cause scaling issues, but because you know that you wanted
to separate the processes, the work is hopefully just a few days to improve the situation because you have
kept this goal in mind all along. The key is to develop with an end in mind. The end may be changing over
time, but that does not mean you should not have target architecture defined at every point. Do not
compromise on this; always take the time so define the target architecture. And communicate it along with
the tradeoffs and the potential shortcuts. Communicate your assumptions, such as the level of scale you
expect this architecture will support.
#17 Never work on something without clear acceptance criteria.
You need to know when you are done. The biggest issue I have seen in communication amongst product
owners, developers, business people is that they have different views of what the result will be. I know
creating acceptance criteria is painful and takes time, but it is well worth it. Just decide not to accept a
project with a clearly marked goal line.
#18 Establish a baseline for scale.
We always talk about building for scale. Scale means so many different things; it can be number of users,
transactions, size of data, et cetera. As an engineer, you need to decide your design parameters. If you are
producing an enterprise application, you probably do not need to start by designing it for 100 million users.
If it is a startup, perhaps start with designing it for 1000 customer companies. Decide on a goal and design
for that. For each design you do, you need to ask the question “will this work for my scale goal?” Always
design with a defined level of scale in mind and communicate your assumptions.
#19 Think platform, build application
In today’s frantically paced world, applications and features change quickly. The Agile Manifesto states it
well: embrace changing requirements. The lean methodology promotes trying things quickly, failing fast
whilst finding the market fit. These thoughts have significant repercussions for a software engineer. I
believe that as an engineer you need to identify what are the platform components that will outlive any
features and application components. These are the ones you must nurture and generalize until you arrive
at the right feature or application. Therefore, fundamentally identify your end platform even if you will not
build it out fully immediately. Most likely you build your platform as part of building your user features. If
you know what your essential platform is, you can generalize and reuse in a way that will ultimately benefit
everyone greatly.
#20 Be explicit on what variable you are changing.
Time, QA/Technology debt, Quality, Functionality or Resourcing. If you cannot fit a deliverable, at least one
of these variables needs to change. When negotiating, understand and communicate which one you are
changing.
#21 Consider risks carefully.
There is a high level of risk in software development. Anything can go wrong, and most of the time
something does. Never, ever, take your eyes off the ball and compromise basic principles. I have seen
examples where there is an issue with the staging environment, and the devs say “I am sure it will work,
never mind that we can not test on stage,” or “This component been working fine, I do not think we need to
test it.” Even if there is an issue in only 1 out of 10 of these situations, the cost of dealing with it is higher
than the combined cost of testing it upfront. Never ever give in; do not take stupid risks even when you are
tired and under pressure. Consider the consequences if things go wrong. Do not compromise your integrity.
7. Cheat Sheet
#1 Customer first.
#2 Speed is Queen, Predictability is King.
#3 Chipping away.
#4 Keep Lights‐on.
#5 Backups and HA.
#6 Hardware redundancy.
#7 Reactive and proactive.
#8 Communicate.
#9 You can not please everyone.
#10 Protect your system; never lose sight of security.
#11 Protect your data.
#12 Hit your sprint goals.
#14 Manage your debt.
#15 Take action on your Jira’s.
#16 Have a goal architecture, develop with an end in mind.
#17 Never work on something without clear acceptance criteria.
#18 Establish a baseline for scale.
#19 Think platform, build application
#20 Be explicit on what variable you are changing.
#21 Consider risks carefully.