Codestock 2018 - Deliver at Warp Speed

Inmar – do not copy, distribute or use without Inmar written permission, 2018 1

Deploying to production 100x a day,
with no QA and zero downtime
Deliver at Warp Speed
Rex Morgan
Director Software Engineering

Inmar – do not copy, distribute or use without Inmar written permission, 2018
How We Do It, And How You Can Too
3
Building Failsafe
Software
Event Sourcing
Idempotency
Async-First
Circuit Breakers & Auto-Retry
Real-Time Architecture
Continuous Deployment
Testing and SDLC
First-class automated testing
Deploy then test (not the other
way around)
Decide what to test using
Science™
Trunk-based development
Operations
You Build It, You Run It
Monitoring and Alarms
Ops Standup
Shoot things just because you
can

What is Inmar?
Ever used a paper or digital
coupon? Ever redeemed a rebate?
Ever filled a prescription, returned
something to a store, or online;
stayed at a hospital, read a blog, or
talked to a chatbot on social
media?
If so, you’ve touched Inmar.
4
> $1b / week processed
>100b consumer touchpoints
100s of deployments / day

Client’s TRUST
is Paramount
How do we go
FAST with
CONFIDENCE?
What do we
GAIN?

I am an Engineer
• 3 years as Director of Engineering at Inmar
• 5 years as Architect at Qorvo and Volvo
• 12 years as a professional software
developer
• At Inmar, managers, directors – all the way
up to the CTO – are technical – we code!
• I get to work with some of the most talented
and sharp engineers around. Want to talk
about joining us? Hit me up after!
6
AND YOU?
MY TEAM:

The Game
Our leaders are always asking,
begging us to deliver more, faster.
Why?
Competition is always about trying
to gain an advantage.
7
THE STAKES:
We win:
• More jobs
• Bigger projects
• Higher pay
We lose:
• Fewer jobs
• Stressful workplace
• No $ for raises

BUSINESSVAUEDELIVERED
TIME ELAPSEDPlanning Architecture
(ooo fun)
Development QA/UAT
BAM!
RELEASE

BUSINESSVAUEDELIVERED
Sprint 1
TIME ELAPSED
Sprint 2 Sprint 3 Sprint 4 Sprint 5 Sprint 6
…
Agile advantages:
• Greater cumulative value delivered
• Team velocity increases over time

OODA
10
Fit what we know about the
environment with our strengths
and weaknesses
Produces
effects on the
environment
ObserveDecide
Orient
ourselves
Act
Produces what
we know about
the environment
Produces
possible paths
to go down
Produces
action plan
Measure our
environment
Choose the best
course of action
Execute the plan!

OODA
11
ObserveDecide
Orient
ourselves
Act
• This is how we Learn: by Shipping
• Business leaders are playing chess –
they need to make a move to see its
effect. They need us to deliver.
• Code that isn’t live, being exercised
to deliver real value, is just practice.
• If the time it takes to complete one
full loop is too slow, the plan you are
executing is based on out-out-of-date
observations!
• If we aren’t looping as fast as we
possibly can, we aren’t maximizing
our own potential.

After a year, the team that OODA
loops 100x a day will have
accelerated far beyond the team
that does once every 3 weeks.

If it’s painful, you should do it more often
Amount of fear:
Unknown > Known
When you start doing things that hurt more
often, they go from “I fear it will fail in some
horrible way” to ”I see that it fails in this
very specific way”
14

Calculated Chaos
• When you go through this cycle of pressing
on your system in ways that hurt, you
discover, describe, quantify, and correct the
fragile parts of your system. Once you
overcome one, you’ll find another, and
another, but they will be more and more rare.
• The really amazing thing that happens is: as
you remove fragilities and they become
more rare, your system becomes more
resilient generally. Because you
proactively found and fixed the specific ways
in which it will break, your system can now
withstand that entire class of failures.
15

Your MARGIN is
my OPPORTUNITY
“ ”Countless companies out
there, large and small,
are looking for weakness
where they can slip into
the margins where you
are vulnerable, and take
your business.
Our responsibility as
professionals is to
proactively defend our
companies’ turf in our
area of expertise:
delivering technology.

17
Building Failsafe
Software
Event Sourcing
Idempotency
Async-First
Testing and SDLC
way around)
Science™
Operations
Ops Standup
can

Event Sourcing
• Problems with that approach
– DELETE is destructive (obviously), but so is
UPDATE
– The audit table approach is not failsafe. The
primary action is guaranteed, but the audit
can break.
– So even with audit tables, every UPDATE is
potentially destroying data – specifically, the
fully-complete answer to the question “what
was it before the update?”
• Traditional CRUD system:
– Model your entities
– Use a DAL for Create, Update, Delete, and
Read
– What if I need to know what happened to the
entity in the past?
– Oh, right… add an audit table
18

• Get rid of UPDATEs. In fact, get rid of
directly storing your entities!
• Instead, just write the intended change -
directly to the audit table
• When I want to know what the entity is, I just
replay all the audit records in order
• When I want to know what the entity was, at
any point in time, I just replay up to that
point.
19
Event Sourcing (cont’d)

Event Sourcing (cont’d)
• Every financial institution in the world uses it,
and they seem to manage transactional
volume decently well
• The most common performance booster is
snapshotting. Every nth record (even every
record) we cache the resulting entity
alongside the event, so we don’t have to
replay from the beginning of time.

Let’s Build A Rebates System!
21
Shopper makes a purchase
(in-store or online) Submission
1 2 3
Submits a rebate form
(paper or online)
4
• Did the consumer buy the
right product?
• Do they meet all the
criteria?
• How much $ are they
due?
Settlement
We owe
you $20!
• How do we remit
payment?
• Do we have enough $ in
the bank?
• What happens if it comes
back unclaimed?
You paid me $20 but I
should have gotten $50
5
Correction:
We owe you
$50!
6 7

Idempotency
Old-school:
POST creates an object… we use an
incrementing primary key. So if I POST the
exact same object again, it will create a
duplicate record with a higher ID. That is
almost never what we want.
In computer science, the term idempotent is
used more comprehensively to describe an
operation that will produce the same results if
executed once or multiple times.
22

Async-First
23
• Myth: most business
operations need to block until
the operation is completely
executed.
• Fact: most business operations
only need to acknowledge the
request to execute was
received.
• Being synchronous is an
additional constraint – a
temporal constraint – on any
feature. Constraints limit
implementation options – and
increase cost.
• It must be a business
requirement – with justification
and value attached – for an
operation to be synchronous.
Otherwise, everything is
implemented async by default.

Submission API
Combine Event Sourcing with Message Bus
24
Event 4
1
Worker
determines if a
settlement is
needed, and
calls the API
3
Event 3
Event 2
Event 1
Event 0
2
Settlement API
Event 4
Event 3
Event 2
Event 1
Event 0
4
5
Publish
event to
bus
Publish
event to
bus
Create or
update
submission

Auto-Retry and Circuit Breakers
25

Twine
When your solution is made of many processes
(functions, or microservices, or whatever), a lot
more of your regular coding involves Inter-Process
Calls (IPC)
• LINQ was made because working with sets is
very common in business apps – gives us a
standard model for thinking about sets
• Twine (and others) were made because in
microservices, IPCs are very common. Gives us
a standard model for thinking about IPC.
26
• Service discovery
• Protocol abstraction
• Load-balancing
• Auto-retry and backoff
• Auto-fallback and failover
• Circuit breakers
• Authentication (like JWT)
• Tracing
• Completely pluggable and extensible
• Implementations in .NET/c#, javascript, and go
twine

Builds & One-Click Deploy
27

28
• Make sure the package can be deployed to
any environment (makes no assumptions)
• Separate Build from Deploy
• Setting environment-specific config or values
is part of the deploy stage.
• Every previous build already packaged, one-
click deploy means it’s easy to roll back
Check-in
change
Master
Trigger
automated build
Build produces a
deployable package
to sit on a shelf
forever
• Apply environment-
specific config
• Remove previous version
from target environment
• Add desired version to
target environment

29
Check-in
change
Master
Trigger
automated build
Build produces a
deployable package
to sit on a shelf
forever
• Apply environment-
specific config
• Remove previous version
from target environment
• Add desired version to
target environment
• If you don’t have this, start here
• So many tools - already solved for any shop,
any platform
– TFS / VSO / Release Manager
– TeamCity
– Jenkins

Failsafe software == Fearless engineers
30
• Combining each of these patterns and
sticking to them religiously, we can
absolutely wreck our system and
everything will be OK
• When you know the chances of
breaking something are small, you can
be fearless

31
Building Failsafe
Software
Event Sourcing
Idempotency
Async-First
Testing and SDLC
way around)
Science™
Operations
Ops Standup
can

Traditional Testing in an SDLC
32
Product owner
writes user story
and team refines
1 2a
Release
Engineer
implements the
story
2b
3
Tester writes test
cases
QA tests
4 “The Loop”

Traditional Testing in an SDLC
33
Product owner
writes user story
and team refines
1 2a
Release
Engineer
implements the
story
2b
3
Tester writes test
cases
QA tests
4 “The Loop”
• Track which engineer
owns each story, and
track each time it comes
back from QA
• Number of cycles went
way down
• NEVER came back from
QA due to failed ACs
• All failures were non-
obvious and unexpected
downstream impacts
outside the scope of the
story

Two Kinds of QA…
Testing
• Exploratory
• Requires real domain expertise
• As well as historical knowledge of the
specific system
• And good judgement
Checking
• Verifying the business requirements are met
• Covers functional and non-functional
requirements
• As long as user stories are decent, anyone
with basic domain knowledge can do it
34

Two Kinds of QA…
Testing
• Exploratory
• Requires real domain expertise
• As well as historical knowledge of the
specific system
• And good judgement
Checking
• Verifying the business requirements are met
• Covers functional and non-functional
requirements
• As long as user stories are decent, anyone
with basic domain knowledge can do it
35
Don’t do this This is better…
But it’s still slow and unpredictable.
More predictable ► More rigorous ► Even slower

Gates are a Cost
36
Product owner
writes user story
and team refines
1 2a
Release
Engineer
implements the
story
2b
3
Tester writes test
cases
QA tests
4 “The Loop”
• Engineering
accountability drives
down QA failures –
most of what we submit
to QA now passes… but
we still test it.
• Add the cost of testing
every time, but we only
realize the benefit of
testing that one rare
time they catch a
problem.

Automated Testing After Deployment – No Gates
37
Product owner
writes user story
and team refines,
including
defining each
positive and
negative test
1 2
Release
Engineer
implements the
story, including
building the
automated tests
3a
Automated test
battery runs
continuously
against
production
Engineer adds the new
or updated automated
tests to the battery
3b Deploys the changes
4 Continuous feedback, 24/7

Why continuously, and why in production?
• Rules for acceptable and unacceptable
behavior are true, not just when you change
the software, but also when the conditions in
which your software runs also change
(different and more data, different loads,
etc.)
• The code changes you’re deploying are
simply one dimension of continuous change
• Fail-safe software won’t do anything
destructive
• Get from concept to effects on the real world
as quickly as possible
39
ObserveDecide
Orient
ourselves
Act

Why continuously, and why in production?
You can start to do more interesting things,
such as:
• Combine the one-step deployable artifacts
with automated tests to do automated
rollbacks
• Use live service tracing with automated tests
to create an early-warning system
(cascading failures before they happen)
40
ObserveDecide
Orient
ourselves
Act

Engineers have to write test cases?
• Every other kind of engineering (mechanical,
structural, etc.) characterizes and quantifies
the failure modes of the thing they’re
engineering.
• In what ways can it fail?
• What are the stressors and limits?
• What is the risk, impact, and mitigation of
each type of potential failure?
41

Make Quantifying Testing Part of User Story Requirements
• Describe each potential failure mode (“what is
the worst possible bug I could introduce?”)
• Y-Axis: Impact of Problem
1. Moderate impact
2. Significant but correctable impact
3. Irreparable harm to reputation or integrity, or
unrecoverable loss of cash
• X-Axis: Invisibility of Problem
1. Very unlikely to be missed during normal dev-
test cycle
2. Discoverable in diligent manual testing pass
3. Likely to be missed - subtle or complex behavior
42

Don’t Need Big Fancy Test Frameworks or Platforms
• Postman to craft your test calls
• Save out the Postman call definitions as files
right into your source control alongside the
thing they test
• newman is a CLI to programmatically run
Postman calls
• Use whatever code you want to look at the
response and decide whether it’s good or
bad
• Save the results to a DB or something
• Report on it!
43

Where does QA fit?
Examples
• Mobile apps
• Hardware
Characteristics
• High MTTR* Components
• Qualitative, not quantitative
• User-facing only
• Not “is it broken?” (automated tests validate
that), but “is the experience as good as it
should be?”
* Mean Time To Resolution
44

Testing Maturity Ladder
45
Manual Testing
1
Creating validation is a first-class
part of every engineer’s daily work,
and computers execute it.
Engineers Test
Test Engineers
2
3
Testers take on responsibility of
automating their work
QA manually gates deployment

Optimize your SDLC for Speed
46
When you automate validation during
development, you:
• Know any future change that has
unexpected and non-obvious
downstream breakage will be caught
deterministically
• Free up humans to do higher-value
work than verification
• Have continuous feedback on end-
to-end health of every potential failure
mode

47
Building Failsafe
Software
Event Sourcing
Idempotency
Async-First
Testing and SDLC
way around)
Science™
Operations
Ops Standup
can

“You Build It, You Run It”
48
• Make no distinction between
operations and development.
They are literally the same thing.
• “Operations” involves:
– Configuring the runtime environment
– Deploying the software
– Monitoring (and responding to) the
health of the runtime environment

Operations Standup
Operations Standup
• What errors got logged or tests failed in the
last 24 hours?
• Who will own resolving each one today?
Traditional Agile Standup
• What did you do yesterday?
• What do you plan to do today?
• What impediments stand in your way?
51
Mandate: zero errors unaccounted for!
No “oh yeah I’ve seen that error, it’s not
actually a problem”

Shoot Things, Just Because You Can
Fully Automated Infrastructure
• Servers have no names. We never even log
into them, except for forensics.
• Make changes in the middle of the day,
because the whole team is online and 100%
engaged
• Kill resources daily, just to force them to be
automatically replaced and verify everything
still works
Traditional Ops/Infrastructure
• “Is the error coming from SEGOT-APP-2503
or SEGOT-APP-2505? You know 03 tends to
get a little wobbly sometimes”
• Make changes in the middle of the night on
the weekend to “minimize impact”
• Reboot things very carefully
52

53
Building Failsafe
Software
Event Sourcing
Idempotency
Async-First
Testing and SDLC
way around)
Science™
Operations
Ops Standup
can

THANK YOU! @rexm rexm

Codestock 2018 - Deliver at Warp Speed

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Codestock 2018 - Deliver at Warp Speed

Ähnlich wie Codestock 2018 - Deliver at Warp Speed (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Codestock 2018 - Deliver at Warp Speed

Hinweis der Redaktion