Distributed Systems Are a UX Problem

@tyler_treat
Distributed Systems Are a 
UX Problem
Tyler Treat / O’Reilly Software Architecture Conference / October 30, 2018

@tyler_treat
Tyler Treat 
tyler.treat@realkinetic.com

@tyler_treat
I like distributed systems.

@tyler_treat
Disclaimer: 
I know approximately nothing about UX…

@tyler_treat
…other than when I’m the user, I know when
my experience is good and when it’s bad.

@tyler_treat
UX Systems
Business

@tyler_treat
UX Systems
Business
This 
Talk

@tyler_treat
The Yin and Yang of
UX and Architecture

@tyler_treat
Service
Service
Service
Service
Service
Service
Service
ServService

@tyler_treat
book trip
Trip
Service
Trip
Database
transaction
Good old days

@tyler_treat
book trip
Microservices
Airline
Service
Hotel
Service
Car
Service
Trip
Service
transaction
transaction
transaction

@tyler_treat
book trip
Microservices
Airline
Service
Hotel
Service
Car
Service
Trip
Service
transaction
transaction
transaction
ACID
ACID
ACID

@tyler_treat
UX Implications of Microservices
• Data consistency

@tyler_treat
• Race conditions

@tyler_treat
• Race conditions
• Performance

@tyler_treat
• Race conditions
• Performance
• Partial failure

@tyler_treat
So are microservices bad?

@tyler_treat
Microservices are about 
people scale.

@tyler_treat
A Study of Transparency and Adaptability of Heterogeneous
Computer Networks with TCP/IP and IPv6 Protocols 
Das, 2012
“Any change in a computing system, such as a new feature or new
component, is transparent if the system after change adheres to
previous external interface as much as possible while changing its
internal behavior.”

@tyler_treat
High TransparencyLow Transparency

@tyler_treat
NFS

@tyler_treat
NFSFTP

@tyler_treat
Types of Transparencies
Access transparency
Location transparency
Migration transparency
Relocation transparency
Replication transparency
Concurrent transparency
Failure transparency
Persistence transparency
Security transparency

@tyler_treat
Transparency is about usability.

@tyler_treat
Usability Control

@tyler_treat
Simplicity
Flexibility, Performance, 
Correctness
RPC

@tyler_treat
Simplicity Flexibility, Performance, 
Correctness
Erlang Message Passing

@tyler_treat
RPCErlang 
Message Passing

@tyler_treat
Translating UX for developers:
APIs

@tyler_treat
Transparencies simplify the API
of a system.

@tyler_treat
UX is about deciding what
knobs to expose.

@tyler_treat
The Truth is Prohibitively Expensive
Balancing Consistency and UX

@tyler_treat
book trip
Trip
Service
Trip
Database
transaction
Good old days
Transparency

@tyler_treat
book trip
Microservices
Airline
Service
Hotel
Service
Car
Service
Trip
Service
transaction
transaction
transactionTransparency

@tyler_treat
book trip
Microservices
Airline
Service
Hotel
Service
Car
Service
Trip
Service
transaction
transaction
transaction
ACID
ACID
ACID
Transparency

@tyler_treat
Spreadsheet service

@tyler_treat
Spreadsheet service
Document service

@tyler_treat
Spreadsheet service
Document service
Presentation service

@tyler_treat
Spreadsheet service
Document service
IAM service

@tyler_treat
Spreadsheet service
Document service
IAM service
consistent

@tyler_treat
Consistency is about ordering of
events in a distributed system.

@tyler_treat
Why is this hard?

@tyler_treat
So what can we do?

@tyler_treat
book trip
2PC Prepare
Airline
Service
Hotel
Service
Car
Service
Trip
Service
propose
propose
propose

@tyler_treat
book trip
2PC Prepare
Airline
Service
Hotel
Service
Car
Service
Trip
Service
vote
vote
vote

@tyler_treat
book trip
2PC Commit
Airline
Service
Hotel
Service
Car
Service
Trip
Service
commit/abort
commit/abort
commit/abort

@tyler_treat
book trip
2PC Commit
Airline
Service
Hotel
Service
Car
Service
Trip
Service
done
done
done

@tyler_treat
Problems with 2PC
• Chatty protocol: beholden to network latency
• Limited throughput
• Transaction coordinator: single point of failure
• Blocking protocol: susceptible to deadlock

@tyler_treat
Three-Phase Commit

@tyler_treat
atomic clocks
NTP
GPS
TrueTime

@tyler_treat
Good news: 
we solved physics.

@tyler_treat
Bad news: 
it costs all the money.

@tyler_treat
Spanner: Google’s Globally-Distributed Database 
Corbett et al.

@tyler_treat
TrueTime forces that uncertainty to the
surface, and Spanner provides a
transparency over it.

@tyler_treat
Spanner doesn’t avoid trade-offs,
it just minimizes their probability.

@tyler_treat
Spanner is expensive and
proprietary.

@tyler_treat
But it’s not the end of the story…

@tyler_treat
Unless every service is backed by the
same database, you probably still have
to deal with consistency problems.

@tyler_treat
Challenges to Adopting Stronger Consistency at Scale 
Ajoux et al., 2015
“The biggest barrier to providing stronger consistency guarantees…is
that the consistency mechanism must integrate consistency across
many stateful services.”

@tyler_treat
Coordination is expensive because
processes can’t make progress
independently.

@tyler_treat
Peter Bailis, 2015 https://speakerdeck.com/pbailis/silence-is-golden-coordination-avoiding-systems-design

@tyler_treat
And what about partial failure?

@tyler_treat
Memories, Guesses, and Apologies
Dealing with Partial Knowledge

@tyler_treat
The cost of knowing the “truth”
can be prohibitively expensive.

@tyler_treat
And partial failure means the
“truth” is also fragile.

@tyler_treat
Where does this leave us?

@tyler_treat
We could go
back to the
monolith.

@tyler_treat
We could build
expensive data centers
with fancy hardware…
@tyler_treat

@tyler_treat
…or we could
rethink our
transparencies.

@tyler_treat
Gregor Hohpe, 2005 https://www.enterpriseintegrationpatterns.com/docs/IEEE_Software_Design_2PC.pdf

@tyler_treat
Exception Handling in
Asynchronous Systems

@tyler_treat
Exception Handling in Asynchronous Systems
• Write-off

@tyler_treat
• Write-off
• Retry

@tyler_treat
• Write-off
• Retry
• Compensating action

@tyler_treat
Revisiting Two-Phase Commit

@tyler_treat
Sagas 
Garcia-Molina & Salem, 1987
“A long-lived transaction is a saga if it can be written as a sequence of
transactions that can be interleaved with other transactions…Either all
the transactions in a saga are successfully completed or
compensating transactions are run to amend a partial execution.”

@tyler_treat
Sagas split long-lived transactions into
individual, interleaved sub-transactions:
T = T1, T2, . . . , Tn

@tyler_treat
And each sub-transaction has a
compensating transaction:
C1, C2, . . . , Cn

@tyler_treat
T1, T2, . . . , Tn
T1, T2, . . . , Tj, Cj, . . . , C2, C1
Sagas guarantee one of two
execution sequences:

@tyler_treat
book trip
Airline
Service
Hotel
Service
Car
Service
Trip
Service
transaction
transaction
transaction

@tyler_treat
• Book flight
• Book hotel
• Book car
• Charge money
T = T1, T2, . . . , Tn

@tyler_treat
• Cancel flight
• Cancel hotel
• Cancel car
• Refund money
C1, C2, . . . , Cn

@tyler_treat
Compensating transactions
must be idempotent.

@tyler_treat
Sagas trade off isolation for
availability.

@tyler_treat
event
Airline
Service
Hotel
Service
Car
Service
Trip
Service
event
event
event

@tyler_treat
System Properties Business Rules

@tyler_treat
Sean T. Allen
“People don’t want distributed transactions,
they just want the guarantees that distributed
transactions give them.”

@tyler_treat
CAP Theorem
• Consistency, Availability, Partition Tolerance
• When a partition occurs, do we:
• Choose availability and give up consistency? 
 
- or -
• Choose consistency and give up availability?

@tyler_treat
CAP Theorem
• Consistency, Availability, Partition Tolerance
• When a partition occurs, do we:
• Choose availability and give up consistency? 
 
- or -
• Choose consistency and give up availability?
(or YOLO it)

@tyler_treat
The CAP theorem is a UX
question…

@tyler_treat
When a partial failure occurs, how do
you want the application to behave?

@tyler_treat
We can choose consistency and
sacriﬁce availability…

@tyler_treat
…or we can choose availability by making
local decisions with the knowledge at
hand and designing the UX accordingly.

@tyler_treat
Managing partial failure is a matter
of dealing with partial knowledge…

@tyler_treat
…and managing risk.

@tyler_treat
Check value 
< $10,000?
Our risk appetite can
drive business rules.
Clear locally
Double check with 
all replicas before 
clearing
yes
no

@tyler_treat
Memories, guesses, and
apologies

@tyler_treat
Computers operate with partial
knowledge.

@tyler_treat
Either there’s a
disconnect with
the “real world”…

@tyler_treat
…or there’s a
disconnect
between systems.

@tyler_treat
Systems don’t make decisions,
they make guesses.

@tyler_treat
Systems have memory.

@tyler_treat
Memories help systems make
better guesses in the future.

@tyler_treat
Forgetfulness is a business
decision.

@tyler_treat
Sometimes the system guesses
wrong.

@tyler_treat
Systems need the capacity to
apologize.

@tyler_treat
Customers judge you not by your
failures, but by how you handle your
failures.

@tyler_treat
Are you building systems that never
fail or systems that fail gracefully?

@tyler_treat
Businesses need both code and
people to manage apologies.

@tyler_treat
It becomes less about trying to build the
perfect system and more about how we
cope with an imperfect one.

@tyler_treat
Wrapping Up
Summary and Observations

@tyler_treat
ACID
distributed transactions
exactly-once delivery
ordered delivery
serializable isolationlinearizability
System Properties

@tyler_treat
ACID
distributed transactions
exactly-once delivery
ordered delivery
serializable isolationlinearizability
System Properties
negative account balance
Business Rules / Application Invariants
two users sharing same IDroom double-booked
balance reconciles

@tyler_treat
We put ourselves at the mercy of our
infrastructure and hope it makes good
on its promises.

@tyler_treat
Kyle Kingsbury, 2015 http://jepsen.io
It often
doesn’t.

@tyler_treat
When do we actually need
consistency?

@tyler_treat
We can use consistency when the
stakes are high and the cost is worth it.

@tyler_treat
And design our transparencies
accordingly.

@tyler_treat
We could try to build perfect
systems.

@tyler_treat
Should we build perfect
systems or pragmatic systems?

@tyler_treat
Systems that can compensate.

@tyler_treat
Systems that can recover.

@tyler_treat
Systems that can apologize.

@tyler_treat
Data Consistency
Race Conditions
Performance
Partial Failure

@tyler_treat
Data Consistency
Race Conditions
Performance
Partial Failure
Transparency
Informs

@tyler_treat
Thank You
bravenewgeek.com 
realkinetic.com

@tyler_treat
References
• https://gotocon.com/dl/goto-chicago-2015/slides/CaitieMcCaffrey_ApplyingTheSagaPattern.pdf
• http://ijcsits.org/papers/vol2no62012/42vol2no6.pdf
• http://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf
• https://queue.acm.org/detail.cfm?id=2745385
• https://www.enterpriseintegrationpatterns.com/docs/IEEE_Software_Design_2PC.pdf
• http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_133.pdf
• https://bravenewgeek.com/distributed-systems-are-a-ux-problem/
• http://www.cs.princeton.edu/~wlloyd/papers/challenges-hotos15.pdf
• https://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf
• https://www.youtube.com/watch?v=lsKaNDj4TrE
• Starbucks photo - https://www.geekwire.com/2015/starbucks-mobile-ordering-now-blankets-the-u-s-with-coverage-in-san-francisco-new-york-and-more-coming-today/
• Friction image - https://byjus.com/physics/friction-in-automobiles/
• Carbon copy forms - http://www.rainiercopy.com/forms.html
• Rosetta Stone photo - https://en.wikipedia.org/wiki/Rosetta_Stone#/media/File:Rosetta_Stone.JPG

Distributed Systems Are a UX Problem

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Distributed Systems Are a UX Problem

Ähnlich wie Distributed Systems Are a UX Problem (20)

Mehr von Tyler Treat

Mehr von Tyler Treat (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Distributed Systems Are a UX Problem