SlideShare ist ein Scribd-Unternehmen logo
1 von 72
Fault tolerance 101
Joe Armstrong
Monday, March 3, 2014
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/fault-tolerance-101-qcon-london
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Fault
• “behaves as per specification”
• “does not crash”
Monday, March 3, 2014
Many systems have no
specification
Monday, March 3, 2014
Programming is the act of
turning an inexact
description of something
(the specification) into an
exact description of the thing
(the program)
Monday, March 3, 2014
A program is the most precise
description of the problem
that we have
Monday, March 3, 2014
• The ability to behave in a sensible manner in
the presence of failure. Consumer so$ware,
websites, ...
• The ability to behave exactly as specified
despite failures. Air traffic control, nuclear power
station control.
What is fault tolerance?
Exact specification is
extremely difficult
“In a sensible manner” is
rather wooly
When there is no spec -
“in a sensible manner”
means - does not crash
Monday, March 3, 2014
• History
• Hardware Fault Tolerance
• Software Fault Tolerance
• Specifications and code
• Erlang FT
• Demo
Monday, March 3, 2014
We cannot prevent failures
Monday, March 3, 2014
Automata Studies
ed. C. Shannon
Princ. Univ. Press 1956
Monday, March 3, 2014
Q: Can we make reliable systems
that behave reasonably from
unreliable components?
A: Yes
Monday, March 3, 2014
The Cornerstones of FT
• Detect Errors
• Correct Errors
• Stop Errors from Propagating
Monday, March 3, 2014
Needs > 1 computer
Computer 1
does the job
Computer 2
watches computer 1
Computer 3
watches computer 1
Computer 3
watches computer 1
Computer ...
watches computer 1
Error detection must work across
machine boundaries
Must write distributed programs
Programs run in para&el Decoupling and separation helps
stop errors 'om propagating
Monday, March 3, 2014
Things to ponder
• Hardware can fail
• Software either complies with
a spec = works or does not do
what the spec says = fails
• What should the software do
when the system behaves in a
way that is not described in
the spec?
• What do we do when we don’t
have a spec?
• Can we make reliable systems
that behave reasonably from
unreliable components?
• Detecting or masking errors?
• Correcting errors
• Propagation of errors
• Error firewalls
• Self-repairing zones
• Static/Dynamic error
detection
Monday, March 3, 2014
Hardware fault tolerance
• System that mask (hide) errors and use
redundancy to mask errors.
Examples: RAID disks, error correcting bits
in memory hardware etc.
Monday, March 3, 2014
Tandem nonstop II (1981)
Monday, March 3, 2014
Tandem ...
Tandem Computers, Inc. was the
dominant manufacturer of fault-
tolerant computer systems for ATM
networks,banks, stock exchanges,
telephone switching centers, and
other similar commercial transaction
processing applications requiring
maximum uptime and zero data loss.
To contain the scope of failures and of corrupted
data, these multi-computer systems have no
shared central components, not even main
memory. Conventional multi-computer systems all
use shared memories and work directly on shared
data objects. Instead, NonStop processors
cooperate by exchanging messages across a
reliable fabric, and software takes periodic
snapshots for possible rollback of program
memory state.
Besides handling failures well, this "shared-nothing"
messaging system design also scales extremely well
to the largest commercial workloads. Each doubling of
the total number of processors would double system
throughput, up to the maximum configuration of 4000
processors. In contrast, the performance of
conventional multiprocessor systems is limited by the
speed of some shared memory, bus, or switch. Adding
more than 4–8 processors that way gives no further
system speedup. NonStop systems have more often
been bought to meet scaling requirements than for
extreme fault tolerance. They compete well against
IBM's largest mainframes, despite being built from
simpler minicomputer technology.
A& quotes 'om Wikipedia
Monday, March 3, 2014
1.10 on tuesday dec 10
Monday, March 3, 2014
Monday, March 3, 2014
Monday, March 3, 2014
What do we do when we
detect an error?
• Mask it (try again)
• Do nothing (crash later - not a tota&y bri&iant
idea)
• Or ...
Monday, March 3, 2014
LET
IT
CRASH
Monday, March 3, 2014
Programming the Ericsson
Diavox (1976)
If you’re in a three-
way call at any time
you can press the #
key then press 1 to
talk to party 1
2 to talk to party 2
or * to enter a
conference call
Monday, March 3, 2014
if(state == 3waycall && key == “#”){
key = get_next_key();
if(key==”1”){
park(2);
connect([self,1]);
} elseif(key==”2”){
park(1);
connect([self,2]);
} elseif (key==”*”){
connect([self,1,2]);
} elseif(key=”onhook”){
/* Uuugh what do I do here */
}
Defensive
programming
Monday, March 3, 2014
• The Spec tells what to do when things happen
• The Spec does not say what to do when the
behavior goes “off-spec”
• The number of ways we can go “off spec” is
huge
• Most specifications do not include failure
analysis, and do not say what to do when you
are “off spec”
Oh Dear
Monday, March 3, 2014
Joe: “So what happens if we’re in a 3-way conference,
and the guy processes hash and then puts the hook
down, and doesn’t press 1 2 or star?”
Bernt: “So what you do is stop the conference, send the
phone a ring tone and when they answer go back to the
point where you were expecting them to enter 1 2 or
star.”
Joe: “But that’s not in the spec.”
Bernt: “But everybody knows.”
Joe: “I didn’t know.”
Monday, March 3, 2014
Calls are “files”
• If a process crashes the OS closes all files
opened by the process
• If a call crashes the OS closes all calls opened
by the process
• The OS’s job is to “keep files safe” (ie it
maintains invariants)
Monday, March 3, 2014
Let it crash philosophy
• If a processes crashes the OS detects this
• The OS protects the resources being used by
the process
• Programs should crash when going off spec
Monday, March 3, 2014
if(state == 3waycall && key == “#”){
key = get_next_key();
if(key==”1”){
park(2);
connect([self,1]);
} elseif(key==”2”){
park(1);
connect([self,2]);
} elseif (key==”*”){
connect([self,1,2]);
} else{
exit(out_of_spec1);
}
}
Defensive
programming
Monday, March 3, 2014
confcall(“#”) ->
case get_next_key() of
”1” ->
park(2);
connect([self,1]);
”2” ->
park(1);
connect([self,2]);
”*” ->
connect([self,1,2])
end.
Failed Patten
matching provides
the exit
Non defensive
programming -
there is no error
detection or
correction code
Monday, March 3, 2014
Are hardware
and software
faults are
fundamentally
different?
Monday, March 3, 2014
Are there any pure functions?
Monday, March 3, 2014
Class (a) functions: If computing f(X)
fails and f is a pure function computing
f(X) will always fail.
Class (b) functions: If computing f(X)
fails and f is a non-pure function it
might succeed if we call f(X) again.
Monday, March 3, 2014
Is this a pure function?
function f(){
int a = 10,
int b = 2,
return a/b
}
Monday, March 3, 2014
function f(){
int a = 10,
int b = 2,
return a/b
}
Cosmic ray hits the memory
ce& where b is stored and
changes the 2 into zero
A heisenbug
Monday, March 3, 2014
Monday, March 3, 2014
• Heisenbug - Bug that that seems to disappear or alter its
behavior when one attempts to study it
• Bohrbug - A "good, solid bug". Like the deterministic Bohr
atom model, they do not change their behavior and are
relatively easily detected.
• Mandelbug - (named after Benoît Mandelbrot's fractal) is a
bug whose causes are so complex it defies repair, or makes its
behavior appear chaotic or even non-deterministic.
• Schrödinbug (named after Erwin Schrödinger and his
thought experiment) is a bug that manifests itself in running
software after a programmer notices that the code should
never have worked in the first place.
• Hindenbug (named after Hindenburg disaster) is a bug with
catastrophic behavior.
Source: wikipedia
Monday, March 3, 2014
• If a process fails restart it (fixes many heisenbugs,
especia&y those due to subtle timing errors)
• If you have tried restarting a process more than
N times in K seconds, then give up. Try and do
something simpler instead.
• Build trees of processes, if low-level nodes fail
and cannot be restarted fail higher up the tree
Monday, March 3, 2014
Supervision trees
workers
supervisors
Don’t forget the manual
backup :-)
Monday, March 3, 2014
The failure model
is part of the specification
(especially for air-traffic
control software etc.)
The customer should
understand the failure model
Monday, March 3, 2014
I want fault tolerant storage
That’s impossible
We’ll make three copies of your data,
on three different machines. We’ll
guarantee that if one machine crashes
you’ll never lose any data
what happens if 2 machines crash
at the same time
You can still save data on the third
machine, but it will be unsafe. Our
guarantee will not apply.
But I want more safety
Monday, March 3, 2014
We’ll make five copies of your data, on
five different machines. We’ll
guarantee that if two machines crashes
you’ll never lose any data
what happens if 3 machines crash
at the same time
You can still save data on machine 4
and 5, but it will be unsafe. Our
guarantee will not apply.
Why is it unsafe? - it’s stored on two machines
Because when machines 1,2,3 come
back to life they might outvote the
changes on machines 4 and 5
Monday, March 3, 2014
You have to explain in the
contract the failure
assumptions and what will
happen if these failures occur.
If a failure occurs that is not
planned it is not covered by
the contract.
“act of God”
Monday, March 3, 2014
Detecting
Errors
Monday, March 3, 2014
Sequential Languages
function c(){
...
if(...){
throw ...
}
}
function a(){
try {
b();
} catch (...) {
...
throw ...
}
}
function b(){
x();
c();
y();
}
• Function calls put call frames
on the stack
• Try instruction put
catchpoints on the stack
• Exceptions unwind the stack
to the last catchpoint
Monday, March 3, 2014
Uncaught Exceptions
• What happens if the exception gets to the top of
the stack and no catchpoint handlers is found?
Java: print a stack trace and exit
C: core dumped
Erlang: Process dies some other process on the same or
some other machine possibly catches the error
Monday, March 3, 2014
Sequential Languages
C
program
File 1 File 2
Operating System
Crash
close close
When a process crashes the
OS notices this and closes any
resources owned by the
process
Monday, March 3, 2014
Erlang
Operating System
When an Erlang process crashes the
Erlang VM notices this and sends
messages to any linked processes
Process45
Crash
Process23
process 45 crashed
Process92
process 45 crashed
Erlang VM
Monday, March 3, 2014
Erlang
Unix OS
Erlang VM
P10
Windows
Erlang VM
P245
Crash process 10 crashed
Monday, March 3, 2014
Demo
1. Start a process on one machine. Send it a
message so it crashes.
2. Start a process on one machine. Send it a
message so it crashes. Detect the crash
3.Start a process on a remote machine. Send it a
message so it crashes. Detect the error on a
remote machine.
Monday, March 3, 2014
prog1.erl
-module(prog1).
-export([loop/0]).
loop() ->
receive
! N ->
! io:format("node=~p 1/~p = ~p~n",
[node(), N, 1/N]),
! loop()
end.
Monday, March 3, 2014
One machine
$ erl
Eshell V5.10.1 (abort with ^G)
1> P = spawn(prog1, loop, []).
<0.34.0>
2> P ! 12.
node=nonode@nohost 1/12 = 0.08333333333333333
12
3> P ! 0.
0
4>
=ERROR REPORT==== 29-Nov-2013::13:07:26 ===
Error in process <0.34.0> with exit value:
{badarith,[{prog1,loop,0,[{file,"prog1.erl"},{line,7}]}]}
4> P ! 12.
12
Monday, March 3, 2014
monitor.erl
-module(monitor).
-export([process/1]).
process(Pid) ->
spawn(fun() ->
! ! process_flag(trap_exit, true),
! ! link(Pid),
! ! monitor(Pid)
! end).
monitor(Pid) ->
receive
! Any ->
! io:format("Monitor ~p received ~p~n",[Pid,Any]),
! monitor(Pid)
end.
Monday, March 3, 2014
One machine + Monitor
Eshell V5.10.1 (abort with ^G)
1> P = spawn(prog1, loop, []).
<0.34.0>
2> monitor:process(P).
<0.36.0>
3> P ! 12.
node=nonode@nohost 1/12 = 0.08333333333333333
12
4> P ! 0.
Monitor <0.34.0> received
{'EXIT',<0.34.0>,
{badarith,
[{prog1,loop,0,
[{file,"prog1.erl"},{line,7}]}]}}
The process dies and a
message is sent to the
monitor process
Monday, March 3, 2014
Two machines and a monitor
$ erl -sname one
(one@joe)1> P = spawn('two@joe', prog1, loop, []).
<6803.43.0>
(one@joe)2> monitor:process(P).
<0.47.0>
(one@joe)4> P ! 10.
10
node=two@joe 1/10 = 0.1
(one@joe)5> P ! 0.
0
Monitor <6803.43.0> received
{'EXIT',<6803.43.0>,
{badarith,
[{prog1,loop,0,
[{file,"prog1.erl"},{line,7}]}]}}
$ erl -sname two
(two@joe)1>
Or we could ki&
the machine?
Monday, March 3, 2014
Reminder
Operating System
When an Erlang process
crashes the Erlang notices
this and te&s and linked
processes
Process 200
Crash
Process300
process 200 crashed
Erlang VM
Monday, March 3, 2014
Monday, March 3, 2014
Defensive
programming
is a consequence of a
bad concurrency
model
Monday, March 3, 2014
We’ve detected an error
what do we do next?
Monday, March 3, 2014
I’ve detected an error, what should I do?
Try again - it might be a heisenbug
Ok - give up, and tell you’re boss you
gave up. You did your best, nobody will
blame you.
I tried again ten time but it didn’t
help
.... *@!%$!!**&%%%!!!%$#@*** #$@
We have a problem Huston
Monday, March 3, 2014
Do not fail silently
if you cannot do exactly what
you are supposed to do crash.
Somebody else will fix the
problem
Monday, March 3, 2014
Summary
• No shared memory
• Pure message passing
• Remote Error Detection
• Replicated hardware and software on separated machines
• Crash when you get an error
• Do not fail silently
• Some other process fixes the error
Monday, March 3, 2014
Does this
strategy work?
Monday, March 3, 2014
•2002 Alexey Shchepin started building an XMPP server
fully in Erlang
•2005 Process One Founded
•2007 Facebook Chat (build on ejabberd) "the only chat
server with built-in clustering"
•2008 Facebook chat in Erlang
•2009 Feb 175M active users (Dropped and rewrite in C++)
•2009 June 8 Jan Koum gets ejabberd working
•2013 2 Jan - 18 B messages/day
•2013 Feb - Chef11 used by facebook/google/Amazon
•2014 19 Feb -19B$ WhatsApp bought by facebook
Monday, March 3, 2014
Monday, March 3, 2014
Monday, March 3, 2014
Monday, March 3, 2014
Monday, March 3, 2014
Finally
• Design with small isolated components
• Fault Tolerant = Scalable
• Small components = Understandable
Monday, March 3, 2014
Questions
Monday, March 3, 2014
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/fault-
tolerance-101-qcon-london

Weitere ähnliche Inhalte

Was ist angesagt?

Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 
Components in real time systems
Components in real time systemsComponents in real time systems
Components in real time systems
Saransh Garg
 
dos mutual exclusion algos
dos mutual exclusion algosdos mutual exclusion algos
dos mutual exclusion algos
Akhil Sharma
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
guest61205606
 
Real Time Operating System
Real Time Operating SystemReal Time Operating System
Real Time Operating System
vivek223
 

Was ist angesagt? (20)

Fault tolerance and computing
Fault tolerance  and computingFault tolerance  and computing
Fault tolerance and computing
 
Fault tolerance in Information Centric Networks
Fault tolerance in Information Centric NetworksFault tolerance in Information Centric Networks
Fault tolerance in Information Centric Networks
 
Fault tolerance techniques
Fault tolerance techniquesFault tolerance techniques
Fault tolerance techniques
 
9 fault-tolerance
9 fault-tolerance9 fault-tolerance
9 fault-tolerance
 
Reliability and clock synchronization
Reliability and clock synchronizationReliability and clock synchronization
Reliability and clock synchronization
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Principles of Compiler Design
Principles of Compiler DesignPrinciples of Compiler Design
Principles of Compiler Design
 
Components in real time systems
Components in real time systemsComponents in real time systems
Components in real time systems
 
dos mutual exclusion algos
dos mutual exclusion algosdos mutual exclusion algos
dos mutual exclusion algos
 
Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)Dependable Systems - Summary (16/16)
Dependable Systems - Summary (16/16)
 
Distributed Operating System_2
Distributed Operating System_2Distributed Operating System_2
Distributed Operating System_2
 
Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
 
Distributed systems and scalability rules
Distributed systems and scalability rulesDistributed systems and scalability rules
Distributed systems and scalability rules
 
4. system models
4. system models4. system models
4. system models
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
 
Application and Website Security -- Designer Edition: Using Formal Specificat...
Application and Website Security -- Designer Edition:Using Formal Specificat...Application and Website Security -- Designer Edition:Using Formal Specificat...
Application and Website Security -- Designer Edition: Using Formal Specificat...
 
Real time operating systems
Real time operating systemsReal time operating systems
Real time operating systems
 
Security in distributed systems
Security in distributed systems Security in distributed systems
Security in distributed systems
 
Real Time Operating System
Real Time Operating SystemReal Time Operating System
Real Time Operating System
 
Functions of the Operating System
Functions of the Operating SystemFunctions of the Operating System
Functions of the Operating System
 

Andere mochten auch

Mi presentacion
Mi presentacionMi presentacion
Mi presentacion
AlejaPotts
 
DTZ CHINA INSIGHT Internet +-CHN
DTZ CHINA INSIGHT Internet +-CHNDTZ CHINA INSIGHT Internet +-CHN
DTZ CHINA INSIGHT Internet +-CHN
Han Zhang
 

Andere mochten auch (17)

NTE Seg3A-B-C
NTE Seg3A-B-CNTE Seg3A-B-C
NTE Seg3A-B-C
 
Mi presentacion
Mi presentacionMi presentacion
Mi presentacion
 
QoS -Aware Spectrum Sharing for Multi-Channel Vehicular Network
QoS -Aware Spectrum Sharing for Multi-Channel Vehicular NetworkQoS -Aware Spectrum Sharing for Multi-Channel Vehicular Network
QoS -Aware Spectrum Sharing for Multi-Channel Vehicular Network
 
Initial covers I made
Initial covers I madeInitial covers I made
Initial covers I made
 
Testing and Performance of Parabolic Trough Collector in Indian climate
Testing and Performance of Parabolic Trough Collector in Indian climateTesting and Performance of Parabolic Trough Collector in Indian climate
Testing and Performance of Parabolic Trough Collector in Indian climate
 
PARTES RICOH 1515
PARTES RICOH 1515PARTES RICOH 1515
PARTES RICOH 1515
 
DEFINICIÓN DE HARDWARE Y SOFTWARE.
DEFINICIÓN DE HARDWARE Y SOFTWARE.DEFINICIÓN DE HARDWARE Y SOFTWARE.
DEFINICIÓN DE HARDWARE Y SOFTWARE.
 
Poster mock ups
Poster mock upsPoster mock ups
Poster mock ups
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
 
Performance Characteristic of Infinitely Short Bearing
Performance Characteristic of Infinitely Short BearingPerformance Characteristic of Infinitely Short Bearing
Performance Characteristic of Infinitely Short Bearing
 
Dealing with Substance Abuse in the Workplace
Dealing with Substance Abuse in the WorkplaceDealing with Substance Abuse in the Workplace
Dealing with Substance Abuse in the Workplace
 
Informe de inversion publicitaria en internet iab peru - 2015 semestre 1 pa...
Informe de inversion publicitaria en internet   iab peru - 2015 semestre 1 pa...Informe de inversion publicitaria en internet   iab peru - 2015 semestre 1 pa...
Informe de inversion publicitaria en internet iab peru - 2015 semestre 1 pa...
 
Direccion ipv4
Direccion ipv4Direccion ipv4
Direccion ipv4
 
SE_6_Edn
SE_6_EdnSE_6_Edn
SE_6_Edn
 
Guano de la isla
Guano de la islaGuano de la isla
Guano de la isla
 
La Europa feudal. S. IX -XI Tema 5 2ºeso curso 2017
La Europa feudal. S. IX -XI Tema 5 2ºeso curso 2017La Europa feudal. S. IX -XI Tema 5 2ºeso curso 2017
La Europa feudal. S. IX -XI Tema 5 2ºeso curso 2017
 
DTZ CHINA INSIGHT Internet +-CHN
DTZ CHINA INSIGHT Internet +-CHNDTZ CHINA INSIGHT Internet +-CHN
DTZ CHINA INSIGHT Internet +-CHN
 

Ähnlich wie Fault Tolerance 101

DevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 SlidesDevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 Slides
Alex Cruise
 
5-Ways-to-Revolutionize-Your-Software-Testing
5-Ways-to-Revolutionize-Your-Software-Testing5-Ways-to-Revolutionize-Your-Software-Testing
5-Ways-to-Revolutionize-Your-Software-Testing
Mary Clemons
 

Ähnlich wie Fault Tolerance 101 (20)

Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015Debugging Complex Systems - Erlang Factory SF 2015
Debugging Complex Systems - Erlang Factory SF 2015
 
How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)
 
Eric Proegler Early Performance Testing from CAST2014
Eric Proegler Early Performance Testing from CAST2014Eric Proegler Early Performance Testing from CAST2014
Eric Proegler Early Performance Testing from CAST2014
 
Computer converted
Computer convertedComputer converted
Computer converted
 
What to Expect When You're Expecting (to Own Production)
What to Expect When You're Expecting (to Own Production)What to Expect When You're Expecting (to Own Production)
What to Expect When You're Expecting (to Own Production)
 
sat_presentation
sat_presentationsat_presentation
sat_presentation
 
Seacon Continuous Delivery Pipeline Tools Track
Seacon Continuous Delivery Pipeline Tools TrackSeacon Continuous Delivery Pipeline Tools Track
Seacon Continuous Delivery Pipeline Tools Track
 
Solving the 3 Biggest Questions in Continuous Testing
Solving the 3 Biggest Questions in Continuous TestingSolving the 3 Biggest Questions in Continuous Testing
Solving the 3 Biggest Questions in Continuous Testing
 
Unleashing the power of Unit Testing - Franck Ninsabira.pdf
Unleashing the power of Unit Testing - Franck Ninsabira.pdfUnleashing the power of Unit Testing - Franck Ninsabira.pdf
Unleashing the power of Unit Testing - Franck Ninsabira.pdf
 
Chaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field GuideChaos Engineering 101: A Field Guide
Chaos Engineering 101: A Field Guide
 
DevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 SlidesDevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 Slides
 
5-Ways-to-Revolutionize-Your-Software-Testing
5-Ways-to-Revolutionize-Your-Software-Testing5-Ways-to-Revolutionize-Your-Software-Testing
5-Ways-to-Revolutionize-Your-Software-Testing
 
Lab 1 reference manual
Lab 1 reference manualLab 1 reference manual
Lab 1 reference manual
 
Disaster Proof
Disaster ProofDisaster Proof
Disaster Proof
 
Put Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwarePut Some SRE in Your Shipped Software
Put Some SRE in Your Shipped Software
 
Survey Presentation About Application Security
Survey Presentation About Application SecuritySurvey Presentation About Application Security
Survey Presentation About Application Security
 
Basic Computer Skills.pptx
Basic Computer Skills.pptxBasic Computer Skills.pptx
Basic Computer Skills.pptx
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
Software Bugs A Software Architect Point Of View
Software Bugs    A Software Architect Point Of ViewSoftware Bugs    A Software Architect Point Of View
Software Bugs A Software Architect Point Of View
 
Case of the Unexplained Support Issue – Troubleshooting steps for diagnosing ...
Case of the Unexplained Support Issue – Troubleshooting steps for diagnosing ...Case of the Unexplained Support Issue – Troubleshooting steps for diagnosing ...
Case of the Unexplained Support Issue – Troubleshooting steps for diagnosing ...
 

Mehr von C4Media

Mehr von C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 

Fault Tolerance 101

  • 1. Fault tolerance 101 Joe Armstrong Monday, March 3, 2014
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /fault-tolerance-101-qcon-london
  • 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Fault • “behaves as per specification” • “does not crash” Monday, March 3, 2014
  • 5. Many systems have no specification Monday, March 3, 2014
  • 6. Programming is the act of turning an inexact description of something (the specification) into an exact description of the thing (the program) Monday, March 3, 2014
  • 7. A program is the most precise description of the problem that we have Monday, March 3, 2014
  • 8. • The ability to behave in a sensible manner in the presence of failure. Consumer so$ware, websites, ... • The ability to behave exactly as specified despite failures. Air traffic control, nuclear power station control. What is fault tolerance? Exact specification is extremely difficult “In a sensible manner” is rather wooly When there is no spec - “in a sensible manner” means - does not crash Monday, March 3, 2014
  • 9. • History • Hardware Fault Tolerance • Software Fault Tolerance • Specifications and code • Erlang FT • Demo Monday, March 3, 2014
  • 10. We cannot prevent failures Monday, March 3, 2014
  • 11. Automata Studies ed. C. Shannon Princ. Univ. Press 1956 Monday, March 3, 2014
  • 12. Q: Can we make reliable systems that behave reasonably from unreliable components? A: Yes Monday, March 3, 2014
  • 13. The Cornerstones of FT • Detect Errors • Correct Errors • Stop Errors from Propagating Monday, March 3, 2014
  • 14. Needs > 1 computer Computer 1 does the job Computer 2 watches computer 1 Computer 3 watches computer 1 Computer 3 watches computer 1 Computer ... watches computer 1 Error detection must work across machine boundaries Must write distributed programs Programs run in para&el Decoupling and separation helps stop errors 'om propagating Monday, March 3, 2014
  • 15. Things to ponder • Hardware can fail • Software either complies with a spec = works or does not do what the spec says = fails • What should the software do when the system behaves in a way that is not described in the spec? • What do we do when we don’t have a spec? • Can we make reliable systems that behave reasonably from unreliable components? • Detecting or masking errors? • Correcting errors • Propagation of errors • Error firewalls • Self-repairing zones • Static/Dynamic error detection Monday, March 3, 2014
  • 16. Hardware fault tolerance • System that mask (hide) errors and use redundancy to mask errors. Examples: RAID disks, error correcting bits in memory hardware etc. Monday, March 3, 2014
  • 17. Tandem nonstop II (1981) Monday, March 3, 2014
  • 18. Tandem ... Tandem Computers, Inc. was the dominant manufacturer of fault- tolerant computer systems for ATM networks,banks, stock exchanges, telephone switching centers, and other similar commercial transaction processing applications requiring maximum uptime and zero data loss. To contain the scope of failures and of corrupted data, these multi-computer systems have no shared central components, not even main memory. Conventional multi-computer systems all use shared memories and work directly on shared data objects. Instead, NonStop processors cooperate by exchanging messages across a reliable fabric, and software takes periodic snapshots for possible rollback of program memory state. Besides handling failures well, this "shared-nothing" messaging system design also scales extremely well to the largest commercial workloads. Each doubling of the total number of processors would double system throughput, up to the maximum configuration of 4000 processors. In contrast, the performance of conventional multiprocessor systems is limited by the speed of some shared memory, bus, or switch. Adding more than 4–8 processors that way gives no further system speedup. NonStop systems have more often been bought to meet scaling requirements than for extreme fault tolerance. They compete well against IBM's largest mainframes, despite being built from simpler minicomputer technology. A& quotes 'om Wikipedia Monday, March 3, 2014
  • 19. 1.10 on tuesday dec 10 Monday, March 3, 2014
  • 22. What do we do when we detect an error? • Mask it (try again) • Do nothing (crash later - not a tota&y bri&iant idea) • Or ... Monday, March 3, 2014
  • 24. Programming the Ericsson Diavox (1976) If you’re in a three- way call at any time you can press the # key then press 1 to talk to party 1 2 to talk to party 2 or * to enter a conference call Monday, March 3, 2014
  • 25. if(state == 3waycall && key == “#”){ key = get_next_key(); if(key==”1”){ park(2); connect([self,1]); } elseif(key==”2”){ park(1); connect([self,2]); } elseif (key==”*”){ connect([self,1,2]); } elseif(key=”onhook”){ /* Uuugh what do I do here */ } Defensive programming Monday, March 3, 2014
  • 26. • The Spec tells what to do when things happen • The Spec does not say what to do when the behavior goes “off-spec” • The number of ways we can go “off spec” is huge • Most specifications do not include failure analysis, and do not say what to do when you are “off spec” Oh Dear Monday, March 3, 2014
  • 27. Joe: “So what happens if we’re in a 3-way conference, and the guy processes hash and then puts the hook down, and doesn’t press 1 2 or star?” Bernt: “So what you do is stop the conference, send the phone a ring tone and when they answer go back to the point where you were expecting them to enter 1 2 or star.” Joe: “But that’s not in the spec.” Bernt: “But everybody knows.” Joe: “I didn’t know.” Monday, March 3, 2014
  • 28. Calls are “files” • If a process crashes the OS closes all files opened by the process • If a call crashes the OS closes all calls opened by the process • The OS’s job is to “keep files safe” (ie it maintains invariants) Monday, March 3, 2014
  • 29. Let it crash philosophy • If a processes crashes the OS detects this • The OS protects the resources being used by the process • Programs should crash when going off spec Monday, March 3, 2014
  • 30. if(state == 3waycall && key == “#”){ key = get_next_key(); if(key==”1”){ park(2); connect([self,1]); } elseif(key==”2”){ park(1); connect([self,2]); } elseif (key==”*”){ connect([self,1,2]); } else{ exit(out_of_spec1); } } Defensive programming Monday, March 3, 2014
  • 31. confcall(“#”) -> case get_next_key() of ”1” -> park(2); connect([self,1]); ”2” -> park(1); connect([self,2]); ”*” -> connect([self,1,2]) end. Failed Patten matching provides the exit Non defensive programming - there is no error detection or correction code Monday, March 3, 2014
  • 32. Are hardware and software faults are fundamentally different? Monday, March 3, 2014
  • 33. Are there any pure functions? Monday, March 3, 2014
  • 34. Class (a) functions: If computing f(X) fails and f is a pure function computing f(X) will always fail. Class (b) functions: If computing f(X) fails and f is a non-pure function it might succeed if we call f(X) again. Monday, March 3, 2014
  • 35. Is this a pure function? function f(){ int a = 10, int b = 2, return a/b } Monday, March 3, 2014
  • 36. function f(){ int a = 10, int b = 2, return a/b } Cosmic ray hits the memory ce& where b is stored and changes the 2 into zero A heisenbug Monday, March 3, 2014
  • 38. • Heisenbug - Bug that that seems to disappear or alter its behavior when one attempts to study it • Bohrbug - A "good, solid bug". Like the deterministic Bohr atom model, they do not change their behavior and are relatively easily detected. • Mandelbug - (named after Benoît Mandelbrot's fractal) is a bug whose causes are so complex it defies repair, or makes its behavior appear chaotic or even non-deterministic. • Schrödinbug (named after Erwin Schrödinger and his thought experiment) is a bug that manifests itself in running software after a programmer notices that the code should never have worked in the first place. • Hindenbug (named after Hindenburg disaster) is a bug with catastrophic behavior. Source: wikipedia Monday, March 3, 2014
  • 39. • If a process fails restart it (fixes many heisenbugs, especia&y those due to subtle timing errors) • If you have tried restarting a process more than N times in K seconds, then give up. Try and do something simpler instead. • Build trees of processes, if low-level nodes fail and cannot be restarted fail higher up the tree Monday, March 3, 2014
  • 40. Supervision trees workers supervisors Don’t forget the manual backup :-) Monday, March 3, 2014
  • 41. The failure model is part of the specification (especially for air-traffic control software etc.) The customer should understand the failure model Monday, March 3, 2014
  • 42. I want fault tolerant storage That’s impossible We’ll make three copies of your data, on three different machines. We’ll guarantee that if one machine crashes you’ll never lose any data what happens if 2 machines crash at the same time You can still save data on the third machine, but it will be unsafe. Our guarantee will not apply. But I want more safety Monday, March 3, 2014
  • 43. We’ll make five copies of your data, on five different machines. We’ll guarantee that if two machines crashes you’ll never lose any data what happens if 3 machines crash at the same time You can still save data on machine 4 and 5, but it will be unsafe. Our guarantee will not apply. Why is it unsafe? - it’s stored on two machines Because when machines 1,2,3 come back to life they might outvote the changes on machines 4 and 5 Monday, March 3, 2014
  • 44. You have to explain in the contract the failure assumptions and what will happen if these failures occur. If a failure occurs that is not planned it is not covered by the contract. “act of God” Monday, March 3, 2014
  • 46. Sequential Languages function c(){ ... if(...){ throw ... } } function a(){ try { b(); } catch (...) { ... throw ... } } function b(){ x(); c(); y(); } • Function calls put call frames on the stack • Try instruction put catchpoints on the stack • Exceptions unwind the stack to the last catchpoint Monday, March 3, 2014
  • 47. Uncaught Exceptions • What happens if the exception gets to the top of the stack and no catchpoint handlers is found? Java: print a stack trace and exit C: core dumped Erlang: Process dies some other process on the same or some other machine possibly catches the error Monday, March 3, 2014
  • 48. Sequential Languages C program File 1 File 2 Operating System Crash close close When a process crashes the OS notices this and closes any resources owned by the process Monday, March 3, 2014
  • 49. Erlang Operating System When an Erlang process crashes the Erlang VM notices this and sends messages to any linked processes Process45 Crash Process23 process 45 crashed Process92 process 45 crashed Erlang VM Monday, March 3, 2014
  • 50. Erlang Unix OS Erlang VM P10 Windows Erlang VM P245 Crash process 10 crashed Monday, March 3, 2014
  • 51. Demo 1. Start a process on one machine. Send it a message so it crashes. 2. Start a process on one machine. Send it a message so it crashes. Detect the crash 3.Start a process on a remote machine. Send it a message so it crashes. Detect the error on a remote machine. Monday, March 3, 2014
  • 52. prog1.erl -module(prog1). -export([loop/0]). loop() -> receive ! N -> ! io:format("node=~p 1/~p = ~p~n", [node(), N, 1/N]), ! loop() end. Monday, March 3, 2014
  • 53. One machine $ erl Eshell V5.10.1 (abort with ^G) 1> P = spawn(prog1, loop, []). <0.34.0> 2> P ! 12. node=nonode@nohost 1/12 = 0.08333333333333333 12 3> P ! 0. 0 4> =ERROR REPORT==== 29-Nov-2013::13:07:26 === Error in process <0.34.0> with exit value: {badarith,[{prog1,loop,0,[{file,"prog1.erl"},{line,7}]}]} 4> P ! 12. 12 Monday, March 3, 2014
  • 54. monitor.erl -module(monitor). -export([process/1]). process(Pid) -> spawn(fun() -> ! ! process_flag(trap_exit, true), ! ! link(Pid), ! ! monitor(Pid) ! end). monitor(Pid) -> receive ! Any -> ! io:format("Monitor ~p received ~p~n",[Pid,Any]), ! monitor(Pid) end. Monday, March 3, 2014
  • 55. One machine + Monitor Eshell V5.10.1 (abort with ^G) 1> P = spawn(prog1, loop, []). <0.34.0> 2> monitor:process(P). <0.36.0> 3> P ! 12. node=nonode@nohost 1/12 = 0.08333333333333333 12 4> P ! 0. Monitor <0.34.0> received {'EXIT',<0.34.0>, {badarith, [{prog1,loop,0, [{file,"prog1.erl"},{line,7}]}]}} The process dies and a message is sent to the monitor process Monday, March 3, 2014
  • 56. Two machines and a monitor $ erl -sname one (one@joe)1> P = spawn('two@joe', prog1, loop, []). <6803.43.0> (one@joe)2> monitor:process(P). <0.47.0> (one@joe)4> P ! 10. 10 node=two@joe 1/10 = 0.1 (one@joe)5> P ! 0. 0 Monitor <6803.43.0> received {'EXIT',<6803.43.0>, {badarith, [{prog1,loop,0, [{file,"prog1.erl"},{line,7}]}]}} $ erl -sname two (two@joe)1> Or we could ki& the machine? Monday, March 3, 2014
  • 57. Reminder Operating System When an Erlang process crashes the Erlang notices this and te&s and linked processes Process 200 Crash Process300 process 200 crashed Erlang VM Monday, March 3, 2014
  • 59. Defensive programming is a consequence of a bad concurrency model Monday, March 3, 2014
  • 60. We’ve detected an error what do we do next? Monday, March 3, 2014
  • 61. I’ve detected an error, what should I do? Try again - it might be a heisenbug Ok - give up, and tell you’re boss you gave up. You did your best, nobody will blame you. I tried again ten time but it didn’t help .... *@!%$!!**&%%%!!!%$#@*** #$@ We have a problem Huston Monday, March 3, 2014
  • 62. Do not fail silently if you cannot do exactly what you are supposed to do crash. Somebody else will fix the problem Monday, March 3, 2014
  • 63. Summary • No shared memory • Pure message passing • Remote Error Detection • Replicated hardware and software on separated machines • Crash when you get an error • Do not fail silently • Some other process fixes the error Monday, March 3, 2014
  • 65. •2002 Alexey Shchepin started building an XMPP server fully in Erlang •2005 Process One Founded •2007 Facebook Chat (build on ejabberd) "the only chat server with built-in clustering" •2008 Facebook chat in Erlang •2009 Feb 175M active users (Dropped and rewrite in C++) •2009 June 8 Jan Koum gets ejabberd working •2013 2 Jan - 18 B messages/day •2013 Feb - Chef11 used by facebook/google/Amazon •2014 19 Feb -19B$ WhatsApp bought by facebook Monday, March 3, 2014
  • 70. Finally • Design with small isolated components • Fault Tolerant = Scalable • Small components = Understandable Monday, March 3, 2014
  • 72. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/fault- tolerance-101-qcon-london